linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0.14] oom detection rework v6
@ 2016-04-20 19:47 Michal Hocko
  2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
                   ` (14 more replies)
  0 siblings, 15 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

Hi,

This is v6 of the series. The previous version was posted [1]. The
code hasn't changed much since then. I have found one old standing
bug (patch 1) which just got much more severe and visible with this
series. Other than that I have reorganized the series and put the
compaction feedback abstraction to the front just in case we find out
that parts of the series would have to be reverted later on for some
reason. The premature oom killer invocation reported by Hugh [2] seems
to be addressed.

We have discussed this series at LSF/MM summit in Raleigh and there
didn't seem to be any concerns/objections to go on with the patch set
and target it for the next merge window. 

Motivation:
As pointed by Linus [3][4] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
   file size, removes the files and starts over again) running in
   parallel for 10s to build up a lot of dirty pages when 100 parallel
   mem_eaters (anon private populated mmap which waits until it gets
   signal) with 80M each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature (e.g. due to dirty pages):

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78

$ grep "Kill process" base-oom-run1.log | tail -n1
[   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3

* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77

e grep "Kill process" patched-oom-run1.log | tail -n1
[  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3

The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs in three different boots
   without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.

I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.

I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.

* Changes since v5
- added "vmscan: consider classzone_idx in compaction_ready"
- added "mm, oom, compaction: prevent from should_compact_retry looping
  for ever for costly orders"
- acked-bys from Vlastimil
- integrated feedback from review
* Changes since v4
- dropped __GFP_REPEAT for costly allocation as it is now replaced by
  the compaction based feedback logic
- !costly high order requests are retried based on the compaction feedback
- compaction feedback has been tweaked to give us an useful information
  to make decisions in the page allocator
- rebased on the current mmotm-2016-04-01-16-24 with the previous version
  of the rework reverted

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)

* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils
[3] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[4] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  3:32   ` Hillf Danton
  2016-05-04 13:56   ` Michal Hocko
  2016-04-20 19:47 ` [PATCH 02/14] mm, compaction: change COMPACT_ constants into enum Michal Hocko
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

while playing with the oom detection rework [1] I have noticed
that my heavy order-9 (hugetlb) load close to OOM ended up in an
endless loop where the reclaim hasn't made any progress but
did_some_progress didn't reflect that and compaction_suitable
was backing off because no zone is above low wmark + 1 << order.

It turned out that this is in fact an old standing bug in compaction_ready
which ignores the requested_highidx and did the watermark check for
0 classzone_idx. This succeeds for zone DMA most of the time as the zone
is mostly unused because of lowmem protection. This also means that the
OOM killer wouldn't be triggered for higher order requests even when
there is no reclaim progress and we essentially rely on order-0 request
to find this out. This has been broken in one way or another since
fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
are sufficient free pages available") but only since 7335084d446b ("mm:
vmscan: do not OOM if aborting reclaim to start compaction") we are not
invoking the OOM killer based on the wrong calculation.

Propagate requested_highidx down to compaction_ready and use it for both
the watermak check and compaction_suitable to fix this issue.

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c839adc13efd..3e6347e2a5fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
  * Returns true if compaction should go ahead for a high-order request, or
  * the high-order allocation would succeed without compaction.
  */
-static inline bool compaction_ready(struct zone *zone, int order)
+static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
 {
 	unsigned long balance_gap, watermark;
 	bool watermark_ok;
@@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
 			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 
 	/*
 	 * If compaction is deferred, reclaim up to a point where
@@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	 * If compaction is not ready to start and allocation is not likely
 	 * to succeed without it, then keep reclaiming.
 	 */
-	if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
+	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
 		return false;
 
 	return watermark_ok;
@@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
 			    zonelist_zone_idx(z) <= requested_highidx &&
-			    compaction_ready(zone, sc->order)) {
+			    compaction_ready(zone, sc->order, requested_highidx)) {
 				sc->compaction_ready = true;
 				continue;
 			}
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/14] mm, compaction: change COMPACT_ constants into enum
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
  2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-20 19:47 ` [PATCH 03/14] mm, compaction: cover all compaction mode in compact_zone Michal Hocko
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
 mm/compaction.c            | 27 ++++++++++++++-------------
 mm/page_alloc.c            |  2 +-
 3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d7c8de583a23..4458fd94170f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
 #define _LINUX_COMPACTION_H
 
 /* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED	0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED		1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE	2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL		3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE	4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE	5
-#define COMPACT_NOT_SUITABLE_ZONE	6
-#define COMPACT_CONTENDED		7
 /* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+	/* compaction didn't start as it was deferred due to past failures */
+	COMPACT_DEFERRED,
+	/*
+	 * compaction didn't start as it was not possible or direct reclaim
+	 * was more suitable
+	 */
+	COMPACT_SKIPPED,
+	/* compaction should continue to another pageblock */
+	COMPACT_CONTINUE,
+	/*
+	 * direct compaction partially compacted a zone and there are suitable
+	 * pages
+	 */
+	COMPACT_PARTIAL,
+	/* The full zone was compacted */
+	COMPACT_COMPLETE,
+	/* For more detailed tracepoint output */
+	COMPACT_NO_SUITABLE_PAGE,
+	COMPACT_NOT_SUITABLE_ZONE,
+	COMPACT_CONTENDED,
+};
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
 /* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+			unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx);
 
 extern void defer_compaction(struct zone *zone, int order);
@@ -57,7 +66,7 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
 			const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
@@ -73,7 +82,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
 
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8cc495042303..8ae7b1c46c72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1281,7 +1281,7 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
 	unsigned int order;
@@ -1344,8 +1344,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
 	return COMPACT_NO_SUITABLE_PAGE;
 }
 
-static int compact_finished(struct zone *zone, struct compact_control *cc,
-			    const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+			struct compact_control *cc,
+			const int migratetype)
 {
 	int ret;
 
@@ -1364,7 +1365,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
  *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
  *   COMPACT_CONTINUE - If compaction should run now
  */
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	int fragindex;
@@ -1409,10 +1410,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 	return COMPACT_CONTINUE;
 }
 
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 
 	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
 	trace_mm_compaction_suitable(zone, order, ret);
@@ -1422,9 +1423,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
 	return ret;
 }
 
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
-	int ret;
+	enum compact_result ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
 	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1588,11 +1589,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
 		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
 		int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -1631,7 +1632,7 @@ int sysctl_extfrag_threshold = 500;
  *
  * This is the main entry point for direct page compaction.
  */
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
 {
@@ -1639,7 +1640,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
-	int rc = COMPACT_DEFERRED;
+	enum compact_result rc = COMPACT_DEFERRED;
 	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
 	*contended = COMPACT_CONTENDED_NONE;
@@ -1653,7 +1654,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
-		int status;
+		enum compact_result status;
 		int zone_contended;
 
 		if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4efafc38273..06af8a757d52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2947,7 +2947,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum migrate_mode mode, int *contended_compaction,
 		bool *deferred_compaction)
 {
-	unsigned long compact_result;
+	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/14] mm, compaction: cover all compaction mode in compact_zone
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
  2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
  2016-04-20 19:47 ` [PATCH 02/14] mm, compaction: change COMPACT_ constants into enum Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-20 19:47 ` [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED Michal Hocko
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
  switch (ret) {
  ^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
---
 mm/compaction.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8ae7b1c46c72..b06de27b7f72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,15 +1433,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
 	ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
 							cc->classzone_idx);
-	switch (ret) {
-	case COMPACT_PARTIAL:
-	case COMPACT_SKIPPED:
-		/* Compaction is likely to fail */
+	/* Compaction is likely to fail */
+	if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
 		return ret;
-	case COMPACT_CONTINUE:
-		/* Fall through to compaction */
-		;
-	}
+
+	/* huh, compaction_suitable is returning something unexpected */
+	VM_BUG_ON(ret != COMPACT_CONTINUE);
 
 	/*
 	 * Clear pageblock skip if there were failures recently and compaction
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (2 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 03/14] mm, compaction: cover all compaction mode in compact_zone Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  7:08   ` Hillf Danton
  2016-04-20 19:47 ` [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE Michal Hocko
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

try_to_compact_pages can currently return COMPACT_SKIPPED even when the
compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h        | 7 +++++--
 include/trace/events/compaction.h | 2 +-
 mm/compaction.c                   | 8 +++++---
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4458fd94170f..7e177d111c39 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,13 +4,16 @@
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
-	/* compaction didn't start as it was deferred due to past failures */
-	COMPACT_DEFERRED,
 	/*
 	 * compaction didn't start as it was not possible or direct reclaim
 	 * was more suitable
 	 */
 	COMPACT_SKIPPED,
+	/* compaction didn't start as it was deferred due to past failures */
+	COMPACT_DEFERRED,
+	/* compaction not active last round */
+	COMPACT_INACTIVE = COMPACT_DEFERRED,
+
 	/* compaction should continue to another pageblock */
 	COMPACT_CONTINUE,
 	/*
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index e215bf68f521..6ba16c86d7db 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -10,8 +10,8 @@
 #include <trace/events/mmflags.h>
 
 #define COMPACTION_STATUS					\
-	EM( COMPACT_DEFERRED,		"deferred")		\
 	EM( COMPACT_SKIPPED,		"skipped")		\
+	EM( COMPACT_DEFERRED,		"deferred")		\
 	EM( COMPACT_CONTINUE,		"continue")		\
 	EM( COMPACT_PARTIAL,		"partial")		\
 	EM( COMPACT_COMPLETE,		"complete")		\
diff --git a/mm/compaction.c b/mm/compaction.c
index b06de27b7f72..13709e33a2fc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
-	enum compact_result rc = COMPACT_DEFERRED;
+	enum compact_result rc = COMPACT_SKIPPED;
 	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
 	*contended = COMPACT_CONTENDED_NONE;
@@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_result status;
 		int zone_contended;
 
-		if (compaction_deferred(zone, order))
+		if (compaction_deferred(zone, order)) {
+			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
+		}
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
 				&zone_contended, alloc_flags,
@@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	 * If at least one zone wasn't deferred or skipped, we report if all
 	 * zones that were tried were lock contended.
 	 */
-	if (rc > COMPACT_SKIPPED && all_zones_contended)
+	if (rc > COMPACT_INACTIVE && all_zones_contended)
 		*contended = COMPACT_CONTENDED_LOCK;
 
 	return rc;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (3 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  6:39   ` Hillf Danton
  2016-04-20 19:47 ` [PATCH 06/14] mm, compaction: Update compaction_result ordering Michal Hocko
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

COMPACT_COMPLETE now means that compaction and free scanner met. This is
not very useful information if somebody just wants to use this feedback
and make any decisions based on that. The current caller might be a poor
guy who just happened to scan tiny portion of the zone and that could be
the reason no suitable pages were compacted. Make sure we distinguish
the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to
use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h        | 10 +++++++++-
 include/trace/events/compaction.h |  1 +
 mm/compaction.c                   | 14 +++++++++++---
 mm/internal.h                     |  1 +
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7e177d111c39..7c4de92d12cc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -21,7 +21,15 @@ enum compact_result {
 	 * pages
 	 */
 	COMPACT_PARTIAL,
-	/* The full zone was compacted */
+	/*
+	 * direct compaction has scanned part of the zone but wasn't successfull
+	 * to compact suitable pages.
+	 */
+	COMPACT_PARTIAL_SKIPPED,
+	/*
+	 * The full zone was compacted scanned but wasn't successfull to compact
+	 * suitable pages.
+	 */
 	COMPACT_COMPLETE,
 	/* For more detailed tracepoint output */
 	COMPACT_NO_SUITABLE_PAGE,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6ba16c86d7db..36e2d6fb1360 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -14,6 +14,7 @@
 	EM( COMPACT_DEFERRED,		"deferred")		\
 	EM( COMPACT_CONTINUE,		"continue")		\
 	EM( COMPACT_PARTIAL,		"partial")		\
+	EM( COMPACT_PARTIAL_SKIPPED,	"partial_skipped")	\
 	EM( COMPACT_COMPLETE,		"complete")		\
 	EM( COMPACT_NO_SUITABLE_PAGE,	"no_suitable_page")	\
 	EM( COMPACT_NOT_SUITABLE_ZONE,	"not_suitable_zone")	\
diff --git a/mm/compaction.c b/mm/compaction.c
index 13709e33a2fc..e2e487cea5ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
 		if (cc->direct_compaction)
 			zone->compact_blockskip_flush = true;
 
-		return COMPACT_COMPLETE;
+		if (cc->whole_zone)
+			return COMPACT_COMPLETE;
+		else
+			return COMPACT_PARTIAL_SKIPPED;
 	}
 
 	if (is_via_compact_memory(cc->order))
@@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
 		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
 	}
+
+	if (cc->migrate_pfn == start_pfn)
+		cc->whole_zone = true;
+
 	cc->last_migrated_pfn = 0;
 
 	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
@@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			goto break_loop;
 		}
 
-		if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
+		if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+					status == COMPACT_PARTIAL_SKIPPED)) {
 			/*
 			 * We think that allocation won't succeed in this zone
 			 * so we defer compaction there. If it ends up
@@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
 						cc.classzone_idx, 0)) {
 			success = true;
 			compaction_defer_reset(zone, cc.order, false);
-		} else if (status == COMPACT_COMPLETE) {
+		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
 			/*
 			 * We use sync migration mode here, so we defer like
 			 * sync direct compaction does.
diff --git a/mm/internal.h b/mm/internal.h
index e9aacea1a0d1..4423dfe69382 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -182,6 +182,7 @@ struct compact_control {
 	enum migrate_mode mode;		/* Async or sync migration mode */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
 	bool direct_compaction;		/* False from kcompactd or /proc/... */
+	bool whole_zone;		/* Whole zone has been scanned */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	const int alloc_flags;		/* alloc flags of a direct compactor */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/14] mm, compaction: Update compaction_result ordering
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (4 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  6:45   ` Hillf Danton
  2016-04-20 19:47 ` [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface Michal Hocko
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

compaction_result will be used as the primary feedback channel for
compaction users. At the same time try_to_compact_pages (and potentially
others) assume a certain ordering where a more specific feedback takes
precendence. This gets a bit awkward when we have conflicting feedback
from different zones. E.g one returing COMPACT_COMPLETE meaning the full
zone has been scanned without any outcome while other returns with
COMPACT_PARTIAL aka made some progress. The caller should get
COMPACT_PARTIAL because that means that the compaction still can make
some progress. The same applies for COMPACT_PARTIAL vs.
COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
larger the value is the more progress we have done.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7c4de92d12cc..a7b9091ff349 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,6 +4,8 @@
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
+	/* For more detailed tracepoint output - internal to compaction */
+	COMPACT_NOT_SUITABLE_ZONE,
 	/*
 	 * compaction didn't start as it was not possible or direct reclaim
 	 * was more suitable
@@ -11,30 +13,34 @@ enum compact_result {
 	COMPACT_SKIPPED,
 	/* compaction didn't start as it was deferred due to past failures */
 	COMPACT_DEFERRED,
+
 	/* compaction not active last round */
 	COMPACT_INACTIVE = COMPACT_DEFERRED,
 
+	/* For more detailed tracepoint output - internal to compaction */
+	COMPACT_NO_SUITABLE_PAGE,
 	/* compaction should continue to another pageblock */
 	COMPACT_CONTINUE,
+
 	/*
-	 * direct compaction partially compacted a zone and there are suitable
-	 * pages
+	 * The full zone was compacted scanned but wasn't successfull to compact
+	 * suitable pages.
 	 */
-	COMPACT_PARTIAL,
+	COMPACT_COMPLETE,
 	/*
 	 * direct compaction has scanned part of the zone but wasn't successfull
 	 * to compact suitable pages.
 	 */
 	COMPACT_PARTIAL_SKIPPED,
+
+	/* compaction terminated prematurely due to lock contentions */
+	COMPACT_CONTENDED,
+
 	/*
-	 * The full zone was compacted scanned but wasn't successfull to compact
-	 * suitable pages.
+	 * direct compaction partially compacted a zone and there might be
+	 * suitable pages
 	 */
-	COMPACT_COMPLETE,
-	/* For more detailed tracepoint output */
-	COMPACT_NO_SUITABLE_PAGE,
-	COMPACT_NOT_SUITABLE_ZONE,
-	COMPACT_CONTENDED,
+	COMPACT_PARTIAL,
 };
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (5 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 06/14] mm, compaction: Update compaction_result ordering Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  6:50   ` Hillf Danton
  2016-04-20 19:47 ` [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers Michal Hocko
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_direct_compact communicates potential back off by two
variables:
	- deferred_compaction tells that the compaction returned
	  COMPACT_DEFERRED
	- contended_compaction is set when there is a contention on
	  zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 31 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06af8a757d52..350d13f3709b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum migrate_mode mode, enum compact_result *compact_result)
 {
-	enum compact_result compact_result;
 	struct page *page;
+	int contended_compaction;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
-	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						mode, contended_compaction);
+	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+						mode, &contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
-	switch (compact_result) {
-	case COMPACT_DEFERRED:
-		*deferred_compaction = true;
-		/* fall-through */
-	case COMPACT_SKIPPED:
+	if (*compact_result <= COMPACT_INACTIVE)
 		return NULL;
-	default:
-		break;
-	}
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTFAIL);
 
+	/*
+	 * In all zones where compaction was attempted (and not
+	 * deferred or skipped), lock contention has been detected.
+	 * For THP allocation we do not want to disrupt the others
+	 * so we fallback to base pages instead.
+	 */
+	if (contended_compaction == COMPACT_CONTENDED_LOCK)
+		*compact_result = COMPACT_CONTENDED;
+
+	/*
+	 * If compaction was aborted due to need_resched(), we do not
+	 * want to further increase allocation latency, unless it is
+	 * khugepaged trying to collapse.
+	 */
+	if (contended_compaction == COMPACT_CONTENDED_SCHED
+		&& !(current->flags & PF_KTHREAD))
+		*compact_result = COMPACT_CONTENDED;
+
 	cond_resched();
 
 	return NULL;
@@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum migrate_mode mode, enum compact_result *compact_result)
 {
 	return NULL;
 }
@@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
-	bool deferred_compaction = false;
-	int contended_compaction = COMPACT_CONTENDED_NONE;
+	enum compact_result compact_result;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
-					&contended_compaction,
-					&deferred_compaction);
+					&compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * to heavily disrupt the system, so we fail the allocation
 		 * instead of entering direct reclaim.
 		 */
-		if (deferred_compaction)
-			goto nopage;
-
-		/*
-		 * In all zones where compaction was attempted (and not
-		 * deferred or skipped), lock contention has been detected.
-		 * For THP allocation we do not want to disrupt the others
-		 * so we fallback to base pages instead.
-		 */
-		if (contended_compaction == COMPACT_CONTENDED_LOCK)
+		if (compact_result == COMPACT_DEFERRED)
 			goto nopage;
 
 		/*
-		 * If compaction was aborted due to need_resched(), we do not
-		 * want to further increase allocation latency, unless it is
-		 * khugepaged trying to collapse.
+		 * Compaction is contended so rather back off than cause
+		 * excessive stalls.
 		 */
-		if (contended_compaction == COMPACT_CONTENDED_SCHED
-			&& !(current->flags & PF_KTHREAD))
+		if(compact_result == COMPACT_CONTENDED)
 			goto nopage;
 	}
 
@@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
-					    &contended_compaction,
-					    &deferred_compaction);
+					    &compact_result);
 	if (page)
 		goto got_pg;
 nopage:
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (6 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  6:57   ` Hillf Danton
  2016-04-28  8:47   ` Vlastimil Babka
  2016-04-20 19:47 ` [PATCH 09/14] mm: use compaction feedback for thp backoff conditions Michal Hocko
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Compaction can provide a wild variation of feedback to the caller. Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.

This patch abstracts the feedback into three basic types:
	- compaction_made_progress - compaction was active and made some
	  progress.
	- compaction_failed - compaction failed and further attempts to
	  invoke it would most probably fail and therefore it is not
	  worth retrying
	- compaction_withdrawn - compaction wasn't invoked for an
          implementation specific reasons. In the current implementation
          it means that the compaction was deferred, contended or the
          page scanners met too early without any progress. Retrying is
          still worthwhile.

[vbabka@suse.cz: do not change thp back off behavior]
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a7b9091ff349..a002ca55c513 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
 
+/* Compaction has made some progress and retrying makes sense */
+static inline bool compaction_made_progress(enum compact_result result)
+{
+	/*
+	 * Even though this might sound confusing this in fact tells us
+	 * that the compaction successfully isolated and migrated some
+	 * pageblocks.
+	 */
+	if (result == COMPACT_PARTIAL)
+		return true;
+
+	return false;
+}
+
+/* Compaction has failed and it doesn't make much sense to keep retrying. */
+static inline bool compaction_failed(enum compact_result result)
+{
+	/* All zones where scanned completely and still not result. */
+	if (result == COMPACT_COMPLETE)
+		return true;
+
+	return false;
+}
+
+/*
+ * Compaction  has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+	/*
+	 * Compaction backed off due to watermark checks for order-0
+	 * so the regular reclaim has to try harder and reclaim something.
+	 */
+	if (result == COMPACT_SKIPPED)
+		return true;
+
+	/*
+	 * If compaction is deferred for high-order allocations, it is
+	 * because sync compaction recently failed. If this is the case
+	 * and the caller requested a THP allocation, we do not want
+	 * to heavily disrupt the system, so we fail the allocation
+	 * instead of entering direct reclaim.
+	 */
+	if (result == COMPACT_DEFERRED)
+		return true;
+
+	/*
+	 * If compaction in async mode encounters contention or blocks higher
+	 * priority task we back off early rather than cause stalls.
+	 */
+	if (result == COMPACT_CONTENDED)
+		return true;
+
+	/*
+	 * Page scanners have met but we haven't scanned full zones so this
+	 * is a back off in fact.
+	 */
+	if (result == COMPACT_PARTIAL_SKIPPED)
+		return true;
+
+	return false;
+}
+
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
@@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return true;
 }
 
+static inline bool compaction_made_progress(enum compact_result result)
+{
+	return false;
+}
+
+static inline bool compaction_failed(enum compact_result result)
+{
+	return false;
+}
+
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+	return true;
+}
+
 static inline int kcompactd_run(int nid)
 {
 	return 0;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (7 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  7:05   ` Hillf Danton
  2016-04-28  8:53   ` Vlastimil Babka
  2016-04-20 19:47 ` [PATCH 10/14] mm, oom: rework oom detection Michal Hocko
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THP requests skip the direct reclaim if the compaction is either
deferred or contended to reduce stalls which wouldn't help the
allocation success anyway. These checks are ignoring other potential
feedback modes which we have available now.

It clearly doesn't make much sense to go and reclaim few pages if the
previous compaction has failed.

We can also simplify the check by using compaction_withdrawn which
checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
is however covering more reasons why the compaction was withdrawn.
None of them should be a problem for the THP case though.

It is safe to back of if we see COMPACT_SKIPPED because that means
that compaction_suitable failed and a single round of the reclaim is
unlikely to make any difference here. We would have to be close to
the low watermark to reclaim enough and even then there is no guarantee
that the compaction would make any progress while the direct reclaim
would have caused the stall.

COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
have only seen a part of the zone so a retry would make some sense. But
it would be a compaction retry not a reclaim retry to perform. We are
not doing that and that might indeed lead to situations where THP fails
but this should happen only rarely and it would be really hard to
measure.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 350d13f3709b..d551fe326c33 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3257,25 +3257,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
-	/* Checks for THP-specific high-order allocations */
-	if (is_thp_gfp_mask(gfp_mask)) {
-		/*
-		 * If compaction is deferred for high-order allocations, it is
-		 * because sync compaction recently failed. If this is the case
-		 * and the caller requested a THP allocation, we do not want
-		 * to heavily disrupt the system, so we fail the allocation
-		 * instead of entering direct reclaim.
-		 */
-		if (compact_result == COMPACT_DEFERRED)
-			goto nopage;
-
-		/*
-		 * Compaction is contended so rather back off than cause
-		 * excessive stalls.
-		 */
-		if(compact_result == COMPACT_CONTENDED)
-			goto nopage;
-	}
+	/*
+	 * Checks for THP-specific high-order allocations and back off
+	 * if the the compaction backed off or failed
+	 */
+	if (is_thp_gfp_mask(gfp_mask) &&
+			(compaction_withdrawn(compact_result) ||
+			 compaction_failed(compact_result)))
+		goto nopage;
 
 	/*
 	 * It can become very expensive to allocate transparent hugepages at
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/14] mm, oom: rework oom detection
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (8 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 09/14] mm: use compaction feedback for thp backoff conditions Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-20 19:47 ` [PATCH 11/14] mm: throttle on IO only when there are too many dirty and writeback pages Michal Hocko
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/swap.h |   1 +
 mm/page_alloc.c      | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          |  25 +++----------
 3 files changed, 97 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..b14a2bb33514 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d551fe326c33..38302c2041a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3145,6 +3145,77 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+		     struct alloc_context *ac, int alloc_flags,
+		     bool did_some_progress, unsigned long pages_reclaimed,
+		     int no_progress_loops)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure we converge to OOM if we cannot make any progress
+	 * several times in the row.
+	 */
+	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+		return false;
+
+	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+		if (pages_reclaimed >= (1<<order))
+			return false;
+
+		if (did_some_progress)
+			return true;
+	}
+
+	/*
+	 * Keep reclaiming pages while there is a chance this will lead somewhere.
+	 * If none of the target zones can satisfy our allocation request even
+	 * if all reclaimable pages are considered then we are screwed and have
+	 * to go OOM.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+					ac->nodemask) {
+		unsigned long available;
+
+		available = zone_reclaimable_pages(zone);
+		available -= DIV_ROUND_UP(no_progress_loops * available,
+					  MAX_RECLAIM_RETRIES);
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/*
+		 * Would the allocation succeed if we reclaimed the whole
+		 * available?
+		 */
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac->high_zoneidx, alloc_flags, available)) {
+			/* Wait for some write requests to complete then retry */
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3156,6 +3227,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	enum compact_result compact_result;
+	int no_progress_loops = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3284,23 +3356,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	/* Keep reclaiming pages as long as there is reasonable progress */
-	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto retry;
+	/*
+	 * Do not retry costly high order allocations unless they are
+	 * __GFP_REPEAT
+	 */
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+		goto noretry;
+
+	if (did_some_progress) {
+		no_progress_loops = 0;
+		pages_reclaimed += did_some_progress;
+	} else {
+		no_progress_loops++;
 	}
 
+	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, pages_reclaimed,
+				 no_progress_loops))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		no_progress_loops = 0;
 		goto retry;
+	}
 
 noretry:
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3e6347e2a5fc..a2ba60aa7b88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,7 +191,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
@@ -2530,10 +2530,8 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2541,7 +2539,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
-	bool reclaimable = false;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2606,17 +2603,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			if (nr_soft_reclaimed)
-				reclaimable = true;
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
-			reclaimable = true;
-
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
 	}
 
 	/*
@@ -2624,8 +2614,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * promoted it to __GFP_HIGHMEM.
 	 */
 	sc->gfp_mask = orig_mask;
-
-	return reclaimable;
 }
 
 /*
@@ -2650,7 +2638,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
-	bool zones_reclaimable;
 retry:
 	delayacct_freepages_start();
 
@@ -2661,7 +2648,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		zones_reclaimable = shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2708,10 +2695,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		goto retry;
 	}
 
-	/* Any of the zones still reclaimable?  Don't OOM. */
-	if (zones_reclaimable)
-		return 1;
-
 	return 0;
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/14] mm: throttle on IO only when there are too many dirty and writeback pages
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (9 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 10/14] mm, oom: rework oom detection Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-20 19:47 ` [PATCH 12/14] mm, oom: protect !costly allocations some more Michal Hocko
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by 373ccbe59270 ("mm, vmstat: allow
WQ concurrency to discover memory reclaim doesn't make any progress")
into the __alloc_pages_slowpath now that wait_iff_congested is not
used anymore.  As the only remaining user of wait_iff_congested is
shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 20 +++-----------------
 mm/page_alloc.c  | 39 ++++++++++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index bfbd7096b6ed..08e3a58628ed 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -979,20 +978,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
-
-		/*
-		 * Memory allocation/reclaim might be called from a WQ
-		 * context and the current implementation of the WQ
-		 * concurrency control doesn't recognize that a particular
-		 * WQ is congested if the worker thread is looping without
-		 * ever sleeping. Therefore we have to do a short sleep
-		 * here rather than calling cond_resched().
-		 */
-		if (current->flags & PF_WQ_WORKER)
-			schedule_timeout_uninterruptible(1);
-		else
-			cond_resched();
-
+		cond_resched();
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38302c2041a3..3b78936eca70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3195,8 +3195,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 					ac->nodemask) {
 		unsigned long available;
+		unsigned long reclaimable;
 
-		available = zone_reclaimable_pages(zone);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -3207,8 +3208,40 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, available)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress) {
+				unsigned long writeback;
+				unsigned long dirty;
+
+				writeback = zone_page_state_snapshot(zone,
+								     NR_WRITEBACK);
+				dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+				if (2*(writeback + dirty) > reclaimable) {
+					congestion_wait(BLK_RW_ASYNC, HZ/10);
+					return true;
+				}
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout_uninterruptible(1);
+			else
+				cond_resched();
+
 			return true;
 		}
 	}
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (10 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 11/14] mm: throttle on IO only when there are too many dirty and writeback pages Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  8:03   ` Hillf Danton
  2016-05-04  6:01   ` Joonsoo Kim
  2016-04-20 19:47 ` [PATCH 13/14] mm: consider compaction feedback also for costly allocation Michal Hocko
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
	- the last compaction round backed off or
	- we haven't completed at least MAX_COMPACT_RETRIES active
	  compaction rounds.

The first rule ensures that the very last attempt for compaction
was not ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy backing of other contexts to steal the compacted pages before
the current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense. We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up. With this logic in place we do not have
to increase the migration mode unconditionally and rather do it only if
the compaction failed for the weaker mode. A nice side effect is that
the stronger migration mode is used only when really needed so this has
a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b78936eca70..bb4df1be0d43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
 #ifdef CONFIG_COMPACTION
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
@@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     enum migrate_mode *migrate_mode,
+		     int compaction_retries)
+{
+	if (!order)
+		return false;
+
+	/*
+	 * compaction considers all the zone as desperately out of memory
+	 * so it doesn't really make much sense to retry except when the
+	 * failure could be caused by weak migration mode.
+	 */
+	if (compaction_failed(compact_result)) {
+		if (*migrate_mode == MIGRATE_ASYNC) {
+			*migrate_mode = MIGRATE_SYNC_LIGHT;
+			return true;
+		}
+		return false;
+	}
+
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM. Still cap the reclaim retry loops with
+	 * progress to prevent from looping forever and potential trashing.
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compaction_withdrawn(compact_result))
+			return true;
+		if (compaction_retries <= MAX_COMPACT_RETRIES)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 {
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     enum migrate_mode *migrate_mode,
+		     int compaction_retries)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	enum compact_result compact_result;
+	int compaction_retries = 0;
 	int no_progress_loops = 0;
 
 	/*
@@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 compaction_failed(compact_result)))
 		goto nopage;
 
-	/*
-	 * It can become very expensive to allocate transparent hugepages at
-	 * fault, so use asynchronous memory compaction for THP unless it is
-	 * khugepaged trying to collapse.
-	 */
-	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
-		migration_mode = MIGRATE_SYNC_LIGHT;
+	if (order && compaction_made_progress(compact_result))
+		compaction_retries++;
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
@@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 no_progress_loops))
 		goto retry;
 
+	/*
+	 * It doesn't make any sense to retry for the compaction if the order-0
+	 * reclaim is not able to make any progress because the current
+	 * implementation of the compaction depends on the sufficient amount
+	 * of free memory (see __compaction_suitable)
+	 */
+	if (did_some_progress > 0 &&
+			should_compact_retry(order, compact_result,
+				&migration_mode, compaction_retries))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 noretry:
 	/*
-	 * High-order allocations do not necessarily loop after
-	 * direct reclaim and reclaim/compaction depends on compaction
-	 * being called after reclaim so call directly if necessary
+	 * High-order allocations do not necessarily loop after direct reclaim
+	 * and reclaim/compaction depends on compaction being called after
+	 * reclaim so call directly if necessary.
+	 * It can become very expensive to allocate transparent hugepages at
+	 * fault, so use asynchronous memory compaction for THP unless it is
+	 * khugepaged trying to collapse. All other requests should tolerate
+	 * at least light sync migration.
 	 */
+	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+		migration_mode = MIGRATE_ASYNC;
+	else
+		migration_mode = MIGRATE_SYNC_LIGHT;
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
 					    &compact_result);
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/14] mm: consider compaction feedback also for costly allocation
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (11 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 12/14] mm, oom: protect !costly allocations some more Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  8:13   ` Hillf Danton
  2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
  2016-05-04  5:45 ` [PATCH 0.14] oom detection rework v6 Joonsoo Kim
  14 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. Remove the check for the number of reclaimed pages and rely
on the compaction feedback solely. should_reclaim_retry now only
makes sure that we keep retrying reclaim for high order pages only
if they are hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt. !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim. We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
	- compact memory
	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
	  of swapspace)
	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
	  steps until certain amount of memory is freed (250M in my test)
	  and reduce the step to (step / 2) + 1 after reaching the end of
	  the mapping
	- then run a script which populates the page cache 2G (MemTotal)
	  from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb4df1be0d43..d5a938f12554 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
 		     enum migrate_mode *migrate_mode,
 		     int compaction_retries)
 {
+	int max_retries = MAX_COMPACT_RETRIES;
+
 	if (!order)
 		return false;
 
@@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
 	}
 
 	/*
-	 * !costly allocations are really important and we have to make sure
-	 * the compaction wasn't deferred or didn't bail out early due to locks
-	 * contention before we go OOM. Still cap the reclaim retry loops with
-	 * progress to prevent from looping forever and potential trashing.
+	 * make sure the compaction wasn't deferred or didn't bail out early
+	 * due to locks contention before we declare that we should give up.
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
-		if (compaction_withdrawn(compact_result))
-			return true;
-		if (compaction_retries <= MAX_COMPACT_RETRIES)
-			return true;
-	}
+	if (compaction_withdrawn(compact_result))
+		return true;
+
+	/*
+	 * !costly requests are much more important than __GFP_REPEAT
+	 * costly ones because they are de facto nofail and invoke OOM
+	 * killer to move on while costly can fail and users are ready
+	 * to cope with that. 1/4 retries is rather arbitrary but we
+	 * would need much more detailed feedback from compaction to
+	 * make a better decision.
+	 */
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		max_retries /= 4;
+	if (compaction_retries <= max_retries)
+		return true;
 
 	return false;
 }
@@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, unsigned long pages_reclaimed,
-		     int no_progress_loops)
+		     bool did_some_progress, int no_progress_loops)
 {
 	struct zone *zone;
 	struct zoneref *z;
@@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	if (no_progress_loops > MAX_RECLAIM_RETRIES)
 		return false;
 
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		if (pages_reclaimed >= (1<<order))
-			return false;
-
-		if (did_some_progress)
-			return true;
-	}
-
 	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
@@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	enum compact_result compact_result;
@@ -3444,16 +3443,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
 		goto noretry;
 
-	if (did_some_progress) {
+	/*
+	 * Costly allocations might have made a progress but this doesn't mean
+	 * their order will become available due to high fragmentation so
+	 * always increment the no progress counter for them
+	 */
+	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
 		no_progress_loops = 0;
-		pages_reclaimed += did_some_progress;
-	} else {
+	else
 		no_progress_loops++;
-	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, pages_reclaimed,
-				 no_progress_loops))
+				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
 	/*
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (12 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 13/14] mm: consider compaction feedback also for costly allocation Michal Hocko
@ 2016-04-20 19:47 ` Michal Hocko
  2016-04-21  8:24   ` Hillf Danton
                     ` (2 more replies)
  2016-05-04  5:45 ` [PATCH 0.14] oom detection rework v6 Joonsoo Kim
  14 siblings, 3 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-20 19:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders. While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic. The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap. If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g. hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark. The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.

I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.

The reason why compaction requires being over low rather than min
watermark is not clear to me. This check was there essentially since
56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails"). It is clearly an implementation detail though and we
shouldn't pull it into the generic retry logic while we should be able
to cope with such eventuality. The only place in should_compact_retry
where we retry without any upper bound is for compaction_withdrawn()
case.

Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed. In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark check.
The reclaimable memory is reduced linearly by the allocation order. The
idea is that we do not want to reclaim all the remaining memory for a
single allocation request just unblock __compaction_suitable which
doesn't guarantee we will make a further progress.

The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress. !costly requests shouldn't be affected much - e.g. order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.

[vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in
compaction_zonelist_suitable]
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  4 ++++
 include/linux/mmzone.h     |  3 +++
 mm/compaction.c            | 42 +++++++++++++++++++++++++++++++++++++++---
 mm/page_alloc.c            | 18 +++++++++++-------
 4 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a002ca55c513..7bbdbf729757 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
 	return false;
 }
 
+
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+					int alloc_flags);
+
 extern int kcompactd_run(int nid);
 extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 150c6049f961..0bf13c7cd8cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags,
+			long free_pages);
 bool zone_watermark_ok(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx, int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index e2e487cea5ea..0a7ca578af97 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
  *   COMPACT_CONTINUE - If compaction should run now
  */
 static enum compact_result __compaction_suitable(struct zone *zone, int order,
-					int alloc_flags, int classzone_idx)
+					int alloc_flags, int classzone_idx,
+					unsigned long wmark_target)
 {
 	int fragindex;
 	unsigned long watermark;
@@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	 * allocated and for a short time, the footprint is higher
 	 */
 	watermark += (2UL << order);
-	if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
+	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
+				 alloc_flags, wmark_target))
 		return COMPACT_SKIPPED;
 
 	/*
@@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 {
 	enum compact_result ret;
 
-	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
+	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
+				    zone_page_state(zone, NR_FREE_PAGES));
 	trace_mm_compaction_suitable(zone, order, ret);
 	if (ret == COMPACT_NOT_SUITABLE_ZONE)
 		ret = COMPACT_SKIPPED;
@@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 	return ret;
 }
 
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+		int alloc_flags)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure at least one zone would pass __compaction_suitable if we continue
+	 * retrying the reclaim.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+					ac->nodemask) {
+		unsigned long available;
+		enum compact_result compact_result;
+
+		/*
+		 * Do not consider all the reclaimable memory because we do not
+		 * want to trash just for a single high order allocation which
+		 * is even not guaranteed to appear even if __compaction_suitable
+		 * is happy about the watermark check.
+		 */
+		available = zone_reclaimable_pages(zone) / order;
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		compact_result = __compaction_suitable(zone, order, alloc_flags,
+				ac->classzone_idx, available);
+		if (compact_result != COMPACT_SKIPPED &&
+				compact_result != COMPACT_NOT_SUITABLE_ZONE)
+			return true;
+	}
+
+	return false;
+}
+
 static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
 	enum compact_result ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d5a938f12554..6757d6df2160 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * one free page of a suitable size. Checking now avoids taking the zone lock
  * to check in the allocation paths if no pages are free.
  */
-static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
 			long free_pages)
 {
@@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
-		     enum migrate_mode *migrate_mode,
+should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
 		     int compaction_retries)
 {
 	int max_retries = MAX_COMPACT_RETRIES;
@@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
 	/*
 	 * make sure the compaction wasn't deferred or didn't bail out early
 	 * due to locks contention before we declare that we should give up.
+	 * But do not retry if the given zonelist is not suitable for
+	 * compaction.
 	 */
 	if (compaction_withdrawn(compact_result))
-		return true;
+		return compaction_zonelist_suitable(ac, order, alloc_flags);
 
 	/*
 	 * !costly requests are much more important than __GFP_REPEAT
@@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
+should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
+		     enum compact_result compact_result,
 		     enum migrate_mode *migrate_mode,
 		     int compaction_retries)
 {
@@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * of free memory (see __compaction_suitable)
 	 */
 	if (did_some_progress > 0 &&
-			should_compact_retry(order, compact_result,
-				&migration_mode, compaction_retries))
+			should_compact_retry(ac, order, alloc_flags,
+				compact_result, &migration_mode,
+				compaction_retries))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready
  2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
@ 2016-04-21  3:32   ` Hillf Danton
  2016-05-04 13:56   ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  3:32 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> while playing with the oom detection rework [1] I have noticed
> that my heavy order-9 (hugetlb) load close to OOM ended up in an
> endless loop where the reclaim hasn't made any progress but
> did_some_progress didn't reflect that and compaction_suitable
> was backing off because no zone is above low wmark + 1 << order.
> 
> It turned out that this is in fact an old standing bug in compaction_ready
> which ignores the requested_highidx and did the watermark check for
> 0 classzone_idx. This succeeds for zone DMA most of the time as the zone
> is mostly unused because of lowmem protection. This also means that the
> OOM killer wouldn't be triggered for higher order requests even when
> there is no reclaim progress and we essentially rely on order-0 request
> to find this out. 

Thanks.

> This has been broken in one way or another since
> fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
> are sufficient free pages available") but only since 7335084d446b ("mm:
> vmscan: do not OOM if aborting reclaim to start compaction") we are not
> invoking the OOM killer based on the wrong calculation.
> 
> Propagate requested_highidx down to compaction_ready and use it for both
> the watermak check and compaction_suitable to fix this issue.
> 
> [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c839adc13efd..3e6347e2a5fc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>   * Returns true if compaction should go ahead for a high-order request, or
>   * the high-order allocation would succeed without compaction.
>   */
> -static inline bool compaction_ready(struct zone *zone, int order)
> +static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long balance_gap, watermark;
>  	bool watermark_ok;
> @@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
>  	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
>  			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
>  	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
> -	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
> +	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
> 
>  	/*
>  	 * If compaction is deferred, reclaim up to a point where
> @@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
>  	 * If compaction is not ready to start and allocation is not likely
>  	 * to succeed without it, then keep reclaiming.
>  	 */
> -	if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
> +	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
>  		return false;
> 
>  	return watermark_ok;
> @@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			if (IS_ENABLED(CONFIG_COMPACTION) &&
>  			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
>  			    zonelist_zone_idx(z) <= requested_highidx &&
> -			    compaction_ready(zone, sc->order)) {
> +			    compaction_ready(zone, sc->order, requested_highidx)) {
>  				sc->compaction_ready = true;
>  				continue;
>  			}
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE
  2016-04-20 19:47 ` [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE Michal Hocko
@ 2016-04-21  6:39   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  6:39 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> COMPACT_COMPLETE now means that compaction and free scanner met. This is
> not very useful information if somebody just wants to use this feedback
> and make any decisions based on that. The current caller might be a poor
> guy who just happened to scan tiny portion of the zone and that could be
> the reason no suitable pages were compacted. Make sure we distinguish
> the full and partial zone walks.
> 
> Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
> and be optimistic in retrying.
> 
> The existing users of COMPACT_COMPLETE are conservatively changed to
> use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
> reconsidered and only defer the compaction only for COMPACT_COMPLETE
> with the new semantic.
> 
> This patch shouldn't introduce any functional changes.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/compaction.h        | 10 +++++++++-
>  include/trace/events/compaction.h |  1 +
>  mm/compaction.c                   | 14 +++++++++++---
>  mm/internal.h                     |  1 +
>  4 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7e177d111c39..7c4de92d12cc 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -21,7 +21,15 @@ enum compact_result {
>  	 * pages
>  	 */
>  	COMPACT_PARTIAL,
> -	/* The full zone was compacted */
> +	/*
> +	 * direct compaction has scanned part of the zone but wasn't successfull
> +	 * to compact suitable pages.
> +	 */
> +	COMPACT_PARTIAL_SKIPPED,
> +	/*
> +	 * The full zone was compacted scanned but wasn't successfull to compact
> +	 * suitable pages.
> +	 */
>  	COMPACT_COMPLETE,
>  	/* For more detailed tracepoint output */
>  	COMPACT_NO_SUITABLE_PAGE,
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 6ba16c86d7db..36e2d6fb1360 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -14,6 +14,7 @@
>  	EM( COMPACT_DEFERRED,		"deferred")		\
>  	EM( COMPACT_CONTINUE,		"continue")		\
>  	EM( COMPACT_PARTIAL,		"partial")		\
> +	EM( COMPACT_PARTIAL_SKIPPED,	"partial_skipped")	\
>  	EM( COMPACT_COMPLETE,		"complete")		\
>  	EM( COMPACT_NO_SUITABLE_PAGE,	"no_suitable_page")	\
>  	EM( COMPACT_NOT_SUITABLE_ZONE,	"not_suitable_zone")	\
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 13709e33a2fc..e2e487cea5ea 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
>  		if (cc->direct_compaction)
>  			zone->compact_blockskip_flush = true;
> 
> -		return COMPACT_COMPLETE;
> +		if (cc->whole_zone)
> +			return COMPACT_COMPLETE;
> +		else
> +			return COMPACT_PARTIAL_SKIPPED;
>  	}
> 
>  	if (is_via_compact_memory(cc->order))
> @@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>  		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
>  	}
> +
> +	if (cc->migrate_pfn == start_pfn)
> +		cc->whole_zone = true;
> +
>  	cc->last_migrated_pfn = 0;
> 
>  	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> @@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  			goto break_loop;
>  		}
> 
> -		if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
> +		if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
> +					status == COMPACT_PARTIAL_SKIPPED)) {
>  			/*
>  			 * We think that allocation won't succeed in this zone
>  			 * so we defer compaction there. If it ends up
> @@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  						cc.classzone_idx, 0)) {
>  			success = true;
>  			compaction_defer_reset(zone, cc.order, false);
> -		} else if (status == COMPACT_COMPLETE) {
> +		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
>  			/*
>  			 * We use sync migration mode here, so we defer like
>  			 * sync direct compaction does.
> diff --git a/mm/internal.h b/mm/internal.h
> index e9aacea1a0d1..4423dfe69382 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -182,6 +182,7 @@ struct compact_control {
>  	enum migrate_mode mode;		/* Async or sync migration mode */
>  	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
>  	bool direct_compaction;		/* False from kcompactd or /proc/... */
> +	bool whole_zone;		/* Whole zone has been scanned */
>  	int order;			/* order a direct compactor needs */
>  	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
>  	const int alloc_flags;		/* alloc flags of a direct compactor */
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/14] mm, compaction: Update compaction_result ordering
  2016-04-20 19:47 ` [PATCH 06/14] mm, compaction: Update compaction_result ordering Michal Hocko
@ 2016-04-21  6:45   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  6:45 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> compaction_result will be used as the primary feedback channel for
> compaction users. At the same time try_to_compact_pages (and potentially
> others) assume a certain ordering where a more specific feedback takes
> precendence. This gets a bit awkward when we have conflicting feedback
> from different zones. E.g one returing COMPACT_COMPLETE meaning the full
> zone has been scanned without any outcome while other returns with
> COMPACT_PARTIAL aka made some progress. The caller should get
> COMPACT_PARTIAL because that means that the compaction still can make
> some progress. The same applies for COMPACT_PARTIAL vs.
> COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
> larger the value is the more progress we have done.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/compaction.h | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7c4de92d12cc..a7b9091ff349 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -4,6 +4,8 @@
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* When adding new states, please adjust include/trace/events/compaction.h */
>  enum compact_result {
> +	/* For more detailed tracepoint output - internal to compaction */
> +	COMPACT_NOT_SUITABLE_ZONE,
>  	/*
>  	 * compaction didn't start as it was not possible or direct reclaim
>  	 * was more suitable
> @@ -11,30 +13,34 @@ enum compact_result {
>  	COMPACT_SKIPPED,
>  	/* compaction didn't start as it was deferred due to past failures */
>  	COMPACT_DEFERRED,
> +
>  	/* compaction not active last round */
>  	COMPACT_INACTIVE = COMPACT_DEFERRED,
> 
> +	/* For more detailed tracepoint output - internal to compaction */
> +	COMPACT_NO_SUITABLE_PAGE,
>  	/* compaction should continue to another pageblock */
>  	COMPACT_CONTINUE,
> +
>  	/*
> -	 * direct compaction partially compacted a zone and there are suitable
> -	 * pages
> +	 * The full zone was compacted scanned but wasn't successfull to compact
> +	 * suitable pages.
>  	 */
> -	COMPACT_PARTIAL,
> +	COMPACT_COMPLETE,
>  	/*
>  	 * direct compaction has scanned part of the zone but wasn't successfull
>  	 * to compact suitable pages.
>  	 */
>  	COMPACT_PARTIAL_SKIPPED,
> +
> +	/* compaction terminated prematurely due to lock contentions */
> +	COMPACT_CONTENDED,
> +
>  	/*
> -	 * The full zone was compacted scanned but wasn't successfull to compact
> -	 * suitable pages.
> +	 * direct compaction partially compacted a zone and there might be
> +	 * suitable pages
>  	 */
> -	COMPACT_COMPLETE,
> -	/* For more detailed tracepoint output */
> -	COMPACT_NO_SUITABLE_PAGE,
> -	COMPACT_NOT_SUITABLE_ZONE,
> -	COMPACT_CONTENDED,
> +	COMPACT_PARTIAL,
>  };
> 
>  /* Used to signal whether compaction detected need_sched() or lock contention */
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface
  2016-04-20 19:47 ` [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface Michal Hocko
@ 2016-04-21  6:50   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  6:50 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_direct_compact communicates potential back off by two
> variables:
> 	- deferred_compaction tells that the compaction returned
> 	  COMPACT_DEFERRED
> 	- contended_compaction is set when there is a contention on
> 	  zone->lock resp. zone->lru_lock locks
> 
> __alloc_pages_slowpath then backs of for THP allocation requests to
> prevent from long stalls. This is rather messy and it would be much
> cleaner to return a single compact result value and hide all the nasty
> details into __alloc_pages_direct_compact.
> 
> This patch shouldn't introduce any functional changes.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
>  1 file changed, 31 insertions(+), 36 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 06af8a757d52..350d13f3709b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
> -		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum migrate_mode mode, enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
> +	int contended_compaction;
> 
>  	if (!order)
>  		return NULL;
> 
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> -						mode, contended_compaction);
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +						mode, &contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
> 
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_INACTIVE)
>  		return NULL;
> -	default:
> -		break;
> -	}
> 
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	count_vm_event(COMPACTFAIL);
> 
> +	/*
> +	 * In all zones where compaction was attempted (and not
> +	 * deferred or skipped), lock contention has been detected.
> +	 * For THP allocation we do not want to disrupt the others
> +	 * so we fallback to base pages instead.
> +	 */
> +	if (contended_compaction == COMPACT_CONTENDED_LOCK)
> +		*compact_result = COMPACT_CONTENDED;
> +
> +	/*
> +	 * If compaction was aborted due to need_resched(), we do not
> +	 * want to further increase allocation latency, unless it is
> +	 * khugepaged trying to collapse.
> +	 */
> +	if (contended_compaction == COMPACT_CONTENDED_SCHED
> +		&& !(current->flags & PF_KTHREAD))
> +		*compact_result = COMPACT_CONTENDED;
> +
>  	cond_resched();
> 
>  	return NULL;
> @@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
> -		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum migrate_mode mode, enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> @@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long pages_reclaimed = 0;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> -	int contended_compaction = COMPACT_CONTENDED_NONE;
> +	enum compact_result compact_result;
> 
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
> -					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
> 
> @@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> -			goto nopage;
> -
> -		/*
> -		 * In all zones where compaction was attempted (and not
> -		 * deferred or skipped), lock contention has been detected.
> -		 * For THP allocation we do not want to disrupt the others
> -		 * so we fallback to base pages instead.
> -		 */
> -		if (contended_compaction == COMPACT_CONTENDED_LOCK)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
> 
>  		/*
> -		 * If compaction was aborted due to need_resched(), we do not
> -		 * want to further increase allocation latency, unless it is
> -		 * khugepaged trying to collapse.
> +		 * Compaction is contended so rather back off than cause
> +		 * excessive stalls.
>  		 */
> -		if (contended_compaction == COMPACT_CONTENDED_SCHED
> -			&& !(current->flags & PF_KTHREAD))
> +		if(compact_result == COMPACT_CONTENDED)
>  			goto nopage;
>  	}
> 
> @@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
> -					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers
  2016-04-20 19:47 ` [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers Michal Hocko
@ 2016-04-21  6:57   ` Hillf Danton
  2016-04-28  8:47   ` Vlastimil Babka
  1 sibling, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  6:57 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
> 
> This patch abstracts the feedback into three basic types:
> 	- compaction_made_progress - compaction was active and made some
> 	  progress.
> 	- compaction_failed - compaction failed and further attempts to
> 	  invoke it would most probably fail and therefore it is not
> 	  worth retrying
> 	- compaction_withdrawn - compaction wasn't invoked for an
>           implementation specific reasons. In the current implementation
>           it means that the compaction was deferred, contended or the
>           page scanners met too early without any progress. Retrying is
>           still worthwhile.
> 
> [vbabka@suse.cz: do not change thp back off behavior]
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 79 insertions(+)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a7b9091ff349..a002ca55c513 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
> 
> +/* Compaction has made some progress and retrying makes sense */
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	/*
> +	 * Even though this might sound confusing this in fact tells us
> +	 * that the compaction successfully isolated and migrated some
> +	 * pageblocks.
> +	 */
> +	if (result == COMPACT_PARTIAL)
> +		return true;
> +
> +	return false;
> +}
> +
> +/* Compaction has failed and it doesn't make much sense to keep retrying. */
> +static inline bool compaction_failed(enum compact_result result)
> +{
> +	/* All zones where scanned completely and still not result. */

s/where/were/

> +	if (result == COMPACT_COMPLETE)
> +		return true;
> +
> +	return false;
> +}
> +
> +/*
> + * Compaction  has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> +	/*
> +	 * Compaction backed off due to watermark checks for order-0
> +	 * so the regular reclaim has to try harder and reclaim something.
> +	 */
> +	if (result == COMPACT_SKIPPED)
> +		return true;
> +
> +	/*
> +	 * If compaction is deferred for high-order allocations, it is
> +	 * because sync compaction recently failed. If this is the case
> +	 * and the caller requested a THP allocation, we do not want
> +	 * to heavily disrupt the system, so we fail the allocation
> +	 * instead of entering direct reclaim.
> +	 */
> +	if (result == COMPACT_DEFERRED)
> +		return true;
> +
> +	/*
> +	 * If compaction in async mode encounters contention or blocks higher
> +	 * priority task we back off early rather than cause stalls.
> +	 */
> +	if (result == COMPACT_CONTENDED)
> +		return true;
> +
> +	/*
> +	 * Page scanners have met but we haven't scanned full zones so this
> +	 * is a back off in fact.
> +	 */
> +	if (result == COMPACT_PARTIAL_SKIPPED)
> +		return true;
> +
> +	return false;
> +}
> +
>  extern int kcompactd_run(int nid);
>  extern void kcompactd_stop(int nid);
>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> @@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
> 
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return false;
> +}
> +
> +static inline bool compaction_failed(enum compact_result result)
> +{
> +	return false;
> +}
> +
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> +	return true;
> +}
> +
>  static inline int kcompactd_run(int nid)
>  {
>  	return 0;
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-20 19:47 ` [PATCH 09/14] mm: use compaction feedback for thp backoff conditions Michal Hocko
@ 2016-04-21  7:05   ` Hillf Danton
  2016-04-28  8:53   ` Vlastimil Babka
  1 sibling, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  7:05 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> THP requests skip the direct reclaim if the compaction is either
> deferred or contended to reduce stalls which wouldn't help the
> allocation success anyway. These checks are ignoring other potential
> feedback modes which we have available now.
> 
> It clearly doesn't make much sense to go and reclaim few pages if the
> previous compaction has failed.
> 
> We can also simplify the check by using compaction_withdrawn which
> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> is however covering more reasons why the compaction was withdrawn.
> None of them should be a problem for the THP case though.
> 
> It is safe to back of if we see COMPACT_SKIPPED because that means
> that compaction_suitable failed and a single round of the reclaim is
> unlikely to make any difference here. We would have to be close to
> the low watermark to reclaim enough and even then there is no guarantee
> that the compaction would make any progress while the direct reclaim
> would have caused the stall.
> 
> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> have only seen a part of the zone so a retry would make some sense. But
> it would be a compaction retry not a reclaim retry to perform. We are
> not doing that and that might indeed lead to situations where THP fails
> but this should happen only rarely and it would be really hard to
> measure.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/page_alloc.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 350d13f3709b..d551fe326c33 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3257,25 +3257,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (page)
>  		goto got_pg;
> 
> -	/* Checks for THP-specific high-order allocations */
> -	if (is_thp_gfp_mask(gfp_mask)) {
> -		/*
> -		 * If compaction is deferred for high-order allocations, it is
> -		 * because sync compaction recently failed. If this is the case
> -		 * and the caller requested a THP allocation, we do not want
> -		 * to heavily disrupt the system, so we fail the allocation
> -		 * instead of entering direct reclaim.
> -		 */
> -		if (compact_result == COMPACT_DEFERRED)
> -			goto nopage;
> -
> -		/*
> -		 * Compaction is contended so rather back off than cause
> -		 * excessive stalls.
> -		 */
> -		if(compact_result == COMPACT_CONTENDED)
> -			goto nopage;
> -	}
> +	/*
> +	 * Checks for THP-specific high-order allocations and back off
> +	 * if the the compaction backed off or failed
> +	 */

Alternatively,
	/*
	 * Check THP allocations and back off
	 * if the compaction bailed out or failed
	 */
> +	if (is_thp_gfp_mask(gfp_mask) &&
> +			(compaction_withdrawn(compact_result) ||
> +			 compaction_failed(compact_result)))
> +		goto nopage;
> 
>  	/*
>  	 * It can become very expensive to allocate transparent hugepages at
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
  2016-04-20 19:47 ` [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED Michal Hocko
@ 2016-04-21  7:08   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  7:08 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> try_to_compact_pages can currently return COMPACT_SKIPPED even when the
> compaction is defered for some zone just because zone DMA is skipped
> in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
> basically unusable for the page allocator as a feedback mechanism.
> 
> Make sure we distinguish those two states properly and switch their
> ordering in the enum. This would mean that the COMPACT_SKIPPED will be
> returned only when all eligible zones are skipped.
> 
> As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
> will be more precise and we would bail out rather than reclaim.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/compaction.h        | 7 +++++--
>  include/trace/events/compaction.h | 2 +-
>  mm/compaction.c                   | 8 +++++---
>  3 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4458fd94170f..7e177d111c39 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -4,13 +4,16 @@
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* When adding new states, please adjust include/trace/events/compaction.h */
>  enum compact_result {
> -	/* compaction didn't start as it was deferred due to past failures */
> -	COMPACT_DEFERRED,
>  	/*
>  	 * compaction didn't start as it was not possible or direct reclaim
>  	 * was more suitable
>  	 */
>  	COMPACT_SKIPPED,
> +	/* compaction didn't start as it was deferred due to past failures */
> +	COMPACT_DEFERRED,
> +	/* compaction not active last round */
> +	COMPACT_INACTIVE = COMPACT_DEFERRED,
> +
>  	/* compaction should continue to another pageblock */
>  	COMPACT_CONTINUE,
>  	/*
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index e215bf68f521..6ba16c86d7db 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -10,8 +10,8 @@
>  #include <trace/events/mmflags.h>
> 
>  #define COMPACTION_STATUS					\
> -	EM( COMPACT_DEFERRED,		"deferred")		\
>  	EM( COMPACT_SKIPPED,		"skipped")		\
> +	EM( COMPACT_DEFERRED,		"deferred")		\
>  	EM( COMPACT_CONTINUE,		"continue")		\
>  	EM( COMPACT_PARTIAL,		"partial")		\
>  	EM( COMPACT_COMPLETE,		"complete")		\
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b06de27b7f72..13709e33a2fc 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  	int may_perform_io = gfp_mask & __GFP_IO;
>  	struct zoneref *z;
>  	struct zone *zone;
> -	enum compact_result rc = COMPACT_DEFERRED;
> +	enum compact_result rc = COMPACT_SKIPPED;
>  	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
> 
>  	*contended = COMPACT_CONTENDED_NONE;
> @@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  		enum compact_result status;
>  		int zone_contended;
> 
> -		if (compaction_deferred(zone, order))
> +		if (compaction_deferred(zone, order)) {
> +			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>  			continue;
> +		}
> 
>  		status = compact_zone_order(zone, order, gfp_mask, mode,
>  				&zone_contended, alloc_flags,
> @@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  	 * If at least one zone wasn't deferred or skipped, we report if all
>  	 * zones that were tried were lock contended.
>  	 */
> -	if (rc > COMPACT_SKIPPED && all_zones_contended)
> +	if (rc > COMPACT_INACTIVE && all_zones_contended)
>  		*contended = COMPACT_CONTENDED_LOCK;
> 
>  	return rc;
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-04-20 19:47 ` [PATCH 12/14] mm, oom: protect !costly allocations some more Michal Hocko
@ 2016-04-21  8:03   ` Hillf Danton
  2016-05-04  6:01   ` Joonsoo Kim
  1 sibling, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  8:03 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.

s/were/where/

> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round backed off or
> 	- we haven't completed at least MAX_COMPACT_RETRIES active
> 	  compaction rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was not ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy backing of other contexts to steal the compacted pages before
> the current context manages to retry to allocate them.
> 
> compaction_failed() is taken as a final word from the compaction that
> the retry doesn't make much sense. We have to be careful though because
> the first compaction round is MIGRATE_ASYNC which is rather weak as it
> ignores pages under writeback and gives up too easily in other
> situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> has been used before we give up. With this logic in place we do not have
> to increase the migration mode unconditionally and rather do it only if
> the compaction failed for the weaker mode. A nice side effect is that
> the stronger migration mode is used only when really needed so this has
> a potential of smaller latencies in some cases.
> 
> Please note that the compaction doesn't tell us much about how
> successful it was when returning compaction_made_progress so we just
> have to blindly trust that another retry is worthwhile and cap the
> number to something reasonable to guarantee a convergence.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 77 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b78936eca70..bb4df1be0d43 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
> 
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
> @@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> 
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     enum migrate_mode *migrate_mode,
> +		     int compaction_retries)
> +{
> +	if (!order)
> +		return false;
> +
> +	/*
> +	 * compaction considers all the zone as desperately out of memory
> +	 * so it doesn't really make much sense to retry except when the
> +	 * failure could be caused by weak migration mode.
> +	 */
> +	if (compaction_failed(compact_result)) {
> +		if (*migrate_mode == MIGRATE_ASYNC) {
> +			*migrate_mode = MIGRATE_SYNC_LIGHT;
> +			return true;
> +		}
> +		return false;
> +	}
> +
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compaction_withdrawn(compact_result))
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     enum migrate_mode *migrate_mode,
> +		     int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
> 
>  /* Perform direct synchronous page reclaim */
> @@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int no_progress_loops = 0;
> 
>  	/*
> @@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			 compaction_failed(compact_result)))
>  		goto nopage;
> 
> -	/*
> -	 * It can become very expensive to allocate transparent hugepages at
> -	 * fault, so use asynchronous memory compaction for THP unless it is
> -	 * khugepaged trying to collapse.
> -	 */
> -	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> -		migration_mode = MIGRATE_SYNC_LIGHT;
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
> 
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 no_progress_loops))
>  		goto retry;
> 
> +	/*
> +	 * It doesn't make any sense to retry for the compaction if the order-0
> +	 * reclaim is not able to make any progress because the current
> +	 * implementation of the compaction depends on the sufficient amount
> +	 * of free memory (see __compaction_suitable)
> +	 */
> +	if (did_some_progress > 0 &&
> +			should_compact_retry(order, compact_result,
> +				&migration_mode, compaction_retries))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> 
>  noretry:
>  	/*
> -	 * High-order allocations do not necessarily loop after
> -	 * direct reclaim and reclaim/compaction depends on compaction
> -	 * being called after reclaim so call directly if necessary
> +	 * High-order allocations do not necessarily loop after direct reclaim
> +	 * and reclaim/compaction depends on compaction being called after
> +	 * reclaim so call directly if necessary.
> +	 * It can become very expensive to allocate transparent hugepages at
> +	 * fault, so use asynchronous memory compaction for THP unless it is
> +	 * khugepaged trying to collapse. All other requests should tolerate
> +	 * at least light sync migration.
>  	 */
> +	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> +		migration_mode = MIGRATE_ASYNC;
> +	else
> +		migration_mode = MIGRATE_SYNC_LIGHT;
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &compact_result);
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/14] mm: consider compaction feedback also for costly allocation
  2016-04-20 19:47 ` [PATCH 13/14] mm: consider compaction feedback also for costly allocation Michal Hocko
@ 2016-04-21  8:13   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  8:13 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
> should_reclaim_retry currently where we decide to not retry after at
> least order worth of pages were reclaimed or the watermark check for at
> least one zone would succeed after reclaiming all pages if the reclaim
> hasn't made any progress. Compaction feedback is mostly ignored and we
> just try to make sure that the compaction did at least something before
> giving up.
> 
> The first condition was added by a41f24ea9fd6 ("page allocator: smarter
> retry of costly-order allocations) and it assumed that lumpy reclaim
> could have created a page of the sufficient order. Lumpy reclaim,
> has been removed quite some time ago so the assumption doesn't hold
> anymore. Remove the check for the number of reclaimed pages and rely
> on the compaction feedback solely. should_reclaim_retry now only
> makes sure that we keep retrying reclaim for high order pages only
> if they are hidden by watermaks so order-0 reclaim makes really sense.
> 
> should_compact_retry now keeps retrying even for the costly allocations.
> The number of retries is reduced wrt. !costly requests because they are
> less important and harder to grant and so their pressure shouldn't cause
> contention for other requests or cause an over reclaim. We also do not
> reset no_progress_loops for costly request to make sure we do not keep
> reclaiming too agressively.
> 
> This has been tested by running a process which fragments memory:
> 	- compact memory
> 	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
> 	  of swapspace)
> 	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
> 	  steps until certain amount of memory is freed (250M in my test)
> 	  and reduce the step to (step / 2) + 1 after reaching the end of
> 	  the mapping
> 	- then run a script which populates the page cache 2G (MemTotal)
> 	  from /dev/zero to a new file
> And then tries to allocate
> nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
> huge pages.
> 
> root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run
> /root/alloc_hugepages.sh 1920M 250M
> Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
> Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437
> 
> * This is the /proc/buddyinfo after the compaction
> 
> Done fragmenting. size=2013265920 freed=262144000
> Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
> Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0
> 
> * /proc/buddyinfo after memory got fragmented
> 
> Executing "/root/alloc_hugepages.sh"
> Eating some pagecache
> 508623+0 records in
> 508623+0 records out
> 2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
> Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
> Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0
> 
> * /proc/buddyinfo after page cache got eaten
> 
> Trying to allocate 129
> 129
> 
> * 129 hugepages requested and all of them granted.
> 
> Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
> Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0
> 
> * /proc/buddyinfo after hugetlb allocation.
> 
> 10 runs will behave as follows:
> Trying to allocate 130
> 130
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 132
> 132
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
> 
> So basically 100% success for all 10 attempts.
> Without the patch numbers looked much worse:
> Trying to allocate 128
> 12
> --
> Trying to allocate 129
> 14
> --
> Trying to allocate 129
> 7
> --
> Trying to allocate 129
> 16
> --
> Trying to allocate 129
> 30
> --
> Trying to allocate 129
> 38
> --
> Trying to allocate 129
> 19
> --
> Trying to allocate 129
> 37
> --
> Trying to allocate 129
> 28
> --
> Trying to allocate 129
> 37
> 
> Just for completness the base kernel without oom detection rework looks
> as follows:
> Trying to allocate 127
> 30
> --
> Trying to allocate 129
> 12
> --
> Trying to allocate 129
> 52
> --
> Trying to allocate 128
> 32
> --
> Trying to allocate 129
> 12
> --
> Trying to allocate 129
> 10
> --
> Trying to allocate 129
> 32
> --
> Trying to allocate 128
> 14
> --
> Trying to allocate 128
> 16
> --
> Trying to allocate 129
> 8
> 
> As we can see the success rate is much more volatile and smaller without
> this patch. So the patch not only makes the retry logic for costly
> requests more sensible the success rate is even higher.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
>  1 file changed, 32 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb4df1be0d43..d5a938f12554 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
>  		     enum migrate_mode *migrate_mode,
>  		     int compaction_retries)
>  {
> +	int max_retries = MAX_COMPACT_RETRIES;
> +
>  	if (!order)
>  		return false;
> 
> @@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
>  	}
> 
>  	/*
> -	 * !costly allocations are really important and we have to make sure
> -	 * the compaction wasn't deferred or didn't bail out early due to locks
> -	 * contention before we go OOM. Still cap the reclaim retry loops with
> -	 * progress to prevent from looping forever and potential trashing.
> +	 * make sure the compaction wasn't deferred or didn't bail out early
> +	 * due to locks contention before we declare that we should give up.
>  	 */
> -	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> -		if (compaction_withdrawn(compact_result))
> -			return true;
> -		if (compaction_retries <= MAX_COMPACT_RETRIES)
> -			return true;
> -	}
> +	if (compaction_withdrawn(compact_result))
> +		return true;
> +
> +	/*
> +	 * !costly requests are much more important than __GFP_REPEAT
> +	 * costly ones because they are de facto nofail and invoke OOM
> +	 * killer to move on while costly can fail and users are ready
> +	 * to cope with that. 1/4 retries is rather arbitrary but we
> +	 * would need much more detailed feedback from compaction to
> +	 * make a better decision.
> +	 */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +		max_retries /= 4;
> +	if (compaction_retries <= max_retries)
> +		return true;
> 
>  	return false;
>  }
> @@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> - * pages) and no_progress_loops (number of reclaim rounds without any progress
> - * in a row) is considered as well as the reclaimable pages on the applicable
> - * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> + * any progress in a row) is considered as well as the reclaimable pages on the
> + * applicable zone list (with a backoff mechanism which is a function of
> + * no_progress_loops).
>   *
>   * Returns true if a retry is viable or false to enter the oom path.
>   */
>  static inline bool
>  should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		     struct alloc_context *ac, int alloc_flags,
> -		     bool did_some_progress, unsigned long pages_reclaimed,
> -		     int no_progress_loops)
> +		     bool did_some_progress, int no_progress_loops)
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> @@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  	if (no_progress_loops > MAX_RECLAIM_RETRIES)
>  		return false;
> 
> -	if (order > PAGE_ALLOC_COSTLY_ORDER) {
> -		if (pages_reclaimed >= (1<<order))
> -			return false;
> -
> -		if (did_some_progress)
> -			return true;
> -	}
> -
>  	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
> @@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>  	struct page *page = NULL;
>  	int alloc_flags;
> -	unsigned long pages_reclaimed = 0;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	enum compact_result compact_result;
> @@ -3444,16 +3443,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>  		goto noretry;
> 
> -	if (did_some_progress) {
> +	/*
> +	 * Costly allocations might have made a progress but this doesn't mean
> +	 * their order will become available due to high fragmentation so
> +	 * always increment the no progress counter for them
> +	 */
> +	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>  		no_progress_loops = 0;
> -		pages_reclaimed += did_some_progress;
> -	} else {
> +	else
>  		no_progress_loops++;
> -	}
> 
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, pages_reclaimed,
> -				 no_progress_loops))
> +				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
> 
>  	/*
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
@ 2016-04-21  8:24   ` Hillf Danton
  2016-04-28  8:59   ` Vlastimil Babka
  2016-05-04  6:27   ` Joonsoo Kim
  2 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2016-04-21  8:24 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'Joonsoo Kim',
	'Vlastimil Babka', linux-mm, 'LKML',
	'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> "mm: consider compaction feedback also for costly allocation" has
> removed the upper bound for the reclaim/compaction retries based on the
> number of reclaimed pages for costly orders. While this is desirable
> the patch did miss a mis interaction between reclaim, compaction and the
> retry logic. The direct reclaim tries to get zones over min watermark
> while compaction backs off and returns COMPACT_SKIPPED when all zones
> are below low watermark + 1<<order gap. If we are getting really close
> to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
> high order request (e.g. hugetlb order-9) while the reclaim is not able
> to release enough pages to get us over low watermark. The reclaim is
> still able to make some progress (usually trashing over few remaining
> pages) so we are not able to break out from the loop.
> 
> I have seen this happening with the same test described in "mm: consider
> compaction feedback also for costly allocation" on a swapless system.
> The original problem got resolved by "vmscan: consider classzone_idx in
> compaction_ready" but it shows how things might go wrong when we
> approach the oom event horizont.
> 
> The reason why compaction requires being over low rather than min
> watermark is not clear to me. This check was there essentially since
> 56de7263fcf3 ("mm: compaction: direct compact when a high-order
> allocation fails"). It is clearly an implementation detail though and we
> shouldn't pull it into the generic retry logic while we should be able
> to cope with such eventuality. The only place in should_compact_retry
> where we retry without any upper bound is for compaction_withdrawn()
> case.
> 
> Introduce compaction_zonelist_suitable function which checks the given
> zonelist and returns true only if there is at least one zone which would
> would unblock __compaction_suitable if more memory got reclaimed. In
> this implementation it checks __compaction_suitable with NR_FREE_PAGES
> plus part of the reclaimable memory as the target for the watermark check.
> The reclaimable memory is reduced linearly by the allocation order. The
> idea is that we do not want to reclaim all the remaining memory for a
> single allocation request just unblock __compaction_suitable which
> doesn't guarantee we will make a further progress.
> 
> The new helper is then used if compaction_withdrawn() feedback was
> provided so we do not retry if there is no outlook for a further
> progress. !costly requests shouldn't be affected much - e.g. order-2
> pages would require to have at least 64kB on the reclaimable LRUs while
> order-9 would need at least 32M which should be enough to not lock up.
> 
> [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in
> compaction_zonelist_suitable]
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

> ---
>  include/linux/compaction.h |  4 ++++
>  include/linux/mmzone.h     |  3 +++
>  mm/compaction.c            | 42 +++++++++++++++++++++++++++++++++++++++---
>  mm/page_alloc.c            | 18 +++++++++++-------
>  4 files changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a002ca55c513..7bbdbf729757 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
>  	return false;
>  }
> 
> +
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> +					int alloc_flags);
> +
>  extern int kcompactd_run(int nid);
>  extern void kcompactd_stop(int nid);
>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 150c6049f961..0bf13c7cd8cd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
>  extern struct mutex zonelists_mutex;
>  void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
>  void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
> +			unsigned long mark, int classzone_idx, int alloc_flags,
> +			long free_pages);
>  bool zone_watermark_ok(struct zone *z, unsigned int order,
>  		unsigned long mark, int classzone_idx, int alloc_flags);
>  bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e2e487cea5ea..0a7ca578af97 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
>   *   COMPACT_CONTINUE - If compaction should run now
>   */
>  static enum compact_result __compaction_suitable(struct zone *zone, int order,
> -					int alloc_flags, int classzone_idx)
> +					int alloc_flags, int classzone_idx,
> +					unsigned long wmark_target)
>  {
>  	int fragindex;
>  	unsigned long watermark;
> @@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
>  	 * allocated and for a short time, the footprint is higher
>  	 */
>  	watermark += (2UL << order);
> -	if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
> +	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
> +				 alloc_flags, wmark_target))
>  		return COMPACT_SKIPPED;
> 
>  	/*
> @@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
>  {
>  	enum compact_result ret;
> 
> -	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
> +	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
> +				    zone_page_state(zone, NR_FREE_PAGES));
>  	trace_mm_compaction_suitable(zone, order, ret);
>  	if (ret == COMPACT_NOT_SUITABLE_ZONE)
>  		ret = COMPACT_SKIPPED;
> @@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
>  	return ret;
>  }
> 
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> +		int alloc_flags)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure at least one zone would pass __compaction_suitable if we continue
> +	 * retrying the reclaim.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> +					ac->nodemask) {
> +		unsigned long available;
> +		enum compact_result compact_result;
> +
> +		/*
> +		 * Do not consider all the reclaimable memory because we do not
> +		 * want to trash just for a single high order allocation which
> +		 * is even not guaranteed to appear even if __compaction_suitable
> +		 * is happy about the watermark check.
> +		 */
> +		available = zone_reclaimable_pages(zone) / order;
> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		compact_result = __compaction_suitable(zone, order, alloc_flags,
> +				ac->classzone_idx, available);
> +		if (compact_result != COMPACT_SKIPPED &&
> +				compact_result != COMPACT_NOT_SUITABLE_ZONE)
> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
>  {
>  	enum compact_result ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5a938f12554..6757d6df2160 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   * one free page of a suitable size. Checking now avoids taking the zone lock
>   * to check in the allocation paths if no pages are free.
>   */
> -static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  			unsigned long mark, int classzone_idx, int alloc_flags,
>  			long free_pages)
>  {
> @@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  }
> 
>  static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> -		     enum migrate_mode *migrate_mode,
> +should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> +		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
>  		     int compaction_retries)
>  {
>  	int max_retries = MAX_COMPACT_RETRIES;
> @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
>  	/*
>  	 * make sure the compaction wasn't deferred or didn't bail out early
>  	 * due to locks contention before we declare that we should give up.
> +	 * But do not retry if the given zonelist is not suitable for
> +	 * compaction.
>  	 */
>  	if (compaction_withdrawn(compact_result))
> -		return true;
> +		return compaction_zonelist_suitable(ac, order, alloc_flags);
> 
>  	/*
>  	 * !costly requests are much more important than __GFP_REPEAT
> @@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  }
> 
>  static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> +should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
> +		     enum compact_result compact_result,
>  		     enum migrate_mode *migrate_mode,
>  		     int compaction_retries)
>  {
> @@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * of free memory (see __compaction_suitable)
>  	 */
>  	if (did_some_progress > 0 &&
> -			should_compact_retry(order, compact_result,
> -				&migration_mode, compaction_retries))
> +			should_compact_retry(ac, order, alloc_flags,
> +				compact_result, &migration_mode,
> +				compaction_retries))
>  		goto retry;
> 
>  	/* Reclaim has failed us, start killing things */
> --
> 2.8.0.rc3

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers
  2016-04-20 19:47 ` [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers Michal Hocko
  2016-04-21  6:57   ` Hillf Danton
@ 2016-04-28  8:47   ` Vlastimil Babka
  1 sibling, 0 replies; 60+ messages in thread
From: Vlastimil Babka @ 2016-04-28  8:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, linux-mm, LKML,
	Michal Hocko

On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> 	- compaction_made_progress - compaction was active and made some
> 	  progress.
> 	- compaction_failed - compaction failed and further attempts to
> 	  invoke it would most probably fail and therefore it is not
> 	  worth retrying
> 	- compaction_withdrawn - compaction wasn't invoked for an
>            implementation specific reasons. In the current implementation
>            it means that the compaction was deferred, contended or the
>            page scanners met too early without any progress. Retrying is
>            still worthwhile.
>
> [vbabka@suse.cz: do not change thp back off behavior]
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-20 19:47 ` [PATCH 09/14] mm: use compaction feedback for thp backoff conditions Michal Hocko
  2016-04-21  7:05   ` Hillf Danton
@ 2016-04-28  8:53   ` Vlastimil Babka
  2016-04-28 12:35     ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Vlastimil Babka @ 2016-04-28  8:53 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, linux-mm, LKML,
	Michal Hocko

On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> THP requests skip the direct reclaim if the compaction is either
> deferred or contended to reduce stalls which wouldn't help the
> allocation success anyway. These checks are ignoring other potential
> feedback modes which we have available now.
>
> It clearly doesn't make much sense to go and reclaim few pages if the
> previous compaction has failed.
>
> We can also simplify the check by using compaction_withdrawn which
> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> is however covering more reasons why the compaction was withdrawn.
> None of them should be a problem for the THP case though.
>
> It is safe to back of if we see COMPACT_SKIPPED because that means
> that compaction_suitable failed and a single round of the reclaim is
> unlikely to make any difference here. We would have to be close to
> the low watermark to reclaim enough and even then there is no guarantee
> that the compaction would make any progress while the direct reclaim
> would have caused the stall.
>
> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> have only seen a part of the zone so a retry would make some sense. But
> it would be a compaction retry not a reclaim retry to perform. We are
> not doing that and that might indeed lead to situations where THP fails
> but this should happen only rarely and it would be really hard to
> measure.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

THP's don't compact by default in page fault path anymore, so we don't 
need to restrict them even more. And hopefully we'll replace the 
is_thp_gfp_mask() hack with something better soon, so this might be just 
extra code churn. But I don't feel strongly enough to nack it.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
  2016-04-21  8:24   ` Hillf Danton
@ 2016-04-28  8:59   ` Vlastimil Babka
  2016-04-28 12:39     ` Michal Hocko
  2016-05-04  6:27   ` Joonsoo Kim
  2 siblings, 1 reply; 60+ messages in thread
From: Vlastimil Babka @ 2016-04-28  8:59 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, linux-mm, LKML,
	Michal Hocko

On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> "mm: consider compaction feedback also for costly allocation" has
> removed the upper bound for the reclaim/compaction retries based on the
> number of reclaimed pages for costly orders. While this is desirable
> the patch did miss a mis interaction between reclaim, compaction and the
> retry logic.

Hmm perhaps reversing the order of patches 13 and 14 would be a bit 
safer wrt future bisections then? Add compaction_zonelist_suitable() 
first with the reasoning, and then immediately use it in the other patch.

> The direct reclaim tries to get zones over min watermark
> while compaction backs off and returns COMPACT_SKIPPED when all zones
> are below low watermark + 1<<order gap. If we are getting really close
> to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
> high order request (e.g. hugetlb order-9) while the reclaim is not able
> to release enough pages to get us over low watermark. The reclaim is
> still able to make some progress (usually trashing over few remaining
> pages) so we are not able to break out from the loop.
>
> I have seen this happening with the same test described in "mm: consider
> compaction feedback also for costly allocation" on a swapless system.
> The original problem got resolved by "vmscan: consider classzone_idx in
> compaction_ready" but it shows how things might go wrong when we
> approach the oom event horizont.
>
> The reason why compaction requires being over low rather than min
> watermark is not clear to me. This check was there essentially since
> 56de7263fcf3 ("mm: compaction: direct compact when a high-order
> allocation fails"). It is clearly an implementation detail though and we
> shouldn't pull it into the generic retry logic while we should be able
> to cope with such eventuality. The only place in should_compact_retry
> where we retry without any upper bound is for compaction_withdrawn()
> case.
>
> Introduce compaction_zonelist_suitable function which checks the given
> zonelist and returns true only if there is at least one zone which would
> would unblock __compaction_suitable if more memory got reclaimed. In
> this implementation it checks __compaction_suitable with NR_FREE_PAGES
> plus part of the reclaimable memory as the target for the watermark check.
> The reclaimable memory is reduced linearly by the allocation order. The
> idea is that we do not want to reclaim all the remaining memory for a
> single allocation request just unblock __compaction_suitable which
> doesn't guarantee we will make a further progress.
>
> The new helper is then used if compaction_withdrawn() feedback was
> provided so we do not retry if there is no outlook for a further
> progress. !costly requests shouldn't be affected much - e.g. order-2
> pages would require to have at least 64kB on the reclaimable LRUs while
> order-9 would need at least 32M which should be enough to not lock up.
>
> [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in
> compaction_zonelist_suitable]
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-28  8:53   ` Vlastimil Babka
@ 2016-04-28 12:35     ` Michal Hocko
  2016-04-29  9:16       ` Vlastimil Babka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-04-28 12:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Joonsoo Kim, Hillf Danton,
	linux-mm, LKML

On Thu 28-04-16 10:53:18, Vlastimil Babka wrote:
> On 04/20/2016 09:47 PM, Michal Hocko wrote:
> >From: Michal Hocko <mhocko@suse.com>
> >
> >THP requests skip the direct reclaim if the compaction is either
> >deferred or contended to reduce stalls which wouldn't help the
> >allocation success anyway. These checks are ignoring other potential
> >feedback modes which we have available now.
> >
> >It clearly doesn't make much sense to go and reclaim few pages if the
> >previous compaction has failed.
> >
> >We can also simplify the check by using compaction_withdrawn which
> >checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> >is however covering more reasons why the compaction was withdrawn.
> >None of them should be a problem for the THP case though.
> >
> >It is safe to back of if we see COMPACT_SKIPPED because that means
> >that compaction_suitable failed and a single round of the reclaim is
> >unlikely to make any difference here. We would have to be close to
> >the low watermark to reclaim enough and even then there is no guarantee
> >that the compaction would make any progress while the direct reclaim
> >would have caused the stall.
> >
> >COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> >have only seen a part of the zone so a retry would make some sense. But
> >it would be a compaction retry not a reclaim retry to perform. We are
> >not doing that and that might indeed lead to situations where THP fails
> >but this should happen only rarely and it would be really hard to
> >measure.
> >
> >Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> THP's don't compact by default in page fault path anymore, so we don't need
> to restrict them even more. And hopefully we'll replace the
> is_thp_gfp_mask() hack with something better soon, so this might be just
> extra code churn. But I don't feel strongly enough to nack it.

My main point was to simplify the code and get rid of as much compaction
specific hacks as possible. We might very well drop this later on but it
would be at least less code to grasp through. I do not have any problem
with dropping this but I think this shouldn't collide with other patches
much so reducing the number of lines is worth it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-04-28  8:59   ` Vlastimil Babka
@ 2016-04-28 12:39     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-28 12:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Joonsoo Kim, Hillf Danton,
	linux-mm, LKML

On Thu 28-04-16 10:59:22, Vlastimil Babka wrote:
> On 04/20/2016 09:47 PM, Michal Hocko wrote:
> >From: Michal Hocko <mhocko@suse.com>
> >
> >"mm: consider compaction feedback also for costly allocation" has
> >removed the upper bound for the reclaim/compaction retries based on the
> >number of reclaimed pages for costly orders. While this is desirable
> >the patch did miss a mis interaction between reclaim, compaction and the
> >retry logic.
> 
> Hmm perhaps reversing the order of patches 13 and 14 would be a bit safer
> wrt future bisections then? Add compaction_zonelist_suitable() first with
> the reasoning, and then immediately use it in the other patch.

Hmm, I do not think the risk is high. This would require the allocate
GFP_REPEAT large orders to the last drop which is not usual. I found the
ordering more logical to argue about because this patch will be mostly
noop for costly orders without 13 and !costly allocations retry
endlessly anyway. So I would prefer this ordering even though there is
a window where an extreme load can lockup. I do not expect people
shooting their head during bisection.

[...]
> >
> >[vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in
> >compaction_zonelist_suitable]
> >Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-28 12:35     ` Michal Hocko
@ 2016-04-29  9:16       ` Vlastimil Babka
  2016-04-29  9:28         ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Vlastimil Babka @ 2016-04-29  9:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Joonsoo Kim, Hillf Danton,
	linux-mm, LKML, Andrea Arcangeli

On 04/28/2016 02:35 PM, Michal Hocko wrote:
> On Thu 28-04-16 10:53:18, Vlastimil Babka wrote:
>> On 04/20/2016 09:47 PM, Michal Hocko wrote:
>>> From: Michal Hocko <mhocko@suse.com>
>>>
>>> THP requests skip the direct reclaim if the compaction is either
>>> deferred or contended to reduce stalls which wouldn't help the
>>> allocation success anyway. These checks are ignoring other potential
>>> feedback modes which we have available now.
>>>
>>> It clearly doesn't make much sense to go and reclaim few pages if the
>>> previous compaction has failed.
>>>
>>> We can also simplify the check by using compaction_withdrawn which
>>> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
>>> is however covering more reasons why the compaction was withdrawn.
>>> None of them should be a problem for the THP case though.
>>>
>>> It is safe to back of if we see COMPACT_SKIPPED because that means
>>> that compaction_suitable failed and a single round of the reclaim is
>>> unlikely to make any difference here. We would have to be close to

Hmm this is actually incorrect, as should_continue_reclaim() will keep 
shrink_zone() going as much as needed for compaction to become enabled, 
so it doesn't reclaim just SWAP_CLUSTER_MAX.

>>> the low watermark to reclaim enough and even then there is no guarantee
>>> that the compaction would make any progress while the direct reclaim
>>> would have caused the stall.
>>>
>>> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
>>> have only seen a part of the zone so a retry would make some sense. But
>>> it would be a compaction retry not a reclaim retry to perform. We are
>>> not doing that and that might indeed lead to situations where THP fails
>>> but this should happen only rarely and it would be really hard to
>>> measure.
>>>
>>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>>
>> THP's don't compact by default in page fault path anymore, so we don't need
>> to restrict them even more. And hopefully we'll replace the
>> is_thp_gfp_mask() hack with something better soon, so this might be just
>> extra code churn. But I don't feel strongly enough to nack it.
>
> My main point was to simplify the code and get rid of as much compaction
> specific hacks as possible. We might very well drop this later on but it
> would be at least less code to grasp through. I do not have any problem
> with dropping this but I think this shouldn't collide with other patches
> much so reducing the number of lines is worth it.

I just realized it also affects khugepaged, and not just THP page 
faults, so it may potentially cripple THP's completely. My main issue is 
that the reasons to bail out includes COMPACT_SKIPPED, and for a wrong 
reason (see the comment above). It also goes against the comment below 
the noretry label:

  * High-order allocations do not necessarily loop after direct reclaim
  * and reclaim/compaction depends on compaction being called after
  * reclaim so call directly if necessary.

Given that THP's are large, I expect reclaim would indeed be quite often 
necessary before compaction, and the first optimistic async compaction 
attempt will just return SKIPPED. After this patch, there will be no 
more reclaim/compaction attempts for THP's, including khugepaged. And 
given the change of THP page fault defaults, even crippling that path 
should no longer be necessary.

So I would just drop this for now indeed.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 09/14] mm: use compaction feedback for thp backoff conditions
  2016-04-29  9:16       ` Vlastimil Babka
@ 2016-04-29  9:28         ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-04-29  9:28 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, linux-mm, LKML,
	Andrea Arcangeli

On Fri 29-04-16 11:16:44, Vlastimil Babka wrote:
> On 04/28/2016 02:35 PM, Michal Hocko wrote:
[...]
> >My main point was to simplify the code and get rid of as much compaction
> >specific hacks as possible. We might very well drop this later on but it
> >would be at least less code to grasp through. I do not have any problem
> >with dropping this but I think this shouldn't collide with other patches
> >much so reducing the number of lines is worth it.

Good point, I have completely missed this part.

> I just realized it also affects khugepaged, and not just THP page faults, so
> it may potentially cripple THP's completely. My main issue is that the
> reasons to bail out includes COMPACT_SKIPPED, and for a wrong reason (see
> the comment above). It also goes against the comment below the noretry
> label:
> 
>  * High-order allocations do not necessarily loop after direct reclaim
>  * and reclaim/compaction depends on compaction being called after
>  * reclaim so call directly if necessary.
> 
> Given that THP's are large, I expect reclaim would indeed be quite often
> necessary before compaction, and the first optimistic async compaction
> attempt will just return SKIPPED. After this patch, there will be no more
> reclaim/compaction attempts for THP's, including khugepaged. And given the
> change of THP page fault defaults, even crippling that path should no longer
> be necessary.
> 
> So I would just drop this for now indeed.

Agreed, thanks for catching this. Andrew, could you drop this patch
please? It was supposed to be a mere clean up without any effect on the
oom detection.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
                   ` (13 preceding siblings ...)
  2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
@ 2016-05-04  5:45 ` Joonsoo Kim
  2016-05-04  8:12   ` Vlastimil Babka
  2016-05-04  8:47   ` Michal Hocko
  14 siblings, 2 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04  5:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed, Apr 20, 2016 at 03:47:13PM -0400, Michal Hocko wrote:
> Hi,
> 
> This is v6 of the series. The previous version was posted [1]. The
> code hasn't changed much since then. I have found one old standing
> bug (patch 1) which just got much more severe and visible with this
> series. Other than that I have reorganized the series and put the
> compaction feedback abstraction to the front just in case we find out
> that parts of the series would have to be reverted later on for some
> reason. The premature oom killer invocation reported by Hugh [2] seems
> to be addressed.
> 
> We have discussed this series at LSF/MM summit in Raleigh and there
> didn't seem to be any concerns/objections to go on with the patch set
> and target it for the next merge window. 

I still don't agree with some part of this patchset that deal with
!costly order. As you know, there was two regression reports from Hugh
and Aaron and you fixed them by ensuring to trigger compaction. I
think that these show the problem of this patchset. Previous kernel
doesn't need to ensure to trigger compaction and just works fine in
any case. Your series make compaction necessary for all. OOM handling
is essential part in MM but compaction isn't. OOM handling should not
depend on compaction. I tested my own benchmark without
CONFIG_COMPACTION and found that premature OOM happens.

I hope that you try to test something without CONFIG_COMPACTION.

Thanks.

> 
> Motivation:
> As pointed by Linus [3][4] relying on zone_reclaimable as a way to
> communicate the reclaim progress is rater dubious. I tend to agree,
> not only it is really obscure, it is not hard to imagine cases where a
> single page freed in the loop keeps all the reclaimers looping without
> getting any progress because their gfp_mask wouldn't allow to get that
> page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
> rare so it doesn't happen in the practice but the current logic which we
> have is rather obscure and hard to follow a also non-deterministic.
> 
> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.
> 
> I have tested several different scenarios but it should be clear that
> testing OOM killer is quite hard to be representative. There is usually
> a tiny gap between almost OOM and full blown OOM which is often time
> sensitive. Anyway, I have tested the following 2 scenarios and I would
> appreciate if there are more to test.
> 
> Testing environment: a virtual machine with 2G of RAM and 2CPUs without
> any swap to make the OOM more deterministic.
> 
> 1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
>    file size, removes the files and starts over again) running in
>    parallel for 10s to build up a lot of dirty pages when 100 parallel
>    mem_eaters (anon private populated mmap which waits until it gets
>    signal) with 80M each.
> 
>    This causes an OOM flood of course and I have compared both patched
>    and unpatched kernels. The test is considered finished after there
>    are no OOM conditions detected. This should tell us whether there are
>    any excessive kills or some of them premature (e.g. due to dirty pages):
> 
> I have performed two runs this time each after a fresh boot.
> 
> * base kernel
> $ grep "Out of memory:" base-oom-run1.log | wc -l
> 78
> $ grep "Out of memory:" base-oom-run2.log | wc -l
> 78
> 
> $ grep "Kill process" base-oom-run1.log | tail -n1
> [   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
> $ grep "Kill process" base-oom-run2.log | tail -n1
> [   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
> 
> $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
> min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
> $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
> min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
> 
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 3
> 
> * patched kernel
> $ grep "Out of memory:" patched-oom-run1.log | wc -l
> 78
> miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
> 77
> 
> e grep "Kill process" patched-oom-run1.log | tail -n1
> [  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
> $ grep "Kill process" patched-oom-run2.log | tail -n1
> [  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
> 
> $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
> min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
> $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk 
> min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
> 
> e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 2
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 3
> 
> The patched kernel run noticeably longer while invoking OOM killer same
> number of times. This means that the original implementation is much
> more aggressive and triggers the OOM killer sooner. free pages stats
> show that neither kernels went OOM too early most of the time, though. I
> guess the difference is in the backoff when retries without any progress
> do sleep for a while if there is memory under writeback or dirty which
> is highly likely considering the parallel IO.
> Both kernels have seen races where zone wasn't marked unreclaimable
> and we still hit the OOM killer. This is most likely a race where
> a task managed to exit between the last allocation attempt and the oom
> killer invocation.
> 
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
>    memory as possible without triggering the OOM killer. This required a lot
>    of tuning but I've considered 3 consecutive runs in three different boots
>    without OOM as a success.
> 
> * base kernel
> size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
> 
> * patched kernel
> size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
> 
> That means 40M more memory was usable without triggering OOM killer. The
> base kernel sometimes managed to handle the same as patched but it
> wasn't consistent and failed in at least on of the 3 runs. This seems
> like a minor improvement.
> 
> I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
> memory and under memory pressure. The results are in patch 11 where the
> logic is implemented. In short I can see huge improvement there.
> 
> I am certainly interested in other usecases as well as well as any
> feedback. Especially those which require higher order requests.
> 
> * Changes since v5
> - added "vmscan: consider classzone_idx in compaction_ready"
> - added "mm, oom, compaction: prevent from should_compact_retry looping
>   for ever for costly orders"
> - acked-bys from Vlastimil
> - integrated feedback from review
> * Changes since v4
> - dropped __GFP_REPEAT for costly allocation as it is now replaced by
>   the compaction based feedback logic
> - !costly high order requests are retried based on the compaction feedback
> - compaction feedback has been tweaked to give us an useful information
>   to make decisions in the page allocator
> - rebased on the current mmotm-2016-04-01-16-24 with the previous version
>   of the rework reverted
> 
> * Changes since v3
> - factor out the new heuristic into its own function as suggested by
>   Johannes (no functional changes)
> 
> * Changes since v2
> - rebased on top of mmotm-2015-11-25-17-08 which includes
>   wait_iff_congested related changes which needed refresh in
>   patch#1 and patch#2
> - use zone_page_state_snapshot for NR_FREE_PAGES per David
> - shrink_zones doesn't need to return anything per David
> - retested because the major kernel version has changed since
>   the last time (4.2 -> 4.3 based kernel + mmotm patches)
> 
> * Changes since v1
> - backoff calculation was de-obfuscated by using DIV_ROUND_UP
> - __GFP_NOFAIL high order migh fail fixed - theoretical bug
> 
> [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
> [2] http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils
> [3] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
> [4] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-04-20 19:47 ` [PATCH 12/14] mm, oom: protect !costly allocations some more Michal Hocko
  2016-04-21  8:03   ` Hillf Danton
@ 2016-05-04  6:01   ` Joonsoo Kim
  2016-05-04  6:31     ` Joonsoo Kim
  2016-05-04  8:53     ` Michal Hocko
  1 sibling, 2 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04  6:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round backed off or
> 	- we haven't completed at least MAX_COMPACT_RETRIES active
> 	  compaction rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was not ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy backing of other contexts to steal the compacted pages before
> the current context manages to retry to allocate them.
> 
> compaction_failed() is taken as a final word from the compaction that
> the retry doesn't make much sense. We have to be careful though because
> the first compaction round is MIGRATE_ASYNC which is rather weak as it
> ignores pages under writeback and gives up too easily in other
> situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> has been used before we give up. With this logic in place we do not have
> to increase the migration mode unconditionally and rather do it only if
> the compaction failed for the weaker mode. A nice side effect is that
> the stronger migration mode is used only when really needed so this has
> a potential of smaller latencies in some cases.
> 
> Please note that the compaction doesn't tell us much about how
> successful it was when returning compaction_made_progress so we just
> have to blindly trust that another retry is worthwhile and cap the
> number to something reasonable to guarantee a convergence.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 77 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b78936eca70..bb4df1be0d43 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
> @@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     enum migrate_mode *migrate_mode,
> +		     int compaction_retries)
> +{
> +	if (!order)
> +		return false;
> +
> +	/*
> +	 * compaction considers all the zone as desperately out of memory
> +	 * so it doesn't really make much sense to retry except when the
> +	 * failure could be caused by weak migration mode.
> +	 */
> +	if (compaction_failed(compact_result)) {

IIUC, this compaction_failed() means that at least one zone is
compacted and failed. This is not same with your assumption in the
comment. If compaction is done and failed on ZONE_DMA, it would be
premature decision.

> +		if (*migrate_mode == MIGRATE_ASYNC) {
> +			*migrate_mode = MIGRATE_SYNC_LIGHT;
> +			return true;
> +		}
> +		return false;
> +	}
> +
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compaction_withdrawn(compact_result))
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     enum migrate_mode *migrate_mode,
> +		     int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
>  	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int no_progress_loops = 0;
>  
>  	/*
> @@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			 compaction_failed(compact_result)))
>  		goto nopage;
>  
> -	/*
> -	 * It can become very expensive to allocate transparent hugepages at
> -	 * fault, so use asynchronous memory compaction for THP unless it is
> -	 * khugepaged trying to collapse.
> -	 */
> -	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> -		migration_mode = MIGRATE_SYNC_LIGHT;
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
>  
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 no_progress_loops))
>  		goto retry;
>  
> +	/*
> +	 * It doesn't make any sense to retry for the compaction if the order-0
> +	 * reclaim is not able to make any progress because the current
> +	 * implementation of the compaction depends on the sufficient amount
> +	 * of free memory (see __compaction_suitable)
> +	 */
> +	if (did_some_progress > 0 &&
> +			should_compact_retry(order, compact_result,
> +				&migration_mode, compaction_retries))

Checking did_some_progress on each round have subtle corner case. Think
about following situation.

round, compaction, did_some_progress, compaction
0, defer, 1
0, defer, 1
0, defer, 1
0, defer, 1
0, defer, 0

In this case, compaction has enough chance to succeed since freepages
increase, but, compaction will not be triggered.

Thanks.

> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  
>  noretry:
>  	/*
> -	 * High-order allocations do not necessarily loop after
> -	 * direct reclaim and reclaim/compaction depends on compaction
> -	 * being called after reclaim so call directly if necessary
> +	 * High-order allocations do not necessarily loop after direct reclaim
> +	 * and reclaim/compaction depends on compaction being called after
> +	 * reclaim so call directly if necessary.
> +	 * It can become very expensive to allocate transparent hugepages at
> +	 * fault, so use asynchronous memory compaction for THP unless it is
> +	 * khugepaged trying to collapse. All other requests should tolerate
> +	 * at least light sync migration.
>  	 */
> +	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> +		migration_mode = MIGRATE_ASYNC;
> +	else
> +		migration_mode = MIGRATE_SYNC_LIGHT;
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &compact_result);
> -- 
> 2.8.0.rc3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
  2016-04-21  8:24   ` Hillf Danton
  2016-04-28  8:59   ` Vlastimil Babka
@ 2016-05-04  6:27   ` Joonsoo Kim
  2016-05-04  9:04     ` Michal Hocko
  2 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04  6:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

On Wed, Apr 20, 2016 at 03:47:27PM -0400, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> "mm: consider compaction feedback also for costly allocation" has
> removed the upper bound for the reclaim/compaction retries based on the
> number of reclaimed pages for costly orders. While this is desirable
> the patch did miss a mis interaction between reclaim, compaction and the
> retry logic. The direct reclaim tries to get zones over min watermark
> while compaction backs off and returns COMPACT_SKIPPED when all zones
> are below low watermark + 1<<order gap. If we are getting really close
> to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
> high order request (e.g. hugetlb order-9) while the reclaim is not able
> to release enough pages to get us over low watermark. The reclaim is
> still able to make some progress (usually trashing over few remaining
> pages) so we are not able to break out from the loop.
> 
> I have seen this happening with the same test described in "mm: consider
> compaction feedback also for costly allocation" on a swapless system.
> The original problem got resolved by "vmscan: consider classzone_idx in
> compaction_ready" but it shows how things might go wrong when we
> approach the oom event horizont.
> 
> The reason why compaction requires being over low rather than min
> watermark is not clear to me. This check was there essentially since
> 56de7263fcf3 ("mm: compaction: direct compact when a high-order
> allocation fails"). It is clearly an implementation detail though and we
> shouldn't pull it into the generic retry logic while we should be able
> to cope with such eventuality. The only place in should_compact_retry
> where we retry without any upper bound is for compaction_withdrawn()
> case.
> 
> Introduce compaction_zonelist_suitable function which checks the given
> zonelist and returns true only if there is at least one zone which would
> would unblock __compaction_suitable if more memory got reclaimed. In
> this implementation it checks __compaction_suitable with NR_FREE_PAGES
> plus part of the reclaimable memory as the target for the watermark check.
> The reclaimable memory is reduced linearly by the allocation order. The
> idea is that we do not want to reclaim all the remaining memory for a
> single allocation request just unblock __compaction_suitable which
> doesn't guarantee we will make a further progress.
> 
> The new helper is then used if compaction_withdrawn() feedback was
> provided so we do not retry if there is no outlook for a further
> progress. !costly requests shouldn't be affected much - e.g. order-2
> pages would require to have at least 64kB on the reclaimable LRUs while
> order-9 would need at least 32M which should be enough to not lock up.
> 
> [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in
> compaction_zonelist_suitable]
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  4 ++++
>  include/linux/mmzone.h     |  3 +++
>  mm/compaction.c            | 42 +++++++++++++++++++++++++++++++++++++++---
>  mm/page_alloc.c            | 18 +++++++++++-------
>  4 files changed, 57 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a002ca55c513..7bbdbf729757 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
>  	return false;
>  }
>  
> +
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> +					int alloc_flags);
> +
>  extern int kcompactd_run(int nid);
>  extern void kcompactd_stop(int nid);
>  extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 150c6049f961..0bf13c7cd8cd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
>  extern struct mutex zonelists_mutex;
>  void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
>  void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
> +			unsigned long mark, int classzone_idx, int alloc_flags,
> +			long free_pages);
>  bool zone_watermark_ok(struct zone *z, unsigned int order,
>  		unsigned long mark, int classzone_idx, int alloc_flags);
>  bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e2e487cea5ea..0a7ca578af97 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
>   *   COMPACT_CONTINUE - If compaction should run now
>   */
>  static enum compact_result __compaction_suitable(struct zone *zone, int order,
> -					int alloc_flags, int classzone_idx)
> +					int alloc_flags, int classzone_idx,
> +					unsigned long wmark_target)
>  {
>  	int fragindex;
>  	unsigned long watermark;
> @@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
>  	 * allocated and for a short time, the footprint is higher
>  	 */
>  	watermark += (2UL << order);
> -	if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
> +	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
> +				 alloc_flags, wmark_target))
>  		return COMPACT_SKIPPED;
>  
>  	/*
> @@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
>  {
>  	enum compact_result ret;
>  
> -	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
> +	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
> +				    zone_page_state(zone, NR_FREE_PAGES));
>  	trace_mm_compaction_suitable(zone, order, ret);
>  	if (ret == COMPACT_NOT_SUITABLE_ZONE)
>  		ret = COMPACT_SKIPPED;
> @@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
>  	return ret;
>  }
>  
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> +		int alloc_flags)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure at least one zone would pass __compaction_suitable if we continue
> +	 * retrying the reclaim.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> +					ac->nodemask) {
> +		unsigned long available;
> +		enum compact_result compact_result;
> +
> +		/*
> +		 * Do not consider all the reclaimable memory because we do not
> +		 * want to trash just for a single high order allocation which
> +		 * is even not guaranteed to appear even if __compaction_suitable
> +		 * is happy about the watermark check.
> +		 */
> +		available = zone_reclaimable_pages(zone) / order;

I can't understand why '/ order' is needed here. Think about specific
example.

zone_reclaimable_pages = 100 MB
NR_FREE_PAGES = 20 MB
watermark = 40 MB
order = 10

I think that compaction should run in this situation and your logic
doesn't. We should be conservative when guessing not to do something
prematurely.

> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		compact_result = __compaction_suitable(zone, order, alloc_flags,
> +				ac->classzone_idx, available);

It misses tracepoint in compaction_suitable().

> +		if (compact_result != COMPACT_SKIPPED &&
> +				compact_result != COMPACT_NOT_SUITABLE_ZONE)

It's undesirable to use COMPACT_NOT_SUITABLE_ZONE here. It is just for
detailed tracepoint output.

> +			return true;
> +	}
> +
> +	return false;
> +}
> +
>  static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
>  {
>  	enum compact_result ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5a938f12554..6757d6df2160 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   * one free page of a suitable size. Checking now avoids taking the zone lock
>   * to check in the allocation paths if no pages are free.
>   */
> -static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  			unsigned long mark, int classzone_idx, int alloc_flags,
>  			long free_pages)
>  {
> @@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  }
>  
>  static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> -		     enum migrate_mode *migrate_mode,
> +should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> +		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
>  		     int compaction_retries)
>  {
>  	int max_retries = MAX_COMPACT_RETRIES;
> @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
>  	/*
>  	 * make sure the compaction wasn't deferred or didn't bail out early
>  	 * due to locks contention before we declare that we should give up.
> +	 * But do not retry if the given zonelist is not suitable for
> +	 * compaction.
>  	 */
>  	if (compaction_withdrawn(compact_result))
> -		return true;
> +		return compaction_zonelist_suitable(ac, order, alloc_flags);

I think that compaction_zonelist_suitable() should be checked first.
If compaction_zonelist_suitable() returns false, it's useless to
retry since it means that compaction cannot run if all reclaimable
pages are reclaimed. Logic should be as following.

if (!compaction_zonelist_suitable())
        return false;

if (compaction_withdrawn())
        return true;

....

Thanks.

>  
>  	/*
>  	 * !costly requests are much more important than __GFP_REPEAT
> @@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  }
>  
>  static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> +should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
> +		     enum compact_result compact_result,
>  		     enum migrate_mode *migrate_mode,
>  		     int compaction_retries)
>  {
> @@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * of free memory (see __compaction_suitable)
>  	 */
>  	if (did_some_progress > 0 &&
> -			should_compact_retry(order, compact_result,
> -				&migration_mode, compaction_retries))
> +			should_compact_retry(ac, order, alloc_flags,
> +				compact_result, &migration_mode,
> +				compaction_retries))
>  		goto retry;
>  
>  	/* Reclaim has failed us, start killing things */
> -- 
> 2.8.0.rc3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04  6:01   ` Joonsoo Kim
@ 2016-05-04  6:31     ` Joonsoo Kim
  2016-05-04  8:56       ` Michal Hocko
  2016-05-04  8:53     ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04  6:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML, Michal Hocko

On Wed, May 04, 2016 at 03:01:24PM +0900, Joonsoo Kim wrote:
> On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and there is no guarantee
> > further reclaim/compaction attempts would help but at least make sure
> > that the compaction was active before we go OOM and keep retrying even
> > if should_reclaim_retry tells us to oom if
> > 	- the last compaction round backed off or
> > 	- we haven't completed at least MAX_COMPACT_RETRIES active
> > 	  compaction rounds.
> > 
> > The first rule ensures that the very last attempt for compaction
> > was not ignored while the second guarantees that the compaction has done
> > some work. Multiple retries might be needed to prevent occasional
> > pigggy backing of other contexts to steal the compacted pages before
> > the current context manages to retry to allocate them.
> > 
> > compaction_failed() is taken as a final word from the compaction that
> > the retry doesn't make much sense. We have to be careful though because
> > the first compaction round is MIGRATE_ASYNC which is rather weak as it
> > ignores pages under writeback and gives up too easily in other
> > situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> > has been used before we give up. With this logic in place we do not have
> > to increase the migration mode unconditionally and rather do it only if
> > the compaction failed for the weaker mode. A nice side effect is that
> > the stronger migration mode is used only when really needed so this has
> > a potential of smaller latencies in some cases.
> > 
> > Please note that the compaction doesn't tell us much about how
> > successful it was when returning compaction_made_progress so we just
> > have to blindly trust that another retry is worthwhile and cap the
> > number to something reasonable to guarantee a convergence.
> > 
> > If the given number of successful retries is not sufficient for a
> > reasonable workloads we should focus on the collected compaction
> > tracepoints data and try to address the issue in the compaction code.
> > If this is not feasible we can increase the retries limit.
> > 
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 77 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3b78936eca70..bb4df1be0d43 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  	return page;
> >  }
> >  
> > +
> > +/*
> > + * Maximum number of compaction retries wit a progress before OOM
> > + * killer is consider as the only way to move forward.
> > + */
> > +#define MAX_COMPACT_RETRIES 16
> > +
> >  #ifdef CONFIG_COMPACTION
> >  /* Try memory compaction for high-order allocations before reclaim */
> >  static struct page *
> > @@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  
> >  	return NULL;
> >  }
> > +
> > +static inline bool
> > +should_compact_retry(unsigned int order, enum compact_result compact_result,
> > +		     enum migrate_mode *migrate_mode,
> > +		     int compaction_retries)
> > +{
> > +	if (!order)
> > +		return false;
> > +
> > +	/*
> > +	 * compaction considers all the zone as desperately out of memory
> > +	 * so it doesn't really make much sense to retry except when the
> > +	 * failure could be caused by weak migration mode.
> > +	 */
> > +	if (compaction_failed(compact_result)) {
> 
> IIUC, this compaction_failed() means that at least one zone is
> compacted and failed. This is not same with your assumption in the
> comment. If compaction is done and failed on ZONE_DMA, it would be
> premature decision.
> 
> > +		if (*migrate_mode == MIGRATE_ASYNC) {
> > +			*migrate_mode = MIGRATE_SYNC_LIGHT;
> > +			return true;
> > +		}
> > +		return false;
> > +	}
> > +
> > +	/*
> > +	 * !costly allocations are really important and we have to make sure
> > +	 * the compaction wasn't deferred or didn't bail out early due to locks
> > +	 * contention before we go OOM. Still cap the reclaim retry loops with
> > +	 * progress to prevent from looping forever and potential trashing.
> > +	 */
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +		if (compaction_withdrawn(compact_result))
> > +			return true;
> > +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> > +			return true;
> > +	}
> > +
> > +	return false;
> > +}
> >  #else
> >  static inline struct page *
> >  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > @@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  {
> >  	return NULL;
> >  }
> > +
> > +static inline bool
> > +should_compact_retry(unsigned int order, enum compact_result compact_result,
> > +		     enum migrate_mode *migrate_mode,
> > +		     int compaction_retries)
> > +{
> > +	return false;
> > +}
> >  #endif /* CONFIG_COMPACTION */
> >  
> >  /* Perform direct synchronous page reclaim */
> > @@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	unsigned long did_some_progress;
> >  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> >  	enum compact_result compact_result;
> > +	int compaction_retries = 0;
> >  	int no_progress_loops = 0;
> >  
> >  	/*
> > @@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  			 compaction_failed(compact_result)))
> >  		goto nopage;
> >  
> > -	/*
> > -	 * It can become very expensive to allocate transparent hugepages at
> > -	 * fault, so use asynchronous memory compaction for THP unless it is
> > -	 * khugepaged trying to collapse.
> > -	 */
> > -	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> > -		migration_mode = MIGRATE_SYNC_LIGHT;
> > +	if (order && compaction_made_progress(compact_result))
> > +		compaction_retries++;
> >  
> >  	/* Try direct reclaim and then allocating */
> >  	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> > @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				 no_progress_loops))
> >  		goto retry;
> >  
> > +	/*
> > +	 * It doesn't make any sense to retry for the compaction if the order-0
> > +	 * reclaim is not able to make any progress because the current
> > +	 * implementation of the compaction depends on the sufficient amount
> > +	 * of free memory (see __compaction_suitable)
> > +	 */
> > +	if (did_some_progress > 0 &&
> > +			should_compact_retry(order, compact_result,
> > +				&migration_mode, compaction_retries))
> 
> Checking did_some_progress on each round have subtle corner case. Think
> about following situation.
> 
> round, compaction, did_some_progress, compaction
> 0, defer, 1
> 0, defer, 1
> 0, defer, 1
> 0, defer, 1
> 0, defer, 0

Oops...Example should be below one.

0, defer, 1
1, defer, 1
2, defer, 1
3, defer, 1
4, defer, 0

> 
> In this case, compaction has enough chance to succeed since freepages
> increase, but, compaction will not be triggered.
> 
> 
> Thanks.
> 
> > +		goto retry;
> > +
> >  	/* Reclaim has failed us, start killing things */
> >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> >  	if (page)
> > @@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  
> >  noretry:
> >  	/*
> > -	 * High-order allocations do not necessarily loop after
> > -	 * direct reclaim and reclaim/compaction depends on compaction
> > -	 * being called after reclaim so call directly if necessary
> > +	 * High-order allocations do not necessarily loop after direct reclaim
> > +	 * and reclaim/compaction depends on compaction being called after
> > +	 * reclaim so call directly if necessary.
> > +	 * It can become very expensive to allocate transparent hugepages at
> > +	 * fault, so use asynchronous memory compaction for THP unless it is
> > +	 * khugepaged trying to collapse. All other requests should tolerate
> > +	 * at least light sync migration.
> >  	 */
> > +	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> > +		migration_mode = MIGRATE_ASYNC;
> > +	else
> > +		migration_mode = MIGRATE_SYNC_LIGHT;
> >  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> >  					    ac, migration_mode,
> >  					    &compact_result);
> > -- 
> > 2.8.0.rc3
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04  5:45 ` [PATCH 0.14] oom detection rework v6 Joonsoo Kim
@ 2016-05-04  8:12   ` Vlastimil Babka
  2016-05-04  8:32     ` Joonsoo Kim
  2016-05-04  8:50     ` Michal Hocko
  2016-05-04  8:47   ` Michal Hocko
  1 sibling, 2 replies; 60+ messages in thread
From: Vlastimil Babka @ 2016-05-04  8:12 UTC (permalink / raw)
  To: Joonsoo Kim, Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, linux-mm, LKML

On 05/04/2016 07:45 AM, Joonsoo Kim wrote:
> I still don't agree with some part of this patchset that deal with
> !costly order. As you know, there was two regression reports from Hugh
> and Aaron and you fixed them by ensuring to trigger compaction. I
> think that these show the problem of this patchset. Previous kernel
> doesn't need to ensure to trigger compaction and just works fine in
> any case.

IIRC previous kernel somehow subtly never OOM'd for !costly orders. So 
anything that introduces the possibility of OOM may look like regression 
for some corner case workloads. But I don't think that it's OK to not 
OOM for e.g. kernel stack allocations?

> Your series make compaction necessary for all. OOM handling
> is essential part in MM but compaction isn't. OOM handling should not
> depend on compaction. I tested my own benchmark without
> CONFIG_COMPACTION and found that premature OOM happens.
>
> I hope that you try to test something without CONFIG_COMPACTION.

Hmm a valid point, !CONFIG_COMPACTION should be considered. But reclaim 
cannot guarantee forming an order>0 page. But neither does OOM. So would 
you suggest we keep reclaiming without OOM as before, to prevent these 
regressions? Or where to draw the line here?

> Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04  8:12   ` Vlastimil Babka
@ 2016-05-04  8:32     ` Joonsoo Kim
  2016-05-04  8:50     ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04  8:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton, linux-mm,
	LKML

On Wed, May 04, 2016 at 10:12:43AM +0200, Vlastimil Babka wrote:
> On 05/04/2016 07:45 AM, Joonsoo Kim wrote:
> >I still don't agree with some part of this patchset that deal with
> >!costly order. As you know, there was two regression reports from Hugh
> >and Aaron and you fixed them by ensuring to trigger compaction. I
> >think that these show the problem of this patchset. Previous kernel
> >doesn't need to ensure to trigger compaction and just works fine in
> >any case.
> 
> IIRC previous kernel somehow subtly never OOM'd for !costly orders.

IIRC, it would not OOM in thrashing case. But, it could OOM in other
cases.

> So anything that introduces the possibility of OOM may look like
> regression for some corner case workloads. But I don't think that
> it's OK to not OOM for e.g. kernel stack allocations?

Sorry. Double negation makes me hard to understand since I'm not
native. So, you think that it's OK to OOM for kernel stack allocation?
I think so, too. But, I want not to OOM prematurely.

> >Your series make compaction necessary for all. OOM handling
> >is essential part in MM but compaction isn't. OOM handling should not
> >depend on compaction. I tested my own benchmark without
> >CONFIG_COMPACTION and found that premature OOM happens.
> >
> >I hope that you try to test something without CONFIG_COMPACTION.
> 
> Hmm a valid point, !CONFIG_COMPACTION should be considered. But
> reclaim cannot guarantee forming an order>0 page. But neither does
> OOM. So would you suggest we keep reclaiming without OOM as before,
> to prevent these regressions? Or where to draw the line here?

I suggested that memorizing number of reclaimable pages when entering
allocation slowpath and try to reclaim at least that amount. Thrashing
is effectively prevented in this algorithm and we don't trigger OOM
prematurely.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04  5:45 ` [PATCH 0.14] oom detection rework v6 Joonsoo Kim
  2016-05-04  8:12   ` Vlastimil Babka
@ 2016-05-04  8:47   ` Michal Hocko
  2016-05-04 14:32     ` Joonsoo Kim
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-04  8:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed 04-05-16 14:45:02, Joonsoo Kim wrote:
> On Wed, Apr 20, 2016 at 03:47:13PM -0400, Michal Hocko wrote:
> > Hi,
> > 
> > This is v6 of the series. The previous version was posted [1]. The
> > code hasn't changed much since then. I have found one old standing
> > bug (patch 1) which just got much more severe and visible with this
> > series. Other than that I have reorganized the series and put the
> > compaction feedback abstraction to the front just in case we find out
> > that parts of the series would have to be reverted later on for some
> > reason. The premature oom killer invocation reported by Hugh [2] seems
> > to be addressed.
> > 
> > We have discussed this series at LSF/MM summit in Raleigh and there
> > didn't seem to be any concerns/objections to go on with the patch set
> > and target it for the next merge window. 
> 
> I still don't agree with some part of this patchset that deal with
> !costly order. As you know, there was two regression reports from Hugh
> and Aaron and you fixed them by ensuring to trigger compaction. I
> think that these show the problem of this patchset. Previous kernel
> doesn't need to ensure to trigger compaction and just works fine in
> any case. Your series make compaction necessary for all. OOM handling
> is essential part in MM but compaction isn't. OOM handling should not
> depend on compaction. I tested my own benchmark without
> CONFIG_COMPACTION and found that premature OOM happens.

High order allocations without compaction are basically a lost game. You
can wait unbounded amount of time and still have no guarantee of any
progress. What is the usual reason to disable compaction in the first
place?

Anyway if this is _really_ a big issue then we can do something like the
following to emulate the previous behavior. We are losing the
determinism but if you really thing that the !COMPACTION workloads
already reconcile with it I can live with that.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e7e26c5d3ba..f48b9e9b1869 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
 		     enum migrate_mode *migrate_mode,
 		     int compaction_retries)
 {
+	struct zone *zone;
+	struct zoneref *z;
+
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return false;
+
+	/*
+	 * There are setups with compaction disabled which would prefer to loop
+	 * inside the allocator rather than hit the oom killer prematurely. Let's
+	 * give them a good hope and keep retrying while the order-0 watermarks
+	 * are OK.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+					ac->nodemask) {
+		if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
+					ac->high_zoneidx, alloc_flags))
+			return true;
+	}
 	return false;
 }
 #endif /* CONFIG_COMPACTION */
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04  8:12   ` Vlastimil Babka
  2016-05-04  8:32     ` Joonsoo Kim
@ 2016-05-04  8:50     ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-04  8:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton, linux-mm,
	LKML

On Wed 04-05-16 10:12:43, Vlastimil Babka wrote:
> On 05/04/2016 07:45 AM, Joonsoo Kim wrote:
> >I still don't agree with some part of this patchset that deal with
> >!costly order. As you know, there was two regression reports from Hugh
> >and Aaron and you fixed them by ensuring to trigger compaction. I
> >think that these show the problem of this patchset. Previous kernel
> >doesn't need to ensure to trigger compaction and just works fine in
> >any case.
> 
> IIRC previous kernel somehow subtly never OOM'd for !costly orders. So
> anything that introduces the possibility of OOM may look like regression for
> some corner case workloads.

The bug fixed by this series was COMPACTION specific because
compaction_ready is not considered otherwise.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04  6:01   ` Joonsoo Kim
  2016-05-04  6:31     ` Joonsoo Kim
@ 2016-05-04  8:53     ` Michal Hocko
  2016-05-04 14:39       ` Joonsoo Kim
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-04  8:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed 04-05-16 15:01:24, Joonsoo Kim wrote:
> On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
[...]

Please try to trim your responses it makes it much easier to follow the
discussion

> > +static inline bool
> > +should_compact_retry(unsigned int order, enum compact_result compact_result,
> > +		     enum migrate_mode *migrate_mode,
> > +		     int compaction_retries)
> > +{
> > +	if (!order)
> > +		return false;
> > +
> > +	/*
> > +	 * compaction considers all the zone as desperately out of memory
> > +	 * so it doesn't really make much sense to retry except when the
> > +	 * failure could be caused by weak migration mode.
> > +	 */
> > +	if (compaction_failed(compact_result)) {
> 
> IIUC, this compaction_failed() means that at least one zone is
> compacted and failed. This is not same with your assumption in the
> comment. If compaction is done and failed on ZONE_DMA, it would be
> premature decision.

Not really, because if other zones are making some progress then their
result will override COMPACT_COMPLETE

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04  6:31     ` Joonsoo Kim
@ 2016-05-04  8:56       ` Michal Hocko
  2016-05-04 14:57         ` Joonsoo Kim
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-04  8:56 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed 04-05-16 15:31:12, Joonsoo Kim wrote:
> On Wed, May 04, 2016 at 03:01:24PM +0900, Joonsoo Kim wrote:
> > On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
[...]
> > > @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  				 no_progress_loops))
> > >  		goto retry;
> > >  
> > > +	/*
> > > +	 * It doesn't make any sense to retry for the compaction if the order-0
> > > +	 * reclaim is not able to make any progress because the current
> > > +	 * implementation of the compaction depends on the sufficient amount
> > > +	 * of free memory (see __compaction_suitable)
> > > +	 */
> > > +	if (did_some_progress > 0 &&
> > > +			should_compact_retry(order, compact_result,
> > > +				&migration_mode, compaction_retries))
> > 
> > Checking did_some_progress on each round have subtle corner case. Think
> > about following situation.
> > 
> > round, compaction, did_some_progress, compaction
> > 0, defer, 1
> > 0, defer, 1
> > 0, defer, 1
> > 0, defer, 1
> > 0, defer, 0
> 
> Oops...Example should be below one.
> 
> 0, defer, 1
> 1, defer, 1
> 2, defer, 1
> 3, defer, 1
> 4, defer, 0

I am not sure I understand. The point of the check is that if the
reclaim doesn't make _any_ progress then checking the result of the
compaction after it didn't lead to a successful allocation just doesn't
make any sense. If the compaction deferred all the time then we have a
bug in the compaction. Vlastimil is already working on a code which
should make the compaction more ready for !costly requests but that is a
separate topic IMO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-05-04  6:27   ` Joonsoo Kim
@ 2016-05-04  9:04     ` Michal Hocko
  2016-05-04 15:14       ` Joonsoo Kim
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-04  9:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed 04-05-16 15:27:48, Joonsoo Kim wrote:
> On Wed, Apr 20, 2016 at 03:47:27PM -0400, Michal Hocko wrote:
[...]
> > +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> > +		int alloc_flags)
> > +{
> > +	struct zone *zone;
> > +	struct zoneref *z;
> > +
> > +	/*
> > +	 * Make sure at least one zone would pass __compaction_suitable if we continue
> > +	 * retrying the reclaim.
> > +	 */
> > +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> > +					ac->nodemask) {
> > +		unsigned long available;
> > +		enum compact_result compact_result;
> > +
> > +		/*
> > +		 * Do not consider all the reclaimable memory because we do not
> > +		 * want to trash just for a single high order allocation which
> > +		 * is even not guaranteed to appear even if __compaction_suitable
> > +		 * is happy about the watermark check.
> > +		 */
> > +		available = zone_reclaimable_pages(zone) / order;
> 
> I can't understand why '/ order' is needed here. Think about specific
> example.
> 
> zone_reclaimable_pages = 100 MB
> NR_FREE_PAGES = 20 MB
> watermark = 40 MB
> order = 10
> 
> I think that compaction should run in this situation and your logic
> doesn't. We should be conservative when guessing not to do something
> prematurely.

I do not mind changing this. But pushing really hard on reclaim for
order-10 pages doesn't sound like a good idea. So we should somehow
reduce the target. I am open for any better suggestions.

> > +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > +		compact_result = __compaction_suitable(zone, order, alloc_flags,
> > +				ac->classzone_idx, available);
> 
> It misses tracepoint in compaction_suitable().

Why do you think the check would be useful. I have considered it more
confusing than halpful to I have intentionally not added it.

> 
> > +		if (compact_result != COMPACT_SKIPPED &&
> > +				compact_result != COMPACT_NOT_SUITABLE_ZONE)
> 
> It's undesirable to use COMPACT_NOT_SUITABLE_ZONE here. It is just for
> detailed tracepoint output.

Well this is a compaction code so I considered it acceptable. If you
consider it a big deal I can extract a wrapper and hide this detail.

[...]

> > @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
> >  	/*
> >  	 * make sure the compaction wasn't deferred or didn't bail out early
> >  	 * due to locks contention before we declare that we should give up.
> > +	 * But do not retry if the given zonelist is not suitable for
> > +	 * compaction.
> >  	 */
> >  	if (compaction_withdrawn(compact_result))
> > -		return true;
> > +		return compaction_zonelist_suitable(ac, order, alloc_flags);
> 
> I think that compaction_zonelist_suitable() should be checked first.
> If compaction_zonelist_suitable() returns false, it's useless to
> retry since it means that compaction cannot run if all reclaimable
> pages are reclaimed. Logic should be as following.
> 
> if (!compaction_zonelist_suitable())
>         return false;
> 
> if (compaction_withdrawn())
>         return true;

That is certainly an option as well. The logic above is that I really
wanted to have a terminal condition when compaction can return
compaction_withdrawn for ever basically. Normally we are bound by a
number of successful reclaim rounds. Before we go an change there I
would like to see where it makes real change though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready
  2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
  2016-04-21  3:32   ` Hillf Danton
@ 2016-05-04 13:56   ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-04 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Joonsoo Kim, Hillf Danton, Vlastimil Babka,
	linux-mm, LKML

On Wed 20-04-16 15:47:14, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> while playing with the oom detection rework [1] I have noticed
> that my heavy order-9 (hugetlb) load close to OOM ended up in an
> endless loop where the reclaim hasn't made any progress but
> did_some_progress didn't reflect that and compaction_suitable
> was backing off because no zone is above low wmark + 1 << order.
> 
> It turned out that this is in fact an old standing bug in compaction_ready
> which ignores the requested_highidx and did the watermark check for
> 0 classzone_idx. This succeeds for zone DMA most of the time as the zone
> is mostly unused because of lowmem protection.

so far so good

>  This also means that the
> OOM killer wouldn't be triggered for higher order requests even when
> there is no reclaim progress and we essentially rely on order-0 request
> to find this out. This has been broken in one way or another since
> fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
> are sufficient free pages available") but only since 7335084d446b ("mm:
> vmscan: do not OOM if aborting reclaim to start compaction") we are not
> invoking the OOM killer based on the wrong calculation.

but now that I was looking at the code again I realize I have missed one
important thing:
shrink_zones()
                        if (IS_ENABLED(CONFIG_COMPACTION) &&
                            sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                            zonelist_zone_idx(z) <= requested_highidx &&
                            compaction_ready(zone, sc->order, requested_highidx)) {
                                sc->compaction_ready = true;
                                continue;
                        }

so the whole argument about OOM is bogus because this whole thing is
done only for costly requests.

So the bug has not been that serious before and it started to matter
only after the oom detection rework (especially after patch 13) where we
really need even costly allocations to not lie about the progress.

Andrew, could you update the changelog to the following please?
"
while playing with the oom detection rework [1] I have noticed that my
heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
where the reclaim hasn't made any progress but did_some_progress didn't
reflect that and compaction_suitable was backing off because no zone is
above low wmark + 1 << order.

It turned out that this is in fact an old standing bug in compaction_ready
which ignores the requested_highidx and did the watermark check for
0 classzone_idx. This succeeds for zone DMA most of the time as the zone
is mostly unused because of lowmem protection. As a result costly high
order allocatios always report a successfull progress even when there
was none. This wasn't a problem so far because these allocations usually
fail quite early or retry only few times with __GFP_REPEAT but this will
change after later patch in this series so make sure to not lie about
the progress and propagate requested_highidx down to compaction_ready
and use it for both the watermak check and compaction_suitable to fix
this issue.

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
"

Thanks and sorry for the confusion!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04  8:47   ` Michal Hocko
@ 2016-05-04 14:32     ` Joonsoo Kim
  2016-05-04 18:16       ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04 14:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 04-05-16 14:45:02, Joonsoo Kim wrote:
>> On Wed, Apr 20, 2016 at 03:47:13PM -0400, Michal Hocko wrote:
>> > Hi,
>> >
>> > This is v6 of the series. The previous version was posted [1]. The
>> > code hasn't changed much since then. I have found one old standing
>> > bug (patch 1) which just got much more severe and visible with this
>> > series. Other than that I have reorganized the series and put the
>> > compaction feedback abstraction to the front just in case we find out
>> > that parts of the series would have to be reverted later on for some
>> > reason. The premature oom killer invocation reported by Hugh [2] seems
>> > to be addressed.
>> >
>> > We have discussed this series at LSF/MM summit in Raleigh and there
>> > didn't seem to be any concerns/objections to go on with the patch set
>> > and target it for the next merge window.
>>
>> I still don't agree with some part of this patchset that deal with
>> !costly order. As you know, there was two regression reports from Hugh
>> and Aaron and you fixed them by ensuring to trigger compaction. I
>> think that these show the problem of this patchset. Previous kernel
>> doesn't need to ensure to trigger compaction and just works fine in
>> any case. Your series make compaction necessary for all. OOM handling
>> is essential part in MM but compaction isn't. OOM handling should not
>> depend on compaction. I tested my own benchmark without
>> CONFIG_COMPACTION and found that premature OOM happens.
>
> High order allocations without compaction are basically a lost game. You

I don't think that order 1 or 2 allocation has a big trouble without compaction.
They can be made by buddy algorithm that keeps high order freepages
as long as possible.

> can wait unbounded amount of time and still have no guarantee of any

I know that it has no guarantee. But, it doesn't mean that it's better to
give up early. Since OOM could causes serious problem, if there is
reclaimable memory, we need to reclaim all of them at least once
with praying for high order page before triggering OOM. Optimizing
this situation by incomplete guessing is a dangerous idea.

> progress. What is the usual reason to disable compaction in the first
> place?

I don't disable it. But, who knows who disable compaction? It's been *not*
a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?

> Anyway if this is _really_ a big issue then we can do something like the
> following to emulate the previous behavior. We are losing the
> determinism but if you really thing that the !COMPACTION workloads
> already reconcile with it I can live with that.
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e7e26c5d3ba..f48b9e9b1869 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
>                      enum migrate_mode *migrate_mode,
>                      int compaction_retries)
>  {
> +       struct zone *zone;
> +       struct zoneref *z;
> +
> +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> +               return false;
> +
> +       /*
> +        * There are setups with compaction disabled which would prefer to loop
> +        * inside the allocator rather than hit the oom killer prematurely. Let's
> +        * give them a good hope and keep retrying while the order-0 watermarks
> +        * are OK.
> +        */
> +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> +                                       ac->nodemask) {
> +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> +                                       ac->high_zoneidx, alloc_flags))
> +                       return true;
> +       }
>         return false;

I hope that this kind of logic is added to should_reclaim_retry() so
that this logic is
applied in any setup. should_compact_retry() should not become a fundamental
criteria to determine OOM. What compaction does can be changed in the future
and it's undesirable that it's change affects OOM condition greatly.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04  8:53     ` Michal Hocko
@ 2016-05-04 14:39       ` Joonsoo Kim
  2016-05-04 18:20         ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04 14:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

2016-05-04 17:53 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 04-05-16 15:01:24, Joonsoo Kim wrote:
>> On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> [...]
>
> Please try to trim your responses it makes it much easier to follow the
> discussion

Okay.

>> > +static inline bool
>> > +should_compact_retry(unsigned int order, enum compact_result compact_result,
>> > +                enum migrate_mode *migrate_mode,
>> > +                int compaction_retries)
>> > +{
>> > +   if (!order)
>> > +           return false;
>> > +
>> > +   /*
>> > +    * compaction considers all the zone as desperately out of memory
>> > +    * so it doesn't really make much sense to retry except when the
>> > +    * failure could be caused by weak migration mode.
>> > +    */
>> > +   if (compaction_failed(compact_result)) {
>>
>> IIUC, this compaction_failed() means that at least one zone is
>> compacted and failed. This is not same with your assumption in the
>> comment. If compaction is done and failed on ZONE_DMA, it would be
>> premature decision.
>
> Not really, because if other zones are making some progress then their
> result will override COMPACT_COMPLETE

Think about the situation that DMA zone fails to compact and
the other zones are deferred or skipped. In this case, COMPACT_COMPLETE
will be returned as a final result and should_compact_retry() return false.
I don't think that it means all the zones are desperately out of memory.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04  8:56       ` Michal Hocko
@ 2016-05-04 14:57         ` Joonsoo Kim
  2016-05-04 18:19           ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04 14:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

2016-05-04 17:56 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 04-05-16 15:31:12, Joonsoo Kim wrote:
>> On Wed, May 04, 2016 at 03:01:24PM +0900, Joonsoo Kim wrote:
>> > On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> [...]
>> > > @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > >                            no_progress_loops))
>> > >           goto retry;
>> > >
>> > > + /*
>> > > +  * It doesn't make any sense to retry for the compaction if the order-0
>> > > +  * reclaim is not able to make any progress because the current
>> > > +  * implementation of the compaction depends on the sufficient amount
>> > > +  * of free memory (see __compaction_suitable)
>> > > +  */
>> > > + if (did_some_progress > 0 &&
>> > > +                 should_compact_retry(order, compact_result,
>> > > +                         &migration_mode, compaction_retries))
>> >
>> > Checking did_some_progress on each round have subtle corner case. Think
>> > about following situation.
>> >
>> > round, compaction, did_some_progress, compaction
>> > 0, defer, 1
>> > 0, defer, 1
>> > 0, defer, 1
>> > 0, defer, 1
>> > 0, defer, 0
>>
>> Oops...Example should be below one.
>>
>> 0, defer, 1
>> 1, defer, 1
>> 2, defer, 1
>> 3, defer, 1
>> 4, defer, 0
>
> I am not sure I understand. The point of the check is that if the
> reclaim doesn't make _any_ progress then checking the result of the
> compaction after it didn't lead to a successful allocation just doesn't
> make any sense.

Even if this round (#4) doesn't reclaim any pages, previous rounds
(#0, #1, #2, #3) would reclaim enough pages to succeed future
compaction attempt.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-05-04  9:04     ` Michal Hocko
@ 2016-05-04 15:14       ` Joonsoo Kim
  2016-05-04 19:22         ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-04 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

2016-05-04 18:04 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 04-05-16 15:27:48, Joonsoo Kim wrote:
>> On Wed, Apr 20, 2016 at 03:47:27PM -0400, Michal Hocko wrote:
> [...]
>> > +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>> > +           int alloc_flags)
>> > +{
>> > +   struct zone *zone;
>> > +   struct zoneref *z;
>> > +
>> > +   /*
>> > +    * Make sure at least one zone would pass __compaction_suitable if we continue
>> > +    * retrying the reclaim.
>> > +    */
>> > +   for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
>> > +                                   ac->nodemask) {
>> > +           unsigned long available;
>> > +           enum compact_result compact_result;
>> > +
>> > +           /*
>> > +            * Do not consider all the reclaimable memory because we do not
>> > +            * want to trash just for a single high order allocation which
>> > +            * is even not guaranteed to appear even if __compaction_suitable
>> > +            * is happy about the watermark check.
>> > +            */
>> > +           available = zone_reclaimable_pages(zone) / order;
>>
>> I can't understand why '/ order' is needed here. Think about specific
>> example.
>>
>> zone_reclaimable_pages = 100 MB
>> NR_FREE_PAGES = 20 MB
>> watermark = 40 MB
>> order = 10
>>
>> I think that compaction should run in this situation and your logic
>> doesn't. We should be conservative when guessing not to do something
>> prematurely.
>
> I do not mind changing this. But pushing really hard on reclaim for
> order-10 pages doesn't sound like a good idea. So we should somehow
> reduce the target. I am open for any better suggestions.

If the situation is changed to order-2, it doesn't look good, either.
I think that some reduction on zone_reclaimable_page() are needed since
it's not possible to free all of them in certain case. But, reduction by order
doesn't make any sense. if we need to consider order to guess probability of
compaction, it should be considered in __compaction_suitable() instead of
reduction from here.

I think that following code that is used in should_reclaim_retry() would be
good for here.

available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES)

Any thought?

>> > +           available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>> > +           compact_result = __compaction_suitable(zone, order, alloc_flags,
>> > +                           ac->classzone_idx, available);
>>
>> It misses tracepoint in compaction_suitable().
>
> Why do you think the check would be useful. I have considered it more
> confusing than halpful to I have intentionally not added it.

What confusing do you have in mind?
If we try to analyze OOM, we need to know why should_compact_retry()
return false and and tracepoint here could be helpful.

>>
>> > +           if (compact_result != COMPACT_SKIPPED &&
>> > +                           compact_result != COMPACT_NOT_SUITABLE_ZONE)
>>
>> It's undesirable to use COMPACT_NOT_SUITABLE_ZONE here. It is just for
>> detailed tracepoint output.
>
> Well this is a compaction code so I considered it acceptable. If you
> consider it a big deal I can extract a wrapper and hide this detail.

It is not a big deal.

> [...]
>
>> > @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
>> >     /*
>> >      * make sure the compaction wasn't deferred or didn't bail out early
>> >      * due to locks contention before we declare that we should give up.
>> > +    * But do not retry if the given zonelist is not suitable for
>> > +    * compaction.
>> >      */
>> >     if (compaction_withdrawn(compact_result))
>> > -           return true;
>> > +           return compaction_zonelist_suitable(ac, order, alloc_flags);
>>
>> I think that compaction_zonelist_suitable() should be checked first.
>> If compaction_zonelist_suitable() returns false, it's useless to
>> retry since it means that compaction cannot run if all reclaimable
>> pages are reclaimed. Logic should be as following.
>>
>> if (!compaction_zonelist_suitable())
>>         return false;
>>
>> if (compaction_withdrawn())
>>         return true;
>
> That is certainly an option as well. The logic above is that I really
> wanted to have a terminal condition when compaction can return
> compaction_withdrawn for ever basically. Normally we are bound by a
> number of successful reclaim rounds. Before we go an change there I
> would like to see where it makes real change though.

It would not make real change because !compaction_withdrawn() and
!compaction_zonelist_suitable() case doesn't happen easily.

But, change makes code more understandable so it's worth doing, IMO.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04 14:32     ` Joonsoo Kim
@ 2016-05-04 18:16       ` Michal Hocko
  2016-05-10  6:41         ` Joonsoo Kim
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-04 18:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 04-05-16 14:45:02, Joonsoo Kim wrote:
> >> On Wed, Apr 20, 2016 at 03:47:13PM -0400, Michal Hocko wrote:
> >> > Hi,
> >> >
> >> > This is v6 of the series. The previous version was posted [1]. The
> >> > code hasn't changed much since then. I have found one old standing
> >> > bug (patch 1) which just got much more severe and visible with this
> >> > series. Other than that I have reorganized the series and put the
> >> > compaction feedback abstraction to the front just in case we find out
> >> > that parts of the series would have to be reverted later on for some
> >> > reason. The premature oom killer invocation reported by Hugh [2] seems
> >> > to be addressed.
> >> >
> >> > We have discussed this series at LSF/MM summit in Raleigh and there
> >> > didn't seem to be any concerns/objections to go on with the patch set
> >> > and target it for the next merge window.
> >>
> >> I still don't agree with some part of this patchset that deal with
> >> !costly order. As you know, there was two regression reports from Hugh
> >> and Aaron and you fixed them by ensuring to trigger compaction. I
> >> think that these show the problem of this patchset. Previous kernel
> >> doesn't need to ensure to trigger compaction and just works fine in
> >> any case. Your series make compaction necessary for all. OOM handling
> >> is essential part in MM but compaction isn't. OOM handling should not
> >> depend on compaction. I tested my own benchmark without
> >> CONFIG_COMPACTION and found that premature OOM happens.
> >
> > High order allocations without compaction are basically a lost game. You
> 
> I don't think that order 1 or 2 allocation has a big trouble without compaction.
> They can be made by buddy algorithm that keeps high order freepages
> as long as possible.
> 
> > can wait unbounded amount of time and still have no guarantee of any
> 
> I know that it has no guarantee. But, it doesn't mean that it's better to
> give up early. Since OOM could causes serious problem, if there is
> reclaimable memory, we need to reclaim all of them at least once
> with praying for high order page before triggering OOM. Optimizing
> this situation by incomplete guessing is a dangerous idea.
> 
> > progress. What is the usual reason to disable compaction in the first
> > place?
> 
> I don't disable it. But, who knows who disable compaction? It's been *not*
> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?

I would really like to hear about real life usecase before we go and
cripple otherwise deterministic algorithms. It might be very well
possible that those configurations simply do not have problems with high
order allocations because they are too specific.

> > Anyway if this is _really_ a big issue then we can do something like the
> > following to emulate the previous behavior. We are losing the
> > determinism but if you really thing that the !COMPACTION workloads
> > already reconcile with it I can live with that.
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e7e26c5d3ba..f48b9e9b1869 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> >                      enum migrate_mode *migrate_mode,
> >                      int compaction_retries)
> >  {
> > +       struct zone *zone;
> > +       struct zoneref *z;
> > +
> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +               return false;
> > +
> > +       /*
> > +        * There are setups with compaction disabled which would prefer to loop
> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
> > +        * give them a good hope and keep retrying while the order-0 watermarks
> > +        * are OK.
> > +        */
> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> > +                                       ac->nodemask) {
> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > +                                       ac->high_zoneidx, alloc_flags))
> > +                       return true;
> > +       }
> >         return false;
> 
> I hope that this kind of logic is added to should_reclaim_retry() so
> that this logic is
> applied in any setup. should_compact_retry() should not become a fundamental
> criteria to determine OOM. What compaction does can be changed in the future
> and it's undesirable that it's change affects OOM condition greatly.

I disagree. High order allocations relying on the reclaim is a bad idea
because there is no guarantee that reclaiming more memory leads to the
success. This is the whole idea of the oom detection rework. So the
whole point of should_reclaim_retry is to get over watermarks while
should_compact_retry is about retrying when high order allocations might
make a progress. I really hate to tweak this for a configuration which
relies on the pure luck. So if we really need to do something
undeterministic then !COMPACTION should_compact_retry is the place where
it should be done.

If you are able to reproduce pre mature OOMs with !COMPACTION then I
would really appreciate if you could test with this patch so that I can
prepare a full patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04 14:57         ` Joonsoo Kim
@ 2016-05-04 18:19           ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-04 18:19 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

On Wed 04-05-16 23:57:50, Joonsoo Kim wrote:
> 2016-05-04 17:56 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 04-05-16 15:31:12, Joonsoo Kim wrote:
> >> On Wed, May 04, 2016 at 03:01:24PM +0900, Joonsoo Kim wrote:
> >> > On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> > [...]
> >> > > @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > >                            no_progress_loops))
> >> > >           goto retry;
> >> > >
> >> > > + /*
> >> > > +  * It doesn't make any sense to retry for the compaction if the order-0
> >> > > +  * reclaim is not able to make any progress because the current
> >> > > +  * implementation of the compaction depends on the sufficient amount
> >> > > +  * of free memory (see __compaction_suitable)
> >> > > +  */
> >> > > + if (did_some_progress > 0 &&
> >> > > +                 should_compact_retry(order, compact_result,
> >> > > +                         &migration_mode, compaction_retries))
> >> >
> >> > Checking did_some_progress on each round have subtle corner case. Think
> >> > about following situation.
> >> >
> >> > round, compaction, did_some_progress, compaction
> >> > 0, defer, 1
> >> > 0, defer, 1
> >> > 0, defer, 1
> >> > 0, defer, 1
> >> > 0, defer, 0
> >>
> >> Oops...Example should be below one.
> >>
> >> 0, defer, 1
> >> 1, defer, 1
> >> 2, defer, 1
> >> 3, defer, 1
> >> 4, defer, 0
> >
> > I am not sure I understand. The point of the check is that if the
> > reclaim doesn't make _any_ progress then checking the result of the
> > compaction after it didn't lead to a successful allocation just doesn't
> > make any sense.
> 
> Even if this round (#4) doesn't reclaim any pages, previous rounds
> (#0, #1, #2, #3) would reclaim enough pages to succeed future
> compaction attempt.

Then the compaction shouldn't back off and I would consider it a
compaction bug. I haven't see this happening though. Vlastimil is
already working on patches which would simply guarantee that really
important allocations will not defer.

So unless I can see an example of a real issue with this I think it is
just a theoretical issue which shouldn't block the patch as is.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 12/14] mm, oom: protect !costly allocations some more
  2016-05-04 14:39       ` Joonsoo Kim
@ 2016-05-04 18:20         ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-04 18:20 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

On Wed 04-05-16 23:39:14, Joonsoo Kim wrote:
> 2016-05-04 17:53 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 04-05-16 15:01:24, Joonsoo Kim wrote:
> >> On Wed, Apr 20, 2016 at 03:47:25PM -0400, Michal Hocko wrote:
> > [...]
> >
> > Please try to trim your responses it makes it much easier to follow the
> > discussion
> 
> Okay.
> 
> >> > +static inline bool
> >> > +should_compact_retry(unsigned int order, enum compact_result compact_result,
> >> > +                enum migrate_mode *migrate_mode,
> >> > +                int compaction_retries)
> >> > +{
> >> > +   if (!order)
> >> > +           return false;
> >> > +
> >> > +   /*
> >> > +    * compaction considers all the zone as desperately out of memory
> >> > +    * so it doesn't really make much sense to retry except when the
> >> > +    * failure could be caused by weak migration mode.
> >> > +    */
> >> > +   if (compaction_failed(compact_result)) {
> >>
> >> IIUC, this compaction_failed() means that at least one zone is
> >> compacted and failed. This is not same with your assumption in the
> >> comment. If compaction is done and failed on ZONE_DMA, it would be
> >> premature decision.
> >
> > Not really, because if other zones are making some progress then their
> > result will override COMPACT_COMPLETE
> 
> Think about the situation that DMA zone fails to compact and
> the other zones are deferred or skipped. In this case, COMPACT_COMPLETE
> will be returned as a final result and should_compact_retry() return false.
> I don't think that it means all the zones are desperately out of memory.

But that would mean that the ZONE_DMA would be eligible for compaction,
no? And considering the watermark check this zone should COMPACT_SKIP
for most allocation request. Or am I missing something?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
  2016-05-04 15:14       ` Joonsoo Kim
@ 2016-05-04 19:22         ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-04 19:22 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

On Thu 05-05-16 00:14:51, Joonsoo Kim wrote:
> 2016-05-04 18:04 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 04-05-16 15:27:48, Joonsoo Kim wrote:
> >> On Wed, Apr 20, 2016 at 03:47:27PM -0400, Michal Hocko wrote:
> > [...]
> >> > +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> >> > +           int alloc_flags)
> >> > +{
> >> > +   struct zone *zone;
> >> > +   struct zoneref *z;
> >> > +
> >> > +   /*
> >> > +    * Make sure at least one zone would pass __compaction_suitable if we continue
> >> > +    * retrying the reclaim.
> >> > +    */
> >> > +   for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> >> > +                                   ac->nodemask) {
> >> > +           unsigned long available;
> >> > +           enum compact_result compact_result;
> >> > +
> >> > +           /*
> >> > +            * Do not consider all the reclaimable memory because we do not
> >> > +            * want to trash just for a single high order allocation which
> >> > +            * is even not guaranteed to appear even if __compaction_suitable
> >> > +            * is happy about the watermark check.
> >> > +            */
> >> > +           available = zone_reclaimable_pages(zone) / order;
> >>
> >> I can't understand why '/ order' is needed here. Think about specific
> >> example.
> >>
> >> zone_reclaimable_pages = 100 MB
> >> NR_FREE_PAGES = 20 MB
> >> watermark = 40 MB
> >> order = 10
> >>
> >> I think that compaction should run in this situation and your logic
> >> doesn't. We should be conservative when guessing not to do something
> >> prematurely.
> >
> > I do not mind changing this. But pushing really hard on reclaim for
> > order-10 pages doesn't sound like a good idea. So we should somehow
> > reduce the target. I am open for any better suggestions.
> 
> If the situation is changed to order-2, it doesn't look good, either.

Why not? If we are not able to get over compaction_suitable watermark
check after we consider half of the reclaimable memory then we are really
close to oom. This will trigger only when the reclaimable LRUs are
really _tiny_. We are (very roughly) talking about:
low_wmark + 2<<order >= NR_FREE_PAGES + reclaimable/order - 1<<order
where low_wmark would be close to NR_FREE_PAGES so in the end we are asking
for order * 3<<order >= reclaimable and that sounds quite conservative
to me. Originally I wanted much more aggressive back off.

> I think that some reduction on zone_reclaimable_page() are needed since
> it's not possible to free all of them in certain case. But, reduction by order
> doesn't make any sense. if we need to consider order to guess probability of
> compaction, it should be considered in __compaction_suitable() instead of
> reduction from here.

I do agree that a more clever algorithm would be better and I also agree
that __compaction_suitable would be a better place for such a better
algorithm. I just wanted to have something simple first and more as a
safety net to stop endless retries (this has proven to work before I
found the real culprit compaction_ready patch). A more rigorous approach
would require a much deeper analysis what the actual compaction capacity
of the reclaimable memory really is. This is a quite hard problem and I
am not really convinced it is really needed.

> I think that following code that is used in should_reclaim_retry() would be
> good for here.
> 
> available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES)
> 
> Any thought?

I would prefer not to mix reclaim retry logic in here. Moreover it can
be argued that this is a kind of arbitrary as well because it has no
relevance to the compaction capacity of the reclaimable memory. If I
have to chose then I would rather go with simpler calculation than
something that is complex and we are even not sure it works any better.

> >> > +           available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> >> > +           compact_result = __compaction_suitable(zone, order, alloc_flags,
> >> > +                           ac->classzone_idx, available);
> >>
> >> It misses tracepoint in compaction_suitable().
> >
> > Why do you think the check would be useful. I have considered it more
> > confusing than halpful to I have intentionally not added it.
> 
> What confusing do you have in mind?
> If we try to analyze OOM, we need to know why should_compact_retry()
> return false and and tracepoint here could be helpful.

Because then you can easily confuse compaction_suitable from the
compaction decisions and the allocation retries. This code patch
definitely deserves a specific trace point and I plan to prepare one
along with others in the allocation path.

[...]
> >> > @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
> >> >     /*
> >> >      * make sure the compaction wasn't deferred or didn't bail out early
> >> >      * due to locks contention before we declare that we should give up.
> >> > +    * But do not retry if the given zonelist is not suitable for
> >> > +    * compaction.
> >> >      */
> >> >     if (compaction_withdrawn(compact_result))
> >> > -           return true;
> >> > +           return compaction_zonelist_suitable(ac, order, alloc_flags);
> >>
> >> I think that compaction_zonelist_suitable() should be checked first.
> >> If compaction_zonelist_suitable() returns false, it's useless to
> >> retry since it means that compaction cannot run if all reclaimable
> >> pages are reclaimed. Logic should be as following.
> >>
> >> if (!compaction_zonelist_suitable())
> >>         return false;
> >>
> >> if (compaction_withdrawn())
> >>         return true;
> >
> > That is certainly an option as well. The logic above is that I really
> > wanted to have a terminal condition when compaction can return
> > compaction_withdrawn for ever basically. Normally we are bound by a
> > number of successful reclaim rounds. Before we go an change there I
> > would like to see where it makes real change though.
> 
> It would not make real change because !compaction_withdrawn() and
> !compaction_zonelist_suitable() case doesn't happen easily.
> 
> But, change makes code more understandable so it's worth doing, IMO.

I dunno. I might be really biased here but I consider the current
ordering more appropriate for the reasons described above. Act as a
terminal condition for potentially endless compaction_withdrawn() rather
than a terminal condition on its own. Anyway I am not really sure this
is something crucial or do you consider this particular part really
important? I would prefer to not sneak last minute changes before the
upcoming merge windown just for readability which is even non-trivial.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-04 18:16       ` Michal Hocko
@ 2016-05-10  6:41         ` Joonsoo Kim
  2016-05-10  7:09           ` Vlastimil Babka
  2016-05-10  9:43           ` Michal Hocko
  0 siblings, 2 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-10  6:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

2016-05-05 3:16 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
>> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 04-05-16 14:45:02, Joonsoo Kim wrote:
>> >> On Wed, Apr 20, 2016 at 03:47:13PM -0400, Michal Hocko wrote:
>> >> > Hi,
>> >> >
>> >> > This is v6 of the series. The previous version was posted [1]. The
>> >> > code hasn't changed much since then. I have found one old standing
>> >> > bug (patch 1) which just got much more severe and visible with this
>> >> > series. Other than that I have reorganized the series and put the
>> >> > compaction feedback abstraction to the front just in case we find out
>> >> > that parts of the series would have to be reverted later on for some
>> >> > reason. The premature oom killer invocation reported by Hugh [2] seems
>> >> > to be addressed.
>> >> >
>> >> > We have discussed this series at LSF/MM summit in Raleigh and there
>> >> > didn't seem to be any concerns/objections to go on with the patch set
>> >> > and target it for the next merge window.
>> >>
>> >> I still don't agree with some part of this patchset that deal with
>> >> !costly order. As you know, there was two regression reports from Hugh
>> >> and Aaron and you fixed them by ensuring to trigger compaction. I
>> >> think that these show the problem of this patchset. Previous kernel
>> >> doesn't need to ensure to trigger compaction and just works fine in
>> >> any case. Your series make compaction necessary for all. OOM handling
>> >> is essential part in MM but compaction isn't. OOM handling should not
>> >> depend on compaction. I tested my own benchmark without
>> >> CONFIG_COMPACTION and found that premature OOM happens.
>> >
>> > High order allocations without compaction are basically a lost game. You
>>
>> I don't think that order 1 or 2 allocation has a big trouble without compaction.
>> They can be made by buddy algorithm that keeps high order freepages
>> as long as possible.
>>
>> > can wait unbounded amount of time and still have no guarantee of any
>>
>> I know that it has no guarantee. But, it doesn't mean that it's better to
>> give up early. Since OOM could causes serious problem, if there is
>> reclaimable memory, we need to reclaim all of them at least once
>> with praying for high order page before triggering OOM. Optimizing
>> this situation by incomplete guessing is a dangerous idea.
>>
>> > progress. What is the usual reason to disable compaction in the first
>> > place?
>>
>> I don't disable it. But, who knows who disable compaction? It's been *not*
>> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?
>
> I would really like to hear about real life usecase before we go and
> cripple otherwise deterministic algorithms. It might be very well
> possible that those configurations simply do not have problems with high
> order allocations because they are too specific.
>
>> > Anyway if this is _really_ a big issue then we can do something like the
>> > following to emulate the previous behavior. We are losing the
>> > determinism but if you really thing that the !COMPACTION workloads
>> > already reconcile with it I can live with that.
>> > ---
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > index 2e7e26c5d3ba..f48b9e9b1869 100644
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
>> >                      enum migrate_mode *migrate_mode,
>> >                      int compaction_retries)
>> >  {
>> > +       struct zone *zone;
>> > +       struct zoneref *z;
>> > +
>> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
>> > +               return false;
>> > +
>> > +       /*
>> > +        * There are setups with compaction disabled which would prefer to loop
>> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
>> > +        * give them a good hope and keep retrying while the order-0 watermarks
>> > +        * are OK.
>> > +        */
>> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
>> > +                                       ac->nodemask) {
>> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
>> > +                                       ac->high_zoneidx, alloc_flags))
>> > +                       return true;
>> > +       }
>> >         return false;
>>
>> I hope that this kind of logic is added to should_reclaim_retry() so
>> that this logic is
>> applied in any setup. should_compact_retry() should not become a fundamental
>> criteria to determine OOM. What compaction does can be changed in the future
>> and it's undesirable that it's change affects OOM condition greatly.
>
> I disagree. High order allocations relying on the reclaim is a bad idea
> because there is no guarantee that reclaiming more memory leads to the
> success. This is the whole idea of the oom detection rework. So the
> whole point of should_reclaim_retry is to get over watermarks while
> should_compact_retry is about retrying when high order allocations might
> make a progress. I really hate to tweak this for a configuration which
> relies on the pure luck. So if we really need to do something
> undeterministic then !COMPACTION should_compact_retry is the place where
> it should be done.
>
> If you are able to reproduce pre mature OOMs with !COMPACTION then I
> would really appreciate if you could test with this patch so that I can
> prepare a full patch.

My benchmark is too specific so I make another one. It does very
simple things.

1) Run the system with 256 MB memory and 2 GB swap
2) Run memory-hogger which takes (anonymous memory) 256 MB
3) Make 1000 new processes by fork (It will take 16 MB order-2 pages)

You can do it yourself with above instructions.

On current upstream kernel without CONFIG_COMPACTION, OOM doesn't happen.
On next-20160509 kernel without CONFIG_COMPACTION, OOM happens when
roughly *500* processes forked.

With CONFIG_COMPACTION, OOM doesn't happen on any kernel.

Other kernels doesn't trigger OOM even if I make 10000 new processes.

This example is very intuitive and reasonable. I think that it's not artificial.
It has enough swap space so OOM should not happen.

This failure shows that fundamental assumption of your patch is
wrong. You triggers OOM even if there is enough reclaimable memory but
no high order freepage depending on the fact that we can't guarantee
that we can make high order page with reclaiming these reclaimable
memory. Yes, we can't guarantee it but we also doesn't know if it
can be possible or not. We should not stop reclaim until this
estimation is is proved. Otherwise, it would be premature OOM.

You applied band-aid for CONFIG_COMPACTION and fixed some reported
problem but it is also fragile. Assume almost pageblock's skipbit are
set. In this case, compaction easily returns COMPACT_COMPLETE and your
logic will stop retry. Compaction isn't designed to report accurate
fragmentation state of the system so depending on it's return value
for OOM is fragile.

Please fix your fundamental assumption and don't add band-aid using
compaction.

I said same thing again and again and I can't convince you until now.
I'm not sure what I can do more.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-10  6:41         ` Joonsoo Kim
@ 2016-05-10  7:09           ` Vlastimil Babka
  2016-05-10  8:00             ` Joonsoo Kim
  2016-05-10  9:43           ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:09 UTC (permalink / raw)
  To: Joonsoo Kim, Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Linux Memory Management List, LKML

On 05/10/2016 08:41 AM, Joonsoo Kim wrote:
> You applied band-aid for CONFIG_COMPACTION and fixed some reported
> problem but it is also fragile. Assume almost pageblock's skipbit are
> set. In this case, compaction easily returns COMPACT_COMPLETE and your
> logic will stop retry. Compaction isn't designed to report accurate
> fragmentation state of the system so depending on it's return value
> for OOM is fragile.

Guess I'll just post a RFC now, even though it's not much tested...

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-10  7:09           ` Vlastimil Babka
@ 2016-05-10  8:00             ` Joonsoo Kim
  2016-05-10  9:44               ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-10  8:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, Linux Memory Management List, LKML

2016-05-10 16:09 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 05/10/2016 08:41 AM, Joonsoo Kim wrote:
>>
>> You applied band-aid for CONFIG_COMPACTION and fixed some reported
>> problem but it is also fragile. Assume almost pageblock's skipbit are
>> set. In this case, compaction easily returns COMPACT_COMPLETE and your
>> logic will stop retry. Compaction isn't designed to report accurate
>> fragmentation state of the system so depending on it's return value
>> for OOM is fragile.
>
>
> Guess I'll just post a RFC now, even though it's not much tested...

I will look at it later. But, I'd like to say something first.
Even if compaction returns more accurate fragmentation states, it's not a good
idea to depend on compaction's result to decide OOM. We have reclaimable but
not migratable pages. Depending on compaction's result cannot deal
with this case.

For example, please assume that all of the system memory are filled
with THP pages
or reclaimable slab pages. They cannot be migrated but we can reclaim them.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-10  6:41         ` Joonsoo Kim
  2016-05-10  7:09           ` Vlastimil Babka
@ 2016-05-10  9:43           ` Michal Hocko
  2016-05-12  2:23             ` Joonsoo Kim
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-05-10  9:43 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	Vlastimil Babka, Linux Memory Management List, LKML

On Tue 10-05-16 15:41:04, Joonsoo Kim wrote:
> 2016-05-05 3:16 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
> >> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
[...]
> >> > progress. What is the usual reason to disable compaction in the first
> >> > place?
> >>
> >> I don't disable it. But, who knows who disable compaction? It's been *not*
> >> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?
> >
> > I would really like to hear about real life usecase before we go and
> > cripple otherwise deterministic algorithms. It might be very well
> > possible that those configurations simply do not have problems with high
> > order allocations because they are too specific.

Sorry for insisting but I would really like to hear some answer for
this, please.

[...]
> >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> > index 2e7e26c5d3ba..f48b9e9b1869 100644
> >> > --- a/mm/page_alloc.c
> >> > +++ b/mm/page_alloc.c
> >> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> >> >                      enum migrate_mode *migrate_mode,
> >> >                      int compaction_retries)
> >> >  {
> >> > +       struct zone *zone;
> >> > +       struct zoneref *z;
> >> > +
> >> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> >> > +               return false;
> >> > +
> >> > +       /*
> >> > +        * There are setups with compaction disabled which would prefer to loop
> >> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
> >> > +        * give them a good hope and keep retrying while the order-0 watermarks
> >> > +        * are OK.
> >> > +        */
> >> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> >> > +                                       ac->nodemask) {
> >> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> >> > +                                       ac->high_zoneidx, alloc_flags))
> >> > +                       return true;
> >> > +       }
> >> >         return false;
[...]
> My benchmark is too specific so I make another one. It does very
> simple things.
> 
> 1) Run the system with 256 MB memory and 2 GB swap
> 2) Run memory-hogger which takes (anonymous memory) 256 MB
> 3) Make 1000 new processes by fork (It will take 16 MB order-2 pages)
> 
> You can do it yourself with above instructions.
> 
> On current upstream kernel without CONFIG_COMPACTION, OOM doesn't happen.
> On next-20160509 kernel without CONFIG_COMPACTION, OOM happens when
> roughly *500* processes forked.
> 
> With CONFIG_COMPACTION, OOM doesn't happen on any kernel.

Does the patch I have posted helped?

> Other kernels doesn't trigger OOM even if I make 10000 new processes.

Is this an usual load on !CONFIG_COMPACTION configurations?

> This example is very intuitive and reasonable. I think that it's not
> artificial.  It has enough swap space so OOM should not happen.

I am not really convinced this is true actually. You can have an
arbitrary amount of the swap space yet it still won't help you
because more reclaimed memory simply doesn't imply a more continuous
memory. This is a fundamental problem. So I think that relying on
!CONFIG_COMPACTION for heavy fork (or other high order) loads simply
never works reliably.

> This failure shows that fundamental assumption of your patch is
> wrong. You triggers OOM even if there is enough reclaimable memory but
> no high order freepage depending on the fact that we can't guarantee
> that we can make high order page with reclaiming these reclaimable
> memory. Yes, we can't guarantee it but we also doesn't know if it
> can be possible or not. We should not stop reclaim until this
> estimation is is proved. Otherwise, it would be premature OOM.

We've been through this before and you keep repeating this argument. 
I have tried to explain that a deterministic behavior is more reasonable
than a random retry loops which pretty much depends on timing and which
can hugely over-reclaim which might be even worse than an OOM killer
invocation which would target a single process.

I do agree that relying solely on the compaction is not the right way
but combining the two (reclaim & compaction) should work reasonably well
in practice. The only regression I have heard so far resulted from the
lack of compaction feedback.

> You applied band-aid for CONFIG_COMPACTION and fixed some reported
> problem but it is also fragile. Assume almost pageblock's skipbit are
> set. In this case, compaction easily returns COMPACT_COMPLETE and your
> logic will stop retry. Compaction isn't designed to report accurate
> fragmentation state of the system so depending on it's return value
> for OOM is fragile.

Which is a deficiency of compaction. And the one which is worked on as
already said by Vlastimil. Even with that deficiency, I am not able
to trigger pre-mature OOM so it sounds more theoretical than a real
issue. I am convinced that deeper surgery into compaction is really due
as it has been mostly designed for THP case completely ignoring !costly
allocations.

> Please fix your fundamental assumption and don't add band-aid using
> compaction.

I do not consider compaction feedback design as a "band-aid". There is
no other reliable source of high order pages except for compaction.

> I said same thing again and again and I can't convince you until now.
> I'm not sure what I can do more.

Yes and yet I haven't seen any real life cases where this feedback
mechanism doesn't work from you. You keep claiming that more reclaiming
_might_ be useful without any grounds for that statement. Even when the
more reclaim would help to survive a particular case we have to weigh
pros and cons of the over reclaim and potential trashing which is worse
than an OOM killer sometimes (staring at your machine you can ping but
you cannot even log in...).

Considering that we are in a clear disagreement in the compaction aspect
I think we need others to either back your concern or you show a clear
justification why compaction feedback is not viable way longterm even
after we make further changes which would make it less THP oriented.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-10  8:00             ` Joonsoo Kim
@ 2016-05-10  9:44               ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-10  9:44 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, Linux Memory Management List, LKML

On Tue 10-05-16 17:00:08, Joonsoo Kim wrote:
> 2016-05-10 16:09 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> > On 05/10/2016 08:41 AM, Joonsoo Kim wrote:
> >>
> >> You applied band-aid for CONFIG_COMPACTION and fixed some reported
> >> problem but it is also fragile. Assume almost pageblock's skipbit are
> >> set. In this case, compaction easily returns COMPACT_COMPLETE and your
> >> logic will stop retry. Compaction isn't designed to report accurate
> >> fragmentation state of the system so depending on it's return value
> >> for OOM is fragile.
> >
> >
> > Guess I'll just post a RFC now, even though it's not much tested...
> 
> I will look at it later. But, I'd like to say something first.
> Even if compaction returns more accurate fragmentation states, it's not a good
> idea to depend on compaction's result to decide OOM. We have reclaimable but
> not migratable pages. Depending on compaction's result cannot deal
> with this case.
> 
> For example, please assume that all of the system memory are filled
> with THP pages
> or reclaimable slab pages. They cannot be migrated but we can reclaim them.

Direct reclaim should break those THP pages or shrink those slabs. And
we make sure to reclaim before we consider final call for fail from
compaction feedback. If this is a vast majority of memory we should hit
it pretty reliably AFAICS.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-10  9:43           ` Michal Hocko
@ 2016-05-12  2:23             ` Joonsoo Kim
  2016-05-12  5:19               ` Joonsoo Kim
  2016-05-12 10:59               ` Michal Hocko
  0 siblings, 2 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-12  2:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	Linux Memory Management List, LKML

On Tue, May 10, 2016 at 11:43:48AM +0200, Michal Hocko wrote:
> On Tue 10-05-16 15:41:04, Joonsoo Kim wrote:
> > 2016-05-05 3:16 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > > On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
> > >> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> [...]
> > >> > progress. What is the usual reason to disable compaction in the first
> > >> > place?
> > >>
> > >> I don't disable it. But, who knows who disable compaction? It's been *not*
> > >> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?
> > >
> > > I would really like to hear about real life usecase before we go and
> > > cripple otherwise deterministic algorithms. It might be very well
> > > possible that those configurations simply do not have problems with high
> > > order allocations because they are too specific.
> 
> Sorry for insisting but I would really like to hear some answer for
> this, please.

I don't know. Who knows? How you can make sure that? And, I don't like
below fixup. Theoretically, it could retry forever.

> 
> [...]
> > >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >> > index 2e7e26c5d3ba..f48b9e9b1869 100644
> > >> > --- a/mm/page_alloc.c
> > >> > +++ b/mm/page_alloc.c
> > >> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > >> >                      enum migrate_mode *migrate_mode,
> > >> >                      int compaction_retries)
> > >> >  {
> > >> > +       struct zone *zone;
> > >> > +       struct zoneref *z;
> > >> > +
> > >> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> > >> > +               return false;
> > >> > +
> > >> > +       /*
> > >> > +        * There are setups with compaction disabled which would prefer to loop
> > >> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
> > >> > +        * give them a good hope and keep retrying while the order-0 watermarks
> > >> > +        * are OK.
> > >> > +        */
> > >> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> > >> > +                                       ac->nodemask) {
> > >> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > >> > +                                       ac->high_zoneidx, alloc_flags))
> > >> > +                       return true;
> > >> > +       }
> > >> >         return false;
> [...]
> > My benchmark is too specific so I make another one. It does very
> > simple things.
> > 
> > 1) Run the system with 256 MB memory and 2 GB swap
> > 2) Run memory-hogger which takes (anonymous memory) 256 MB
> > 3) Make 1000 new processes by fork (It will take 16 MB order-2 pages)
> > 
> > You can do it yourself with above instructions.
> > 
> > On current upstream kernel without CONFIG_COMPACTION, OOM doesn't happen.
> > On next-20160509 kernel without CONFIG_COMPACTION, OOM happens when
> > roughly *500* processes forked.
> > 
> > With CONFIG_COMPACTION, OOM doesn't happen on any kernel.
> 
> Does the patch I have posted helped?

I guess that it will help but please do it by yourself. It's simple.

> > Other kernels doesn't trigger OOM even if I make 10000 new processes.
> 
> Is this an usual load on !CONFIG_COMPACTION configurations?

I don't know. User-space developer doesn't take care about kernel
configuration and it seems that fork 500 times when memory is full is
not a corner case to me.

> > This example is very intuitive and reasonable. I think that it's not
> > artificial.  It has enough swap space so OOM should not happen.
> 
> I am not really convinced this is true actually. You can have an
> arbitrary amount of the swap space yet it still won't help you
> because more reclaimed memory simply doesn't imply a more continuous
> memory. This is a fundamental problem. So I think that relying on
> !CONFIG_COMPACTION for heavy fork (or other high order) loads simply
> never works reliably.

I think that you don't understand how powerful the reclaim and
compaction are. In the system with large disk swap, what compaction can do
is also possible for reclaim. Reclaim can do more.

Think about following examples.

_: free
U: used(unmovable)
M: used(migratable and reclaimable)

_MUU _U_U MMMM MMMM

With compaction (assume theoretically best algorithm),
just 3 contiguous region can be made like as following:

MMUU MUMU ___M MMMM

With reclaim, we can make 8 contiguous region.

__UU _U_U ____ ____

Reclaim can be easily affected by thrashing but it is fundamentally
more powerful than compaction.

Even, there are not migratable but reclaimable pages and it could weak
power of the compaction.

> > This failure shows that fundamental assumption of your patch is
> > wrong. You triggers OOM even if there is enough reclaimable memory but
> > no high order freepage depending on the fact that we can't guarantee
> > that we can make high order page with reclaiming these reclaimable
> > memory. Yes, we can't guarantee it but we also doesn't know if it
> > can be possible or not. We should not stop reclaim until this
> > estimation is is proved. Otherwise, it would be premature OOM.
> 
> We've been through this before and you keep repeating this argument. 
> I have tried to explain that a deterministic behavior is more reasonable
> than a random retry loops which pretty much depends on timing and which
> can hugely over-reclaim which might be even worse than an OOM killer
> invocation which would target a single process.

I didn't say that deterministic behavior is less reasonable. I like
it. What I insist is the your criteria for deterministic behavior
is wrong and please use another criteria for deterministic behavior.
That's what I want.

> I do agree that relying solely on the compaction is not the right way
> but combining the two (reclaim & compaction) should work reasonably well
> in practice. The only regression I have heard so far resulted from the
> lack of compaction feedback.

I agree that combining is needed. But, base criteria looks not
reasonable to me.

> > You applied band-aid for CONFIG_COMPACTION and fixed some reported
> > problem but it is also fragile. Assume almost pageblock's skipbit are
> > set. In this case, compaction easily returns COMPACT_COMPLETE and your
> > logic will stop retry. Compaction isn't designed to report accurate
> > fragmentation state of the system so depending on it's return value
> > for OOM is fragile.
> 
> Which is a deficiency of compaction. And the one which is worked on as
> already said by Vlastimil. Even with that deficiency, I am not able
> to trigger pre-mature OOM so it sounds more theoretical than a real
> issue. I am convinced that deeper surgery into compaction is really due
> as it has been mostly designed for THP case completely ignoring !costly
> allocations.
> 
> > Please fix your fundamental assumption and don't add band-aid using
> > compaction.
> 
> I do not consider compaction feedback design as a "band-aid". There is
> no other reliable source of high order pages except for compaction.
> 
> > I said same thing again and again and I can't convince you until now.
> > I'm not sure what I can do more.
> 
> Yes and yet I haven't seen any real life cases where this feedback
> mechanism doesn't work from you. You keep claiming that more reclaiming
> _might_ be useful without any grounds for that statement. Even when the
> more reclaim would help to survive a particular case we have to weigh
> pros and cons of the over reclaim and potential trashing which is worse
> than an OOM killer sometimes (staring at your machine you can ping but
> you cannot even log in...).

I didn't say that your OOM rework is totally wrong. I just said that
you should fix !costly order case since your criteria doesn't make
sense to this case.

> 
> Considering that we are in a clear disagreement in the compaction aspect
> I think we need others to either back your concern or you show a clear
> justification why compaction feedback is not viable way longterm even
> after we make further changes which would make it less THP oriented.

I can't understand why I need to convince you. Conventionally, patch
author needs to convince reviewer. Anyway, above exmaple would be helpful
to understand limitation of the compaction.

If explanation in this reply also would not convince you, I
won't insist more. Discussing more on this topic would not be
productive for us.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-12  2:23             ` Joonsoo Kim
@ 2016-05-12  5:19               ` Joonsoo Kim
  2016-05-12 10:59               ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Joonsoo Kim @ 2016-05-12  5:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	Linux Memory Management List, LKML

On Thu, May 12, 2016 at 11:23:34AM +0900, Joonsoo Kim wrote:
> On Tue, May 10, 2016 at 11:43:48AM +0200, Michal Hocko wrote:
> > On Tue 10-05-16 15:41:04, Joonsoo Kim wrote:
> > > 2016-05-05 3:16 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > > > On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
> > > >> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > [...]
> > > >> > progress. What is the usual reason to disable compaction in the first
> > > >> > place?
> > > >>
> > > >> I don't disable it. But, who knows who disable compaction? It's been *not*
> > > >> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?
> > > >
> > > > I would really like to hear about real life usecase before we go and
> > > > cripple otherwise deterministic algorithms. It might be very well
> > > > possible that those configurations simply do not have problems with high
> > > > order allocations because they are too specific.
> > 
> > Sorry for insisting but I would really like to hear some answer for
> > this, please.
> 
> I don't know. Who knows? How you can make sure that? And, I don't like
> below fixup. Theoretically, it could retry forever.
> 
> > 
> > [...]
> > > >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > >> > index 2e7e26c5d3ba..f48b9e9b1869 100644
> > > >> > --- a/mm/page_alloc.c
> > > >> > +++ b/mm/page_alloc.c
> > > >> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > > >> >                      enum migrate_mode *migrate_mode,
> > > >> >                      int compaction_retries)
> > > >> >  {
> > > >> > +       struct zone *zone;
> > > >> > +       struct zoneref *z;
> > > >> > +
> > > >> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> > > >> > +               return false;
> > > >> > +
> > > >> > +       /*
> > > >> > +        * There are setups with compaction disabled which would prefer to loop
> > > >> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
> > > >> > +        * give them a good hope and keep retrying while the order-0 watermarks
> > > >> > +        * are OK.
> > > >> > +        */
> > > >> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> > > >> > +                                       ac->nodemask) {
> > > >> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > > >> > +                                       ac->high_zoneidx, alloc_flags))
> > > >> > +                       return true;
> > > >> > +       }
> > > >> >         return false;
> > [...]
> > > My benchmark is too specific so I make another one. It does very
> > > simple things.
> > > 
> > > 1) Run the system with 256 MB memory and 2 GB swap
> > > 2) Run memory-hogger which takes (anonymous memory) 256 MB
> > > 3) Make 1000 new processes by fork (It will take 16 MB order-2 pages)
> > > 
> > > You can do it yourself with above instructions.
> > > 
> > > On current upstream kernel without CONFIG_COMPACTION, OOM doesn't happen.
> > > On next-20160509 kernel without CONFIG_COMPACTION, OOM happens when
> > > roughly *500* processes forked.
> > > 
> > > With CONFIG_COMPACTION, OOM doesn't happen on any kernel.
> > 
> > Does the patch I have posted helped?
> 
> I guess that it will help but please do it by yourself. It's simple.
> 
> > > Other kernels doesn't trigger OOM even if I make 10000 new processes.
> > 
> > Is this an usual load on !CONFIG_COMPACTION configurations?
> 
> I don't know. User-space developer doesn't take care about kernel
> configuration and it seems that fork 500 times when memory is full is
> not a corner case to me.
> 
> > > This example is very intuitive and reasonable. I think that it's not
> > > artificial.  It has enough swap space so OOM should not happen.
> > 
> > I am not really convinced this is true actually. You can have an
> > arbitrary amount of the swap space yet it still won't help you
> > because more reclaimed memory simply doesn't imply a more continuous
> > memory. This is a fundamental problem. So I think that relying on
> > !CONFIG_COMPACTION for heavy fork (or other high order) loads simply
> > never works reliably.
> 
> I think that you don't understand how powerful the reclaim and
> compaction are. In the system with large disk swap, what compaction can do
> is also possible for reclaim. Reclaim can do more.
> 
> Think about following examples.
> 
> _: free
> U: used(unmovable)
> M: used(migratable and reclaimable)
> 
> _MUU _U_U MMMM MMMM
> 
> With compaction (assume theoretically best algorithm),
> just 3 contiguous region can be made like as following:
> 
> MMUU MUMU ___M MMMM
> 
> With reclaim, we can make 8 contiguous region.
> 
> __UU _U_U ____ ____
> 
> Reclaim can be easily affected by thrashing but it is fundamentally
> more powerful than compaction.

Hmm... I uses wrong example here because if there is enough freepage,
compaction using theoretically best algorithm can make enoguh
contiguous region. We ensure that by watermark check so it has no
problem like as above. Anyway, you can see that reclaim has at least
enough power to make high order page. That's what I'd like to express
by using this example.

> 
> Even, there are not migratable but reclaimable pages and it could weak
> power of the compaction.

This is still true and one of limitation of compaction.

Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0.14] oom detection rework v6
  2016-05-12  2:23             ` Joonsoo Kim
  2016-05-12  5:19               ` Joonsoo Kim
@ 2016-05-12 10:59               ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-05-12 10:59 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, Vlastimil Babka,
	Linux Memory Management List, LKML

On Thu 12-05-16 11:23:34, Joonsoo Kim wrote:
> On Tue, May 10, 2016 at 11:43:48AM +0200, Michal Hocko wrote:
> > On Tue 10-05-16 15:41:04, Joonsoo Kim wrote:
> > > 2016-05-05 3:16 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > > > On Wed 04-05-16 23:32:31, Joonsoo Kim wrote:
> > > >> 2016-05-04 17:47 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > [...]
> > > >> > progress. What is the usual reason to disable compaction in the first
> > > >> > place?
> > > >>
> > > >> I don't disable it. But, who knows who disable compaction? It's been *not*
> > > >> a long time that CONFIG_COMPACTION is default enable. Maybe, 3 years?
> > > >
> > > > I would really like to hear about real life usecase before we go and
> > > > cripple otherwise deterministic algorithms. It might be very well
> > > > possible that those configurations simply do not have problems with high
> > > > order allocations because they are too specific.
> > 
> > Sorry for insisting but I would really like to hear some answer for
> > this, please.
> 
> I don't know. Who knows? How you can make sure that?

This is pretty much a corner case configuration. I would assume that
somebody who wants to save memory for such an important feature for high
order allocations would have a very specific workloads.

> And, I don't like below fixup. Theoretically, it could retry forever.

Sure it can retry forever if we are constantly over the watermark and the
reclaim makes progress. This is the primary thing I hate about the
current implementation and the follow up fix reintroduces that behavior
for !COMPACTION case. It will OOM as soon as there is no reclaim
progress or all available zones are not passing the watermark check so
there shouldn't be any regressions.

> > [...]
> > > >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > >> > index 2e7e26c5d3ba..f48b9e9b1869 100644
> > > >> > --- a/mm/page_alloc.c
> > > >> > +++ b/mm/page_alloc.c
> > > >> > @@ -3319,6 +3319,24 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
> > > >> >                      enum migrate_mode *migrate_mode,
> > > >> >                      int compaction_retries)
> > > >> >  {
> > > >> > +       struct zone *zone;
> > > >> > +       struct zoneref *z;
> > > >> > +
> > > >> > +       if (order > PAGE_ALLOC_COSTLY_ORDER)
> > > >> > +               return false;
> > > >> > +
> > > >> > +       /*
> > > >> > +        * There are setups with compaction disabled which would prefer to loop
> > > >> > +        * inside the allocator rather than hit the oom killer prematurely. Let's
> > > >> > +        * give them a good hope and keep retrying while the order-0 watermarks
> > > >> > +        * are OK.
> > > >> > +        */
> > > >> > +       for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> > > >> > +                                       ac->nodemask) {
> > > >> > +               if(zone_watermark_ok(zone, 0, min_wmark_pages(zone),
> > > >> > +                                       ac->high_zoneidx, alloc_flags))
> > > >> > +                       return true;
> > > >> > +       }
> > > >> >         return false;
> > [...]
> > > My benchmark is too specific so I make another one. It does very
> > > simple things.
> > > 
> > > 1) Run the system with 256 MB memory and 2 GB swap
> > > 2) Run memory-hogger which takes (anonymous memory) 256 MB
> > > 3) Make 1000 new processes by fork (It will take 16 MB order-2 pages)
> > > 
> > > You can do it yourself with above instructions.
> > > 
> > > On current upstream kernel without CONFIG_COMPACTION, OOM doesn't happen.
> > > On next-20160509 kernel without CONFIG_COMPACTION, OOM happens when
> > > roughly *500* processes forked.
> > > 
> > > With CONFIG_COMPACTION, OOM doesn't happen on any kernel.
> > 
> > Does the patch I have posted helped?
> 
> I guess that it will help but please do it by yourself. It's simple.

Fair enough. I have prepared a similar setup (virtual machine with
2 CPUs, 256M RAM, 2G swap space, CONFIG_COMPACTION disabled and the
current mmotm tree). mem_eater does MAP_POPULATE 512MB of anon private
memory and then I start an aggressive fork test which is forking
short term children (which exit after a short <1s random timeout) as
quickly as possible and it makes sure there are always 1000 children
running. All this racing with the mem_eater. This was the test I was
originally using to test oom rework with COMPACTION enabled.

This triggered the OOM for order-2 allocation requests. With the patch
applied the test has survived.
             total       used       free     shared    buffers     cached
Mem:        232572     228748       3824          0       1164       2480
-/+ buffers/cache:     225104       7468
Swap:      2097148     348320    1748828
Node 0, zone      DMA
  pages free     282
        min      33
        low      41
--
Node 0, zone    DMA32
  pages free     610
        min      441
        low      551
nr_children:1000
^CCreated 11494416 children

I will post the patch shortly.

[...]

> I think that you don't understand how powerful the reclaim and
> compaction are. In the system with large disk swap, what compaction can do
> is also possible for reclaim. Reclaim can do more.
> 
> Think about following examples.
> 
> _: free
> U: used(unmovable)
> M: used(migratable and reclaimable)
> 
> _MUU _U_U MMMM MMMM
> 
> With compaction (assume theoretically best algorithm),
> just 3 contiguous region can be made like as following:
> 
> MMUU MUMU ___M MMMM
> 
> With reclaim, we can make 8 contiguous region.
> 
> __UU _U_U ____ ____
> 
> Reclaim can be easily affected by thrashing but it is fundamentally
> more powerful than compaction.

OK, it seems I was ambiguous in my previous statements, sorry about
that. Of course that reclaiming all (or large portion of) the memory
will free up more high order slots. But this is quite unreasonable
behavior to get few !costly blocks of memory because it affects most
processes. There should be some balance there.

> Even, there are not migratable but reclaimable pages and it could weak
> power of the compaction.
> 
> > > This failure shows that fundamental assumption of your patch is
> > > wrong. You triggers OOM even if there is enough reclaimable memory but
> > > no high order freepage depending on the fact that we can't guarantee
> > > that we can make high order page with reclaiming these reclaimable
> > > memory. Yes, we can't guarantee it but we also doesn't know if it
> > > can be possible or not. We should not stop reclaim until this
> > > estimation is is proved. Otherwise, it would be premature OOM.
> > 
> > We've been through this before and you keep repeating this argument. 
> > I have tried to explain that a deterministic behavior is more reasonable
> > than a random retry loops which pretty much depends on timing and which
> > can hugely over-reclaim which might be even worse than an OOM killer
> > invocation which would target a single process.
> 
> I didn't say that deterministic behavior is less reasonable. I like
> it. What I insist is the your criteria for deterministic behavior
> is wrong and please use another criteria for deterministic behavior.
> That's what I want.

I have structured my current criteria to be as independent on both
reclaim and compaction as possible and understandable at the same
time. I simply do not see how I would do it differently at this
moment. The current code behaves reasonably well with workloads I was
testing. I am not claiming this will need some surgery later on but I
would rather see oom reports and tweak the current implementation in
incremental steps than over engineer something from the early beginning
for theoretical issues which I even cannot get rid of completely. This
is a _heuristic_ and as such it can handle certain class of workloads
better than others. This is the case with the current implementation as
well.

[...]

> > Considering that we are in a clear disagreement in the compaction aspect
> > I think we need others to either back your concern or you show a clear
> > justification why compaction feedback is not viable way longterm even
> > after we make further changes which would make it less THP oriented.
> 
> I can't understand why I need to convince you. Conventionally, patch
> author needs to convince reviewer.

I am trying my best to clarify/justify my changes but it is really hard
when you disagree with some core principles with theoretical problems
which I do not see in practice. This is a heuristic and as such it will
never cover 100% cases. I aim to be as good as possible and the results
so far look reasonable to me.

> Anyway, above exmaple would be helpful to understand limitation of the
> compaction.

I understand that the compaction is not omnipotent and I can see there
will be corner cases but there always have been some in this area I am
just replacing the current by less fuzzy ones.
 
> If explanation in this reply also would not convince you, I
> won't insist more. Discussing more on this topic would not be
> productive for us.

I am afraid I haven't heard any such a strong argument to re-evaluate my
current position. As I've said we might need some tweaks here and there
in the future but at least we can build on a solid and deterministic
grounds which I find the most important aspect of the new
implementation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-05-12 10:59 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-20 19:47 [PATCH 0.14] oom detection rework v6 Michal Hocko
2016-04-20 19:47 ` [PATCH 01/14] vmscan: consider classzone_idx in compaction_ready Michal Hocko
2016-04-21  3:32   ` Hillf Danton
2016-05-04 13:56   ` Michal Hocko
2016-04-20 19:47 ` [PATCH 02/14] mm, compaction: change COMPACT_ constants into enum Michal Hocko
2016-04-20 19:47 ` [PATCH 03/14] mm, compaction: cover all compaction mode in compact_zone Michal Hocko
2016-04-20 19:47 ` [PATCH 04/14] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED Michal Hocko
2016-04-21  7:08   ` Hillf Danton
2016-04-20 19:47 ` [PATCH 05/14] mm, compaction: distinguish between full and partial COMPACT_COMPLETE Michal Hocko
2016-04-21  6:39   ` Hillf Danton
2016-04-20 19:47 ` [PATCH 06/14] mm, compaction: Update compaction_result ordering Michal Hocko
2016-04-21  6:45   ` Hillf Danton
2016-04-20 19:47 ` [PATCH 07/14] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface Michal Hocko
2016-04-21  6:50   ` Hillf Danton
2016-04-20 19:47 ` [PATCH 08/14] mm, compaction: Abstract compaction feedback to helpers Michal Hocko
2016-04-21  6:57   ` Hillf Danton
2016-04-28  8:47   ` Vlastimil Babka
2016-04-20 19:47 ` [PATCH 09/14] mm: use compaction feedback for thp backoff conditions Michal Hocko
2016-04-21  7:05   ` Hillf Danton
2016-04-28  8:53   ` Vlastimil Babka
2016-04-28 12:35     ` Michal Hocko
2016-04-29  9:16       ` Vlastimil Babka
2016-04-29  9:28         ` Michal Hocko
2016-04-20 19:47 ` [PATCH 10/14] mm, oom: rework oom detection Michal Hocko
2016-04-20 19:47 ` [PATCH 11/14] mm: throttle on IO only when there are too many dirty and writeback pages Michal Hocko
2016-04-20 19:47 ` [PATCH 12/14] mm, oom: protect !costly allocations some more Michal Hocko
2016-04-21  8:03   ` Hillf Danton
2016-05-04  6:01   ` Joonsoo Kim
2016-05-04  6:31     ` Joonsoo Kim
2016-05-04  8:56       ` Michal Hocko
2016-05-04 14:57         ` Joonsoo Kim
2016-05-04 18:19           ` Michal Hocko
2016-05-04  8:53     ` Michal Hocko
2016-05-04 14:39       ` Joonsoo Kim
2016-05-04 18:20         ` Michal Hocko
2016-04-20 19:47 ` [PATCH 13/14] mm: consider compaction feedback also for costly allocation Michal Hocko
2016-04-21  8:13   ` Hillf Danton
2016-04-20 19:47 ` [PATCH 14/14] mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders Michal Hocko
2016-04-21  8:24   ` Hillf Danton
2016-04-28  8:59   ` Vlastimil Babka
2016-04-28 12:39     ` Michal Hocko
2016-05-04  6:27   ` Joonsoo Kim
2016-05-04  9:04     ` Michal Hocko
2016-05-04 15:14       ` Joonsoo Kim
2016-05-04 19:22         ` Michal Hocko
2016-05-04  5:45 ` [PATCH 0.14] oom detection rework v6 Joonsoo Kim
2016-05-04  8:12   ` Vlastimil Babka
2016-05-04  8:32     ` Joonsoo Kim
2016-05-04  8:50     ` Michal Hocko
2016-05-04  8:47   ` Michal Hocko
2016-05-04 14:32     ` Joonsoo Kim
2016-05-04 18:16       ` Michal Hocko
2016-05-10  6:41         ` Joonsoo Kim
2016-05-10  7:09           ` Vlastimil Babka
2016-05-10  8:00             ` Joonsoo Kim
2016-05-10  9:44               ` Michal Hocko
2016-05-10  9:43           ` Michal Hocko
2016-05-12  2:23             ` Joonsoo Kim
2016-05-12  5:19               ` Joonsoo Kim
2016-05-12 10:59               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).