All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] OOM detection rework v4
@ 2015-12-15 18:19 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,

This is v4 of the series. The previous version was posted [1].  I have
dropped the RFC because this has been sitting and waiting for the
fundamental objections for quite some time and there were none. I still
do not think we should rush this and merge it no sooner than 4.6. Having
this in the mmotm and thus linux-next would open it to a much larger
testing coverage. I will iron out issues as they come but hopefully
there will no serious ones.

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

as pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
   removes the files and starts over again) running in parallel for 10s
   to build up a lot of dirty pages when 100 parallel mem_eaters (anon
   private populated mmap which waits until it gets signal) with 80M
   each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements
but the runtime is way too different (it took 800+s). One thing that
could have skewed results was that I was tail -f the serial log on the
host system to see the progress. I have stopped doing that. The results
are more consistent now but still too different from the last time.
This is really weird so I've retested with the last 4.2 mmotm again and
I am getting consistent ~220s which is really close to the above. If I
apply the WQ vmstat patch on top I am getting close to 160s so the stale
vmstat counters made a difference which is to be expected. I have a new
SSD in my laptop which migh have made a difference but I wouldn't expect
it to be that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall
runtime of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.


2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
   Start 10 memeaters in parallel each with
   size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
   This will cause an OOM killer which will kill one of them which will free up
   200M and then try to use all the remaining space for hugetlb pages. See how
   many of them will pass kill everything, wait 2s and try again.
   This tests whether we do not fail __GFP_REPEAT costly allocations too early
   now.
* base kernel
$ sort base-hugepages.log | uniq -c
      1 64
     13 65
      6 66
     20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
     17 65
      3 66
     20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/1448974607-10208-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com


^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 0/3] OOM detection rework v4
@ 2015-12-15 18:19 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,

This is v4 of the series. The previous version was posted [1].  I have
dropped the RFC because this has been sitting and waiting for the
fundamental objections for quite some time and there were none. I still
do not think we should rush this and merge it no sooner than 4.6. Having
this in the mmotm and thus linux-next would open it to a much larger
testing coverage. I will iron out issues as they come but hopefully
there will no serious ones.

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

as pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
   removes the files and starts over again) running in parallel for 10s
   to build up a lot of dirty pages when 100 parallel mem_eaters (anon
   private populated mmap which waits until it gets signal) with 80M
   each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements
but the runtime is way too different (it took 800+s). One thing that
could have skewed results was that I was tail -f the serial log on the
host system to see the progress. I have stopped doing that. The results
are more consistent now but still too different from the last time.
This is really weird so I've retested with the last 4.2 mmotm again and
I am getting consistent ~220s which is really close to the above. If I
apply the WQ vmstat patch on top I am getting close to 160s so the stale
vmstat counters made a difference which is to be expected. I have a new
SSD in my laptop which migh have made a difference but I wouldn't expect
it to be that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall
runtime of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.


2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
   Start 10 memeaters in parallel each with
   size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
   This will cause an OOM killer which will kill one of them which will free up
   200M and then try to use all the remaining space for hugetlb pages. See how
   many of them will pass kill everything, wait 2s and try again.
   This tests whether we do not fail __GFP_REPEAT costly allocations too early
   now.
* base kernel
$ sort base-hugepages.log | uniq -c
      1 64
     13 65
      6 66
     20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
     17 65
      3 66
     20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/1448974607-10208-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, oom: rework oom detection
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
feedback mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

factor out the retry logic into separate function - per Johannes
---
 include/linux/swap.h |  1 +
 mm/page_alloc.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          | 25 +++------------
 3 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..738ae2206635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e267faad4649..f77e283fb8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+		     struct alloc_context *ac, int alloc_flags,
+		     bool did_some_progress, unsigned long pages_reclaimed,
+		     int no_progress_loops)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure we converge to OOM if we cannot make any progress
+	 * several times in the row.
+	 */
+	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+		return false;
+
+	/* Do not retry high order allocations unless they are __GFP_REPEAT */
+	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+			return false;
+
+		if (did_some_progress)
+			return true;
+	}
+
+	/*
+	 * Keep reclaiming pages while there is a chance this will lead somewhere.
+	 * If none of the target zones can satisfy our allocation request even
+	 * if all reclaimable pages are considered then we are screwed and have
+	 * to go OOM.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
+		unsigned long available;
+
+		available = zone_reclaimable_pages(zone);
+		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/*
+		 * Would the allocation succeed if we reclaimed the whole available?
+		 */
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac->high_zoneidx, alloc_flags, available)) {
+			/* Wait for some write requests to complete then retry */
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -2996,6 +3065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int no_progress_loops = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3155,23 +3225,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	/* Keep reclaiming pages as long as there is reasonable progress */
-	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto retry;
+	if (did_some_progress) {
+		no_progress_loops = 0;
+		pages_reclaimed += did_some_progress;
+	} else {
+		no_progress_loops++;
 	}
 
+	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, pages_reclaimed,
+				 no_progress_loops))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		no_progress_loops = 0;
 		goto retry;
+	}
 
 noretry:
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..489212252cd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,7 +192,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
@@ -2516,10 +2516,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2527,7 +2525,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
-	bool reclaimable = false;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			if (nr_soft_reclaimed)
-				reclaimable = true;
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
-			reclaimable = true;
-
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		shrink_zone(zone, sc, zone_idx(zone));
 	}
 
 	/*
@@ -2610,8 +2600,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * promoted it to __GFP_HIGHMEM.
 	 */
 	sc->gfp_mask = orig_mask;
-
-	return reclaimable;
 }
 
 /*
@@ -2636,7 +2624,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
-	bool zones_reclaimable;
 retry:
 	delayacct_freepages_start();
 
@@ -2647,7 +2634,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		zones_reclaimable = shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2694,10 +2681,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		goto retry;
 	}
 
-	/* Any of the zones still reclaimable?  Don't OOM. */
-	if (zones_reclaimable)
-		return 1;
-
 	return 0;
 }
 
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, oom: rework oom detection
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
feedback mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

factor out the retry logic into separate function - per Johannes
---
 include/linux/swap.h |  1 +
 mm/page_alloc.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          | 25 +++------------
 3 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..738ae2206635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e267faad4649..f77e283fb8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+		     struct alloc_context *ac, int alloc_flags,
+		     bool did_some_progress, unsigned long pages_reclaimed,
+		     int no_progress_loops)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure we converge to OOM if we cannot make any progress
+	 * several times in the row.
+	 */
+	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+		return false;
+
+	/* Do not retry high order allocations unless they are __GFP_REPEAT */
+	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+			return false;
+
+		if (did_some_progress)
+			return true;
+	}
+
+	/*
+	 * Keep reclaiming pages while there is a chance this will lead somewhere.
+	 * If none of the target zones can satisfy our allocation request even
+	 * if all reclaimable pages are considered then we are screwed and have
+	 * to go OOM.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
+		unsigned long available;
+
+		available = zone_reclaimable_pages(zone);
+		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/*
+		 * Would the allocation succeed if we reclaimed the whole available?
+		 */
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac->high_zoneidx, alloc_flags, available)) {
+			/* Wait for some write requests to complete then retry */
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -2996,6 +3065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int no_progress_loops = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3155,23 +3225,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	/* Keep reclaiming pages as long as there is reasonable progress */
-	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto retry;
+	if (did_some_progress) {
+		no_progress_loops = 0;
+		pages_reclaimed += did_some_progress;
+	} else {
+		no_progress_loops++;
 	}
 
+	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, pages_reclaimed,
+				 no_progress_loops))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		no_progress_loops = 0;
 		goto retry;
+	}
 
 noretry:
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..489212252cd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,7 +192,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
@@ -2516,10 +2516,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2527,7 +2525,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
-	bool reclaimable = false;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			if (nr_soft_reclaimed)
-				reclaimable = true;
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
-			reclaimable = true;
-
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		shrink_zone(zone, sc, zone_idx(zone));
 	}
 
 	/*
@@ -2610,8 +2600,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * promoted it to __GFP_HIGHMEM.
 	 */
 	sc->gfp_mask = orig_mask;
-
-	return reclaimable;
 }
 
 /*
@@ -2636,7 +2624,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
-	bool zones_reclaimable;
 retry:
 	delayacct_freepages_start();
 
@@ -2647,7 +2634,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		zones_reclaimable = shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2694,10 +2681,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		goto retry;
 	}
 
-	/* Any of the zones still reclaimable?  Don't OOM. */
-	if (zones_reclaimable)
-		return 1;
-
 	return 0;
 }
 
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested
because the sleep is needed to be done only once in the allocation retry
cycle.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 19 +++----------------
 mm/page_alloc.c  | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..d2473ce9cc57 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -980,19 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
 
-		/*
-		 * Memory allocation/reclaim might be called from a WQ
-		 * context and the current implementation of the WQ
-		 * concurrency control doesn't recognize that a particular
-		 * WQ is congested if the worker thread is looping without
-		 * ever sleeping. Therefore we have to do a short sleep
-		 * here rather than calling cond_resched().
-		 */
-		if (current->flags & PF_WQ_WORKER)
-			schedule_timeout(1);
-		else
-			cond_resched();
-
+		cond_resched();
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f77e283fb8c6..b2de8c8761ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3034,8 +3034,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
+		unsigned long reclaimable;
 
-		available = zone_reclaimable_pages(zone);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
@@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, available)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback;
+			unsigned long dirty;
+
+			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout(1);
+			else
+				cond_resched();
+
 			return true;
 		}
 	}
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested
because the sleep is needed to be done only once in the allocation retry
cycle.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 19 +++----------------
 mm/page_alloc.c  | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..d2473ce9cc57 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -980,19 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
 
-		/*
-		 * Memory allocation/reclaim might be called from a WQ
-		 * context and the current implementation of the WQ
-		 * concurrency control doesn't recognize that a particular
-		 * WQ is congested if the worker thread is looping without
-		 * ever sleeping. Therefore we have to do a short sleep
-		 * here rather than calling cond_resched().
-		 */
-		if (current->flags & PF_WQ_WORKER)
-			schedule_timeout(1);
-		else
-			cond_resched();
-
+		cond_resched();
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f77e283fb8c6..b2de8c8761ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3034,8 +3034,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
+		unsigned long reclaimable;
 
-		available = zone_reclaimable_pages(zone);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
@@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, available)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback;
+			unsigned long dirty;
+
+			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout(1);
+			else
+				cond_resched();
+
 			return true;
 		}
 	}
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath retries costly allocations until at least
order worth of pages were reclaimed or the watermark check for at least
one zone would succeed after all reclaiming all pages if the reclaim
hasn't made any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. It would be more appropriate to check the compaction progress
instead but this patch simply removes the check and relies solely
on the watermark check.

To prevent from too many retries the no_progress_loops is not reseted after
a reclaim which made progress because we cannot assume it helped high
order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2de8c8761ad..268de1654128 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2994,17 +2994,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, unsigned long pages_reclaimed,
+		     bool did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3018,13 +3018,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
-			return false;
-
-		if (did_some_progress)
-			return true;
-	}
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+		return false;
 
 	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
@@ -3090,7 +3085,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
@@ -3255,16 +3249,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	if (did_some_progress) {
+	/*
+	 * Costly allocations might have made a progress but this doesn't mean
+	 * their order will become available due to high fragmentation so do
+	 * not reset the no progress counter for them
+	 */
+	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
 		no_progress_loops = 0;
-		pages_reclaimed += did_some_progress;
-	} else {
+	else
 		no_progress_loops++;
-	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, pages_reclaimed,
-				 no_progress_loops))
+				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.6.2


^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath retries costly allocations until at least
order worth of pages were reclaimed or the watermark check for at least
one zone would succeed after all reclaiming all pages if the reclaim
hasn't made any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. It would be more appropriate to check the compaction progress
instead but this patch simply removes the check and relies solely
on the watermark check.

To prevent from too many retries the no_progress_loops is not reseted after
a reclaim which made progress because we cannot assume it helped high
order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2de8c8761ad..268de1654128 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2994,17 +2994,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, unsigned long pages_reclaimed,
+		     bool did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3018,13 +3018,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
-			return false;
-
-		if (did_some_progress)
-			return true;
-	}
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+		return false;
 
 	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
@@ -3090,7 +3085,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
@@ -3255,16 +3249,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	if (did_some_progress) {
+	/*
+	 * Costly allocations might have made a progress but this doesn't mean
+	 * their order will become available due to high fragmentation so do
+	 * not reset the no progress counter for them
+	 */
+	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
 		no_progress_loops = 0;
-		pages_reclaimed += did_some_progress;
-	} else {
+	else
 		no_progress_loops++;
-	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, pages_reclaimed,
-				 no_progress_loops))
+				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-16 23:35   ` Andrew Morton
  -1 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well.  This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything).  It's not so easy in the case of
oom-too-late-or-never.  The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom.  But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
  start going wrong and turns on diagnostics (this would need an enable
  knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
  the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
  instructions/scripts so that people who know nothing about kernel
  internals or tracing can easily gather the info we need to understand
  issues.

- add a sysrq key to turn on diagnostics.  Pretty essential when the
  machine is comatose and doesn't respond to keystrokes.

- something else

So...  please have a think about it?  What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well? 
At this time, too much developer support code will be better than too
little.  We can take it out later on.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-16 23:35   ` Andrew Morton
  0 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well.  This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything).  It's not so easy in the case of
oom-too-late-or-never.  The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom.  But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
  start going wrong and turns on diagnostics (this would need an enable
  knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
  the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
  instructions/scripts so that people who know nothing about kernel
  internals or tracing can easily gather the info we need to understand
  issues.

- add a sysrq key to turn on diagnostics.  Pretty essential when the
  machine is comatose and doesn't respond to keystrokes.

- something else

So...  please have a think about it?  What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well? 
At this time, too much developer support code will be better than too
little.  We can take it out later on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-16 23:58   ` Andrew Morton
  -1 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> 
> ...
>
> * base kernel
> $ grep "Killed process" base-oom-run1.log | tail -n1
> [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> $ grep "Killed process" base-oom-run2.log | tail -n1
> [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> 
> $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> 76
> 
> The number of OOM invocations is consistent with my last measurements
> but the runtime is way too different (it took 800+s).

I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
assume the 800+s is sum-across-multiple-CPUs?  Given that all the CPUs
are pounding away at the same data and the same disk, that doesn't
sound like very interesting info - the overall elapsed time is the
thing to look at in this case.

> One thing that
> could have skewed results was that I was tail -f the serial log on the
> host system to see the progress. I have stopped doing that. The results
> are more consistent now but still too different from the last time.
> This is really weird so I've retested with the last 4.2 mmotm again and
> I am getting consistent ~220s which is really close to the above. If I
> apply the WQ vmstat patch on top I am getting close to 160s so the stale
> vmstat counters made a difference which is to be expected. I have a new
> SSD in my laptop which migh have made a difference but I wouldn't expect
> it to be that large.
> 
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 4
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 1
> 
> * patched kernel
> $ grep "Killed process" patched-oom-run1.log | tail -n1
> [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> $ grep "Killed process" patched-oom-run2.log | tail -n1
> [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

Even better.

> $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> 77
> 
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 0
> 
> So the number of OOM killer invocation is the same but the overall
> runtime of the test was much longer with the patched kernel. This can be
> attributed to more retries in general. The results from the base kernel
> are quite inconsitent and I think that consistency is better here.

It's hard to say how long declaration of oom should take.  Correctness
comes first.  But what is "correct"?  oom isn't a binary condition -
there's a chance that if we keep churning away for another 5 minutes
we'll be able to satisfy this allocation (but probably not the next
one).  There are tradeoffs between promptness-of-declaring-oom and
exhaustiveness-in-avoiding-it.

> 
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
>    memory as possible without triggering the OOM killer. This required a lot
>    of tuning but I've considered 3 consecutive runs without OOM as a success.

"a lot of tuning" sounds bad.  It means that the tuning settings you
have now for a particular workload on a particular machine will be
wrong for other workloads and machines.  uh-oh.

> ...

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-16 23:58   ` Andrew Morton
  0 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> 
> ...
>
> * base kernel
> $ grep "Killed process" base-oom-run1.log | tail -n1
> [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> $ grep "Killed process" base-oom-run2.log | tail -n1
> [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> 
> $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> 76
> 
> The number of OOM invocations is consistent with my last measurements
> but the runtime is way too different (it took 800+s).

I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
assume the 800+s is sum-across-multiple-CPUs?  Given that all the CPUs
are pounding away at the same data and the same disk, that doesn't
sound like very interesting info - the overall elapsed time is the
thing to look at in this case.

> One thing that
> could have skewed results was that I was tail -f the serial log on the
> host system to see the progress. I have stopped doing that. The results
> are more consistent now but still too different from the last time.
> This is really weird so I've retested with the last 4.2 mmotm again and
> I am getting consistent ~220s which is really close to the above. If I
> apply the WQ vmstat patch on top I am getting close to 160s so the stale
> vmstat counters made a difference which is to be expected. I have a new
> SSD in my laptop which migh have made a difference but I wouldn't expect
> it to be that large.
> 
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 4
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 1
> 
> * patched kernel
> $ grep "Killed process" patched-oom-run1.log | tail -n1
> [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> $ grep "Killed process" patched-oom-run2.log | tail -n1
> [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

Even better.

> $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> 77
> 
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 0
> 
> So the number of OOM killer invocation is the same but the overall
> runtime of the test was much longer with the patched kernel. This can be
> attributed to more retries in general. The results from the base kernel
> are quite inconsitent and I think that consistency is better here.

It's hard to say how long declaration of oom should take.  Correctness
comes first.  But what is "correct"?  oom isn't a binary condition -
there's a chance that if we keep churning away for another 5 minutes
we'll be able to satisfy this allocation (but probably not the next
one).  There are tradeoffs between promptness-of-declaring-oom and
exhaustiveness-in-avoiding-it.

> 
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
>    memory as possible without triggering the OOM killer. This required a lot
>    of tuning but I've considered 3 consecutive runs without OOM as a success.

"a lot of tuning" sounds bad.  It means that the tuning settings you
have now for a particular workload on a particular machine will be
wrong for other workloads and machines.  uh-oh.

> ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-16 23:35   ` Andrew Morton
@ 2015-12-18 12:12     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 12:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:35:13, Andrew Morton wrote:
[...]
> So...  please have a think about it?  What can we add in here to make it
> as easy as possible for us (ie: you ;)) to get this code working well? 
> At this time, too much developer support code will be better than too
> little.  We can take it out later on.

Sure. I will think about this and get back to it early next year. I will
be mostly offline starting next week.

Thanks for looking into this!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 12:12     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 12:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:35:13, Andrew Morton wrote:
[...]
> So...  please have a think about it?  What can we add in here to make it
> as easy as possible for us (ie: you ;)) to get this code working well? 
> At this time, too much developer support code will be better than too
> little.  We can take it out later on.

Sure. I will think about this and get back to it early next year. I will
be mostly offline starting next week.

Thanks for looking into this!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-16 23:58   ` Andrew Morton
@ 2015-12-18 13:15     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > 
> > ...
> >
> > * base kernel
> > $ grep "Killed process" base-oom-run1.log | tail -n1
> > [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> > $ grep "Killed process" base-oom-run2.log | tail -n1
> > [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> > 
> > $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> > 76
> > 
> > The number of OOM invocations is consistent with my last measurements
> > but the runtime is way too different (it took 800+s).
> 
> I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
> assume the 800+s is sum-across-multiple-CPUs?

This is the time until the oom situation settled down. And I really
suspect that the new SSD made a difference here.

> Given that all the CPUs
> are pounding away at the same data and the same disk, that doesn't
> sound like very interesting info - the overall elapsed time is the
> thing to look at in this case.

Which is what I was looking at when checking the timestamp in the log.

[...]
> > * patched kernel
> > $ grep "Killed process" patched-oom-run1.log | tail -n1
> > [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> > $ grep "Killed process" patched-oom-run2.log | tail -n1
> > [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB
> 
> Even better.
> 
> > $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> > 77
> > 
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> > 1
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> > 0
> > 
> > So the number of OOM killer invocation is the same but the overall
> > runtime of the test was much longer with the patched kernel. This can be
> > attributed to more retries in general. The results from the base kernel
> > are quite inconsitent and I think that consistency is better here.
> 
> It's hard to say how long declaration of oom should take.  Correctness
> comes first.  But what is "correct"?  oom isn't a binary condition -
> there's a chance that if we keep churning away for another 5 minutes
> we'll be able to satisfy this allocation (but probably not the next
> one).  There are tradeoffs between promptness-of-declaring-oom and
> exhaustiveness-in-avoiding-it.

Yes, this is really hard to tell. What I wanted to achieve here is a
determinism - the same load should give comparable results. It seems
that there is an improvement in this regards. The time to settle is 
much more consistent than with the original implementation.
 
> > 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> >    memory as possible without triggering the OOM killer. This required a lot
> >    of tuning but I've considered 3 consecutive runs without OOM as a success.
> 
> "a lot of tuning" sounds bad.  It means that the tuning settings you
> have now for a particular workload on a particular machine will be
> wrong for other workloads and machines.  uh-oh.

Well, I had to tune the test to see how close to the edge I can get. I
haven't done any decisions based on this test.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 13:15     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > 
> > ...
> >
> > * base kernel
> > $ grep "Killed process" base-oom-run1.log | tail -n1
> > [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> > $ grep "Killed process" base-oom-run2.log | tail -n1
> > [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> > 
> > $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> > 76
> > 
> > The number of OOM invocations is consistent with my last measurements
> > but the runtime is way too different (it took 800+s).
> 
> I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
> assume the 800+s is sum-across-multiple-CPUs?

This is the time until the oom situation settled down. And I really
suspect that the new SSD made a difference here.

> Given that all the CPUs
> are pounding away at the same data and the same disk, that doesn't
> sound like very interesting info - the overall elapsed time is the
> thing to look at in this case.

Which is what I was looking at when checking the timestamp in the log.

[...]
> > * patched kernel
> > $ grep "Killed process" patched-oom-run1.log | tail -n1
> > [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> > $ grep "Killed process" patched-oom-run2.log | tail -n1
> > [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB
> 
> Even better.
> 
> > $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> > 77
> > 
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> > 1
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> > 0
> > 
> > So the number of OOM killer invocation is the same but the overall
> > runtime of the test was much longer with the patched kernel. This can be
> > attributed to more retries in general. The results from the base kernel
> > are quite inconsitent and I think that consistency is better here.
> 
> It's hard to say how long declaration of oom should take.  Correctness
> comes first.  But what is "correct"?  oom isn't a binary condition -
> there's a chance that if we keep churning away for another 5 minutes
> we'll be able to satisfy this allocation (but probably not the next
> one).  There are tradeoffs between promptness-of-declaring-oom and
> exhaustiveness-in-avoiding-it.

Yes, this is really hard to tell. What I wanted to achieve here is a
determinism - the same load should give comparable results. It seems
that there is an improvement in this regards. The time to settle is 
much more consistent than with the original implementation.
 
> > 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> >    memory as possible without triggering the OOM killer. This required a lot
> >    of tuning but I've considered 3 consecutive runs without OOM as a success.
> 
> "a lot of tuning" sounds bad.  It means that the tuning settings you
> have now for a particular workload on a particular machine will be
> wrong for other workloads and machines.  uh-oh.

Well, I had to tune the test to see how close to the edge I can get. I
haven't done any decisions based on this test.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-18 13:15     ` Michal Hocko
@ 2015-12-18 16:35       ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2015-12-18 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Fri, Dec 18, 2015 at 02:15:09PM +0100, Michal Hocko wrote:
> On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> > It's hard to say how long declaration of oom should take.  Correctness
> > comes first.  But what is "correct"?  oom isn't a binary condition -
> > there's a chance that if we keep churning away for another 5 minutes
> > we'll be able to satisfy this allocation (but probably not the next
> > one).  There are tradeoffs between promptness-of-declaring-oom and
> > exhaustiveness-in-avoiding-it.
> 
> Yes, this is really hard to tell. What I wanted to achieve here is a
> determinism - the same load should give comparable results. It seems
> that there is an improvement in this regards. The time to settle is 
> much more consistent than with the original implementation.

+1

Before that we couldn't even really make a meaningful statement about
how long we are going to try - "as long as reclaim thinks it can maybe
do some more, depending on heuristics". I think the best thing we can
strive for with OOM is to make the rules simple and predictable.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 16:35       ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2015-12-18 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Fri, Dec 18, 2015 at 02:15:09PM +0100, Michal Hocko wrote:
> On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> > It's hard to say how long declaration of oom should take.  Correctness
> > comes first.  But what is "correct"?  oom isn't a binary condition -
> > there's a chance that if we keep churning away for another 5 minutes
> > we'll be able to satisfy this allocation (but probably not the next
> > one).  There are tradeoffs between promptness-of-declaring-oom and
> > exhaustiveness-in-avoiding-it.
> 
> Yes, this is really hard to tell. What I wanted to achieve here is a
> determinism - the same load should give comparable results. It seems
> that there is an improvement in this regards. The time to settle is 
> much more consistent than with the original implementation.

+1

Before that we couldn't even really make a meaningful statement about
how long we are going to try - "as long as reclaim thinks it can maybe
do some more, depending on heuristics". I think the best thing we can
strive for with OOM is to make the rules simple and predictable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-24 12:41   ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-24 12:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

I got OOM killers while running heavy disk I/O (extracting kernel source,
running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
Do you think these OOM killers reasonable? Too weak against fragmentation?

[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.432780] kthreadd cpuset=/ mems_allowed=0
[ 3902.433904] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 3902.435463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3902.437541]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 3902.439129]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 3902.440779]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 3902.442436] Call Trace:
[ 3902.443094]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 3902.444188]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 3902.445301]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3902.446656]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 3902.447881]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 3902.449093]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 3902.450266]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 3902.451430]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 3902.452757]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 3902.454468]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 3902.455756]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 3902.457076]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 3902.458396]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 3902.459480]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 3902.460775]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 3902.461894]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 3902.463035]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 3902.464230]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.465502]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 3902.466648]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.467953] Mem-Info:
[ 3902.468537] active_anon:20817 inactive_anon:2098 isolated_anon:0
[ 3902.468537]  active_file:145434 inactive_file:145453 isolated_file:0
[ 3902.468537]  unevictable:0 dirty:20613 writeback:7248 unstable:0
[ 3902.468537]  slab_reclaimable:86363 slab_unreclaimable:14905
[ 3902.468537]  mapped:6670 shmem:2167 pagetables:1497 bounce:0
[ 3902.468537]  free:5422 free_pcp:75 free_cma:0
[ 3902.476541] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3268kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:36kB shmem:216kB slab_reclaimable:3708kB slab_unreclaimable:456kB kernel_stack:48kB pagetables:160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3902.486494] lowmem_reserve[]: 0 1714 1714 1714
[ 3902.487659] Node 0 DMA32 free:13760kB min:5172kB low:6464kB high:7756kB active_anon:80000kB inactive_anon:8192kB active_file:581780kB inactive_file:581848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:82312kB writeback:29588kB mapped:26648kB shmem:8452kB slab_reclaimable:341744kB slab_unreclaimable:59496kB kernel_stack:3456kB pagetables:5828kB unstable:0kB bounce:0kB free_pcp:732kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:560 all_unreclaimable? no
[ 3902.500438] lowmem_reserve[]: 0 0 0 0
[ 3902.502373] Node 0 DMA: 42*4kB (UME) 84*8kB (UM) 57*16kB (UM) 15*32kB (UM) 11*64kB (M) 9*128kB (UME) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 0*2048kB 0*4096kB = 6904kB
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 3902.511236] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3902.513938] 292144 total pagecache pages
[ 3902.515609] 0 pages in swap cache
[ 3902.517139] Swap cache stats: add 0, delete 0, find 0/0
[ 3902.519153] Free swap  = 0kB
[ 3902.520587] Total swap = 0kB
[ 3902.522095] 524157 pages RAM
[ 3902.523511] 0 pages HighMem/MovableOnly
[ 3902.525091] 80441 pages reserved
[ 3902.526580] 0 pages hwpoisoned
[ 3902.528169] Out of memory: Kill process 687 (firewalld) score 11 or sacrifice child
[ 3902.531017] Killed process 687 (firewalld) total-vm:323600kB, anon-rss:17032kB, file-rss:4896kB, shmem-rss:0kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.903629] smbd cpuset=/ mems_allowed=0
[ 5262.904725] CPU: 2 PID: 3935 Comm: smbd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5262.906401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5262.908679]  0000000000000000 00000000eaa24b41 ffff88007c37faf8 ffffffff81395bc3
[ 5262.910459]  0000000000000000 ffff88007c37fb98 ffffffff811babac 0000000000000206
[ 5262.912224]  ffffffff81810470 ffff88007c37fb38 ffffffff810bce29 0000000000000206
[ 5262.914019] Call Trace:
[ 5262.914839]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5262.916118]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5262.917493]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5262.919131]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5262.920690]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5262.922204]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5262.923863]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5262.925386]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5262.927121]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5262.928738]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5262.930438]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.932110]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5262.933410]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.935016]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 5262.936632]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 5262.938383]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 5262.940024]  [<ffffffff8106e5a4>] SyS_clone+0x14/0x20
[ 5262.941465]  [<ffffffff816fc532>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 5262.943137] Mem-Info:
[ 5262.944068] active_anon:37901 inactive_anon:2095 isolated_anon:0
[ 5262.944068]  active_file:134812 inactive_file:135474 isolated_file:0
[ 5262.944068]  unevictable:0 dirty:257 writeback:0 unstable:0
[ 5262.944068]  slab_reclaimable:90770 slab_unreclaimable:12759
[ 5262.944068]  mapped:4223 shmem:2166 pagetables:1428 bounce:0
[ 5262.944068]  free:3738 free_pcp:49 free_cma:0
[ 5262.953176] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:900kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5262.963749] lowmem_reserve[]: 0 1714 1714 1714
[ 5262.965434] Node 0 DMA32 free:8048kB min:5172kB low:6464kB high:7756kB active_anon:150704kB inactive_anon:8180kB active_file:539244kB inactive_file:541892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:1028kB writeback:0kB mapped:16860kB shmem:8448kB slab_reclaimable:357524kB slab_unreclaimable:50324kB kernel_stack:3232kB pagetables:5560kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? no
[ 5262.976879] lowmem_reserve[]: 0 0 0 0
[ 5262.978586] Node 0 DMA: 58*4kB (UME) 60*8kB (UME) 73*16kB (UME) 23*32kB (UME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6904kB
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5262.987124] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5262.989532] 272459 total pagecache pages
[ 5262.991203] 0 pages in swap cache
[ 5262.992583] Swap cache stats: add 0, delete 0, find 0/0
[ 5262.994334] Free swap  = 0kB
[ 5262.995787] Total swap = 0kB
[ 5262.997038] 524157 pages RAM
[ 5262.998270] 0 pages HighMem/MovableOnly
[ 5262.999683] 80441 pages reserved
[ 5263.001153] 0 pages hwpoisoned
[ 5263.002612] Out of memory: Kill process 26226 (genxref) score 54 or sacrifice child
[ 5263.004648] Killed process 26226 (genxref) total-vm:130348kB, anon-rss:94680kB, file-rss:4756kB, shmem-rss:0kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.767289] kthreadd cpuset=/ mems_allowed=0
[ 5269.768904] CPU: 2 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5269.770956] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5269.773754]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 5269.776088]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 5269.778213]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 5269.780497] Call Trace:
[ 5269.781796]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5269.783634]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5269.786116]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5269.788495]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5269.790538]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5269.792755]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5269.794784]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5269.796848]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5269.799038]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 5269.801073]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5269.803186]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5269.805249]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 5269.807374]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5269.809089]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 5269.811146]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 5269.812944]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 5269.814698]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 5269.816330]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.818088]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 5269.819685]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.821399] Mem-Info:
[ 5269.822430] active_anon:14280 inactive_anon:2095 isolated_anon:0
[ 5269.822430]  active_file:134344 inactive_file:134515 isolated_file:0
[ 5269.822430]  unevictable:0 dirty:2 writeback:0 unstable:0
[ 5269.822430]  slab_reclaimable:96214 slab_unreclaimable:22185
[ 5269.822430]  mapped:3512 shmem:2166 pagetables:1368 bounce:0
[ 5269.822430]  free:12388 free_pcp:51 free_cma:0
[ 5269.831310] Node 0 DMA free:6892kB min:44kB low:52kB high:64kB active_anon:856kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:768kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.840580] lowmem_reserve[]: 0 1714 1714 1714
[ 5269.842107] Node 0 DMA32 free:42660kB min:5172kB low:6464kB high:7756kB active_anon:56264kB inactive_anon:8180kB active_file:537372kB inactive_file:538056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:8kB writeback:0kB mapped:14020kB shmem:8448kB slab_reclaimable:379300kB slab_unreclaimable:87972kB kernel_stack:3232kB pagetables:5320kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.852375] lowmem_reserve[]: 0 0 0 0
[ 5269.853784] Node 0 DMA: 67*4kB (ME) 60*8kB (UME) 72*16kB (ME) 22*32kB (ME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6892kB
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB
[ 5269.861551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5269.863676] 271012 total pagecache pages
[ 5269.865100] 0 pages in swap cache
[ 5269.866366] Swap cache stats: add 0, delete 0, find 0/0
[ 5269.867996] Free swap  = 0kB
[ 5269.869363] Total swap = 0kB
[ 5269.870593] 524157 pages RAM
[ 5269.871857] 0 pages HighMem/MovableOnly
[ 5269.873604] 80441 pages reserved
[ 5269.874937] 0 pages hwpoisoned
[ 5269.876207] Out of memory: Kill process 2710 (tuned) score 7 or sacrifice child
[ 5269.878265] Killed process 2710 (tuned) total-vm:553052kB, anon-rss:10596kB, file-rss:2776kB, shmem-rss:0kB

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-24 12:41   ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-24 12:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

I got OOM killers while running heavy disk I/O (extracting kernel source,
running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
Do you think these OOM killers reasonable? Too weak against fragmentation?

[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.432780] kthreadd cpuset=/ mems_allowed=0
[ 3902.433904] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 3902.435463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3902.437541]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 3902.439129]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 3902.440779]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 3902.442436] Call Trace:
[ 3902.443094]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 3902.444188]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 3902.445301]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3902.446656]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 3902.447881]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 3902.449093]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 3902.450266]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 3902.451430]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 3902.452757]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 3902.454468]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 3902.455756]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 3902.457076]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 3902.458396]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 3902.459480]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 3902.460775]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 3902.461894]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 3902.463035]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 3902.464230]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.465502]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 3902.466648]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.467953] Mem-Info:
[ 3902.468537] active_anon:20817 inactive_anon:2098 isolated_anon:0
[ 3902.468537]  active_file:145434 inactive_file:145453 isolated_file:0
[ 3902.468537]  unevictable:0 dirty:20613 writeback:7248 unstable:0
[ 3902.468537]  slab_reclaimable:86363 slab_unreclaimable:14905
[ 3902.468537]  mapped:6670 shmem:2167 pagetables:1497 bounce:0
[ 3902.468537]  free:5422 free_pcp:75 free_cma:0
[ 3902.476541] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3268kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:36kB shmem:216kB slab_reclaimable:3708kB slab_unreclaimable:456kB kernel_stack:48kB pagetables:160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3902.486494] lowmem_reserve[]: 0 1714 1714 1714
[ 3902.487659] Node 0 DMA32 free:13760kB min:5172kB low:6464kB high:7756kB active_anon:80000kB inactive_anon:8192kB active_file:581780kB inactive_file:581848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:82312kB writeback:29588kB mapped:26648kB shmem:8452kB slab_reclaimable:341744kB slab_unreclaimable:59496kB kernel_stack:3456kB pagetables:5828kB unstable:0kB bounce:0kB free_pcp:732kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:560 all_unreclaimable? no
[ 3902.500438] lowmem_reserve[]: 0 0 0 0
[ 3902.502373] Node 0 DMA: 42*4kB (UME) 84*8kB (UM) 57*16kB (UM) 15*32kB (UM) 11*64kB (M) 9*128kB (UME) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 0*2048kB 0*4096kB = 6904kB
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 3902.511236] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3902.513938] 292144 total pagecache pages
[ 3902.515609] 0 pages in swap cache
[ 3902.517139] Swap cache stats: add 0, delete 0, find 0/0
[ 3902.519153] Free swap  = 0kB
[ 3902.520587] Total swap = 0kB
[ 3902.522095] 524157 pages RAM
[ 3902.523511] 0 pages HighMem/MovableOnly
[ 3902.525091] 80441 pages reserved
[ 3902.526580] 0 pages hwpoisoned
[ 3902.528169] Out of memory: Kill process 687 (firewalld) score 11 or sacrifice child
[ 3902.531017] Killed process 687 (firewalld) total-vm:323600kB, anon-rss:17032kB, file-rss:4896kB, shmem-rss:0kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.903629] smbd cpuset=/ mems_allowed=0
[ 5262.904725] CPU: 2 PID: 3935 Comm: smbd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5262.906401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5262.908679]  0000000000000000 00000000eaa24b41 ffff88007c37faf8 ffffffff81395bc3
[ 5262.910459]  0000000000000000 ffff88007c37fb98 ffffffff811babac 0000000000000206
[ 5262.912224]  ffffffff81810470 ffff88007c37fb38 ffffffff810bce29 0000000000000206
[ 5262.914019] Call Trace:
[ 5262.914839]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5262.916118]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5262.917493]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5262.919131]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5262.920690]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5262.922204]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5262.923863]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5262.925386]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5262.927121]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5262.928738]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5262.930438]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.932110]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5262.933410]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.935016]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 5262.936632]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 5262.938383]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 5262.940024]  [<ffffffff8106e5a4>] SyS_clone+0x14/0x20
[ 5262.941465]  [<ffffffff816fc532>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 5262.943137] Mem-Info:
[ 5262.944068] active_anon:37901 inactive_anon:2095 isolated_anon:0
[ 5262.944068]  active_file:134812 inactive_file:135474 isolated_file:0
[ 5262.944068]  unevictable:0 dirty:257 writeback:0 unstable:0
[ 5262.944068]  slab_reclaimable:90770 slab_unreclaimable:12759
[ 5262.944068]  mapped:4223 shmem:2166 pagetables:1428 bounce:0
[ 5262.944068]  free:3738 free_pcp:49 free_cma:0
[ 5262.953176] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:900kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5262.963749] lowmem_reserve[]: 0 1714 1714 1714
[ 5262.965434] Node 0 DMA32 free:8048kB min:5172kB low:6464kB high:7756kB active_anon:150704kB inactive_anon:8180kB active_file:539244kB inactive_file:541892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:1028kB writeback:0kB mapped:16860kB shmem:8448kB slab_reclaimable:357524kB slab_unreclaimable:50324kB kernel_stack:3232kB pagetables:5560kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? no
[ 5262.976879] lowmem_reserve[]: 0 0 0 0
[ 5262.978586] Node 0 DMA: 58*4kB (UME) 60*8kB (UME) 73*16kB (UME) 23*32kB (UME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6904kB
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5262.987124] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5262.989532] 272459 total pagecache pages
[ 5262.991203] 0 pages in swap cache
[ 5262.992583] Swap cache stats: add 0, delete 0, find 0/0
[ 5262.994334] Free swap  = 0kB
[ 5262.995787] Total swap = 0kB
[ 5262.997038] 524157 pages RAM
[ 5262.998270] 0 pages HighMem/MovableOnly
[ 5262.999683] 80441 pages reserved
[ 5263.001153] 0 pages hwpoisoned
[ 5263.002612] Out of memory: Kill process 26226 (genxref) score 54 or sacrifice child
[ 5263.004648] Killed process 26226 (genxref) total-vm:130348kB, anon-rss:94680kB, file-rss:4756kB, shmem-rss:0kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.767289] kthreadd cpuset=/ mems_allowed=0
[ 5269.768904] CPU: 2 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5269.770956] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5269.773754]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 5269.776088]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 5269.778213]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 5269.780497] Call Trace:
[ 5269.781796]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5269.783634]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5269.786116]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5269.788495]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5269.790538]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5269.792755]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5269.794784]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5269.796848]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5269.799038]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 5269.801073]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5269.803186]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5269.805249]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 5269.807374]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5269.809089]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 5269.811146]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 5269.812944]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 5269.814698]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 5269.816330]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.818088]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 5269.819685]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.821399] Mem-Info:
[ 5269.822430] active_anon:14280 inactive_anon:2095 isolated_anon:0
[ 5269.822430]  active_file:134344 inactive_file:134515 isolated_file:0
[ 5269.822430]  unevictable:0 dirty:2 writeback:0 unstable:0
[ 5269.822430]  slab_reclaimable:96214 slab_unreclaimable:22185
[ 5269.822430]  mapped:3512 shmem:2166 pagetables:1368 bounce:0
[ 5269.822430]  free:12388 free_pcp:51 free_cma:0
[ 5269.831310] Node 0 DMA free:6892kB min:44kB low:52kB high:64kB active_anon:856kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:768kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.840580] lowmem_reserve[]: 0 1714 1714 1714
[ 5269.842107] Node 0 DMA32 free:42660kB min:5172kB low:6464kB high:7756kB active_anon:56264kB inactive_anon:8180kB active_file:537372kB inactive_file:538056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:8kB writeback:0kB mapped:14020kB shmem:8448kB slab_reclaimable:379300kB slab_unreclaimable:87972kB kernel_stack:3232kB pagetables:5320kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.852375] lowmem_reserve[]: 0 0 0 0
[ 5269.853784] Node 0 DMA: 67*4kB (ME) 60*8kB (UME) 72*16kB (ME) 22*32kB (ME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6892kB
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB
[ 5269.861551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5269.863676] 271012 total pagecache pages
[ 5269.865100] 0 pages in swap cache
[ 5269.866366] Swap cache stats: add 0, delete 0, find 0/0
[ 5269.867996] Free swap  = 0kB
[ 5269.869363] Total swap = 0kB
[ 5269.870593] 524157 pages RAM
[ 5269.871857] 0 pages HighMem/MovableOnly
[ 5269.873604] 80441 pages reserved
[ 5269.874937] 0 pages hwpoisoned
[ 5269.876207] Out of memory: Kill process 2710 (tuned) score 7 or sacrifice child
[ 5269.878265] Killed process 2710 (tuned) total-vm:553052kB, anon-rss:10596kB, file-rss:2776kB, shmem-rss:0kB

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-24 12:41   ` Tetsuo Handa
@ 2015-12-28 12:08     ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 12:08 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

Well, current patch invokes OOM killers when more than 75% of memory is used
for file cache (active_file: + inactive_file:). I think this is a surprising
thing for administrators and we want to retry more harder (but not forever,
please).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151228.txt.xz .
----------
[  277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
----------

Since I cannot establish workload that caused December 24's natural OOM
killers, I used the following stressor for generating similar situation.

The fileio.c fills up all memory with file cache and tries to keep them
on memory. The fork.c is flood of order-2 allocation generator because
December 24's OOM killers were triggered by copy_process() which involves
order-2 allocation request.

---------- fileio.c start ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	static char buffer[4096];
	signal(SIGCHLD, SIG_IGN);
	for (i = 0; i < 2; i++) {
		int fd;
		int j;
		snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
		fd = open(buffer, O_RDWR | O_CREAT, 0600);
		memset(buffer, 0, sizeof(buffer));
		for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
			write(fd, buffer, sizeof(buffer));
		close(fd);
	}
	for (i = 0; i < 2; i++) {
		if (fork() == 0) {
			int fd;
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
			fd = open(buffer, O_RDWR);
			memset(buffer, 0, sizeof(buffer));
			while (fd != EOF) {
				lseek(fd, 0, SEEK_SET);
				while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
			}
			_exit(0);
		}
	}
	if (fork() == 0) {
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	if (fork() == 0) {
		sleep(1);
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	while (1)
		system("pidof fork | wc");
	return 0;
}
---------- fileio.c end ----------

---------- fork.c start ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	signal(SIGCHLD, SIG_IGN);
	while (1) {
		sleep(5);
		for (i = 0; i < 2000; i++) {
			if (fork() == 0) {
				sleep(3);
				_exit(0);
			}
		}
	}
}
---------- fork.c end ----------

This reproducer also showed that once the OOM killer is invoked,
subsequent OOM killers tend to occur shortly because file cache
do not decrease.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-28 12:08     ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 12:08 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

Well, current patch invokes OOM killers when more than 75% of memory is used
for file cache (active_file: + inactive_file:). I think this is a surprising
thing for administrators and we want to retry more harder (but not forever,
please).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151228.txt.xz .
----------
[  277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
----------

Since I cannot establish workload that caused December 24's natural OOM
killers, I used the following stressor for generating similar situation.

The fileio.c fills up all memory with file cache and tries to keep them
on memory. The fork.c is flood of order-2 allocation generator because
December 24's OOM killers were triggered by copy_process() which involves
order-2 allocation request.

---------- fileio.c start ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	static char buffer[4096];
	signal(SIGCHLD, SIG_IGN);
	for (i = 0; i < 2; i++) {
		int fd;
		int j;
		snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
		fd = open(buffer, O_RDWR | O_CREAT, 0600);
		memset(buffer, 0, sizeof(buffer));
		for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
			write(fd, buffer, sizeof(buffer));
		close(fd);
	}
	for (i = 0; i < 2; i++) {
		if (fork() == 0) {
			int fd;
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
			fd = open(buffer, O_RDWR);
			memset(buffer, 0, sizeof(buffer));
			while (fd != EOF) {
				lseek(fd, 0, SEEK_SET);
				while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
			}
			_exit(0);
		}
	}
	if (fork() == 0) {
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	if (fork() == 0) {
		sleep(1);
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	while (1)
		system("pidof fork | wc");
	return 0;
}
---------- fileio.c end ----------

---------- fork.c start ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	signal(SIGCHLD, SIG_IGN);
	while (1) {
		sleep(5);
		for (i = 0; i < 2000; i++) {
			if (fork() == 0) {
				sleep(3);
				_exit(0);
			}
		}
	}
}
---------- fork.c end ----------

This reproducer also showed that once the OOM killer is invoked,
subsequent OOM killers tend to occur shortly because file cache
do not decrease.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 12:08     ` Tetsuo Handa
@ 2015-12-28 14:13       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 14:13 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Since I cannot establish workload that caused December 24's natural OOM
> killers, I used the following stressor for generating similar situation.
>

I came to feel that I am observing a different problem which is currently
hidden behind the "too small to fail" memory-allocation rule. That is, tasks
requesting order > 0 pages are continuously losing the competition when
tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
by tasks requesting order = 0 pages before reclaimed pages are combined to
order > 0 pages (or maybe order > 0 pages are immediately split into
order = 0 pages due to tasks requesting order = 0 pages).

Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
unless chosen by the OOM killer. Therefore, even if tasks requesting
order = 2 pages lost the competition when there are tasks requesting
order = 0 pages, the order = 2 allocation request is implicitly retried
and therefore the OOM killer is not invoked (though there is a problem that
tasks requesting order > 0 allocation will stall as long as tasks requesting
order = 0 pages dominate).

But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
order = 2 pages lost the competition for 16 times due to tasks requesting
order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
To avoid the OOM killer, we need to make sure that pages reclaimed for
order > 0 allocations will not be stolen by tasks requesting order = 0
allocations.

Is my feeling plausible?

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-28 14:13       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 14:13 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Since I cannot establish workload that caused December 24's natural OOM
> killers, I used the following stressor for generating similar situation.
>

I came to feel that I am observing a different problem which is currently
hidden behind the "too small to fail" memory-allocation rule. That is, tasks
requesting order > 0 pages are continuously losing the competition when
tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
by tasks requesting order = 0 pages before reclaimed pages are combined to
order > 0 pages (or maybe order > 0 pages are immediately split into
order = 0 pages due to tasks requesting order = 0 pages).

Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
unless chosen by the OOM killer. Therefore, even if tasks requesting
order = 2 pages lost the competition when there are tasks requesting
order = 0 pages, the order = 2 allocation request is implicitly retried
and therefore the OOM killer is not invoked (though there is a problem that
tasks requesting order > 0 allocation will stall as long as tasks requesting
order = 0 pages dominate).

But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
order = 2 pages lost the competition for 16 times due to tasks requesting
order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
To avoid the OOM killer, we need to make sure that pages reclaimed for
order > 0 allocations will not be stolen by tasks requesting order = 0
allocations.

Is my feeling plausible?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-24 12:41   ` Tetsuo Handa
@ 2015-12-29 16:27     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 24-12-15 21:41:19, Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

I will have a look at the oom report more closely early next week (I am
still in holiday mode) but it would be good to compare how the same load
behaves with the original implementation. It would be also interesting
to see how stable are the results (is there any variability in multiple
runs?).

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-29 16:27     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 24-12-15 21:41:19, Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

I will have a look at the oom report more closely early next week (I am
still in holiday mode) but it would be good to compare how the same load
behaves with the original implementation. It would be also interesting
to see how stable are the results (is there any variability in multiple
runs?).

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 12:08     ` Tetsuo Handa
@ 2015-12-29 16:32       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
> 
> Well, current patch invokes OOM killers when more than 75% of memory is used
> for file cache (active_file: + inactive_file:). I think this is a surprising
> thing for administrators and we want to retry more harder (but not forever,
> please).

Here again, it would be good to see what is the comparision between
the original and the new behavior. 75% of a page cache is certainly
unexpected but those pages might be pinned for other reasons and so
unreclaimable and basically IO bound. This is hard to optimize for
without causing any undesirable side effects for other loads. I will
have a look at the oom reports later but having a comparision would be
a great start.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-29 16:32       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
> 
> Well, current patch invokes OOM killers when more than 75% of memory is used
> for file cache (active_file: + inactive_file:). I think this is a surprising
> thing for administrators and we want to retry more harder (but not forever,
> please).

Here again, it would be good to see what is the comparision between
the original and the new behavior. 75% of a page cache is certainly
unexpected but those pages might be pinned for other reasons and so
unreclaimable and basically IO bound. This is hard to optimize for
without causing any undesirable side effects for other loads. I will
have a look at the oom reports later but having a comparision would be
a great start.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-29 16:32       ` Michal Hocko
@ 2015-12-30 15:05         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-30 15:05 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > 
> > Well, current patch invokes OOM killers when more than 75% of memory is used
> > for file cache (active_file: + inactive_file:). I think this is a surprising
> > thing for administrators and we want to retry more harder (but not forever,
> > please).
> 
> Here again, it would be good to see what is the comparision between
> the original and the new behavior. 75% of a page cache is certainly
> unexpected but those pages might be pinned for other reasons and so
> unreclaimable and basically IO bound. This is hard to optimize for
> without causing any undesirable side effects for other loads. I will
> have a look at the oom reports later but having a comparision would be
> a great start.

Prior to "mm, oom: rework oom detection" patch (the original), this stressor
never invoked the OOM killer. After this patch (the new), this stressor easily
invokes the OOM killer. Both the original and the new case, active_file: +
inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
order > 0 allocation requests.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-30 15:05         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-30 15:05 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > 
> > Well, current patch invokes OOM killers when more than 75% of memory is used
> > for file cache (active_file: + inactive_file:). I think this is a surprising
> > thing for administrators and we want to retry more harder (but not forever,
> > please).
> 
> Here again, it would be good to see what is the comparision between
> the original and the new behavior. 75% of a page cache is certainly
> unexpected but those pages might be pinned for other reasons and so
> unreclaimable and basically IO bound. This is hard to optimize for
> without causing any undesirable side effects for other loads. I will
> have a look at the oom reports later but having a comparision would be
> a great start.

Prior to "mm, oom: rework oom detection" patch (the original), this stressor
never invoked the OOM killer. After this patch (the new), this stressor easily
invokes the OOM killer. Both the original and the new case, active_file: +
inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
order > 0 allocation requests.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-30 15:05         ` Tetsuo Handa
@ 2016-01-02 15:47           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-02 15:47 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > > 
> > > Well, current patch invokes OOM killers when more than 75% of memory is used
> > > for file cache (active_file: + inactive_file:). I think this is a surprising
> > > thing for administrators and we want to retry more harder (but not forever,
> > > please).
> > 
> > Here again, it would be good to see what is the comparision between
> > the original and the new behavior. 75% of a page cache is certainly
> > unexpected but those pages might be pinned for other reasons and so
> > unreclaimable and basically IO bound. This is hard to optimize for
> > without causing any undesirable side effects for other loads. I will
> > have a look at the oom reports later but having a comparision would be
> > a great start.
> 
> Prior to "mm, oom: rework oom detection" patch (the original), this stressor
> never invoked the OOM killer. After this patch (the new), this stressor easily
> invokes the OOM killer. Both the original and the new case, active_file: +
> inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
> order > 0 allocation requests.
> 

I retested with below debug printk() patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..e433504 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3014,7 +3014,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3024,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3086,6 +3088,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress);
 	}
 
 	return false;
@@ -3273,7 +3277,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.
That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
patch hits the trigger.

----------
[  154.547143] zone=DMA32 reclaimable=323478 available=325894 no_progress_loops=0 did_some_progress=58
[  154.551119] zone=DMA32 reclaimable=323153 available=325770 no_progress_loops=0 did_some_progress=58
[  154.571983] zone=DMA32 reclaimable=319582 available=322161 no_progress_loops=0 did_some_progress=56
[  154.576121] zone=DMA32 reclaimable=319647 available=322016 no_progress_loops=0 did_some_progress=56
[  154.583523] zone=DMA32 reclaimable=319467 available=321801 no_progress_loops=0 did_some_progress=55
[  154.593948] zone=DMA32 reclaimable=317400 available=320988 no_progress_loops=0 did_some_progress=56
[  154.730880] zone=DMA32 reclaimable=312385 available=313952 no_progress_loops=0 did_some_progress=48
[  154.733226] zone=DMA32 reclaimable=312337 available=313919 no_progress_loops=0 did_some_progress=48
[  154.737270] zone=DMA32 reclaimable=312417 available=313871 no_progress_loops=0 did_some_progress=48
[  154.739569] zone=DMA32 reclaimable=312369 available=313844 no_progress_loops=0 did_some_progress=48
[  154.743195] zone=DMA32 reclaimable=312385 available=313790 no_progress_loops=0 did_some_progress=48
[  154.745534] zone=DMA32 reclaimable=312365 available=313813 no_progress_loops=0 did_some_progress=48
[  154.748431] zone=DMA32 reclaimable=312272 available=313728 no_progress_loops=0 did_some_progress=48
[  154.750973] zone=DMA32 reclaimable=312273 available=313760 no_progress_loops=0 did_some_progress=48
[  154.753503] zone=DMA32 reclaimable=312289 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753584] zone=DMA32 reclaimable=312241 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753660] zone=DMA32 reclaimable=312193 available=313958 no_progress_loops=0 did_some_progress=48
[  154.781574] zone=DMA32 reclaimable=312147 available=314095 no_progress_loops=0 did_some_progress=48
[  154.784281] zone=DMA32 reclaimable=311539 available=314015 no_progress_loops=0 did_some_progress=49
[  154.786639] zone=DMA32 reclaimable=311498 available=314040 no_progress_loops=0 did_some_progress=49
[  154.788761] zone=DMA32 reclaimable=311432 available=314040 no_progress_loops=0 did_some_progress=49
[  154.791047] zone=DMA32 reclaimable=311366 available=314040 no_progress_loops=0 did_some_progress=49
[  154.793388] zone=DMA32 reclaimable=311300 available=314040 no_progress_loops=0 did_some_progress=49
[  154.795802] zone=DMA32 reclaimable=311153 available=314006 no_progress_loops=0 did_some_progress=49
[  154.804685] zone=DMA32 reclaimable=309950 available=313140 no_progress_loops=0 did_some_progress=49
[  154.807039] zone=DMA32 reclaimable=309867 available=313138 no_progress_loops=0 did_some_progress=49
[  154.809440] zone=DMA32 reclaimable=309761 available=313080 no_progress_loops=0 did_some_progress=49
[  154.811583] zone=DMA32 reclaimable=309735 available=313120 no_progress_loops=0 did_some_progress=49
[  154.814090] zone=DMA32 reclaimable=309561 available=313068 no_progress_loops=0 did_some_progress=49
[  154.817381] zone=DMA32 reclaimable=309463 available=313030 no_progress_loops=0 did_some_progress=49
[  154.824387] zone=DMA32 reclaimable=309414 available=313030 no_progress_loops=0 did_some_progress=49
[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
[  154.844308] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  154.846654]  0000000000000000 0000000045061c6b ffff88007a5dbb00 ffffffff81398b83
[  154.848559]  0000000000000000 ffff88007a5dbba0 ffffffff811bc81c 0000000000000206
[  154.850488]  ffffffff818104b0 ffff88007a5dbb40 ffffffff810bdd79 0000000000000206
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
[  154.875357]  [<ffffffff8106d441>] copy_process.part.31+0x131/0x1b40
[  154.877845]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.880397]  [<ffffffff8106f01b>] _do_fork+0xdb/0x5d0
[  154.882259]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.884722]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  154.887201]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  154.889666]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[  154.891519]  [<ffffffff8106f594>] SyS_clone+0x14/0x20
[  154.893059]  [<ffffffff816feeb2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  154.894859] Mem-Info:
[  154.895851] active_anon:31807 inactive_anon:2093 isolated_anon:0
[  154.895851]  active_file:242656 inactive_file:67266 isolated_file:0
[  154.895851]  unevictable:0 dirty:8 writeback:0 unstable:0
[  154.895851]  slab_reclaimable:15100 slab_unreclaimable:20839
[  154.895851]  mapped:1681 shmem:2162 pagetables:18491 bounce:0
[  154.895851]  free:4243 free_pcp:343 free_cma:0
[  154.905459] Node 0 DMA free:6908kB min:44kB low:52kB high:64kB active_anon:3408kB inactive_anon:120kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:124kB slab_reclaimable:872kB slab_unreclaimable:3032kB kernel_stack:176kB pagetables:328kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.916097] lowmem_reserve[]: 0 1714 1714 1714
[  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB active_anon:121688kB inactive_anon:8252kB active_file:970620kB inactive_file:269060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758944kB mlocked:0kB dirty:32kB writeback:0kB mapped:6660kB shmem:8524kB slab_reclaimable:59528kB slab_unreclaimable:80460kB kernel_stack:47312kB pagetables:70972kB unstable:0kB bounce:0kB free_pcp:1356kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.929908] lowmem_reserve[]: 0 0 0 0
[  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
[  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
[  154.941617] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  154.944167] 312171 total pagecache pages
[  154.945926] 0 pages in swap cache
[  154.947521] Swap cache stats: add 0, delete 0, find 0/0
[  154.949436] Free swap  = 0kB
[  154.950920] Total swap = 0kB
[  154.952531] 524157 pages RAM
[  154.954063] 0 pages HighMem/MovableOnly
[  154.955785] 80445 pages reserved
[  154.957362] 0 pages hwpoisoned
----------

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-02 15:47           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-02 15:47 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > > 
> > > Well, current patch invokes OOM killers when more than 75% of memory is used
> > > for file cache (active_file: + inactive_file:). I think this is a surprising
> > > thing for administrators and we want to retry more harder (but not forever,
> > > please).
> > 
> > Here again, it would be good to see what is the comparision between
> > the original and the new behavior. 75% of a page cache is certainly
> > unexpected but those pages might be pinned for other reasons and so
> > unreclaimable and basically IO bound. This is hard to optimize for
> > without causing any undesirable side effects for other loads. I will
> > have a look at the oom reports later but having a comparision would be
> > a great start.
> 
> Prior to "mm, oom: rework oom detection" patch (the original), this stressor
> never invoked the OOM killer. After this patch (the new), this stressor easily
> invokes the OOM killer. Both the original and the new case, active_file: +
> inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
> order > 0 allocation requests.
> 

I retested with below debug printk() patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..e433504 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3014,7 +3014,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3024,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3086,6 +3088,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress);
 	}
 
 	return false;
@@ -3273,7 +3277,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.
That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
patch hits the trigger.

----------
[  154.547143] zone=DMA32 reclaimable=323478 available=325894 no_progress_loops=0 did_some_progress=58
[  154.551119] zone=DMA32 reclaimable=323153 available=325770 no_progress_loops=0 did_some_progress=58
[  154.571983] zone=DMA32 reclaimable=319582 available=322161 no_progress_loops=0 did_some_progress=56
[  154.576121] zone=DMA32 reclaimable=319647 available=322016 no_progress_loops=0 did_some_progress=56
[  154.583523] zone=DMA32 reclaimable=319467 available=321801 no_progress_loops=0 did_some_progress=55
[  154.593948] zone=DMA32 reclaimable=317400 available=320988 no_progress_loops=0 did_some_progress=56
[  154.730880] zone=DMA32 reclaimable=312385 available=313952 no_progress_loops=0 did_some_progress=48
[  154.733226] zone=DMA32 reclaimable=312337 available=313919 no_progress_loops=0 did_some_progress=48
[  154.737270] zone=DMA32 reclaimable=312417 available=313871 no_progress_loops=0 did_some_progress=48
[  154.739569] zone=DMA32 reclaimable=312369 available=313844 no_progress_loops=0 did_some_progress=48
[  154.743195] zone=DMA32 reclaimable=312385 available=313790 no_progress_loops=0 did_some_progress=48
[  154.745534] zone=DMA32 reclaimable=312365 available=313813 no_progress_loops=0 did_some_progress=48
[  154.748431] zone=DMA32 reclaimable=312272 available=313728 no_progress_loops=0 did_some_progress=48
[  154.750973] zone=DMA32 reclaimable=312273 available=313760 no_progress_loops=0 did_some_progress=48
[  154.753503] zone=DMA32 reclaimable=312289 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753584] zone=DMA32 reclaimable=312241 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753660] zone=DMA32 reclaimable=312193 available=313958 no_progress_loops=0 did_some_progress=48
[  154.781574] zone=DMA32 reclaimable=312147 available=314095 no_progress_loops=0 did_some_progress=48
[  154.784281] zone=DMA32 reclaimable=311539 available=314015 no_progress_loops=0 did_some_progress=49
[  154.786639] zone=DMA32 reclaimable=311498 available=314040 no_progress_loops=0 did_some_progress=49
[  154.788761] zone=DMA32 reclaimable=311432 available=314040 no_progress_loops=0 did_some_progress=49
[  154.791047] zone=DMA32 reclaimable=311366 available=314040 no_progress_loops=0 did_some_progress=49
[  154.793388] zone=DMA32 reclaimable=311300 available=314040 no_progress_loops=0 did_some_progress=49
[  154.795802] zone=DMA32 reclaimable=311153 available=314006 no_progress_loops=0 did_some_progress=49
[  154.804685] zone=DMA32 reclaimable=309950 available=313140 no_progress_loops=0 did_some_progress=49
[  154.807039] zone=DMA32 reclaimable=309867 available=313138 no_progress_loops=0 did_some_progress=49
[  154.809440] zone=DMA32 reclaimable=309761 available=313080 no_progress_loops=0 did_some_progress=49
[  154.811583] zone=DMA32 reclaimable=309735 available=313120 no_progress_loops=0 did_some_progress=49
[  154.814090] zone=DMA32 reclaimable=309561 available=313068 no_progress_loops=0 did_some_progress=49
[  154.817381] zone=DMA32 reclaimable=309463 available=313030 no_progress_loops=0 did_some_progress=49
[  154.824387] zone=DMA32 reclaimable=309414 available=313030 no_progress_loops=0 did_some_progress=49
[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
[  154.844308] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  154.846654]  0000000000000000 0000000045061c6b ffff88007a5dbb00 ffffffff81398b83
[  154.848559]  0000000000000000 ffff88007a5dbba0 ffffffff811bc81c 0000000000000206
[  154.850488]  ffffffff818104b0 ffff88007a5dbb40 ffffffff810bdd79 0000000000000206
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
[  154.875357]  [<ffffffff8106d441>] copy_process.part.31+0x131/0x1b40
[  154.877845]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.880397]  [<ffffffff8106f01b>] _do_fork+0xdb/0x5d0
[  154.882259]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.884722]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  154.887201]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  154.889666]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[  154.891519]  [<ffffffff8106f594>] SyS_clone+0x14/0x20
[  154.893059]  [<ffffffff816feeb2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  154.894859] Mem-Info:
[  154.895851] active_anon:31807 inactive_anon:2093 isolated_anon:0
[  154.895851]  active_file:242656 inactive_file:67266 isolated_file:0
[  154.895851]  unevictable:0 dirty:8 writeback:0 unstable:0
[  154.895851]  slab_reclaimable:15100 slab_unreclaimable:20839
[  154.895851]  mapped:1681 shmem:2162 pagetables:18491 bounce:0
[  154.895851]  free:4243 free_pcp:343 free_cma:0
[  154.905459] Node 0 DMA free:6908kB min:44kB low:52kB high:64kB active_anon:3408kB inactive_anon:120kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:124kB slab_reclaimable:872kB slab_unreclaimable:3032kB kernel_stack:176kB pagetables:328kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.916097] lowmem_reserve[]: 0 1714 1714 1714
[  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB active_anon:121688kB inactive_anon:8252kB active_file:970620kB inactive_file:269060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758944kB mlocked:0kB dirty:32kB writeback:0kB mapped:6660kB shmem:8524kB slab_reclaimable:59528kB slab_unreclaimable:80460kB kernel_stack:47312kB pagetables:70972kB unstable:0kB bounce:0kB free_pcp:1356kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.929908] lowmem_reserve[]: 0 0 0 0
[  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
[  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
[  154.941617] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  154.944167] 312171 total pagecache pages
[  154.945926] 0 pages in swap cache
[  154.947521] Swap cache stats: add 0, delete 0, find 0/0
[  154.949436] Free swap  = 0kB
[  154.950920] Total swap = 0kB
[  154.952531] 524157 pages RAM
[  154.954063] 0 pages HighMem/MovableOnly
[  154.955785] 80445 pages reserved
[  154.957362] 0 pages hwpoisoned
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 14:13       ` Tetsuo Handa
@ 2016-01-06 12:44         ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-01-06 12:44 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

On 12/28/2015 03:13 PM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Tetsuo Handa wrote:
>> > I got OOM killers while running heavy disk I/O (extracting kernel source,
>> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
>> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>>
>> Since I cannot establish workload that caused December 24's natural OOM
>> killers, I used the following stressor for generating similar situation.
>>
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).

Hm I would expect that as long as there are some reserves left that your
reproducer cannot grab, there are some free pages left and the allocator should
thus preserve the order-2 pages that combine, since order-0 allocations will get
existing order-0 pages before splitting higher orders. Compaction should also be
able to successfully combine order-2 without racing allocators thanks to per-cpu
caching (but I'd have to check).

So I think the problem is not higher-order page itself, but that order-2 needs 4
pages and thus needs to pass a bit higher watermark, thus being at disadvantage
to order-0 allocations. Thus I would expect the order-2 pages to be there, but
not available for allocation due to watermarks.

> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).
> 
> But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
> order = 2 pages lost the competition for 16 times due to tasks requesting
> order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
> To avoid the OOM killer, we need to make sure that pages reclaimed for
> order > 0 allocations will not be stolen by tasks requesting order = 0
> allocations.
> 
> Is my feeling plausible?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-06 12:44         ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-01-06 12:44 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

On 12/28/2015 03:13 PM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Tetsuo Handa wrote:
>> > I got OOM killers while running heavy disk I/O (extracting kernel source,
>> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
>> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>>
>> Since I cannot establish workload that caused December 24's natural OOM
>> killers, I used the following stressor for generating similar situation.
>>
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).

Hm I would expect that as long as there are some reserves left that your
reproducer cannot grab, there are some free pages left and the allocator should
thus preserve the order-2 pages that combine, since order-0 allocations will get
existing order-0 pages before splitting higher orders. Compaction should also be
able to successfully combine order-2 without racing allocators thanks to per-cpu
caching (but I'd have to check).

So I think the problem is not higher-order page itself, but that order-2 needs 4
pages and thus needs to pass a bit higher watermark, thus being at disadvantage
to order-0 allocations. Thus I would expect the order-2 pages to be there, but
not available for allocation due to watermarks.

> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).
> 
> But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
> order = 2 pages lost the competition for 16 times due to tasks requesting
> order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
> To avoid the OOM killer, we need to make sure that pages reclaimed for
> order > 0 allocations will not be stolen by tasks requesting order = 0
> allocations.
> 
> Is my feeling plausible?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 14:13       ` Tetsuo Handa
@ 2016-01-08 12:37         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 23:13:31, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> >
> > Since I cannot establish workload that caused December 24's natural OOM
> > killers, I used the following stressor for generating similar situation.
> >
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).
> 
> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).

Yes this is possible and nothing new. High order allocations (even small
orders) are never for free and more expensive than order-0. I have seen
an OOM killer striking while there were megs of free memory on a larger
machine just because of the high fragmentation.

> But this patchset introduced a limit of 16 retries.

We retry 16 times _only_ if the reclaim hasn't made _any_ progress
which means it hasn't reclaimed a single page. We can still fail due to
watermarks check for the required order but I think this is a correct
and desirable behavior because there is no guarantee that lower order
pages will get coalesced after more retries. The primary point of this
rework is to make the whole thing more deterministic.

So we can see some OOM reports for high orders (<COSTLY) which would
survive before just because we have retried so many times that we
end up allocating that single high order page but this was a pure luck
and indeterministic behavior. That being said I agree we might end up
doing some more tuning for non-costly high order allocation but it
should be bounded as well and based on failures on some reasonable
workloads. I haven't got to OOM reports you have posted yet but I
definitely plan to check them soon.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-08 12:37         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 23:13:31, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> >
> > Since I cannot establish workload that caused December 24's natural OOM
> > killers, I used the following stressor for generating similar situation.
> >
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).
> 
> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).

Yes this is possible and nothing new. High order allocations (even small
orders) are never for free and more expensive than order-0. I have seen
an OOM killer striking while there were megs of free memory on a larger
machine just because of the high fragmentation.

> But this patchset introduced a limit of 16 retries.

We retry 16 times _only_ if the reclaim hasn't made _any_ progress
which means it hasn't reclaimed a single page. We can still fail due to
watermarks check for the required order but I think this is a correct
and desirable behavior because there is no guarantee that lower order
pages will get coalesced after more retries. The primary point of this
rework is to make the whole thing more deterministic.

So we can see some OOM reports for high orders (<COSTLY) which would
survive before just because we have retried so many times that we
end up allocating that single high order page but this was a pure luck
and indeterministic behavior. That being said I agree we might end up
doing some more tuning for non-costly high order allocation but it
should be bounded as well and based on failures on some reasonable
workloads. I haven't got to OOM reports you have posted yet but I
definitely plan to check them soon.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2015-12-15 18:19   ` Michal Hocko
@ 2016-01-14 22:58     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-14 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Tue, 15 Dec 2015, Michal Hocko wrote:

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..738ae2206635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e267faad4649..f77e283fb8c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>  	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
>  }
>  
> +/*
> + * Maximum number of reclaim retries without any progress before OOM killer
> + * is consider as the only way to move forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
> + * Checks whether it makes sense to retry the reclaim to make a forward progress
> + * for the given allocation request.
> + * The reclaim feedback represented by did_some_progress (any progress during
> + * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> + * pages) and no_progress_loops (number of reclaim rounds without any progress
> + * in a row) is considered as well as the reclaimable pages on the applicable
> + * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + *
> + * Returns true if a retry is viable or false to enter the oom path.
> + */
> +static inline bool
> +should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> +		     struct alloc_context *ac, int alloc_flags,
> +		     bool did_some_progress, unsigned long pages_reclaimed,
> +		     int no_progress_loops)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure we converge to OOM if we cannot make any progress
> +	 * several times in the row.
> +	 */
> +	if (no_progress_loops > MAX_RECLAIM_RETRIES)
> +		return false;
> +
> +	/* Do not retry high order allocations unless they are __GFP_REPEAT */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER) {
> +		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
> +			return false;
> +
> +		if (did_some_progress)
> +			return true;
> +	}
> +
> +	/*
> +	 * Keep reclaiming pages while there is a chance this will lead somewhere.
> +	 * If none of the target zones can satisfy our allocation request even
> +	 * if all reclaimable pages are considered then we are screwed and have
> +	 * to go OOM.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
> +		unsigned long available;
> +
> +		available = zone_reclaimable_pages(zone);
> +		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole available?
> +		 */
> +		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +				ac->high_zoneidx, alloc_flags, available)) {
> +			/* Wait for some write requests to complete then retry */
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			return true;
> +		}
> +	}

Tetsuo's log of an early oom in this thread shows that this check is 
wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
with only ZONE_DMA and ZONE_DMA32:

	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50

and the watermarks:

	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
	lowmem_reserve[]: 0 1714 1714 1714
	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
	lowmem_reserve[]: 0 0 0 0

and the scary thing is that this triggers when no_progress_loops == 0, so 
this is the first time trying the allocation after progress has been made.

Watermarks clearly indicate that memory is available, the problem is 
fragmentation for the order-2 allocation.  This is not a situation where 
we want to immediately call the oom killer to solve since we have no 
guarantee it is going to free contiguous memory (in fact it wouldn't be 
used at all for PAGE_ALLOC_COSTLY_ORDER).

There is order-2 memory available however:

	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
oom for this allocation.  ZONE_DMA32 is not, however.

I'm wondering if this has to do with the z->nr_reserved_highatomic 
estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
to 1%, or 20806kB.  That failure would make sense if free is 17996kB.

Tetsuo, would it be possible to try your workload with just this match and 
also show z->nr_reserved_highatomic?

This patch would need to at least have knowledge of the heuristics used by 
__zone_watermark_ok() since it's making an inference on reclaimability 
based on numbers that include pageblocks that are reserved from usage.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-14 22:58     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-14 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Tue, 15 Dec 2015, Michal Hocko wrote:

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..738ae2206635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e267faad4649..f77e283fb8c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>  	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
>  }
>  
> +/*
> + * Maximum number of reclaim retries without any progress before OOM killer
> + * is consider as the only way to move forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
> + * Checks whether it makes sense to retry the reclaim to make a forward progress
> + * for the given allocation request.
> + * The reclaim feedback represented by did_some_progress (any progress during
> + * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> + * pages) and no_progress_loops (number of reclaim rounds without any progress
> + * in a row) is considered as well as the reclaimable pages on the applicable
> + * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + *
> + * Returns true if a retry is viable or false to enter the oom path.
> + */
> +static inline bool
> +should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> +		     struct alloc_context *ac, int alloc_flags,
> +		     bool did_some_progress, unsigned long pages_reclaimed,
> +		     int no_progress_loops)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure we converge to OOM if we cannot make any progress
> +	 * several times in the row.
> +	 */
> +	if (no_progress_loops > MAX_RECLAIM_RETRIES)
> +		return false;
> +
> +	/* Do not retry high order allocations unless they are __GFP_REPEAT */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER) {
> +		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
> +			return false;
> +
> +		if (did_some_progress)
> +			return true;
> +	}
> +
> +	/*
> +	 * Keep reclaiming pages while there is a chance this will lead somewhere.
> +	 * If none of the target zones can satisfy our allocation request even
> +	 * if all reclaimable pages are considered then we are screwed and have
> +	 * to go OOM.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
> +		unsigned long available;
> +
> +		available = zone_reclaimable_pages(zone);
> +		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole available?
> +		 */
> +		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +				ac->high_zoneidx, alloc_flags, available)) {
> +			/* Wait for some write requests to complete then retry */
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			return true;
> +		}
> +	}

Tetsuo's log of an early oom in this thread shows that this check is 
wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
with only ZONE_DMA and ZONE_DMA32:

	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50

and the watermarks:

	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
	lowmem_reserve[]: 0 1714 1714 1714
	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
	lowmem_reserve[]: 0 0 0 0

and the scary thing is that this triggers when no_progress_loops == 0, so 
this is the first time trying the allocation after progress has been made.

Watermarks clearly indicate that memory is available, the problem is 
fragmentation for the order-2 allocation.  This is not a situation where 
we want to immediately call the oom killer to solve since we have no 
guarantee it is going to free contiguous memory (in fact it wouldn't be 
used at all for PAGE_ALLOC_COSTLY_ORDER).

There is order-2 memory available however:

	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
oom for this allocation.  ZONE_DMA32 is not, however.

I'm wondering if this has to do with the z->nr_reserved_highatomic 
estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
to 1%, or 20806kB.  That failure would make sense if free is 17996kB.

Tetsuo, would it be possible to try your workload with just this match and 
also show z->nr_reserved_highatomic?

This patch would need to at least have knowledge of the heuristics used by 
__zone_watermark_ok() since it's making an inference on reclaimability 
based on numbers that include pageblocks that are reserved from usage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-14 22:58     ` David Rientjes
@ 2016-01-16  1:07       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-16  1:07 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Tetsuo's log of an early oom in this thread shows that this check is 
> wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> with only ZONE_DMA and ZONE_DMA32:
> 
> 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> 
> and the watermarks:
> 
> 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> 	lowmem_reserve[]: 0 1714 1714 1714
> 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> 	lowmem_reserve[]: 0 0 0 0
> 
> and the scary thing is that this triggers when no_progress_loops == 0, so 
> this is the first time trying the allocation after progress has been made.
> 
> Watermarks clearly indicate that memory is available, the problem is 
> fragmentation for the order-2 allocation.  This is not a situation where 
> we want to immediately call the oom killer to solve since we have no 
> guarantee it is going to free contiguous memory (in fact it wouldn't be 
> used at all for PAGE_ALLOC_COSTLY_ORDER).
> 
> There is order-2 memory available however:
> 
> 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> oom for this allocation.  ZONE_DMA32 is not, however.
> 
> I'm wondering if this has to do with the z->nr_reserved_highatomic 
> estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> 
> Tetsuo, would it be possible to try your workload with just this match and 
> also show z->nr_reserved_highatomic?

I don't know what "try your workload with just this match" expects, but
zone->nr_reserved_highatomic is always 0.

----------
[  178.058803] zone=DMA32 reclaimable=367474 available=369923 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.061350] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.132174] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3256kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:56kB shmem:180kB slab_reclaimable:2056kB slab_unreclaimable:1096kB kernel_stack:192kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  178.145589] Node 0 DMA32 free:11532kB min:5564kB low:6952kB high:8344kB active_anon:133896kB inactive_anon:8204kB active_file:1001828kB inactive_file:462944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:8kB writeback:0kB mapped:8572kB shmem:8468kB slab_reclaimable:57136kB slab_unreclaimable:86380kB kernel_stack:50080kB pagetables:83600kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  198.457718] zone=DMA32 reclaimable=381991 available=386237 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.460111] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.507204] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3088kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:180kB slab_reclaimable:976kB slab_unreclaimable:1468kB kernel_stack:672kB pagetables:336kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  198.507209] Node 0 DMA32 free:19992kB min:5564kB low:6952kB high:8344kB active_anon:104176kB inactive_anon:8204kB active_file:905320kB inactive_file:617264kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12772kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:77856kB kernel_stack:44144kB pagetables:69180kB unstable:0kB bounce:0kB free_pcp:1104kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.647075] zone=DMA32 reclaimable=374429 available=378945 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.647076] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.652177] Node 0 DMA free:7928kB min:40kB low:48kB high:60kB active_anon:588kB inactive_anon:172kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:88kB shmem:180kB slab_reclaimable:1008kB slab_unreclaimable:2576kB kernel_stack:1840kB pagetables:408kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.652182] Node 0 DMA32 free:17608kB min:5564kB low:6952kB high:8344kB active_anon:89528kB inactive_anon:8204kB active_file:1025084kB inactive_file:472512kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12848kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:86628kB kernel_stack:50880kB pagetables:82336kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  207.045450] zone=DMA32 reclaimable=386923 available=392299 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.045451] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.050241] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:732kB inactive_anon:336kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:140kB shmem:436kB slab_reclaimable:456kB slab_unreclaimable:3536kB kernel_stack:1584kB pagetables:188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  207.050246] Node 0 DMA32 free:20092kB min:5564kB low:6952kB high:8344kB active_anon:91600kB inactive_anon:18620kB active_file:921896kB inactive_file:626544kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:964kB writeback:0kB mapped:17016kB shmem:24584kB slab_reclaimable:51908kB slab_unreclaimable:72792kB kernel_stack:40832kB pagetables:67396kB unstable:0kB bounce:0kB free_pcp:472kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  221.034713] zone=DMA32 reclaimable=389283 available=393245 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.037103] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.105952] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:416kB inactive_anon:304kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:132kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3156kB kernel_stack:2352kB pagetables:212kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  221.119016] Node 0 DMA32 free:7220kB min:5564kB low:6952kB high:8344kB active_anon:74480kB inactive_anon:23544kB active_file:946560kB inactive_file:618900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:1056kB writeback:0kB mapped:14760kB shmem:32768kB slab_reclaimable:51328kB slab_unreclaimable:75692kB kernel_stack:42960kB pagetables:66732kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:248 all_unreclaimable? no
[  224.072875] zone=DMA32 reclaimable=397667 available=401058 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.075212] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.133813] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:664kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3760kB kernel_stack:1136kB pagetables:376kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  224.145691] Node 0 DMA32 free:12160kB min:5564kB low:6952kB high:8344kB active_anon:69352kB inactive_anon:23140kB active_file:1191992kB inactive_file:399408kB unevictable:0kB isolated(anon):0kB isolated(file):104kB present:2080640kB managed:2021064kB mlocked:0kB dirty:844kB writeback:0kB mapped:4916kB shmem:32768kB slab_reclaimable:51288kB slab_unreclaimable:68392kB kernel_stack:38560kB pagetables:61820kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  234.291285] zone=DMA32 reclaimable=403563 available=407626 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.293557] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.357091] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:312kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2596kB kernel_stack:2992kB pagetables:204kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  234.370106] Node 0 DMA32 free:6804kB min:5564kB low:6952kB high:8344kB active_anon:77364kB inactive_anon:23140kB active_file:1168356kB inactive_file:454384kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:11884kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:61492kB kernel_stack:32016kB pagetables:49248kB unstable:0kB bounce:0kB free_pcp:760kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:696 all_unreclaimable? no
[  246.183836] zone=DMA32 reclaimable=405496 available=410200 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.186069] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.246157] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:1144kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2404kB kernel_stack:1392kB pagetables:660kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  246.260159] Node 0 DMA32 free:11564kB min:5564kB low:6952kB high:8344kB active_anon:74360kB inactive_anon:23036kB active_file:1173248kB inactive_file:456000kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:732kB writeback:0kB mapped:14812kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:59884kB kernel_stack:31824kB pagetables:47960kB unstable:0kB bounce:0kB free_pcp:136kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  258.994846] zone=DMA32 reclaimable=403441 available=407544 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  258.997488] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  259.055818] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:848kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2692kB kernel_stack:1872kB pagetables:476kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  259.067950] Node 0 DMA32 free:29136kB min:5564kB low:6952kB high:8344kB active_anon:71476kB inactive_anon:23032kB active_file:1129276kB inactive_file:485324kB unevictable:0kB isolated(anon):0kB isolated(file):112kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:14340kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61680kB kernel_stack:34704kB pagetables:44856kB unstable:0kB bounce:0kB free_pcp:1996kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.392099] zone=DMA32 reclaimable=399774 available=406049 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.394646] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.459049] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:832kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2824kB kernel_stack:2320kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.472413] Node 0 DMA32 free:21848kB min:5564kB low:6952kB high:8344kB active_anon:77144kB inactive_anon:23032kB active_file:1148420kB inactive_file:462308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:14700kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61672kB kernel_stack:32064kB pagetables:50888kB unstable:0kB bounce:0kB free_pcp:848kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.428858] zone=DMA32 reclaimable=404186 available=408756 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.431146] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.487864] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:600kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:3504kB kernel_stack:1120kB pagetables:532kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.499779] Node 0 DMA32 free:17040kB min:5564kB low:6952kB high:8344kB active_anon:60480kB inactive_anon:23032kB active_file:1277956kB inactive_file:339528kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:5912kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:64216kB kernel_stack:37520kB pagetables:52096kB unstable:0kB bounce:0kB free_pcp:308kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160116.txt.xz .

> 
> This patch would need to at least have knowledge of the heuristics used by 
> __zone_watermark_ok() since it's making an inference on reclaimability 
> based on numbers that include pageblocks that are reserved from usage.
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-16  1:07       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-16  1:07 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Tetsuo's log of an early oom in this thread shows that this check is 
> wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> with only ZONE_DMA and ZONE_DMA32:
> 
> 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> 
> and the watermarks:
> 
> 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> 	lowmem_reserve[]: 0 1714 1714 1714
> 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> 	lowmem_reserve[]: 0 0 0 0
> 
> and the scary thing is that this triggers when no_progress_loops == 0, so 
> this is the first time trying the allocation after progress has been made.
> 
> Watermarks clearly indicate that memory is available, the problem is 
> fragmentation for the order-2 allocation.  This is not a situation where 
> we want to immediately call the oom killer to solve since we have no 
> guarantee it is going to free contiguous memory (in fact it wouldn't be 
> used at all for PAGE_ALLOC_COSTLY_ORDER).
> 
> There is order-2 memory available however:
> 
> 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> oom for this allocation.  ZONE_DMA32 is not, however.
> 
> I'm wondering if this has to do with the z->nr_reserved_highatomic 
> estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> 
> Tetsuo, would it be possible to try your workload with just this match and 
> also show z->nr_reserved_highatomic?

I don't know what "try your workload with just this match" expects, but
zone->nr_reserved_highatomic is always 0.

----------
[  178.058803] zone=DMA32 reclaimable=367474 available=369923 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.061350] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.132174] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3256kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:56kB shmem:180kB slab_reclaimable:2056kB slab_unreclaimable:1096kB kernel_stack:192kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  178.145589] Node 0 DMA32 free:11532kB min:5564kB low:6952kB high:8344kB active_anon:133896kB inactive_anon:8204kB active_file:1001828kB inactive_file:462944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:8kB writeback:0kB mapped:8572kB shmem:8468kB slab_reclaimable:57136kB slab_unreclaimable:86380kB kernel_stack:50080kB pagetables:83600kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  198.457718] zone=DMA32 reclaimable=381991 available=386237 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.460111] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.507204] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3088kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:180kB slab_reclaimable:976kB slab_unreclaimable:1468kB kernel_stack:672kB pagetables:336kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  198.507209] Node 0 DMA32 free:19992kB min:5564kB low:6952kB high:8344kB active_anon:104176kB inactive_anon:8204kB active_file:905320kB inactive_file:617264kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12772kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:77856kB kernel_stack:44144kB pagetables:69180kB unstable:0kB bounce:0kB free_pcp:1104kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.647075] zone=DMA32 reclaimable=374429 available=378945 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.647076] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.652177] Node 0 DMA free:7928kB min:40kB low:48kB high:60kB active_anon:588kB inactive_anon:172kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:88kB shmem:180kB slab_reclaimable:1008kB slab_unreclaimable:2576kB kernel_stack:1840kB pagetables:408kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.652182] Node 0 DMA32 free:17608kB min:5564kB low:6952kB high:8344kB active_anon:89528kB inactive_anon:8204kB active_file:1025084kB inactive_file:472512kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12848kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:86628kB kernel_stack:50880kB pagetables:82336kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  207.045450] zone=DMA32 reclaimable=386923 available=392299 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.045451] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.050241] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:732kB inactive_anon:336kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:140kB shmem:436kB slab_reclaimable:456kB slab_unreclaimable:3536kB kernel_stack:1584kB pagetables:188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  207.050246] Node 0 DMA32 free:20092kB min:5564kB low:6952kB high:8344kB active_anon:91600kB inactive_anon:18620kB active_file:921896kB inactive_file:626544kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:964kB writeback:0kB mapped:17016kB shmem:24584kB slab_reclaimable:51908kB slab_unreclaimable:72792kB kernel_stack:40832kB pagetables:67396kB unstable:0kB bounce:0kB free_pcp:472kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  221.034713] zone=DMA32 reclaimable=389283 available=393245 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.037103] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.105952] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:416kB inactive_anon:304kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:132kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3156kB kernel_stack:2352kB pagetables:212kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  221.119016] Node 0 DMA32 free:7220kB min:5564kB low:6952kB high:8344kB active_anon:74480kB inactive_anon:23544kB active_file:946560kB inactive_file:618900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:1056kB writeback:0kB mapped:14760kB shmem:32768kB slab_reclaimable:51328kB slab_unreclaimable:75692kB kernel_stack:42960kB pagetables:66732kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:248 all_unreclaimable? no
[  224.072875] zone=DMA32 reclaimable=397667 available=401058 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.075212] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.133813] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:664kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3760kB kernel_stack:1136kB pagetables:376kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  224.145691] Node 0 DMA32 free:12160kB min:5564kB low:6952kB high:8344kB active_anon:69352kB inactive_anon:23140kB active_file:1191992kB inactive_file:399408kB unevictable:0kB isolated(anon):0kB isolated(file):104kB present:2080640kB managed:2021064kB mlocked:0kB dirty:844kB writeback:0kB mapped:4916kB shmem:32768kB slab_reclaimable:51288kB slab_unreclaimable:68392kB kernel_stack:38560kB pagetables:61820kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  234.291285] zone=DMA32 reclaimable=403563 available=407626 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.293557] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.357091] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:312kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2596kB kernel_stack:2992kB pagetables:204kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  234.370106] Node 0 DMA32 free:6804kB min:5564kB low:6952kB high:8344kB active_anon:77364kB inactive_anon:23140kB active_file:1168356kB inactive_file:454384kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:11884kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:61492kB kernel_stack:32016kB pagetables:49248kB unstable:0kB bounce:0kB free_pcp:760kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:696 all_unreclaimable? no
[  246.183836] zone=DMA32 reclaimable=405496 available=410200 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.186069] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.246157] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:1144kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2404kB kernel_stack:1392kB pagetables:660kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  246.260159] Node 0 DMA32 free:11564kB min:5564kB low:6952kB high:8344kB active_anon:74360kB inactive_anon:23036kB active_file:1173248kB inactive_file:456000kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:732kB writeback:0kB mapped:14812kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:59884kB kernel_stack:31824kB pagetables:47960kB unstable:0kB bounce:0kB free_pcp:136kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  258.994846] zone=DMA32 reclaimable=403441 available=407544 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  258.997488] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  259.055818] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:848kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2692kB kernel_stack:1872kB pagetables:476kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  259.067950] Node 0 DMA32 free:29136kB min:5564kB low:6952kB high:8344kB active_anon:71476kB inactive_anon:23032kB active_file:1129276kB inactive_file:485324kB unevictable:0kB isolated(anon):0kB isolated(file):112kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:14340kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61680kB kernel_stack:34704kB pagetables:44856kB unstable:0kB bounce:0kB free_pcp:1996kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.392099] zone=DMA32 reclaimable=399774 available=406049 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.394646] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.459049] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:832kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2824kB kernel_stack:2320kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.472413] Node 0 DMA32 free:21848kB min:5564kB low:6952kB high:8344kB active_anon:77144kB inactive_anon:23032kB active_file:1148420kB inactive_file:462308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:14700kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61672kB kernel_stack:32064kB pagetables:50888kB unstable:0kB bounce:0kB free_pcp:848kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.428858] zone=DMA32 reclaimable=404186 available=408756 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.431146] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.487864] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:600kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:3504kB kernel_stack:1120kB pagetables:532kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.499779] Node 0 DMA32 free:17040kB min:5564kB low:6952kB high:8344kB active_anon:60480kB inactive_anon:23032kB active_file:1277956kB inactive_file:339528kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:5912kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:64216kB kernel_stack:37520kB pagetables:52096kB unstable:0kB bounce:0kB free_pcp:308kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160116.txt.xz .

> 
> This patch would need to at least have knowledge of the heuristics used by 
> __zone_watermark_ok() since it's making an inference on reclaimability 
> based on numbers that include pageblocks that are reserved from usage.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-16  1:07       ` Tetsuo Handa
@ 2016-01-19 22:48         ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-19 22:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

On Sat, 16 Jan 2016, Tetsuo Handa wrote:

> > Tetsuo's log of an early oom in this thread shows that this check is 
> > wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> > with only ZONE_DMA and ZONE_DMA32:
> > 
> > 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > 
> > and the watermarks:
> > 
> > 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> > 	lowmem_reserve[]: 0 1714 1714 1714
> > 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> > 	lowmem_reserve[]: 0 0 0 0
> > 
> > and the scary thing is that this triggers when no_progress_loops == 0, so 
> > this is the first time trying the allocation after progress has been made.
> > 
> > Watermarks clearly indicate that memory is available, the problem is 
> > fragmentation for the order-2 allocation.  This is not a situation where 
> > we want to immediately call the oom killer to solve since we have no 
> > guarantee it is going to free contiguous memory (in fact it wouldn't be 
> > used at all for PAGE_ALLOC_COSTLY_ORDER).
> > 
> > There is order-2 memory available however:
> > 
> > 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> > oom for this allocation.  ZONE_DMA32 is not, however.
> > 
> > I'm wondering if this has to do with the z->nr_reserved_highatomic 
> > estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> > to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> > 
> > Tetsuo, would it be possible to try your workload with just this match and 
> > also show z->nr_reserved_highatomic?
> 
> I don't know what "try your workload with just this match" expects, but
> zone->nr_reserved_highatomic is always 0.
> 

My point about z->nr_reserved_highatomic still stands, specifically that 
pageblocks may be reserved from allocation and __zone_watermark_ok() may 
fail, which would cause a premature oom condition, for this patch's 
calculation of "available".  It may not have caused a problem on your 
specific workload, however.

Are you able to precisely identify why __zone_watermark_ok() is failing 
and triggering the oom in the log you posted January 3?

[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
// here //
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
...
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
// and also here, if we didn't serialize the oom killer //

I think that would help in fixing the issue you reported.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-19 22:48         ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-19 22:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

On Sat, 16 Jan 2016, Tetsuo Handa wrote:

> > Tetsuo's log of an early oom in this thread shows that this check is 
> > wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> > with only ZONE_DMA and ZONE_DMA32:
> > 
> > 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > 
> > and the watermarks:
> > 
> > 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> > 	lowmem_reserve[]: 0 1714 1714 1714
> > 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> > 	lowmem_reserve[]: 0 0 0 0
> > 
> > and the scary thing is that this triggers when no_progress_loops == 0, so 
> > this is the first time trying the allocation after progress has been made.
> > 
> > Watermarks clearly indicate that memory is available, the problem is 
> > fragmentation for the order-2 allocation.  This is not a situation where 
> > we want to immediately call the oom killer to solve since we have no 
> > guarantee it is going to free contiguous memory (in fact it wouldn't be 
> > used at all for PAGE_ALLOC_COSTLY_ORDER).
> > 
> > There is order-2 memory available however:
> > 
> > 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> > oom for this allocation.  ZONE_DMA32 is not, however.
> > 
> > I'm wondering if this has to do with the z->nr_reserved_highatomic 
> > estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> > to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> > 
> > Tetsuo, would it be possible to try your workload with just this match and 
> > also show z->nr_reserved_highatomic?
> 
> I don't know what "try your workload with just this match" expects, but
> zone->nr_reserved_highatomic is always 0.
> 

My point about z->nr_reserved_highatomic still stands, specifically that 
pageblocks may be reserved from allocation and __zone_watermark_ok() may 
fail, which would cause a premature oom condition, for this patch's 
calculation of "available".  It may not have caused a problem on your 
specific workload, however.

Are you able to precisely identify why __zone_watermark_ok() is failing 
and triggering the oom in the log you posted January 3?

[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
// here //
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
...
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
// and also here, if we didn't serialize the oom killer //

I think that would help in fixing the issue you reported.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-19 22:48         ` David Rientjes
@ 2016-01-20 11:13           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-20 11:13 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Are you able to precisely identify why __zone_watermark_ok() is failing 
> and triggering the oom in the log you posted January 3?
> 
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> // here //
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
> [  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
> ...
> [  154.852386] Call Trace:
> [  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
> [  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
> [  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
> [  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
> [  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
> [  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
> [  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
> [  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
> [  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
> [  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
> [  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
> // and also here, if we didn't serialize the oom killer //
> 
> I think that would help in fixing the issue you reported.
> 
Does "why __zone_watermark_ok() is failing" mean "which 'return false;' statement
in __zone_watermark_ok() I'm hitting on my specific workload"? Then, answer is
the former for DMA zone and the latter for DMA32 zone.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..dd36f01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2390,7 +2390,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
-			long free_pages)
+				long free_pages, bool *no_free)
 {
 	long min = mark;
 	int o;
@@ -2423,6 +2423,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * are not met, then a high-order request also cannot go ahead
 	 * even if a suitable page happened to be free.
 	 */
+	*no_free = false;
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 
@@ -2453,26 +2454,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		}
 #endif
 	}
+	*no_free = true;
 	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
+	bool unused;
+
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
-					zone_page_state(z, NR_FREE_PAGES));
+				   zone_page_state(z, NR_FREE_PAGES), &unused);
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx)
 {
+	bool unused;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
-								free_pages);
+				   free_pages, &unused);
 }
 
 #ifdef CONFIG_NUMA
@@ -3014,7 +3019,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3029,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3039,6 +3046,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 			ac->high_zoneidx, ac->nodemask) {
+		bool no_free;
 		unsigned long available;
 		unsigned long reclaimable;
 
@@ -3052,7 +3060,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 * available?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac->high_zoneidx, alloc_flags, available)) {
+					ac->high_zoneidx, alloc_flags, available, &no_free)) {
 			unsigned long writeback;
 			unsigned long dirty;
 
@@ -3086,6 +3094,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu nr_reserved_highatomic=%lu no_free=%u\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress, zone->nr_reserved_highatomic, no_free);
 	}
 
 	return false;
@@ -3273,7 +3283,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .
----------
[  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  142.055908] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:3208kB inactive_anon:188kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB\
 dirty:0kB writeback:0kB mapped:60kB shmem:188kB slab_reclaimable:2792kB slab_unreclaimable:360kB kernel_stack:224kB pagetables:260kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_\
unreclaimable? no
[  142.066690] lowmem_reserve[]: 0 1970 1970 1970

[  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
[  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
[  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
[  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
[  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
[  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
[  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
[  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
[  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.006952] zone=DMA32 reclaimable=405821 available=410137 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.006954] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.010690] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.030121] zone=DMA32 reclaimable=405016 available=409801 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.030123] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.033530] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  190.499877] zone=DMA32 reclaimable=382152 available=384996 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  190.499878] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  190.504901] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  196.146728] zone=DMA32 reclaimable=371941 available=374605 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=1
[  196.146730] zone=DMA reclaimable=1 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=0
[  196.152546] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  201.837825] zone=DMA32 reclaimable=364569 available=370359 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=1
[  201.837826] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=0
[  201.844879] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.862325] zone=DMA32 reclaimable=381542 available=387785 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=1
[  212.862327] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=0
[  212.866857] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.866914] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:440kB inactive_anon:196kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB \
dirty:8kB writeback:0kB mapped:0kB shmem:280kB slab_reclaimable:480kB slab_unreclaimable:3856kB kernel_stack:1776kB pagetables:240kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_u\
nreclaimable? no
[  212.866915] lowmem_reserve[]: 0 1970 1970 1970
----------

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-20 11:13           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-20 11:13 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Are you able to precisely identify why __zone_watermark_ok() is failing 
> and triggering the oom in the log you posted January 3?
> 
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> // here //
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
> [  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
> ...
> [  154.852386] Call Trace:
> [  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
> [  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
> [  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
> [  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
> [  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
> [  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
> [  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
> [  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
> [  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
> [  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
> [  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
> // and also here, if we didn't serialize the oom killer //
> 
> I think that would help in fixing the issue you reported.
> 
Does "why __zone_watermark_ok() is failing" mean "which 'return false;' statement
in __zone_watermark_ok() I'm hitting on my specific workload"? Then, answer is
the former for DMA zone and the latter for DMA32 zone.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..dd36f01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2390,7 +2390,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
-			long free_pages)
+				long free_pages, bool *no_free)
 {
 	long min = mark;
 	int o;
@@ -2423,6 +2423,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * are not met, then a high-order request also cannot go ahead
 	 * even if a suitable page happened to be free.
 	 */
+	*no_free = false;
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 
@@ -2453,26 +2454,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		}
 #endif
 	}
+	*no_free = true;
 	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
+	bool unused;
+
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
-					zone_page_state(z, NR_FREE_PAGES));
+				   zone_page_state(z, NR_FREE_PAGES), &unused);
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx)
 {
+	bool unused;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
-								free_pages);
+				   free_pages, &unused);
 }
 
 #ifdef CONFIG_NUMA
@@ -3014,7 +3019,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3029,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3039,6 +3046,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 			ac->high_zoneidx, ac->nodemask) {
+		bool no_free;
 		unsigned long available;
 		unsigned long reclaimable;
 
@@ -3052,7 +3060,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 * available?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac->high_zoneidx, alloc_flags, available)) {
+					ac->high_zoneidx, alloc_flags, available, &no_free)) {
 			unsigned long writeback;
 			unsigned long dirty;
 
@@ -3086,6 +3094,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu nr_reserved_highatomic=%lu no_free=%u\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress, zone->nr_reserved_highatomic, no_free);
 	}
 
 	return false;
@@ -3273,7 +3283,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .
----------
[  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  142.055908] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:3208kB inactive_anon:188kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB\
 dirty:0kB writeback:0kB mapped:60kB shmem:188kB slab_reclaimable:2792kB slab_unreclaimable:360kB kernel_stack:224kB pagetables:260kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_\
unreclaimable? no
[  142.066690] lowmem_reserve[]: 0 1970 1970 1970

[  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
[  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
[  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
[  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
[  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
[  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
[  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
[  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
[  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.006952] zone=DMA32 reclaimable=405821 available=410137 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.006954] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.010690] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.030121] zone=DMA32 reclaimable=405016 available=409801 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.030123] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.033530] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  190.499877] zone=DMA32 reclaimable=382152 available=384996 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  190.499878] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  190.504901] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  196.146728] zone=DMA32 reclaimable=371941 available=374605 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=1
[  196.146730] zone=DMA reclaimable=1 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=0
[  196.152546] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  201.837825] zone=DMA32 reclaimable=364569 available=370359 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=1
[  201.837826] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=0
[  201.844879] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.862325] zone=DMA32 reclaimable=381542 available=387785 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=1
[  212.862327] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=0
[  212.866857] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.866914] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:440kB inactive_anon:196kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB \
dirty:8kB writeback:0kB mapped:0kB shmem:280kB slab_reclaimable:480kB slab_unreclaimable:3856kB kernel_stack:1776kB pagetables:240kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_u\
nreclaimable? no
[  212.866915] lowmem_reserve[]: 0 1970 1970 1970
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-02 15:47           ` Tetsuo Handa
@ 2016-01-20 12:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 12:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 03-01-16 00:47:30, Tetsuo Handa wrote:
[...]
> The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
> zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
> reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.

Yes and this is to be expected. Direct reclaim doesn't guarantee any
progress for high order allocations. We might be reclaiming pages which
cannot be coalesced.

> That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> patch hits the trigger.
[....]
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
[...]
> [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
[...]
> [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

It is really strange that __zone_watermark_ok claimed DMA32 unusable
here. With the target of 312734 which should easilly pass the wmark
check for the particular order and there are 116*16kB 15*32kB 1*64kB
blocks "usable" for our request because GFP_KERNEL can use both
Unmovable and Movable blocks. So it makes sense to wait for more order-0
allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
with this particular allocation request.

The nr_reserved_highatomic might be too high to matter but then you see
[1] the reserce being 0. So this doesn't make much sense to me. I will
dig into it some more.

[1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-20 12:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 12:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 03-01-16 00:47:30, Tetsuo Handa wrote:
[...]
> The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
> zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
> reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.

Yes and this is to be expected. Direct reclaim doesn't guarantee any
progress for high order allocations. We might be reclaiming pages which
cannot be coalesced.

> That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> patch hits the trigger.
[....]
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
[...]
> [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
[...]
> [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

It is really strange that __zone_watermark_ok claimed DMA32 unusable
here. With the target of 312734 which should easilly pass the wmark
check for the particular order and there are 116*16kB 15*32kB 1*64kB
blocks "usable" for our request because GFP_KERNEL can use both
Unmovable and Movable blocks. So it makes sense to wait for more order-0
allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
with this particular allocation request.

The nr_reserved_highatomic might be too high to matter but then you see
[1] the reserce being 0. So this doesn't make much sense to me. I will
dig into it some more.

[1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-20 11:13           ` Tetsuo Handa
@ 2016-01-20 13:13             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 13:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Wed 20-01-16 20:13:32, Tetsuo Handa wrote:
[...]
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .

> [  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1

Ok, so we really do not have _any_ pages on the order 2+ free lists and
that is why __zone_watermark_ok failed.

> [  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0

DMA zone is not even interesting because it is fully protected by the
lowmem reserves.

> [  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.086897] Node 0 DMA32: 1796*4kB (M) 763*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13288kB

And indeed we still do not have any order-2+ available. OOM seems
reasonable.

> [  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
> [  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
> [  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.921192] Node 0 DMA32: 1794*4kB (UME) 464*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

Ditto

> [  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
> [  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
> [  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.620582] Node 0 DMA32: 1241*4kB (UME) 1280*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15204kB

Ditto

> [  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
> [  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
> [  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.663480] Node 0 DMA32: 554*4kB (UME) 2148*8kB (UM) 3*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19448kB

Now we have __zone_watermark_ok claiming no order 2+ blocks available
but oom report little bit later sees 3 blocks. This would suggest that
this is just a matter of timing when the children exit and free their
stacks which are order-2.

> [  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
> [  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
> [  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  159.622451] Node 0 DMA32: 2141*4kB (UM) 1435*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20044kB

Again no high order pages.

> [  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
> [  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
> [  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  164.786643] Node 0 DMA32: 2961*4kB (UME) 432*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15300kB

Ditto

> [  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
> [  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
> [  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  184.728695] Node 0 DMA32: 3144*4kB (UME) 971*8kB (UME) 43*16kB (UM) 3*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21128kB

Again we have order >=2 pages available here after the allocator has
seen none earlier. And the pattern repeats later on. So I would say
that in this particular load it is a timing which plays the role. I
am not sure we can tune for such a load beause any difference in the
timing would result in a different behavior and basically breaking such
a tuning.

The current heuristic is based on an assumption that retrying for high
order allocations only makes sense if they are hidden behind the min
watermark and the currently reclaimable pages would get us above the
watermark. We cannot assume that the order-0 reclaimable pages will form
the required high order blocks because there is no such guarantee.  I
think such a heuristic makes sense because we have passed the direct
reclaim and also compaction at the time when we check for the retry
so chances to get the required block from the reclaim are not that high.

So I am not really sure what to do here now. On one hand the previous
heuristic would happen to work here probably better because we would be
looping in the allocator, exiting processes would rest the counter and
keep the retries and sooner or later the fork would be lucky and see its
order-2 block and continue. We could starve in this state for basically
unbounded amount of time though which is excatly what I would like to
get rid of. I guess we might want to give few attempts to retry for
all order>0. Let me think about it some more.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-20 13:13             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 13:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Wed 20-01-16 20:13:32, Tetsuo Handa wrote:
[...]
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .

> [  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1

Ok, so we really do not have _any_ pages on the order 2+ free lists and
that is why __zone_watermark_ok failed.

> [  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0

DMA zone is not even interesting because it is fully protected by the
lowmem reserves.

> [  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.086897] Node 0 DMA32: 1796*4kB (M) 763*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13288kB

And indeed we still do not have any order-2+ available. OOM seems
reasonable.

> [  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
> [  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
> [  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.921192] Node 0 DMA32: 1794*4kB (UME) 464*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

Ditto

> [  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
> [  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
> [  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.620582] Node 0 DMA32: 1241*4kB (UME) 1280*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15204kB

Ditto

> [  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
> [  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
> [  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.663480] Node 0 DMA32: 554*4kB (UME) 2148*8kB (UM) 3*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19448kB

Now we have __zone_watermark_ok claiming no order 2+ blocks available
but oom report little bit later sees 3 blocks. This would suggest that
this is just a matter of timing when the children exit and free their
stacks which are order-2.

> [  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
> [  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
> [  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  159.622451] Node 0 DMA32: 2141*4kB (UM) 1435*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20044kB

Again no high order pages.

> [  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
> [  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
> [  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  164.786643] Node 0 DMA32: 2961*4kB (UME) 432*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15300kB

Ditto

> [  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
> [  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
> [  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  184.728695] Node 0 DMA32: 3144*4kB (UME) 971*8kB (UME) 43*16kB (UM) 3*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21128kB

Again we have order >=2 pages available here after the allocator has
seen none earlier. And the pattern repeats later on. So I would say
that in this particular load it is a timing which plays the role. I
am not sure we can tune for such a load beause any difference in the
timing would result in a different behavior and basically breaking such
a tuning.

The current heuristic is based on an assumption that retrying for high
order allocations only makes sense if they are hidden behind the min
watermark and the currently reclaimable pages would get us above the
watermark. We cannot assume that the order-0 reclaimable pages will form
the required high order blocks because there is no such guarantee.  I
think such a heuristic makes sense because we have passed the direct
reclaim and also compaction at the time when we check for the retry
so chances to get the required block from the reclaim are not that high.

So I am not really sure what to do here now. On one hand the previous
heuristic would happen to work here probably better because we would be
looping in the allocator, exiting processes would rest the counter and
keep the retries and sooner or later the fork would be lucky and see its
order-2 block and continue. We could starve in this state for basically
unbounded amount of time though which is excatly what I would like to
get rid of. I guess we might want to give few attempts to retry for
all order>0. Let me think about it some more.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-20 12:24             ` Michal Hocko
@ 2016-01-27 23:18               ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-27 23:18 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Tetsuo Handa, Andrew Morton, torvalds, hannes, mgorman, hillf.zj,
	Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Michal Hocko wrote:

> > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > patch hits the trigger.
> [....]
> > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > [  154.841167] fork cpuset=/ mems_allowed=0
> [...]
> > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> [...]
> > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> It is really strange that __zone_watermark_ok claimed DMA32 unusable
> here. With the target of 312734 which should easilly pass the wmark
> check for the particular order and there are 116*16kB 15*32kB 1*64kB
> blocks "usable" for our request because GFP_KERNEL can use both
> Unmovable and Movable blocks. So it makes sense to wait for more order-0
> allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> with this particular allocation request.
> 
> The nr_reserved_highatomic might be too high to matter but then you see
> [1] the reserce being 0. So this doesn't make much sense to me. I will
> dig into it some more.
> 
> [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp

There's another issue in the use of zone_reclaimable_pages().  I think 
should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
as I indicated before, but notice that zone_reclaimable_pages() only uses 
zone_page_state().  It means that the heuristic is based on some 
up-to-date members and some stale members.  If we are relying on 
NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
then it may take up to 1s for that to actually occur and may quickly 
exhaust the retry counter in should_reclaim_retry() before that happens.

This is the same issue that Joonsoo reported with the use of 
zone_page_state(NR_ISOLATED_*) in the too_many_isolated() loops of reclaim 
and compaction.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-27 23:18               ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-27 23:18 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Tetsuo Handa, Andrew Morton, torvalds, hannes, mgorman, hillf.zj,
	Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Michal Hocko wrote:

> > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > patch hits the trigger.
> [....]
> > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > [  154.841167] fork cpuset=/ mems_allowed=0
> [...]
> > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> [...]
> > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> It is really strange that __zone_watermark_ok claimed DMA32 unusable
> here. With the target of 312734 which should easilly pass the wmark
> check for the particular order and there are 116*16kB 15*32kB 1*64kB
> blocks "usable" for our request because GFP_KERNEL can use both
> Unmovable and Movable blocks. So it makes sense to wait for more order-0
> allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> with this particular allocation request.
> 
> The nr_reserved_highatomic might be too high to matter but then you see
> [1] the reserce being 0. So this doesn't make much sense to me. I will
> dig into it some more.
> 
> [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp

There's another issue in the use of zone_reclaimable_pages().  I think 
should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
as I indicated before, but notice that zone_reclaimable_pages() only uses 
zone_page_state().  It means that the heuristic is based on some 
up-to-date members and some stale members.  If we are relying on 
NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
then it may take up to 1s for that to actually occur and may quickly 
exhaust the retry counter in should_reclaim_retry() before that happens.

This is the same issue that Joonsoo reported with the use of 
zone_page_state(NR_ISOLATED_*) in the too_many_isolated() loops of reclaim 
and compaction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2015-12-15 18:19 ` Michal Hocko
@ 2016-01-28 20:40   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom has been doing get_page_from_freelist with
ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
killer. This has two reasons as explained by Andrea:
"
: the reason for the high wmark is to reduce the likelihood of livelocks
: and be sure to invoke the OOM killer, if we're still under pressure
: and reclaim just failed. The high wmark is used to be sure the failure
: of reclaim isn't going to be ignored. If using the min wmark like
: you propose there's risk of livelock or anyway of delayed OOM killer
: invocation.
:
: The reason for doing one last wmark check (regardless of the wmark
: used) before invoking the oom killer, was just to be sure another OOM
: killer invocation hasn't already freed a ton of memory while we were
: stuck in reclaim. A lot of free memory generated by the OOM killer,
: won't make a parallel reclaim more likely to succeed, it just creates
: free memory, but reclaim only succeeds when it finds "freeable" memory
: and it makes progress in converting it to free memory. So for the
: purpose of this last check, the high wmark would work fine as lots of
: free memory would have been generated in such case.
"

This is no longer a concern after "mm, oom: rework oom detection"
because should_reclaim_retry performs the water mark check right before
__alloc_pages_may_oom is invoked. Remove the last moment allocation
request as it just makes the code more confusing and doesn't really
serve any purpose because a success is basically impossible otherwise
should_reclaim_retry would force the reclaim to retry. So this is
merely a code cleanup rather than a functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 268de1654128..f82941c0ac4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
-	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	if (!(gfp_mask & __GFP_NOFAIL)) {
 		/* Coredumps can quickly deplete all memory reserves */
 		if (current->flags & PF_DUMPCORE)
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 20:40   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom has been doing get_page_from_freelist with
ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
killer. This has two reasons as explained by Andrea:
"
: the reason for the high wmark is to reduce the likelihood of livelocks
: and be sure to invoke the OOM killer, if we're still under pressure
: and reclaim just failed. The high wmark is used to be sure the failure
: of reclaim isn't going to be ignored. If using the min wmark like
: you propose there's risk of livelock or anyway of delayed OOM killer
: invocation.
:
: The reason for doing one last wmark check (regardless of the wmark
: used) before invoking the oom killer, was just to be sure another OOM
: killer invocation hasn't already freed a ton of memory while we were
: stuck in reclaim. A lot of free memory generated by the OOM killer,
: won't make a parallel reclaim more likely to succeed, it just creates
: free memory, but reclaim only succeeds when it finds "freeable" memory
: and it makes progress in converting it to free memory. So for the
: purpose of this last check, the high wmark would work fine as lots of
: free memory would have been generated in such case.
"

This is no longer a concern after "mm, oom: rework oom detection"
because should_reclaim_retry performs the water mark check right before
__alloc_pages_may_oom is invoked. Remove the last moment allocation
request as it just makes the code more confusing and doesn't really
serve any purpose because a success is basically impossible otherwise
should_reclaim_retry would force the reclaim to retry. So this is
merely a code cleanup rather than a functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 268de1654128..f82941c0ac4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
-	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	if (!(gfp_mask & __GFP_NOFAIL)) {
 		/* Coredumps can quickly deplete all memory reserves */
 		if (current->flags & PF_DUMPCORE)
-- 
2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-27 23:18               ` David Rientjes
@ 2016-01-28 21:19                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Tetsuo Handa, Andrew Morton, torvalds, hannes,
	mgorman, hillf.zj, Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed 27-01-16 15:18:11, David Rientjes wrote:
> On Wed, 20 Jan 2016, Michal Hocko wrote:
> 
> > > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > > patch hits the trigger.
> > [....]
> > > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > > [  154.841167] fork cpuset=/ mems_allowed=0
> > [...]
> > > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> > [...]
> > > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > It is really strange that __zone_watermark_ok claimed DMA32 unusable
> > here. With the target of 312734 which should easilly pass the wmark
> > check for the particular order and there are 116*16kB 15*32kB 1*64kB
> > blocks "usable" for our request because GFP_KERNEL can use both
> > Unmovable and Movable blocks. So it makes sense to wait for more order-0
> > allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> > with this particular allocation request.
> > 
> > The nr_reserved_highatomic might be too high to matter but then you see
> > [1] the reserce being 0. So this doesn't make much sense to me. I will
> > dig into it some more.
> > 
> > [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
> 
> There's another issue in the use of zone_reclaimable_pages().  I think 
> should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
> as I indicated before, but notice that zone_reclaimable_pages() only uses 
> zone_page_state().  It means that the heuristic is based on some 
> up-to-date members and some stale members.  If we are relying on 
> NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
> then it may take up to 1s for that to actually occur and may quickly 
> exhaust the retry counter in should_reclaim_retry() before that happens.

You are right. I will post a patch to fix that.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-28 21:19                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Tetsuo Handa, Andrew Morton, torvalds, hannes,
	mgorman, hillf.zj, Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed 27-01-16 15:18:11, David Rientjes wrote:
> On Wed, 20 Jan 2016, Michal Hocko wrote:
> 
> > > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > > patch hits the trigger.
> > [....]
> > > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > > [  154.841167] fork cpuset=/ mems_allowed=0
> > [...]
> > > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> > [...]
> > > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > It is really strange that __zone_watermark_ok claimed DMA32 unusable
> > here. With the target of 312734 which should easilly pass the wmark
> > check for the particular order and there are 116*16kB 15*32kB 1*64kB
> > blocks "usable" for our request because GFP_KERNEL can use both
> > Unmovable and Movable blocks. So it makes sense to wait for more order-0
> > allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> > with this particular allocation request.
> > 
> > The nr_reserved_highatomic might be too high to matter but then you see
> > [1] the reserce being 0. So this doesn't make much sense to me. I will
> > dig into it some more.
> > 
> > [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
> 
> There's another issue in the use of zone_reclaimable_pages().  I think 
> should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
> as I indicated before, but notice that zone_reclaimable_pages() only uses 
> zone_page_state().  It means that the heuristic is based on some 
> up-to-date members and some stale members.  If we are relying on 
> NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
> then it may take up to 1s for that to actually occur and may quickly 
> exhaust the retry counter in should_reclaim_retry() before that happens.

You are right. I will post a patch to fix that.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2015-12-15 18:19 ` Michal Hocko
@ 2016-01-28 21:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zone_reclaimable_pages is used in should_reclaim_retry which uses it to
calculate the target for the watermark check. This means that precise
numbers are important for the correct decision. zone_reclaimable_pages
uses zone_page_state which can contain stale data with per-cpu diffs
not synced yet (the last vmstat_update might have run 1s in the past).

Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
of the current callers is in a hot path where getting the precise value
(which involves per-cpu iteration) would cause an unreasonable overhead.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 489212252cd6..9145e3f89eab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE) +
-	     zone_page_state(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON) +
-		      zone_page_state(zone, NR_ISOLATED_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
2.7.0.rc3

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-28 21:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zone_reclaimable_pages is used in should_reclaim_retry which uses it to
calculate the target for the watermark check. This means that precise
numbers are important for the correct decision. zone_reclaimable_pages
uses zone_page_state which can contain stale data with per-cpu diffs
not synced yet (the last vmstat_update might have run 1s in the past).

Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
of the current callers is in a hot path where getting the precise value
(which involves per-cpu iteration) would cause an unreasonable overhead.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 489212252cd6..9145e3f89eab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE) +
-	     zone_page_state(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON) +
-		      zone_page_state(zone, NR_ISOLATED_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 20:40   ` Michal Hocko
@ 2016-01-28 21:36     ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom has been doing get_page_from_freelist with
> ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> killer. This has two reasons as explained by Andrea:
> "
> : the reason for the high wmark is to reduce the likelihood of livelocks
> : and be sure to invoke the OOM killer, if we're still under pressure
> : and reclaim just failed. The high wmark is used to be sure the failure
> : of reclaim isn't going to be ignored. If using the min wmark like
> : you propose there's risk of livelock or anyway of delayed OOM killer
> : invocation.
> :
> : The reason for doing one last wmark check (regardless of the wmark
> : used) before invoking the oom killer, was just to be sure another OOM
> : killer invocation hasn't already freed a ton of memory while we were
> : stuck in reclaim. A lot of free memory generated by the OOM killer,
> : won't make a parallel reclaim more likely to succeed, it just creates
> : free memory, but reclaim only succeeds when it finds "freeable" memory
> : and it makes progress in converting it to free memory. So for the
> : purpose of this last check, the high wmark would work fine as lots of
> : free memory would have been generated in such case.
> "
> 
> This is no longer a concern after "mm, oom: rework oom detection"
> because should_reclaim_retry performs the water mark check right before
> __alloc_pages_may_oom is invoked. Remove the last moment allocation
> request as it just makes the code more confusing and doesn't really
> serve any purpose because a success is basically impossible otherwise
> should_reclaim_retry would force the reclaim to retry. So this is
> merely a code cleanup rather than a functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

The check has to happen while holding the OOM lock, otherwise we'll
end up killing much more than necessary when there are many racing
allocations.

Please drop this patch.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 21:36     ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom has been doing get_page_from_freelist with
> ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> killer. This has two reasons as explained by Andrea:
> "
> : the reason for the high wmark is to reduce the likelihood of livelocks
> : and be sure to invoke the OOM killer, if we're still under pressure
> : and reclaim just failed. The high wmark is used to be sure the failure
> : of reclaim isn't going to be ignored. If using the min wmark like
> : you propose there's risk of livelock or anyway of delayed OOM killer
> : invocation.
> :
> : The reason for doing one last wmark check (regardless of the wmark
> : used) before invoking the oom killer, was just to be sure another OOM
> : killer invocation hasn't already freed a ton of memory while we were
> : stuck in reclaim. A lot of free memory generated by the OOM killer,
> : won't make a parallel reclaim more likely to succeed, it just creates
> : free memory, but reclaim only succeeds when it finds "freeable" memory
> : and it makes progress in converting it to free memory. So for the
> : purpose of this last check, the high wmark would work fine as lots of
> : free memory would have been generated in such case.
> "
> 
> This is no longer a concern after "mm, oom: rework oom detection"
> because should_reclaim_retry performs the water mark check right before
> __alloc_pages_may_oom is invoked. Remove the last moment allocation
> request as it just makes the code more confusing and doesn't really
> serve any purpose because a success is basically impossible otherwise
> should_reclaim_retry would force the reclaim to retry. So this is
> merely a code cleanup rather than a functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

The check has to happen while holding the OOM lock, otherwise we'll
end up killing much more than necessary when there are many racing
allocations.

Please drop this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 21:36     ` Johannes Weiner
@ 2016-01-28 23:19       ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Johannes Weiner wrote:

> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.
> 

Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
acquired.

The situation is still somewhat fragile, however, but I think it's 
tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
thread isn't visible during the oom killer's tasklist scan because it has 
exited, we still end up killing more than we should.  The likelihood of 
this happening grows with the length of the tasklist.

Perhaps we should try testing watermarks after a victim has been selected 
and immediately before killing?  (Aside: we actually carry an internal 
patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
victim because we have been hit with this before in the memcg path.)

I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
to deem that we aren't going to immediately reenter an oom condition so 
the deferred killing is a waste of time.

The downside is how sloppy this would be because it's blurring the line 
between oom killer and page allocator.  We'd need the oom killer to return 
the selected victim to the page allocator, try the allocation, and then 
call oom_kill_process() if necessary.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 23:19       ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Johannes Weiner wrote:

> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.
> 

Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
acquired.

The situation is still somewhat fragile, however, but I think it's 
tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
thread isn't visible during the oom killer's tasklist scan because it has 
exited, we still end up killing more than we should.  The likelihood of 
this happening grows with the length of the tasklist.

Perhaps we should try testing watermarks after a victim has been selected 
and immediately before killing?  (Aside: we actually carry an internal 
patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
victim because we have been hit with this before in the memcg path.)

I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
to deem that we aren't going to immediately reenter an oom condition so 
the deferred killing is a waste of time.

The downside is how sloppy this would be because it's blurring the line 
between oom killer and page allocator.  We'd need the oom killer to return 
the selected victim to the page allocator, try the allocation, and then 
call oom_kill_process() if necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-28 23:20     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-28 23:20     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:19       ` David Rientjes
@ 2016-01-28 23:51         ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 23:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.
> 
> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

https://lkml.org/lkml/2015/3/25/40

We could have out_of_memory() wait until the number of outstanding OOM
victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
the lock until its kill has been finalized:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 914451a..4dc5b9d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
 		 */
-		schedule_timeout_killable(1);
+		if (!test_thread_flag(TIF_MEMDIE))
+			wait_event_timeout(oom_victims_wait,
+					   !atomic_read(&oom_victims), HZ);
 	}
 	return true;
 }

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 23:51         ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 23:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.
> 
> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

https://lkml.org/lkml/2015/3/25/40

We could have out_of_memory() wait until the number of outstanding OOM
victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
the lock until its kill has been finalized:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 914451a..4dc5b9d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
 		 */
-		schedule_timeout_killable(1);
+		if (!test_thread_flag(TIF_MEMDIE))
+			wait_event_timeout(oom_victims_wait,
+					   !atomic_read(&oom_victims), HZ);
 	}
 	return true;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-29  3:41     ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-01-29  3:41 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 489212252cd6..9145e3f89eab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
> 
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE) +
> -	     zone_page_state(zone, NR_ISOLATED_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> 
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON) +
> -		      zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> 
>  	return nr;
>  }
> 
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
> 
> --
> 2.7.0.rc3

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29  3:41     ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-01-29  3:41 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 489212252cd6..9145e3f89eab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
> 
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE) +
> -	     zone_page_state(zone, NR_ISOLATED_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> 
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON) +
> -		      zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> 
>  	return nr;
>  }
> 
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
> 
> --
> 2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-29 10:35     ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 

I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
was forgotten. Anyway,

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 10:35     ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 

I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
was forgotten. Anyway,

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:51         ` Johannes Weiner
@ 2016-01-29 10:39           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:39 UTC (permalink / raw)
  To: mhocko, hannes, rientjes
  Cc: akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu, linux-mm,
	linux-kernel, mhocko

Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> >
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > >
> >
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been
> > acquired.
> >
> > The situation is still somewhat fragile, however, but I think it's
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE
> > thread isn't visible during the oom killer's tasklist scan because it has
> > exited, we still end up killing more than we should.  The likelihood of
> > this happening grows with the length of the tasklist.
> >
> > Perhaps we should try testing watermarks after a victim has been selected
> > and immediately before killing?  (Aside: we actually carry an internal
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a
> > victim because we have been hit with this before in the memcg path.)

Yes. Moving final testing to after selecting an OOM victim can reduce the
possibility of killing more OOM victims than we need. But unfortunately, it is
likely that memory becomes available (i.e. get_page_from_freelist() succeeds)
during dump_header() is printing OOM messages using printk(), for printk() is
a slow operation compared to selecting a victim. This happens very much later
counted from the moment the victim cleared TIF_MEMDIE.

We can avoid killing more OOM victims than we need if we move final testing to
after printing OOM messages, but we can't avoid printing OOM messages when we
don't kill a victim. Maybe this is not a problem if we do

  pr_err("But did not kill any process ...")

instead of

  do_send_sig_info(SIGKILL);
  mark_oom_victim();
  pr_err("Killed process %d (%s) ...")

when final testing succeeded.

> >
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory
> > to deem that we aren't going to immediately reenter an oom condition so
> > the deferred killing is a waste of time.
> >
> > The downside is how sloppy this would be because it's blurring the line
> > between oom killer and page allocator.  We'd need the oom killer to return
> > the selected victim to the page allocator, try the allocation, and then
> > call oom_kill_process() if necessary.

I assumed that Michal wants to preserve the boundary between the OOM killer
and the page allocator. Therefore, I proposed a patch
( http://lkml.kernel.org/r/201512291559.HGA46749.VFOFSOHLMtFJQO@I-love.SAKURA.ne.jp )
which tries to manage it without returning a victim and without depending on
TIF_MEMDIE or oom_victims.

>
> https://lkml.org/lkml/2015/3/25/40
>
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }
>

oom_victims became 0 does not mean that memory became available (i.e.
get_page_from_freelist() will succeed). I think this patch wants some
effort for trying to reduce possibility of killing more OOM victims
than we need.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 10:39           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:39 UTC (permalink / raw)
  To: mhocko, hannes, rientjes
  Cc: akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu, linux-mm,
	linux-kernel, mhocko

Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> >
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > >
> >
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been
> > acquired.
> >
> > The situation is still somewhat fragile, however, but I think it's
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE
> > thread isn't visible during the oom killer's tasklist scan because it has
> > exited, we still end up killing more than we should.  The likelihood of
> > this happening grows with the length of the tasklist.
> >
> > Perhaps we should try testing watermarks after a victim has been selected
> > and immediately before killing?  (Aside: we actually carry an internal
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a
> > victim because we have been hit with this before in the memcg path.)

Yes. Moving final testing to after selecting an OOM victim can reduce the
possibility of killing more OOM victims than we need. But unfortunately, it is
likely that memory becomes available (i.e. get_page_from_freelist() succeeds)
during dump_header() is printing OOM messages using printk(), for printk() is
a slow operation compared to selecting a victim. This happens very much later
counted from the moment the victim cleared TIF_MEMDIE.

We can avoid killing more OOM victims than we need if we move final testing to
after printing OOM messages, but we can't avoid printing OOM messages when we
don't kill a victim. Maybe this is not a problem if we do

  pr_err("But did not kill any process ...")

instead of

  do_send_sig_info(SIGKILL);
  mark_oom_victim();
  pr_err("Killed process %d (%s) ...")

when final testing succeeded.

> >
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory
> > to deem that we aren't going to immediately reenter an oom condition so
> > the deferred killing is a waste of time.
> >
> > The downside is how sloppy this would be because it's blurring the line
> > between oom killer and page allocator.  We'd need the oom killer to return
> > the selected victim to the page allocator, try the allocation, and then
> > call oom_kill_process() if necessary.

I assumed that Michal wants to preserve the boundary between the OOM killer
and the page allocator. Therefore, I proposed a patch
( http://lkml.kernel.org/r/201512291559.HGA46749.VFOFSOHLMtFJQO@I-love.SAKURA.ne.jp )
which tries to manage it without returning a victim and without depending on
TIF_MEMDIE or oom_victims.

>
> https://lkml.org/lkml/2015/3/25/40
>
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }
>

oom_victims became 0 does not mean that memory became available (i.e.
get_page_from_freelist() will succeed). I think this patch wants some
effort for trying to reduce possibility of killing more OOM victims
than we need.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-29 10:35     ` Tetsuo Handa
@ 2016-01-29 15:17       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > calculate the target for the watermark check. This means that precise
> > numbers are important for the correct decision. zone_reclaimable_pages
> > uses zone_page_state which can contain stale data with per-cpu diffs
> > not synced yet (the last vmstat_update might have run 1s in the past).
> > 
> > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > of the current callers is in a hot path where getting the precise value
> > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > 
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmscan.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> 
> I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> was forgotten. Anyway,

OK, that explains why this sounded so familiar. Sorry I comepletely
forgot about it.

> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Can I change it to your Signed-off-by?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 15:17       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > calculate the target for the watermark check. This means that precise
> > numbers are important for the correct decision. zone_reclaimable_pages
> > uses zone_page_state which can contain stale data with per-cpu diffs
> > not synced yet (the last vmstat_update might have run 1s in the past).
> > 
> > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > of the current callers is in a hot path where getting the precise value
> > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > 
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmscan.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> 
> I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> was forgotten. Anyway,

OK, that explains why this sounded so familiar. Sorry I comepletely
forgot about it.

> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Can I change it to your Signed-off-by?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:19       ` David Rientjes
@ 2016-01-29 15:23         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 15:19:08, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.

Yes exactly the point I made in the original thread which brought the
question about ALLOC_WMARK_HIGH originally. The race window after the
last attempt is much larger than between the last wmark check and the
attempt.

> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

Yes the layer violation is definitely not nice.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:23         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 15:19:08, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.

Yes exactly the point I made in the original thread which brought the
question about ALLOC_WMARK_HIGH originally. The race window after the
last attempt is much larger than between the last wmark check and the
attempt.

> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

Yes the layer violation is definitely not nice.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 21:36     ` Johannes Weiner
@ 2016-01-29 15:24       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 16:36:34, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom has been doing get_page_from_freelist with
> > ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> > killer. This has two reasons as explained by Andrea:
> > "
> > : the reason for the high wmark is to reduce the likelihood of livelocks
> > : and be sure to invoke the OOM killer, if we're still under pressure
> > : and reclaim just failed. The high wmark is used to be sure the failure
> > : of reclaim isn't going to be ignored. If using the min wmark like
> > : you propose there's risk of livelock or anyway of delayed OOM killer
> > : invocation.
> > :
> > : The reason for doing one last wmark check (regardless of the wmark
> > : used) before invoking the oom killer, was just to be sure another OOM
> > : killer invocation hasn't already freed a ton of memory while we were
> > : stuck in reclaim. A lot of free memory generated by the OOM killer,
> > : won't make a parallel reclaim more likely to succeed, it just creates
> > : free memory, but reclaim only succeeds when it finds "freeable" memory
> > : and it makes progress in converting it to free memory. So for the
> > : purpose of this last check, the high wmark would work fine as lots of
> > : free memory would have been generated in such case.
> > "
> > 
> > This is no longer a concern after "mm, oom: rework oom detection"
> > because should_reclaim_retry performs the water mark check right before
> > __alloc_pages_may_oom is invoked. Remove the last moment allocation
> > request as it just makes the code more confusing and doesn't really
> > serve any purpose because a success is basically impossible otherwise
> > should_reclaim_retry would force the reclaim to retry. So this is
> > merely a code cleanup rather than a functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.

My testing shows that this doesn't trigger even during oom flood
testing. So I am not really convinced it does anything useful.

> Please drop this patch.

Sure I do not insist...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:24       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 16:36:34, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom has been doing get_page_from_freelist with
> > ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> > killer. This has two reasons as explained by Andrea:
> > "
> > : the reason for the high wmark is to reduce the likelihood of livelocks
> > : and be sure to invoke the OOM killer, if we're still under pressure
> > : and reclaim just failed. The high wmark is used to be sure the failure
> > : of reclaim isn't going to be ignored. If using the min wmark like
> > : you propose there's risk of livelock or anyway of delayed OOM killer
> > : invocation.
> > :
> > : The reason for doing one last wmark check (regardless of the wmark
> > : used) before invoking the oom killer, was just to be sure another OOM
> > : killer invocation hasn't already freed a ton of memory while we were
> > : stuck in reclaim. A lot of free memory generated by the OOM killer,
> > : won't make a parallel reclaim more likely to succeed, it just creates
> > : free memory, but reclaim only succeeds when it finds "freeable" memory
> > : and it makes progress in converting it to free memory. So for the
> > : purpose of this last check, the high wmark would work fine as lots of
> > : free memory would have been generated in such case.
> > "
> > 
> > This is no longer a concern after "mm, oom: rework oom detection"
> > because should_reclaim_retry performs the water mark check right before
> > __alloc_pages_may_oom is invoked. Remove the last moment allocation
> > request as it just makes the code more confusing and doesn't really
> > serve any purpose because a success is basically impossible otherwise
> > should_reclaim_retry would force the reclaim to retry. So this is
> > merely a code cleanup rather than a functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.

My testing shows that this doesn't trigger even during oom flood
testing. So I am not really convinced it does anything useful.

> Please drop this patch.

Sure I do not insist...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:51         ` Johannes Weiner
@ 2016-01-29 15:32           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 18:51:10, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> > 
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > > 
> > 
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> > acquired.
> > 
> > The situation is still somewhat fragile, however, but I think it's 
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> > thread isn't visible during the oom killer's tasklist scan because it has 
> > exited, we still end up killing more than we should.  The likelihood of 
> > this happening grows with the length of the tasklist.
> > 
> > Perhaps we should try testing watermarks after a victim has been selected 
> > and immediately before killing?  (Aside: we actually carry an internal 
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> > victim because we have been hit with this before in the memcg path.)
> > 
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> > to deem that we aren't going to immediately reenter an oom condition so 
> > the deferred killing is a waste of time.
> > 
> > The downside is how sloppy this would be because it's blurring the line 
> > between oom killer and page allocator.  We'd need the oom killer to return 
> > the selected victim to the page allocator, try the allocation, and then 
> > call oom_kill_process() if necessary.
> 
> https://lkml.org/lkml/2015/3/25/40
> 
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }

Yes this makes sense to me
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:32           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 18:51:10, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> > 
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > > 
> > 
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> > acquired.
> > 
> > The situation is still somewhat fragile, however, but I think it's 
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> > thread isn't visible during the oom killer's tasklist scan because it has 
> > exited, we still end up killing more than we should.  The likelihood of 
> > this happening grows with the length of the tasklist.
> > 
> > Perhaps we should try testing watermarks after a victim has been selected 
> > and immediately before killing?  (Aside: we actually carry an internal 
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> > victim because we have been hit with this before in the memcg path.)
> > 
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> > to deem that we aren't going to immediately reenter an oom condition so 
> > the deferred killing is a waste of time.
> > 
> > The downside is how sloppy this would be because it's blurring the line 
> > between oom killer and page allocator.  We'd need the oom killer to return 
> > the selected victim to the page allocator, try the allocation, and then 
> > call oom_kill_process() if necessary.
> 
> https://lkml.org/lkml/2015/3/25/40
> 
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }

Yes this makes sense to me
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-29 15:17       ` Michal Hocko
@ 2016-01-29 21:30         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 21:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > > calculate the target for the watermark check. This means that precise
> > > numbers are important for the correct decision. zone_reclaimable_pages
> > > uses zone_page_state which can contain stale data with per-cpu diffs
> > > not synced yet (the last vmstat_update might have run 1s in the past).
> > > 
> > > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > > of the current callers is in a hot path where getting the precise value
> > > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > > 
> > > Suggested-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmscan.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > 
> > 
> > I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> > was forgotten. Anyway,
> 
> OK, that explains why this sounded so familiar. Sorry I comepletely
> forgot about it.
> 
> > Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Can I change it to your Signed-off-by?

No problem.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 21:30         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 21:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > > calculate the target for the watermark check. This means that precise
> > > numbers are important for the correct decision. zone_reclaimable_pages
> > > uses zone_page_state which can contain stale data with per-cpu diffs
> > > not synced yet (the last vmstat_update might have run 1s in the past).
> > > 
> > > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > > of the current callers is in a hot path where getting the precise value
> > > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > > 
> > > Suggested-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmscan.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > 
> > 
> > I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> > was forgotten. Anyway,
> 
> OK, that explains why this sounded so familiar. Sorry I comepletely
> forgot about it.
> 
> > Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Can I change it to your Signed-off-by?

No problem.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-29 15:32           ` Michal Hocko
@ 2016-01-30 12:18             ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-30 12:18 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > https://lkml.org/lkml/2015/3/25/40
> > 
> > We could have out_of_memory() wait until the number of outstanding OOM
> > victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> > the lock until its kill has been finalized:
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 914451a..4dc5b9d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> >  		 * Give the killed process a good chance to exit before trying
> >  		 * to allocate memory again.
> >  		 */
> > -		schedule_timeout_killable(1);
> > +		if (!test_thread_flag(TIF_MEMDIE))
> > +			wait_event_timeout(oom_victims_wait,
> > +					   !atomic_read(&oom_victims), HZ);
> >  	}
> >  	return true;
> >  }
> 
> Yes this makes sense to me

I think schedule_timeout_killable(1) was used for handling cases
where current thread did not get TIF_MEMDIE but got SIGKILL due to
sharing the victim's memory. If current thread is blocking TIF_MEMDIE
thread, this can become a needless delay.

Also, I don't know whether using wait_event_*() helps handling a
problem that schedule_timeout_killable(1) can sleep for many minutes
with oom_lock held when there are a lot of tasks. Detail is explained
in my proposed patch.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-30 12:18             ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-30 12:18 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > https://lkml.org/lkml/2015/3/25/40
> > 
> > We could have out_of_memory() wait until the number of outstanding OOM
> > victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> > the lock until its kill has been finalized:
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 914451a..4dc5b9d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> >  		 * Give the killed process a good chance to exit before trying
> >  		 * to allocate memory again.
> >  		 */
> > -		schedule_timeout_killable(1);
> > +		if (!test_thread_flag(TIF_MEMDIE))
> > +			wait_event_timeout(oom_victims_wait,
> > +					   !atomic_read(&oom_victims), HZ);
> >  	}
> >  	return true;
> >  }
> 
> Yes this makes sense to me

I think schedule_timeout_killable(1) was used for handling cases
where current thread did not get TIF_MEMDIE but got SIGKILL due to
sharing the victim's memory. If current thread is blocking TIF_MEMDIE
thread, this can become a needless delay.

Also, I don't know whether using wait_event_*() helps handling a
problem that schedule_timeout_killable(1) can sleep for many minutes
with oom_lock held when there are a lot of tasks. Detail is explained
in my proposed patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2016-02-03 13:27   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-03 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,
this thread went mostly quite. Are all the main concerns clarified?
Are there any new concerns? Are there any objections to targeting
this for the next merge window?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-03 13:27   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-03 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,
this thread went mostly quite. Are all the main concerns clarified?
Are there any new concerns? Are there any objections to targeting
this for the next merge window?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 13:27   ` Michal Hocko
@ 2016-02-03 22:58     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-02-03 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:

> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Did we ever figure out what was causing the oom killer to be called much 
earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and
http://marc.info/?l=linux-kernel&m=145130454913757 ?  I'd like to take a 
look at the patch(es) that fixed it.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-03 22:58     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-02-03 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:

> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Did we ever figure out what was causing the oom killer to be called much 
earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and
http://marc.info/?l=linux-kernel&m=145130454913757 ?  I'd like to take a 
look at the patch(es) that fixed it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 22:58     ` David Rientjes
@ 2016-02-04 12:57       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 12:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 03-02-16 14:58:06, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Did we ever figure out what was causing the oom killer to be called much 
> earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and

>From the OOM report:
[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB

> http://marc.info/?l=linux-kernel&m=145130454913757 ?

[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

There are cases where order-2 has some pages but I have commented on
that here [1]

> I'd like to take a look at the patch(es) that fixed it.

I am not sure we can fix these pathological loads where we hit the
higher order depletion and there is a chance that one of the thousands
tasks terminates in an unpredictable way which happens to race with the
OOM killer. As I've pointed out in [1] once the watermark check for the
higher order allocation fails for the given order then we cannot rely
on the reclaimable pages ever construct the required order. The current
zone_reclaimable approach just happens to work for this particular load
because the NR_PAGES_SCANNED gets reseted too often with a side effect
of an undeterministic behavior.

[1] http://lkml.kernel.org/r/20160120131355.GE14187@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 12:57       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 12:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 03-02-16 14:58:06, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Did we ever figure out what was causing the oom killer to be called much 
> earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and

>From the OOM report:
[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB

> http://marc.info/?l=linux-kernel&m=145130454913757 ?

[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

There are cases where order-2 has some pages but I have commented on
that here [1]

> I'd like to take a look at the patch(es) that fixed it.

I am not sure we can fix these pathological loads where we hit the
higher order depletion and there is a chance that one of the thousands
tasks terminates in an unpredictable way which happens to race with the
OOM killer. As I've pointed out in [1] once the watermark check for the
higher order allocation fails for the given order then we cannot rely
on the reclaimable pages ever construct the required order. The current
zone_reclaimable approach just happens to work for this particular load
because the NR_PAGES_SCANNED gets reseted too often with a side effect
of an undeterministic behavior.

[1] http://lkml.kernel.org/r/20160120131355.GE14187@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 12:57       ` Michal Hocko
@ 2016-02-04 13:10         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-04 13:10 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not sure we can fix these pathological loads where we hit the
> higher order depletion and there is a chance that one of the thousands
> tasks terminates in an unpredictable way which happens to race with the
> OOM killer.

When I hit this problem on Dec 24th, I didn't run thousands of tasks.
I think there were less than one hundred tasks in the system and only
a few tasks were running. Not a pathological load at all.

I'm running thousands of tasks only for increasing the possibility
in the reproducer.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 13:10         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-04 13:10 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not sure we can fix these pathological loads where we hit the
> higher order depletion and there is a chance that one of the thousands
> tasks terminates in an unpredictable way which happens to race with the
> OOM killer.

When I hit this problem on Dec 24th, I didn't run thousands of tasks.
I think there were less than one hundred tasks in the system and only
a few tasks were running. Not a pathological load at all.

I'm running thousands of tasks only for increasing the possibility
in the reproducer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:10         ` Tetsuo Handa
@ 2016-02-04 13:39           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 13:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I am not sure we can fix these pathological loads where we hit the
> > higher order depletion and there is a chance that one of the thousands
> > tasks terminates in an unpredictable way which happens to race with the
> > OOM killer.
> 
> When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> I think there were less than one hundred tasks in the system and only
> a few tasks were running. Not a pathological load at all.

But as the OOM report clearly stated there were no > order-1 pages
available in that particular case. And that happened after the direct
reclaim and compaction were already invoked.

As I've mentioned in the referenced email, we can try to do multiple
retries e.g. do not give up on the higher order requests until we hit
the maximum number of retries but I consider it quite ugly to be honest.
I think that a proper communication with compaction is a more
appropriate way to go long term. E.g. I find it interesting that
try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
and treat is as any other high order request.

Something like the following:
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..1ae5b7da821b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 13:39           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 13:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I am not sure we can fix these pathological loads where we hit the
> > higher order depletion and there is a chance that one of the thousands
> > tasks terminates in an unpredictable way which happens to race with the
> > OOM killer.
> 
> When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> I think there were less than one hundred tasks in the system and only
> a few tasks were running. Not a pathological load at all.

But as the OOM report clearly stated there were no > order-1 pages
available in that particular case. And that happened after the direct
reclaim and compaction were already invoked.

As I've mentioned in the referenced email, we can try to do multiple
retries e.g. do not give up on the higher order requests until we hit
the maximum number of retries but I consider it quite ugly to be honest.
I think that a proper communication with compaction is a more
appropriate way to go long term. E.g. I find it interesting that
try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
and treat is as any other high order request.

Something like the following:
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..1ae5b7da821b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:39           ` Michal Hocko
@ 2016-02-04 14:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 14:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 14:39:05, Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 
> Something like the following:

With the patch description. Please note I haven't tested this yet so
this is more a RFC than something I am really convinced about. I can
live with it because the number of retries is nicely bounded but it
sounds too hackish because it makes the decision rather blindly. I will
talk to Vlastimil and Mel whether they see some way how to communicate
the compaction state in a reasonable way. But I guess this is something
that can come up later. What do you think?
---
>From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 4 Feb 2016 14:56:59 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and the OOM killer is just a
matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
order and not costly requests to make sure we do not fail prematurely.

This also means that we do not reset no_progress_loops at the
__alloc_pages_slowpath for high order allocations to guarantee a bounded
number of retries.

Longterm it would be much better to communicate with the compaction
and retry only if the compaction considers it meaningfull.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f05aca36469b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 14:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 14:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 14:39:05, Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 
> Something like the following:

With the patch description. Please note I haven't tested this yet so
this is more a RFC than something I am really convinced about. I can
live with it because the number of retries is nicely bounded but it
sounds too hackish because it makes the decision rather blindly. I will
talk to Vlastimil and Mel whether they see some way how to communicate
the compaction state in a reasonable way. But I guess this is something
that can come up later. What do you think?
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:39           ` Michal Hocko
@ 2016-02-07  4:09             ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-07  4:09 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 

FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
I think current patchset is too fragile to merge.
----------------------------------------
[ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 3101.629148] smbd cpuset=/ mems_allowed=0
[ 3101.630332] CPU: 1 PID: 3941 Comm: smbd Not tainted 4.5.0-rc2-next-20160205 #293
[ 3101.632335] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3101.634567]  0000000000000286 000000005784a8f9 ffff88007c47bad0 ffffffff8139abbd
[ 3101.636533]  0000000000000000 ffff88007c47bd00 ffff88007c47bb70 ffffffff811bdc6c
[ 3101.638381]  0000000000000206 ffffffff81810b30 ffff88007c47bb10 ffffffff810be079
[ 3101.640215] Call Trace:
[ 3101.641169]  [<ffffffff8139abbd>] dump_stack+0x85/0xc8
[ 3101.642560]  [<ffffffff811bdc6c>] dump_header+0x5b/0x3b0
[ 3101.643983]  [<ffffffff810be079>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3101.645616]  [<ffffffff810be14d>] ? trace_hardirqs_on+0xd/0x10
[ 3101.647081]  [<ffffffff81143fb6>] oom_kill_process+0x366/0x550
[ 3101.648631]  [<ffffffff811443df>] out_of_memory+0x1ef/0x5a0
[ 3101.650081]  [<ffffffff8114449d>] ? out_of_memory+0x2ad/0x5a0
[ 3101.651624]  [<ffffffff81149d0d>] __alloc_pages_nodemask+0xbad/0xd90
[ 3101.653207]  [<ffffffff8114a0ac>] alloc_kmem_pages_node+0x4c/0xc0
[ 3101.654767]  [<ffffffff8106d5c1>] copy_process.part.31+0x131/0x1b40
[ 3101.656381]  [<ffffffff8111d9ea>] ? __audit_syscall_entry+0xaa/0xf0
[ 3101.657952]  [<ffffffff810e8119>] ? current_kernel_time64+0xa9/0xc0
[ 3101.659492]  [<ffffffff8106f19b>] _do_fork+0xdb/0x5d0
[ 3101.660814]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 3101.662305]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 3101.663988]  [<ffffffff81703d2c>] ? return_from_SYSCALL_64+0x2d/0x7a
[ 3101.665572]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[ 3101.667067]  [<ffffffff8106f714>] SyS_clone+0x14/0x20
[ 3101.668510]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[ 3101.669931]  [<ffffffff81703cff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3101.671642] Mem-Info:
[ 3101.672612] active_anon:46842 inactive_anon:2094 isolated_anon:0
 active_file:108974 inactive_file:131350 isolated_file:0
 unevictable:0 dirty:1174 writeback:0 unstable:0
 slab_reclaimable:107536 slab_unreclaimable:14287
 mapped:4199 shmem:2166 pagetables:1524 bounce:0
 free:6260 free_pcp:31 free_cma:0
[ 3101.681294] Node 0 DMA free:6884kB min:44kB low:52kB high:64kB active_anon:3488kB inactive_anon:100kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:100kB slab_reclaimable:3852kB slab_unreclaimable:444kB kernel_stack:80kB pagetables:112kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.691319] lowmem_reserve[]: 0 1714 1714 1714
[ 3101.692847] Node 0 DMA32 free:18156kB min:5172kB low:6464kB high:7756kB active_anon:183880kB inactive_anon:8276kB active_file:435896kB inactive_file:525396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759480kB mlocked:0kB dirty:4696kB writeback:0kB mapped:16792kB shmem:8564kB slab_reclaimable:426292kB slab_unreclaimable:56704kB kernel_stack:3328kB pagetables:5984kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.704239] lowmem_reserve[]: 0 0 0 0
[ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
[ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
[ 3101.713857] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3101.716332] 242517 total pagecache pages
[ 3101.717878] 0 pages in swap cache
[ 3101.719332] Swap cache stats: add 0, delete 0, find 0/0
[ 3101.721577] Free swap  = 0kB
[ 3101.722980] Total swap = 0kB
[ 3101.724364] 524157 pages RAM
[ 3101.725697] 0 pages HighMem/MovableOnly
[ 3101.727165] 80311 pages reserved
[ 3101.728482] 0 pages hwpoisoned
[ 3101.729754] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 3101.732071] [  492]     0   492     9206      975      20       4        0             0 systemd-journal
[ 3101.734357] [  520]     0   520    10479      631      22       3        0         -1000 systemd-udevd
[ 3101.737036] [  527]     0   527    12805      682      24       3        0         -1000 auditd
[ 3101.739505] [ 1174]     0  1174     4830      556      14       3        0             0 irqbalance
[ 3101.741876] [ 1180]    81  1180     6672      604      20       3        0          -900 dbus-daemon
[ 3101.744728] [ 1817]     0  1817    56009      880      40       4        0             0 rsyslogd
[ 3101.747164] [ 1818]     0  1818     1096      349       8       3        0             0 rngd
[ 3101.749788] [ 1820]     0  1820    52575     1074      56       3        0             0 abrtd
[ 3101.752135] [ 1821]     0  1821    80901     5160      80       4        0             0 firewalld
[ 3101.754532] [ 1823]     0  1823     6602      681      20       3        0             0 systemd-logind
[ 3101.757342] [ 1825]    70  1825     6999      458      20       3        0             0 avahi-daemon
[ 3101.759784] [ 1827]     0  1827    51995      986      55       3        0             0 abrt-watch-log
[ 3101.762465] [ 1838]     0  1838    31586      647      21       3        0             0 crond
[ 3101.764797] [ 1946]    70  1946     6999       58      19       3        0             0 avahi-daemon
[ 3101.767262] [ 2043]     0  2043    65187      858      43       3        0             0 vmtoolsd
[ 3101.769665] [ 2618]     0  2618    27631     3112      53       3        0             0 dhclient
[ 3101.772203] [ 2622]   999  2622   130827     2570      56       3        0             0 polkitd
[ 3101.774645] [ 2704]     0  2704   138263     3351      91       4        0             0 tuned
[ 3101.777114] [ 2709]     0  2709    20640      773      45       3        0         -1000 sshd
[ 3101.779428] [ 2711]     0  2711     7328      551      19       3        0             0 xinetd
[ 3101.782016] [ 3883]     0  3883    22785      827      45       3        0             0 master
[ 3101.784576] [ 3884]    89  3884    22811      924      46       4        0             0 pickup
[ 3101.786898] [ 3885]    89  3885    22828      886      44       3        0             0 qmgr
[ 3101.789287] [ 3916]     0  3916    23203      736      50       3        0             0 login
[ 3101.791666] [ 3927]     0  3927    27511      381      13       3        0             0 agetty
[ 3101.794116] [ 3930]     0  3930    79392     1063     105       3        0             0 nmbd
[ 3101.796387] [ 3941]     0  3941    96485     1544     138       4        0             0 smbd
[ 3101.798602] [ 3944]     0  3944    96485     1290     131       4        0             0 smbd
[ 3101.800783] [ 7471]     0  7471    28886      732      15       3        0             0 bash
[ 3101.803013] [ 7580]     0  7580     2380      613      10       3        0             0 makelxr.sh
[ 3101.805147] [ 7786]     0  7786    27511      395      10       3        0             0 agetty
[ 3101.807198] [ 8139]     0  8139    35888      974      72       3        0             0 sshd
[ 3101.809255] [ 8144]     0  8144    28896      761      15       4        0             0 bash
[ 3101.811335] [15286]     0 15286    38294    30474      81       3        0             0 genxref
[ 3101.813512] Out of memory: Kill process 15286 (genxref) score 66 or sacrifice child
[ 3101.815659] Killed process 15286 (genxref) total-vm:153176kB, anon-rss:117092kB, file-rss:4804kB, shmem-rss:0kB
----------------------------------------

> Something like the following:
Yes, I do think we need something like it.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-07  4:09             ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-07  4:09 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 

FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
I think current patchset is too fragile to merge.
----------------------------------------
[ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 3101.629148] smbd cpuset=/ mems_allowed=0
[ 3101.630332] CPU: 1 PID: 3941 Comm: smbd Not tainted 4.5.0-rc2-next-20160205 #293
[ 3101.632335] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3101.634567]  0000000000000286 000000005784a8f9 ffff88007c47bad0 ffffffff8139abbd
[ 3101.636533]  0000000000000000 ffff88007c47bd00 ffff88007c47bb70 ffffffff811bdc6c
[ 3101.638381]  0000000000000206 ffffffff81810b30 ffff88007c47bb10 ffffffff810be079
[ 3101.640215] Call Trace:
[ 3101.641169]  [<ffffffff8139abbd>] dump_stack+0x85/0xc8
[ 3101.642560]  [<ffffffff811bdc6c>] dump_header+0x5b/0x3b0
[ 3101.643983]  [<ffffffff810be079>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3101.645616]  [<ffffffff810be14d>] ? trace_hardirqs_on+0xd/0x10
[ 3101.647081]  [<ffffffff81143fb6>] oom_kill_process+0x366/0x550
[ 3101.648631]  [<ffffffff811443df>] out_of_memory+0x1ef/0x5a0
[ 3101.650081]  [<ffffffff8114449d>] ? out_of_memory+0x2ad/0x5a0
[ 3101.651624]  [<ffffffff81149d0d>] __alloc_pages_nodemask+0xbad/0xd90
[ 3101.653207]  [<ffffffff8114a0ac>] alloc_kmem_pages_node+0x4c/0xc0
[ 3101.654767]  [<ffffffff8106d5c1>] copy_process.part.31+0x131/0x1b40
[ 3101.656381]  [<ffffffff8111d9ea>] ? __audit_syscall_entry+0xaa/0xf0
[ 3101.657952]  [<ffffffff810e8119>] ? current_kernel_time64+0xa9/0xc0
[ 3101.659492]  [<ffffffff8106f19b>] _do_fork+0xdb/0x5d0
[ 3101.660814]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 3101.662305]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 3101.663988]  [<ffffffff81703d2c>] ? return_from_SYSCALL_64+0x2d/0x7a
[ 3101.665572]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[ 3101.667067]  [<ffffffff8106f714>] SyS_clone+0x14/0x20
[ 3101.668510]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[ 3101.669931]  [<ffffffff81703cff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3101.671642] Mem-Info:
[ 3101.672612] active_anon:46842 inactive_anon:2094 isolated_anon:0
 active_file:108974 inactive_file:131350 isolated_file:0
 unevictable:0 dirty:1174 writeback:0 unstable:0
 slab_reclaimable:107536 slab_unreclaimable:14287
 mapped:4199 shmem:2166 pagetables:1524 bounce:0
 free:6260 free_pcp:31 free_cma:0
[ 3101.681294] Node 0 DMA free:6884kB min:44kB low:52kB high:64kB active_anon:3488kB inactive_anon:100kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:100kB slab_reclaimable:3852kB slab_unreclaimable:444kB kernel_stack:80kB pagetables:112kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.691319] lowmem_reserve[]: 0 1714 1714 1714
[ 3101.692847] Node 0 DMA32 free:18156kB min:5172kB low:6464kB high:7756kB active_anon:183880kB inactive_anon:8276kB active_file:435896kB inactive_file:525396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759480kB mlocked:0kB dirty:4696kB writeback:0kB mapped:16792kB shmem:8564kB slab_reclaimable:426292kB slab_unreclaimable:56704kB kernel_stack:3328kB pagetables:5984kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.704239] lowmem_reserve[]: 0 0 0 0
[ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
[ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
[ 3101.713857] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3101.716332] 242517 total pagecache pages
[ 3101.717878] 0 pages in swap cache
[ 3101.719332] Swap cache stats: add 0, delete 0, find 0/0
[ 3101.721577] Free swap  = 0kB
[ 3101.722980] Total swap = 0kB
[ 3101.724364] 524157 pages RAM
[ 3101.725697] 0 pages HighMem/MovableOnly
[ 3101.727165] 80311 pages reserved
[ 3101.728482] 0 pages hwpoisoned
[ 3101.729754] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 3101.732071] [  492]     0   492     9206      975      20       4        0             0 systemd-journal
[ 3101.734357] [  520]     0   520    10479      631      22       3        0         -1000 systemd-udevd
[ 3101.737036] [  527]     0   527    12805      682      24       3        0         -1000 auditd
[ 3101.739505] [ 1174]     0  1174     4830      556      14       3        0             0 irqbalance
[ 3101.741876] [ 1180]    81  1180     6672      604      20       3        0          -900 dbus-daemon
[ 3101.744728] [ 1817]     0  1817    56009      880      40       4        0             0 rsyslogd
[ 3101.747164] [ 1818]     0  1818     1096      349       8       3        0             0 rngd
[ 3101.749788] [ 1820]     0  1820    52575     1074      56       3        0             0 abrtd
[ 3101.752135] [ 1821]     0  1821    80901     5160      80       4        0             0 firewalld
[ 3101.754532] [ 1823]     0  1823     6602      681      20       3        0             0 systemd-logind
[ 3101.757342] [ 1825]    70  1825     6999      458      20       3        0             0 avahi-daemon
[ 3101.759784] [ 1827]     0  1827    51995      986      55       3        0             0 abrt-watch-log
[ 3101.762465] [ 1838]     0  1838    31586      647      21       3        0             0 crond
[ 3101.764797] [ 1946]    70  1946     6999       58      19       3        0             0 avahi-daemon
[ 3101.767262] [ 2043]     0  2043    65187      858      43       3        0             0 vmtoolsd
[ 3101.769665] [ 2618]     0  2618    27631     3112      53       3        0             0 dhclient
[ 3101.772203] [ 2622]   999  2622   130827     2570      56       3        0             0 polkitd
[ 3101.774645] [ 2704]     0  2704   138263     3351      91       4        0             0 tuned
[ 3101.777114] [ 2709]     0  2709    20640      773      45       3        0         -1000 sshd
[ 3101.779428] [ 2711]     0  2711     7328      551      19       3        0             0 xinetd
[ 3101.782016] [ 3883]     0  3883    22785      827      45       3        0             0 master
[ 3101.784576] [ 3884]    89  3884    22811      924      46       4        0             0 pickup
[ 3101.786898] [ 3885]    89  3885    22828      886      44       3        0             0 qmgr
[ 3101.789287] [ 3916]     0  3916    23203      736      50       3        0             0 login
[ 3101.791666] [ 3927]     0  3927    27511      381      13       3        0             0 agetty
[ 3101.794116] [ 3930]     0  3930    79392     1063     105       3        0             0 nmbd
[ 3101.796387] [ 3941]     0  3941    96485     1544     138       4        0             0 smbd
[ 3101.798602] [ 3944]     0  3944    96485     1290     131       4        0             0 smbd
[ 3101.800783] [ 7471]     0  7471    28886      732      15       3        0             0 bash
[ 3101.803013] [ 7580]     0  7580     2380      613      10       3        0             0 makelxr.sh
[ 3101.805147] [ 7786]     0  7786    27511      395      10       3        0             0 agetty
[ 3101.807198] [ 8139]     0  8139    35888      974      72       3        0             0 sshd
[ 3101.809255] [ 8144]     0  8144    28896      761      15       4        0             0 bash
[ 3101.811335] [15286]     0 15286    38294    30474      81       3        0             0 genxref
[ 3101.813512] Out of memory: Kill process 15286 (genxref) score 66 or sacrifice child
[ 3101.815659] Killed process 15286 (genxref) total-vm:153176kB, anon-rss:117092kB, file-rss:4804kB, shmem-rss:0kB
----------------------------------------

> Something like the following:
Yes, I do think we need something like it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-07  4:09             ` Tetsuo Handa
@ 2016-02-15 20:06               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-15 20:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
[...]
> FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> I think current patchset is too fragile to merge.
> ----------------------------------------
> [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> [ 3101.629148] smbd cpuset=/ mems_allowed=0
[...]
> [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB

How come this is an unexpected OOM? There is clearly no order-2+ page
available for the allocation request.

> > Something like the following:
> Yes, I do think we need something like it.

Was the patch applied?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-15 20:06               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-15 20:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
[...]
> FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> I think current patchset is too fragile to merge.
> ----------------------------------------
> [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> [ 3101.629148] smbd cpuset=/ mems_allowed=0
[...]
> [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB

How come this is an unexpected OOM? There is clearly no order-2+ page
available for the allocation request.

> > Something like the following:
> Yes, I do think we need something like it.

Was the patch applied?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-15 20:06               ` Michal Hocko
@ 2016-02-16 13:10                 ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-16 13:10 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> [...]
> > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > I think current patchset is too fragile to merge.
> > ----------------------------------------
> > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> [...]
> > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> 
> How come this is an unexpected OOM? There is clearly no order-2+ page
> available for the allocation request.

I used "unexpected" because there were only 35 userspace processes and
genxref was the only process which did a lot of memory allocation
(modulo kernel threads woken by file I/O) and most memory is reclaimable.

> 
> > > Something like the following:
> > Yes, I do think we need something like it.
> 
> Was the patch applied?

No for above result.

A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
today's linux-next is shown below. It seems that protection is not enough.

----------
[  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  118.586684] fork cpuset=/ mems_allowed=0
[  118.588254] CPU: 2 PID: 9565 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  118.589795] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  118.591941]  0000000000000286 0000000085a9ed62 ffff88007b3d3ad0 ffffffff8139e82d
[  118.593616]  0000000000000000 ffff88007b3d3d00 ffff88007b3d3b70 ffffffff811bedec
[  118.595273]  0000000000000206 ffffffff81810b70 ffff88007b3d3b10 ffffffff810be8f9
[  118.596970] Call Trace:
[  118.597634]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  118.598787]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  118.599979]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  118.601421]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  118.602713]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  118.604882]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  118.606940]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  118.608275]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  118.609698]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  118.611166]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  118.612589]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  118.614203]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  118.615689]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  118.617151]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  118.618391]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  118.619875]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  118.621642]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  118.622920]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  118.624262]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  118.625661]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  118.626959]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  118.628340]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  118.630002] Mem-Info:
[  118.630853] active_anon:27270 inactive_anon:2094 isolated_anon:0
[  118.630853]  active_file:253575 inactive_file:89021 isolated_file:22
[  118.630853]  unevictable:0 dirty:0 writeback:0 unstable:0
[  118.630853]  slab_reclaimable:14202 slab_unreclaimable:13906
[  118.630853]  mapped:1622 shmem:2162 pagetables:10587 bounce:0
[  118.630853]  free:5328 free_pcp:356 free_cma:0
[  118.639774] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3280kB inactive_anon:156kB active_file:684kB inactive_file:2292kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:800kB kernel_stack:256kB pagetables:200kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.650132] lowmem_reserve[]: 0 1714 1714 1714
[  118.651763] Node 0 DMA32 free:14256kB min:5172kB low:6464kB high:7756kB active_anon:105924kB inactive_anon:8220kB active_file:1026268kB inactive_file:340844kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:1759460kB mlocked:0kB dirty:0kB writeback:0kB mapped:6436kB shmem:8484kB slab_reclaimable:56740kB slab_unreclaimable:54824kB kernel_stack:28112kB pagetables:42148kB unstable:0kB bounce:0kB free_pcp:1440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.663101] lowmem_reserve[]: 0 0 0 0
[  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
[  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[  118.673742] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  118.676297] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  118.678610] 344508 total pagecache pages
[  118.680163] 0 pages in swap cache
[  118.681567] Swap cache stats: add 0, delete 0, find 0/0
[  118.681567] Free swap  = 0kB
[  118.681568] Total swap = 0kB
[  118.681625] 524157 pages RAM
[  118.681625] 0 pages HighMem/MovableOnly
[  118.681625] 80316 pages reserved
[  118.681626] 0 pages hwpoisoned

[  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  120.117097] fork cpuset=/ mems_allowed=0
[  120.117099] CPU: 0 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  120.117100] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  120.117102]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  120.117103]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  120.117104]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  120.117104] Call Trace:
[  120.117111]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  120.117113]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  120.117116]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  120.117117]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  120.117119]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  120.117121]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  120.117122]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  120.117123]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  120.117124]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  120.117125]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  120.117128]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  120.117130]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  120.117132]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  120.117133]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  120.117136]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  120.117137]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  120.117139]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  120.117142]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  120.117143]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  120.117144]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  120.117145]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  120.117147]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  120.117147] Mem-Info:
[  120.117150] active_anon:30895 inactive_anon:2094 isolated_anon:0
[  120.117150]  active_file:183306 inactive_file:118692 isolated_file:18
[  120.117150]  unevictable:0 dirty:47 writeback:0 unstable:0
[  120.117150]  slab_reclaimable:14405 slab_unreclaimable:22372
[  120.117150]  mapped:3101 shmem:2162 pagetables:20154 bounce:0
[  120.117150]  free:7231 free_pcp:108 free_cma:0
[  120.117154] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:1172kB inactive_anon:156kB active_file:684kB inactive_file:1356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:2244kB kernel_stack:1376kB pagetables:436kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  120.117156] lowmem_reserve[]: 0 1714 1714 1714
[  120.117172] Node 0 DMA32 free:22020kB min:5172kB low:6464kB high:7756kB active_anon:122408kB inactive_anon:8220kB active_file:732540kB inactive_file:473412kB unevictable:0kB isolated(anon):0kB isolated(file):72kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:11984kB shmem:8484kB slab_reclaimable:57056kB slab_unreclaimable:87244kB kernel_stack:52048kB pagetables:80180kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  120.117230] lowmem_reserve[]: 0 0 0 0
[  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
[  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  120.117245] 304244 total pagecache pages
[  120.117246] 0 pages in swap cache
[  120.117246] Swap cache stats: add 0, delete 0, find 0/0
[  120.117247] Free swap  = 0kB
[  120.117247] Total swap = 0kB
[  120.117248] 524157 pages RAM
[  120.117248] 0 pages HighMem/MovableOnly
[  120.117248] 80316 pages reserved
[  120.117249] 0 pages hwpoisoned

[  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  126.034918] fork cpuset=/ mems_allowed=0
[  126.034920] CPU: 2 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  126.034921] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  126.034923]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  126.034925]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  126.034926]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  126.034926] Call Trace:
[  126.034932]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  126.034935]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  126.034938]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  126.034939]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  126.034941]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  126.034943]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  126.034944]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  126.034945]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  126.034947]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  126.034948]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  126.034950]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  126.034952]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  126.034954]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  126.034956]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  126.034958]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  126.034959]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  126.034961]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  126.034965]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  126.034965]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  126.034967]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  126.034968]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  126.034969]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  126.034970] Mem-Info:
[  126.034973] active_anon:27060 inactive_anon:2093 isolated_anon:0
[  126.034973]  active_file:206123 inactive_file:85224 isolated_file:32
[  126.034973]  unevictable:0 dirty:47 writeback:0 unstable:0
[  126.034973]  slab_reclaimable:13214 slab_unreclaimable:26604
[  126.034973]  mapped:2421 shmem:2161 pagetables:24889 bounce:0
[  126.034973]  free:4649 free_pcp:30 free_cma:0
[  126.034986] Node 0 DMA free:6924kB min:44kB low:52kB high:64kB active_anon:1156kB inactive_anon:156kB active_file:728kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:368kB shmem:164kB slab_reclaimable:468kB slab_unreclaimable:2496kB kernel_stack:832kB pagetables:704kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  126.034988] lowmem_reserve[]: 0 1714 1714 1714
[  126.034992] Node 0 DMA32 free:11672kB min:5172kB low:6464kB high:7756kB active_anon:107084kB inactive_anon:8216kB active_file:823764kB inactive_file:339836kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:9316kB shmem:8480kB slab_reclaimable:52388kB slab_unreclaimable:103920kB kernel_stack:66016kB pagetables:98852kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  126.034993] lowmem_reserve[]: 0 0 0 0
[  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
[  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  126.035007] 293674 total pagecache pages
[  126.035008] 0 pages in swap cache
[  126.035008] Swap cache stats: add 0, delete 0, find 0/0
[  126.035009] Free swap  = 0kB
[  126.035009] Total swap = 0kB
[  126.035010] 524157 pages RAM
[  126.035010] 0 pages HighMem/MovableOnly
[  126.035010] 80316 pages reserved
[  126.035011] 0 pages hwpoisoned
----------

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-16 13:10                 ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-16 13:10 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> [...]
> > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > I think current patchset is too fragile to merge.
> > ----------------------------------------
> > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> [...]
> > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> 
> How come this is an unexpected OOM? There is clearly no order-2+ page
> available for the allocation request.

I used "unexpected" because there were only 35 userspace processes and
genxref was the only process which did a lot of memory allocation
(modulo kernel threads woken by file I/O) and most memory is reclaimable.

> 
> > > Something like the following:
> > Yes, I do think we need something like it.
> 
> Was the patch applied?

No for above result.

A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
today's linux-next is shown below. It seems that protection is not enough.

----------
[  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  118.586684] fork cpuset=/ mems_allowed=0
[  118.588254] CPU: 2 PID: 9565 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  118.589795] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  118.591941]  0000000000000286 0000000085a9ed62 ffff88007b3d3ad0 ffffffff8139e82d
[  118.593616]  0000000000000000 ffff88007b3d3d00 ffff88007b3d3b70 ffffffff811bedec
[  118.595273]  0000000000000206 ffffffff81810b70 ffff88007b3d3b10 ffffffff810be8f9
[  118.596970] Call Trace:
[  118.597634]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  118.598787]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  118.599979]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  118.601421]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  118.602713]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  118.604882]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  118.606940]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  118.608275]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  118.609698]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  118.611166]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  118.612589]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  118.614203]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  118.615689]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  118.617151]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  118.618391]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  118.619875]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  118.621642]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  118.622920]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  118.624262]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  118.625661]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  118.626959]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  118.628340]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  118.630002] Mem-Info:
[  118.630853] active_anon:27270 inactive_anon:2094 isolated_anon:0
[  118.630853]  active_file:253575 inactive_file:89021 isolated_file:22
[  118.630853]  unevictable:0 dirty:0 writeback:0 unstable:0
[  118.630853]  slab_reclaimable:14202 slab_unreclaimable:13906
[  118.630853]  mapped:1622 shmem:2162 pagetables:10587 bounce:0
[  118.630853]  free:5328 free_pcp:356 free_cma:0
[  118.639774] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3280kB inactive_anon:156kB active_file:684kB inactive_file:2292kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:800kB kernel_stack:256kB pagetables:200kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.650132] lowmem_reserve[]: 0 1714 1714 1714
[  118.651763] Node 0 DMA32 free:14256kB min:5172kB low:6464kB high:7756kB active_anon:105924kB inactive_anon:8220kB active_file:1026268kB inactive_file:340844kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:1759460kB mlocked:0kB dirty:0kB writeback:0kB mapped:6436kB shmem:8484kB slab_reclaimable:56740kB slab_unreclaimable:54824kB kernel_stack:28112kB pagetables:42148kB unstable:0kB bounce:0kB free_pcp:1440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.663101] lowmem_reserve[]: 0 0 0 0
[  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
[  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[  118.673742] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  118.676297] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  118.678610] 344508 total pagecache pages
[  118.680163] 0 pages in swap cache
[  118.681567] Swap cache stats: add 0, delete 0, find 0/0
[  118.681567] Free swap  = 0kB
[  118.681568] Total swap = 0kB
[  118.681625] 524157 pages RAM
[  118.681625] 0 pages HighMem/MovableOnly
[  118.681625] 80316 pages reserved
[  118.681626] 0 pages hwpoisoned

[  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  120.117097] fork cpuset=/ mems_allowed=0
[  120.117099] CPU: 0 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  120.117100] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  120.117102]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  120.117103]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  120.117104]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  120.117104] Call Trace:
[  120.117111]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  120.117113]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  120.117116]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  120.117117]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  120.117119]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  120.117121]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  120.117122]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  120.117123]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  120.117124]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  120.117125]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  120.117128]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  120.117130]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  120.117132]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  120.117133]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  120.117136]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  120.117137]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  120.117139]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  120.117142]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  120.117143]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  120.117144]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  120.117145]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  120.117147]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  120.117147] Mem-Info:
[  120.117150] active_anon:30895 inactive_anon:2094 isolated_anon:0
[  120.117150]  active_file:183306 inactive_file:118692 isolated_file:18
[  120.117150]  unevictable:0 dirty:47 writeback:0 unstable:0
[  120.117150]  slab_reclaimable:14405 slab_unreclaimable:22372
[  120.117150]  mapped:3101 shmem:2162 pagetables:20154 bounce:0
[  120.117150]  free:7231 free_pcp:108 free_cma:0
[  120.117154] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:1172kB inactive_anon:156kB active_file:684kB inactive_file:1356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:2244kB kernel_stack:1376kB pagetables:436kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  120.117156] lowmem_reserve[]: 0 1714 1714 1714
[  120.117172] Node 0 DMA32 free:22020kB min:5172kB low:6464kB high:7756kB active_anon:122408kB inactive_anon:8220kB active_file:732540kB inactive_file:473412kB unevictable:0kB isolated(anon):0kB isolated(file):72kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:11984kB shmem:8484kB slab_reclaimable:57056kB slab_unreclaimable:87244kB kernel_stack:52048kB pagetables:80180kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  120.117230] lowmem_reserve[]: 0 0 0 0
[  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
[  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  120.117245] 304244 total pagecache pages
[  120.117246] 0 pages in swap cache
[  120.117246] Swap cache stats: add 0, delete 0, find 0/0
[  120.117247] Free swap  = 0kB
[  120.117247] Total swap = 0kB
[  120.117248] 524157 pages RAM
[  120.117248] 0 pages HighMem/MovableOnly
[  120.117248] 80316 pages reserved
[  120.117249] 0 pages hwpoisoned

[  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  126.034918] fork cpuset=/ mems_allowed=0
[  126.034920] CPU: 2 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  126.034921] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  126.034923]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  126.034925]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  126.034926]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  126.034926] Call Trace:
[  126.034932]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  126.034935]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  126.034938]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  126.034939]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  126.034941]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  126.034943]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  126.034944]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  126.034945]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  126.034947]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  126.034948]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  126.034950]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  126.034952]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  126.034954]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  126.034956]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  126.034958]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  126.034959]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  126.034961]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  126.034965]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  126.034965]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  126.034967]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  126.034968]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  126.034969]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  126.034970] Mem-Info:
[  126.034973] active_anon:27060 inactive_anon:2093 isolated_anon:0
[  126.034973]  active_file:206123 inactive_file:85224 isolated_file:32
[  126.034973]  unevictable:0 dirty:47 writeback:0 unstable:0
[  126.034973]  slab_reclaimable:13214 slab_unreclaimable:26604
[  126.034973]  mapped:2421 shmem:2161 pagetables:24889 bounce:0
[  126.034973]  free:4649 free_pcp:30 free_cma:0
[  126.034986] Node 0 DMA free:6924kB min:44kB low:52kB high:64kB active_anon:1156kB inactive_anon:156kB active_file:728kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:368kB shmem:164kB slab_reclaimable:468kB slab_unreclaimable:2496kB kernel_stack:832kB pagetables:704kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  126.034988] lowmem_reserve[]: 0 1714 1714 1714
[  126.034992] Node 0 DMA32 free:11672kB min:5172kB low:6464kB high:7756kB active_anon:107084kB inactive_anon:8216kB active_file:823764kB inactive_file:339836kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:9316kB shmem:8480kB slab_reclaimable:52388kB slab_unreclaimable:103920kB kernel_stack:66016kB pagetables:98852kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  126.034993] lowmem_reserve[]: 0 0 0 0
[  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
[  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  126.035007] 293674 total pagecache pages
[  126.035008] 0 pages in swap cache
[  126.035008] Swap cache stats: add 0, delete 0, find 0/0
[  126.035009] Free swap  = 0kB
[  126.035009] Total swap = 0kB
[  126.035010] 524157 pages RAM
[  126.035010] 0 pages HighMem/MovableOnly
[  126.035010] 80316 pages reserved
[  126.035011] 0 pages hwpoisoned
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-16 13:10                 ` Tetsuo Handa
@ 2016-02-16 15:19                   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-16 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Tue 16-02-16 22:10:01, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> > [...]
> > > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > > I think current patchset is too fragile to merge.
> > > ----------------------------------------
> > > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> > [...]
> > > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> > 
> > How come this is an unexpected OOM? There is clearly no order-2+ page
> > available for the allocation request.
> 
> I used "unexpected" because there were only 35 userspace processes and
> genxref was the only process which did a lot of memory allocation
> (modulo kernel threads woken by file I/O) and most memory is reclaimable.

The memory is reclaimable but that doesn't mean that order-2 page block
will get formed even if all of it gets reclaimed. The memory is simply
too fragmented. That is why I think the OOM makes sense.

> > > > Something like the following:
> > > Yes, I do think we need something like it.
> > 
> > Was the patch applied?
> 
> No for above result.
> 
> A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
> today's linux-next is shown below. It seems that protection is not enough.
> 
> ----------
> [  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
> [  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[...]
> [  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
> [  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[...]
> [  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
> [  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB

As you can see, in all cases we had order-2 requests and no order-2+
free blocks even after all the retries. I think the OOM is appropriate
at that time. We could have tried N+1 times but we have to draw a line
at some point of time. The reason why we do not have any high order
block available is a completely different question IMO. Maybe the
compaction just gets deferred and doesn't do anything. This would be
interesting to investigate further of course. Anyway my point is
that going OOM with the current fragmentation is simply the only choice.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-16 15:19                   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-16 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Tue 16-02-16 22:10:01, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> > [...]
> > > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > > I think current patchset is too fragile to merge.
> > > ----------------------------------------
> > > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> > [...]
> > > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> > 
> > How come this is an unexpected OOM? There is clearly no order-2+ page
> > available for the allocation request.
> 
> I used "unexpected" because there were only 35 userspace processes and
> genxref was the only process which did a lot of memory allocation
> (modulo kernel threads woken by file I/O) and most memory is reclaimable.

The memory is reclaimable but that doesn't mean that order-2 page block
will get formed even if all of it gets reclaimed. The memory is simply
too fragmented. That is why I think the OOM makes sense.

> > > > Something like the following:
> > > Yes, I do think we need something like it.
> > 
> > Was the patch applied?
> 
> No for above result.
> 
> A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
> today's linux-next is shown below. It seems that protection is not enough.
> 
> ----------
> [  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
> [  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[...]
> [  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
> [  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[...]
> [  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
> [  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB

As you can see, in all cases we had order-2 requests and no order-2+
free blocks even after all the retries. I think the OOM is appropriate
at that time. We could have tried N+1 times but we have to draw a line
at some point of time. The reason why we do not have any high order
block available is a completely different question IMO. Maybe the
compaction just gets deferred and doesn't do anything. This would be
interesting to investigate further of course. Anyway my point is
that going OOM with the current fragmentation is simply the only choice.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 13:27   ` Michal Hocko
@ 2016-02-25  3:47     ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-25  3:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:
> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Sorry to say at this late date, but I do have one concern: hopefully
you can tweak something somewhere, or point me to some tunable that
I can adjust (I've not studied the patches, sorry).

This rework makes it impossible to run my tmpfs swapping loads:
they're soon OOM-killed when they ran forever before, so swapping
does not get the exercise on mmotm that it used to.  (But I'm not
so arrogant as to expect you to optimize for my load!)

Maybe it's just that I'm using tmpfs, and there's code that's conscious
of file and anon, but doesn't cope properly with the awkward shmem case.

(Of course, tmpfs is and always has been a problem for OOM-killing,
given that it takes up memory, but none is freed by killing processes:
but although that is a tiresome problem, it's not what either of us is
attacking here.)

Taking many of the irrelevancies out of my load, here's something you
could try, first on v4.5-rc5 and then on mmotm.

Boot with mem=1G (or boot your usual way, and do something to occupy
most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
way to gobble up most of the memory, though it's not how I've done it).

Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
kernel source tree into a tmpfs: size=2G is more than enough.
make defconfig there, then make -j20.

On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.

Except that you'll probably need to fiddle around with that j20,
it's true for my laptop but not for my workstation.  j20 just happens
to be what I've had there for years, that I now see breaking down
(I can lower to j6 to proceed, perhaps could go a bit higher,
but it still doesn't exercise swap very much).

This OOM detection rework significantly lowers the number of jobs
which can be run in parallel without being OOM-killed.  Which would
be welcome if it were choosing to abort in place of thrashing, but
the system was far from thrashing: j20 took a few seconds more than
j6, and even j30 didn't take 50% longer.

(I have /proc/sys/vm/swappiness 100, if that matters.)

I hope there's an easy answer to this: thanks!
Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  3:47     ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-25  3:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:
> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Sorry to say at this late date, but I do have one concern: hopefully
you can tweak something somewhere, or point me to some tunable that
I can adjust (I've not studied the patches, sorry).

This rework makes it impossible to run my tmpfs swapping loads:
they're soon OOM-killed when they ran forever before, so swapping
does not get the exercise on mmotm that it used to.  (But I'm not
so arrogant as to expect you to optimize for my load!)

Maybe it's just that I'm using tmpfs, and there's code that's conscious
of file and anon, but doesn't cope properly with the awkward shmem case.

(Of course, tmpfs is and always has been a problem for OOM-killing,
given that it takes up memory, but none is freed by killing processes:
but although that is a tiresome problem, it's not what either of us is
attacking here.)

Taking many of the irrelevancies out of my load, here's something you
could try, first on v4.5-rc5 and then on mmotm.

Boot with mem=1G (or boot your usual way, and do something to occupy
most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
way to gobble up most of the memory, though it's not how I've done it).

Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
kernel source tree into a tmpfs: size=2G is more than enough.
make defconfig there, then make -j20.

On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.

Except that you'll probably need to fiddle around with that j20,
it's true for my laptop but not for my workstation.  j20 just happens
to be what I've had there for years, that I now see breaking down
(I can lower to j6 to proceed, perhaps could go a bit higher,
but it still doesn't exercise swap very much).

This OOM detection rework significantly lowers the number of jobs
which can be run in parallel without being OOM-killed.  Which would
be welcome if it were choosing to abort in place of thrashing, but
the system was far from thrashing: j20 took a few seconds more than
j6, and even j30 didn't take 50% longer.

(I have /proc/sys/vm/swappiness 100, if that matters.)

I hope there's an easy answer to this: thanks!
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
@ 2016-02-25  6:48       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25  6:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky,
	Sergey Senozhatsky

Hello,

On (02/24/16 19:47), Hugh Dickins wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Sorry to say at this late date, but I do have one concern: hopefully
> you can tweak something somewhere, or point me to some tunable that
> I can adjust (I've not studied the patches, sorry).
> 
> This rework makes it impossible to run my tmpfs swapping loads:
> they're soon OOM-killed when they ran forever before, so swapping
> does not get the exercise on mmotm that it used to.  (But I'm not
> so arrogant as to expect you to optimize for my load!)
> 
> Maybe it's just that I'm using tmpfs, and there's code that's conscious
> of file and anon, but doesn't cope properly with the awkward shmem case.
> 
> (Of course, tmpfs is and always has been a problem for OOM-killing,
> given that it takes up memory, but none is freed by killing processes:
> but although that is a tiresome problem, it's not what either of us is
> attacking here.)
> 
> Taking many of the irrelevancies out of my load, here's something you
> could try, first on v4.5-rc5 and then on mmotm.
> 

FWIW,

I have recently noticed the same change while testing zram-zsmalloc. next/mmots
are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
oom-kills Xorg, etc.

the test script just creates a zram device (ext4 fs, lzo compression) and fills
it with some data, nothing special.


OOM example:

[ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2392.663187] Call Trace:
[ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2392.663230] Mem-Info:
[ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
                active_file:161111 inactive_file:320022 isolated_file:0
                unevictable:0 dirty:51 writeback:0 unstable:0
                slab_reclaimable:80335 slab_unreclaimable:5920
                mapped:30115 shmem:29235 pagetables:2589 bounce:0
                free:10949 free_pcp:189 free_cma:0
[ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
[ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
[ 2392.663249] lowmem_reserve[]: 0 0 824 824
[ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 2392.663257] lowmem_reserve[]: 0 0 0 0
[ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
[ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
[ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2392.663303] 510384 total pagecache pages
[ 2392.663305] 31 pages in swap cache
[ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
[ 2392.663307] Free swap  = 8388268kB
[ 2392.663308] Total swap = 8388604kB
[ 2392.663308] 1032092 pages RAM
[ 2392.663309] 0 pages HighMem/MovableOnly
[ 2392.663310] 40110 pages reserved
[ 2392.663311] 0 pages hwpoisoned
[ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
[ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
[ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
[ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
[ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
[ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
[ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
[ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
[ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB


[ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2400.152481] Call Trace:
[ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
[ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2400.152527] Mem-Info:
[ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
                active_file:160072 inactive_file:275086 isolated_file:0
                unevictable:0 dirty:49 writeback:0 unstable:0
                slab_reclaimable:54096 slab_unreclaimable:5978
                mapped:13650 shmem:29234 pagetables:2058 bounce:0
                free:13017 free_pcp:134 free_cma:0
[ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
[ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
[ 2400.152546] lowmem_reserve[]: 0 0 824 824
[ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
[ 2400.152555] lowmem_reserve[]: 0 0 0 0
[ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
[ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
[ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2400.152593] 464295 total pagecache pages
[ 2400.152594] 31 pages in swap cache
[ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
[ 2400.152596] Free swap  = 8388268kB
[ 2400.152597] Total swap = 8388604kB
[ 2400.152598] 1032092 pages RAM
[ 2400.152599] 0 pages HighMem/MovableOnly
[ 2400.152600] 40110 pages reserved
[ 2400.152600] 0 pages hwpoisoned
[ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
[ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
[ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
[ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
[ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
[ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
[ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
[ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
[ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB




$ free
              total        used        free      shared  buff/cache   available
Mem:        3967928     1563132      310548      116936     2094248     2207584
Swap:       8388604         332     8388272


	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  6:48       ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25  6:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky,
	Sergey Senozhatsky

Hello,

On (02/24/16 19:47), Hugh Dickins wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Sorry to say at this late date, but I do have one concern: hopefully
> you can tweak something somewhere, or point me to some tunable that
> I can adjust (I've not studied the patches, sorry).
> 
> This rework makes it impossible to run my tmpfs swapping loads:
> they're soon OOM-killed when they ran forever before, so swapping
> does not get the exercise on mmotm that it used to.  (But I'm not
> so arrogant as to expect you to optimize for my load!)
> 
> Maybe it's just that I'm using tmpfs, and there's code that's conscious
> of file and anon, but doesn't cope properly with the awkward shmem case.
> 
> (Of course, tmpfs is and always has been a problem for OOM-killing,
> given that it takes up memory, but none is freed by killing processes:
> but although that is a tiresome problem, it's not what either of us is
> attacking here.)
> 
> Taking many of the irrelevancies out of my load, here's something you
> could try, first on v4.5-rc5 and then on mmotm.
> 

FWIW,

I have recently noticed the same change while testing zram-zsmalloc. next/mmots
are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
oom-kills Xorg, etc.

the test script just creates a zram device (ext4 fs, lzo compression) and fills
it with some data, nothing special.


OOM example:

[ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2392.663187] Call Trace:
[ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2392.663230] Mem-Info:
[ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
                active_file:161111 inactive_file:320022 isolated_file:0
                unevictable:0 dirty:51 writeback:0 unstable:0
                slab_reclaimable:80335 slab_unreclaimable:5920
                mapped:30115 shmem:29235 pagetables:2589 bounce:0
                free:10949 free_pcp:189 free_cma:0
[ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
[ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
[ 2392.663249] lowmem_reserve[]: 0 0 824 824
[ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 2392.663257] lowmem_reserve[]: 0 0 0 0
[ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
[ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
[ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2392.663303] 510384 total pagecache pages
[ 2392.663305] 31 pages in swap cache
[ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
[ 2392.663307] Free swap  = 8388268kB
[ 2392.663308] Total swap = 8388604kB
[ 2392.663308] 1032092 pages RAM
[ 2392.663309] 0 pages HighMem/MovableOnly
[ 2392.663310] 40110 pages reserved
[ 2392.663311] 0 pages hwpoisoned
[ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
[ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
[ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
[ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
[ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
[ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
[ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
[ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
[ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB


[ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2400.152481] Call Trace:
[ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
[ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2400.152527] Mem-Info:
[ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
                active_file:160072 inactive_file:275086 isolated_file:0
                unevictable:0 dirty:49 writeback:0 unstable:0
                slab_reclaimable:54096 slab_unreclaimable:5978
                mapped:13650 shmem:29234 pagetables:2058 bounce:0
                free:13017 free_pcp:134 free_cma:0
[ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
[ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
[ 2400.152546] lowmem_reserve[]: 0 0 824 824
[ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
[ 2400.152555] lowmem_reserve[]: 0 0 0 0
[ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
[ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
[ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2400.152593] 464295 total pagecache pages
[ 2400.152594] 31 pages in swap cache
[ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
[ 2400.152596] Free swap  = 8388268kB
[ 2400.152597] Total swap = 8388604kB
[ 2400.152598] 1032092 pages RAM
[ 2400.152599] 0 pages HighMem/MovableOnly
[ 2400.152600] 40110 pages reserved
[ 2400.152600] 0 pages hwpoisoned
[ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
[ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
[ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
[ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
[ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
[ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
[ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
[ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
[ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB




$ free
              total        used        free      shared  buff/cache   available
Mem:        3967928     1563132      310548      116936     2094248     2207584
Swap:       8388604         332     8388272


	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  6:48       ` Sergey Senozhatsky
@ 2016-02-25  9:17         ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:17 UTC (permalink / raw)
  To: 'Sergey Senozhatsky', 'Hugh Dickins'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

> 
> On (02/24/16 19:47), Hugh Dickins wrote:
> > On Wed, 3 Feb 2016, Michal Hocko wrote:
> > > Hi,
> > > this thread went mostly quite. Are all the main concerns clarified?
> > > Are there any new concerns? Are there any objections to targeting
> > > this for the next merge window?
> >
> > Sorry to say at this late date, but I do have one concern: hopefully
> > you can tweak something somewhere, or point me to some tunable that
> > I can adjust (I've not studied the patches, sorry).
> >
> > This rework makes it impossible to run my tmpfs swapping loads:
> > they're soon OOM-killed when they ran forever before, so swapping
> > does not get the exercise on mmotm that it used to.  (But I'm not
> > so arrogant as to expect you to optimize for my load!)
> >
> > Maybe it's just that I'm using tmpfs, and there's code that's conscious
> > of file and anon, but doesn't cope properly with the awkward shmem case.
> >
> > (Of course, tmpfs is and always has been a problem for OOM-killing,
> > given that it takes up memory, but none is freed by killing processes:
> > but although that is a tiresome problem, it's not what either of us is
> > attacking here.)
> >
> > Taking many of the irrelevancies out of my load, here's something you
> > could try, first on v4.5-rc5 and then on mmotm.
> >
> 
> FWIW,
> 
> I have recently noticed the same change while testing zram-zsmalloc. next/mmots
> are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
> of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
> oom-kills Xorg, etc.
> 
> the test script just creates a zram device (ext4 fs, lzo compression) and fills
> it with some data, nothing special.
> 
> 
> OOM example:
> 
> [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2392.663187] Call Trace:
> [ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2392.663230] Mem-Info:
> [ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
>                 active_file:161111 inactive_file:320022 isolated_file:0
>                 unevictable:0 dirty:51 writeback:0 unstable:0
>                 slab_reclaimable:80335 slab_unreclaimable:5920
>                 mapped:30115 shmem:29235 pagetables:2589 bounce:0
>                 free:10949 free_pcp:189 free_cma:0
> [ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
> [ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB
> active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB
> slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
> [ 2392.663249] lowmem_reserve[]: 0 0 824 824
> [ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB
> active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB
> slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [ 2392.663257] lowmem_reserve[]: 0 0 0 0
> [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 23260kB
> [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 6060kB
> [ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2392.663303] 510384 total pagecache pages
> [ 2392.663305] 31 pages in swap cache
> [ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
> [ 2392.663307] Free swap  = 8388268kB
> [ 2392.663308] Total swap = 8388604kB
> [ 2392.663308] 1032092 pages RAM
> [ 2392.663309] 0 pages HighMem/MovableOnly
> [ 2392.663310] 40110 pages reserved
> [ 2392.663311] 0 pages hwpoisoned
> [ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
> [ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
> [ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
> [ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
> [ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
> [ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
> [ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
> [ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB
> 
> 
> [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2400.152481] Call Trace:
> [ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
> [ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2400.152527] Mem-Info:
> [ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
>                 active_file:160072 inactive_file:275086 isolated_file:0
>                 unevictable:0 dirty:49 writeback:0 unstable:0
>                 slab_reclaimable:54096 slab_unreclaimable:5978
>                 mapped:13650 shmem:29234 pagetables:2058 bounce:0
>                 free:13017 free_pcp:134 free_cma:0
> [ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
> [ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB
> active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB
> slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
> [ 2400.152546] lowmem_reserve[]: 0 0 824 824
> [ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB
> active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB
> slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
> [ 2400.152555] lowmem_reserve[]: 0 0 0 0
> [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 31780kB
> [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 5708kB
> [ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2400.152593] 464295 total pagecache pages
> [ 2400.152594] 31 pages in swap cache
> [ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
> [ 2400.152596] Free swap  = 8388268kB
> [ 2400.152597] Total swap = 8388604kB
> [ 2400.152598] 1032092 pages RAM
> [ 2400.152599] 0 pages HighMem/MovableOnly
> [ 2400.152600] 40110 pages reserved
> [ 2400.152600] 0 pages hwpoisoned
> [ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
> [ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
> [ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
> [ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
> [ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
> [ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
> [ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
> [ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
> 
> 
> 
> 
> $ free
>               total        used        free      shared  buff/cache   available
> Mem:        3967928     1563132      310548      116936     2094248     2207584
> Swap:       8388604         332     8388272
> 
Hi Sergey

Thanks for your info.

Can you please schedule a run for the diff attached, in which 
non-expensive allocators are allowed to burn more CPU cycles.

thanks
Hillf

--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		no_progress_loops /= 2;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:17         ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:17 UTC (permalink / raw)
  To: 'Sergey Senozhatsky', 'Hugh Dickins'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

> 
> On (02/24/16 19:47), Hugh Dickins wrote:
> > On Wed, 3 Feb 2016, Michal Hocko wrote:
> > > Hi,
> > > this thread went mostly quite. Are all the main concerns clarified?
> > > Are there any new concerns? Are there any objections to targeting
> > > this for the next merge window?
> >
> > Sorry to say at this late date, but I do have one concern: hopefully
> > you can tweak something somewhere, or point me to some tunable that
> > I can adjust (I've not studied the patches, sorry).
> >
> > This rework makes it impossible to run my tmpfs swapping loads:
> > they're soon OOM-killed when they ran forever before, so swapping
> > does not get the exercise on mmotm that it used to.  (But I'm not
> > so arrogant as to expect you to optimize for my load!)
> >
> > Maybe it's just that I'm using tmpfs, and there's code that's conscious
> > of file and anon, but doesn't cope properly with the awkward shmem case.
> >
> > (Of course, tmpfs is and always has been a problem for OOM-killing,
> > given that it takes up memory, but none is freed by killing processes:
> > but although that is a tiresome problem, it's not what either of us is
> > attacking here.)
> >
> > Taking many of the irrelevancies out of my load, here's something you
> > could try, first on v4.5-rc5 and then on mmotm.
> >
> 
> FWIW,
> 
> I have recently noticed the same change while testing zram-zsmalloc. next/mmots
> are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
> of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
> oom-kills Xorg, etc.
> 
> the test script just creates a zram device (ext4 fs, lzo compression) and fills
> it with some data, nothing special.
> 
> 
> OOM example:
> 
> [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2392.663187] Call Trace:
> [ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2392.663230] Mem-Info:
> [ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
>                 active_file:161111 inactive_file:320022 isolated_file:0
>                 unevictable:0 dirty:51 writeback:0 unstable:0
>                 slab_reclaimable:80335 slab_unreclaimable:5920
>                 mapped:30115 shmem:29235 pagetables:2589 bounce:0
>                 free:10949 free_pcp:189 free_cma:0
> [ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
> [ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB
> active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB
> slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
> [ 2392.663249] lowmem_reserve[]: 0 0 824 824
> [ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB
> active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB
> slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [ 2392.663257] lowmem_reserve[]: 0 0 0 0
> [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 23260kB
> [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 6060kB
> [ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2392.663303] 510384 total pagecache pages
> [ 2392.663305] 31 pages in swap cache
> [ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
> [ 2392.663307] Free swap  = 8388268kB
> [ 2392.663308] Total swap = 8388604kB
> [ 2392.663308] 1032092 pages RAM
> [ 2392.663309] 0 pages HighMem/MovableOnly
> [ 2392.663310] 40110 pages reserved
> [ 2392.663311] 0 pages hwpoisoned
> [ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
> [ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
> [ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
> [ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
> [ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
> [ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
> [ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
> [ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB
> 
> 
> [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2400.152481] Call Trace:
> [ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
> [ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2400.152527] Mem-Info:
> [ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
>                 active_file:160072 inactive_file:275086 isolated_file:0
>                 unevictable:0 dirty:49 writeback:0 unstable:0
>                 slab_reclaimable:54096 slab_unreclaimable:5978
>                 mapped:13650 shmem:29234 pagetables:2058 bounce:0
>                 free:13017 free_pcp:134 free_cma:0
> [ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
> [ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB
> active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB
> slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
> [ 2400.152546] lowmem_reserve[]: 0 0 824 824
> [ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB
> active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB
> slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
> [ 2400.152555] lowmem_reserve[]: 0 0 0 0
> [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 31780kB
> [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 5708kB
> [ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2400.152593] 464295 total pagecache pages
> [ 2400.152594] 31 pages in swap cache
> [ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
> [ 2400.152596] Free swap  = 8388268kB
> [ 2400.152597] Total swap = 8388604kB
> [ 2400.152598] 1032092 pages RAM
> [ 2400.152599] 0 pages HighMem/MovableOnly
> [ 2400.152600] 40110 pages reserved
> [ 2400.152600] 0 pages hwpoisoned
> [ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
> [ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
> [ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
> [ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
> [ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
> [ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
> [ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
> [ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
> 
> 
> 
> 
> $ free
>               total        used        free      shared  buff/cache   available
> Mem:        3967928     1563132      310548      116936     2094248     2207584
> Swap:       8388604         332     8388272
> 
Hi Sergey

Thanks for your info.

Can you please schedule a run for the diff attached, in which 
non-expensive allocators are allowed to burn more CPU cycles.

thanks
Hillf

--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		no_progress_loops /= 2;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
@ 2016-02-25  9:23       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).
> 
> This OOM detection rework significantly lowers the number of jobs
> which can be run in parallel without being OOM-killed. 

This all smells like pre mature OOM because of a high order allocation
(order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is
reporting order-2 OOMs as well. It is true that what we have in the
mmomt right now is quite fragile if all order-N+ are completely
depleted. That was the case for both Tetsuo and Sergey. I have tried to
mitigate this at least to some degree by
http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
with the full changelog) but I haven't heard back whether it helped
so I haven't posted the official patch yet.

I also suspect that something is not quite right with the compaction and
it gives up too early even though we have quite a lot reclaimable pages.
I do not have any numbers for that because I didn't have a load to
reproduce this problem yet. I will try your setup and see what I can do
about that. It would be great if you could give the patch below a try
and see if it helps.
---
>From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 4 Feb 2016 14:56:59 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and the OOM killer is just a
matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
order and not costly requests to make sure we do not fail prematurely.

This also means that we do not reset no_progress_loops at the
__alloc_pages_slowpath for high order allocations to guarantee a bounded
number of retries.

Longterm it would be much better to communicate with the compaction
and retry only if the compaction considers it meaningfull.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f05aca36469b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:23       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).
> 
> This OOM detection rework significantly lowers the number of jobs
> which can be run in parallel without being OOM-killed. 

This all smells like pre mature OOM because of a high order allocation
(order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is
reporting order-2 OOMs as well. It is true that what we have in the
mmomt right now is quite fragile if all order-N+ are completely
depleted. That was the case for both Tetsuo and Sergey. I have tried to
mitigate this at least to some degree by
http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
with the full changelog) but I haven't heard back whether it helped
so I haven't posted the official patch yet.

I also suspect that something is not quite right with the compaction and
it gives up too early even though we have quite a lot reclaimable pages.
I do not have any numbers for that because I didn't have a load to
reproduce this problem yet. I will try your setup and see what I can do
about that. It would be great if you could give the patch below a try
and see if it helps.
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:17         ` Hillf Danton
@ 2016-02-25  9:27           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

On Thu 25-02-16 17:17:45, Hillf Danton wrote:
[...]
> > OOM example:
> > 
> > [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,  oom_score_adj=0
[...]
> > [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
> > [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB

[...]
> > [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> > [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)  2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
> > [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =  5708kB
[...]
> Thanks for your info.
> 
> Can you please schedule a run for the diff attached, in which 
> non-expensive allocators are allowed to burn more CPU cycles.

I do not think your patch will help. As you can see, both OOMs were for
order-2 and there simply are no order-2+ free blocks usable for the
allocation request so the watermark check will fail for all eligible
zones and no_progress_loops is simply ignored. This is what I've tried
to address by patch I have just posted as a reply to Hugh's email
http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz

> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		no_progress_loops /= 2;
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
>  	 * several times in the row.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:27           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

On Thu 25-02-16 17:17:45, Hillf Danton wrote:
[...]
> > OOM example:
> > 
> > [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,  oom_score_adj=0
[...]
> > [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
> > [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB

[...]
> > [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> > [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)  2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
> > [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =  5708kB
[...]
> Thanks for your info.
> 
> Can you please schedule a run for the diff attached, in which 
> non-expensive allocators are allowed to burn more CPU cycles.

I do not think your patch will help. As you can see, both OOMs were for
order-2 and there simply are no order-2+ free blocks usable for the
allocation request so the watermark check will fail for all eligible
zones and no_progress_loops is simply ignored. This is what I've tried
to address by patch I have just posted as a reply to Hugh's email
http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz

> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		no_progress_loops /= 2;
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
>  	 * several times in the row.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:27           ` Michal Hocko
@ 2016-02-25  9:48             ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:48 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> >
> > Can you please schedule a run for the diff attached, in which
> > non-expensive allocators are allowed to burn more CPU cycles.
> 
> I do not think your patch will help. As you can see, both OOMs were for
> order-2 and there simply are no order-2+ free blocks usable for the
> allocation request so the watermark check will fail for all eligible
> zones and no_progress_loops is simply ignored. This is what I've tried
> to address by patch I have just posted as a reply to Hugh's email
> http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> 
Hm, Mr. Swap can tell us more.

Hillf

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:48             ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:48 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> >
> > Can you please schedule a run for the diff attached, in which
> > non-expensive allocators are allowed to burn more CPU cycles.
> 
> I do not think your patch will help. As you can see, both OOMs were for
> order-2 and there simply are no order-2+ free blocks usable for the
> allocation request so the watermark check will fail for all eligible
> zones and no_progress_loops is simply ignored. This is what I've tried
> to address by patch I have just posted as a reply to Hugh's email
> http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> 
Hm, Mr. Swap can tell us more.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:48             ` Hillf Danton
@ 2016-02-25 11:02               ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25 11:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', 'Sergey Senozhatsky',
	'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On (02/25/16 17:48), Hillf Danton wrote:
> > > Can you please schedule a run for the diff attached, in which
> > > non-expensive allocators are allowed to burn more CPU cycles.
> > 
> > I do not think your patch will help. As you can see, both OOMs were for
> > order-2 and there simply are no order-2+ free blocks usable for the
> > allocation request so the watermark check will fail for all eligible
> > zones and no_progress_loops is simply ignored. This is what I've tried
> > to address by patch I have just posted as a reply to Hugh's email
> > http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> > 
> Hm, Mr. Swap can tell us more.


Hi,

after *preliminary testing* both patches seem to work. at least I don't
see oom-kills and there are some swapouts.

Michal Hocko's
              total        used        free      shared  buff/cache   available
Mem:        3836880     2458020       35992      115984     1342868     1181484
Swap:       8388604        2008     8386596

              total        used        free      shared  buff/cache   available
Mem:        3836880     2459516       39616      115880     1337748     1180156
Swap:       8388604        2052     8386552

              total        used        free      shared  buff/cache   available
Mem:        3836880     2460584       33944      115880     1342352     1179004
Swap:       8388604        2132     8386472
...




Hillf Danton's
              total        used        free      shared  buff/cache   available
Mem:        3836880     1661000      554236      116448     1621644     1978872
Swap:       8388604        1548     8387056

              total        used        free      shared  buff/cache   available
Mem:        3836880     1660500      554740      116448     1621640     1979376
Swap:       8388604        1548     8387056

...


I'll do more tests tomorrow.


	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25 11:02               ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25 11:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', 'Sergey Senozhatsky',
	'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On (02/25/16 17:48), Hillf Danton wrote:
> > > Can you please schedule a run for the diff attached, in which
> > > non-expensive allocators are allowed to burn more CPU cycles.
> > 
> > I do not think your patch will help. As you can see, both OOMs were for
> > order-2 and there simply are no order-2+ free blocks usable for the
> > allocation request so the watermark check will fail for all eligible
> > zones and no_progress_loops is simply ignored. This is what I've tried
> > to address by patch I have just posted as a reply to Hugh's email
> > http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> > 
> Hm, Mr. Swap can tell us more.


Hi,

after *preliminary testing* both patches seem to work. at least I don't
see oom-kills and there are some swapouts.

Michal Hocko's
              total        used        free      shared  buff/cache   available
Mem:        3836880     2458020       35992      115984     1342868     1181484
Swap:       8388604        2008     8386596

              total        used        free      shared  buff/cache   available
Mem:        3836880     2459516       39616      115880     1337748     1180156
Swap:       8388604        2052     8386552

              total        used        free      shared  buff/cache   available
Mem:        3836880     2460584       33944      115880     1342352     1179004
Swap:       8388604        2132     8386472
...




Hillf Danton's
              total        used        free      shared  buff/cache   available
Mem:        3836880     1661000      554236      116448     1621644     1978872
Swap:       8388604        1548     8387056

              total        used        free      shared  buff/cache   available
Mem:        3836880     1660500      554740      116448     1621640     1979376
Swap:       8388604        1548     8387056

...


I'll do more tests tomorrow.


	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:23       ` Michal Hocko
@ 2016-02-26  6:32         ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Thu, 25 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> > 
> > This OOM detection rework significantly lowers the number of jobs
> > which can be run in parallel without being OOM-killed. 
> 
> This all smells like pre mature OOM because of a high order allocation
> (order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is

You're absolutely right, and I'm ashamed not to have noticed that, nor
your comments and patch earlier in this thread, before bothering you.
Order 2 they are.

> reporting order-2 OOMs as well. It is true that what we have in the
> mmomt right now is quite fragile if all order-N+ are completely
> depleted. That was the case for both Tetsuo and Sergey. I have tried to
> mitigate this at least to some degree by
> http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
> with the full changelog) but I haven't heard back whether it helped
> so I haven't posted the official patch yet.
> 
> I also suspect that something is not quite right with the compaction and
> it gives up too early even though we have quite a lot reclaimable pages.
> I do not have any numbers for that because I didn't have a load to
> reproduce this problem yet. I will try your setup and see what I can do

Thanks a lot.

> about that. It would be great if you could give the patch below a try
> and see if it helps.
> ---
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

It didn't really help, I'm afraid: it reduces the actual number of OOM
kills which occur before the job is terminated, but doesn't stop the
job from being terminated very soon.

I also tried Hillf's patch (separately) too, but as you expected,
it didn't seem to make any difference.

(I haven't tried on the PowerMac G5 yet, since that's busy with
other testing; but expect that to tell the same story.)

Hugh

> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  6:32         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Thu, 25 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> > 
> > This OOM detection rework significantly lowers the number of jobs
> > which can be run in parallel without being OOM-killed. 
> 
> This all smells like pre mature OOM because of a high order allocation
> (order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is

You're absolutely right, and I'm ashamed not to have noticed that, nor
your comments and patch earlier in this thread, before bothering you.
Order 2 they are.

> reporting order-2 OOMs as well. It is true that what we have in the
> mmomt right now is quite fragile if all order-N+ are completely
> depleted. That was the case for both Tetsuo and Sergey. I have tried to
> mitigate this at least to some degree by
> http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
> with the full changelog) but I haven't heard back whether it helped
> so I haven't posted the official patch yet.
> 
> I also suspect that something is not quite right with the compaction and
> it gives up too early even though we have quite a lot reclaimable pages.
> I do not have any numbers for that because I didn't have a load to
> reproduce this problem yet. I will try your setup and see what I can do

Thanks a lot.

> about that. It would be great if you could give the patch below a try
> and see if it helps.
> ---
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

It didn't really help, I'm afraid: it reduces the actual number of OOM
kills which occur before the job is terminated, but doesn't stop the
job from being terminated very soon.

I also tried Hillf's patch (separately) too, but as you expected,
it didn't seem to make any difference.

(I haven't tried on the PowerMac G5 yet, since that's busy with
other testing; but expect that to tell the same story.)

Hugh

> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  6:32         ` Hugh Dickins
@ 2016-02-26  7:54           ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26  7:54 UTC (permalink / raw)
  To: 'Hugh Dickins', 'Michal Hocko'
  Cc: 'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.
> 
> I also tried Hillf's patch (separately) too, but as you expected,
> it didn't seem to make any difference.
> 
Perhaps non-costly means NOFAIL as shown by folding the two
patches into one. Can it make any sense?

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  7:54           ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26  7:54 UTC (permalink / raw)
  To: 'Hugh Dickins', 'Michal Hocko'
  Cc: 'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.
> 
> I also tried Hillf's patch (separately) too, but as you expected,
> it didn't seem to make any difference.
> 
Perhaps non-costly means NOFAIL as shown by folding the two
patches into one. Can it make any sense?

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  7:54           ` Hillf Danton
@ 2016-02-26  9:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 15:54:19, Hillf Danton wrote:
> > 
> > It didn't really help, I'm afraid: it reduces the actual number of OOM
> > kills which occur before the job is terminated, but doesn't stop the
> > job from being terminated very soon.
> > 
> > I also tried Hillf's patch (separately) too, but as you expected,
> > it didn't seem to make any difference.
> > 
> Perhaps non-costly means NOFAIL as shown by folding the two

nofail only means that the page allocator doesn't return with NULL.
OOM killer is still not put aside...

> patches into one. Can it make any sense?
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;

This is defeating the whole purpose of the rework - to behave
deterministically. You have just disabled the oom killer completely.
This is not the way to go

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  9:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 15:54:19, Hillf Danton wrote:
> > 
> > It didn't really help, I'm afraid: it reduces the actual number of OOM
> > kills which occur before the job is terminated, but doesn't stop the
> > job from being terminated very soon.
> > 
> > I also tried Hillf's patch (separately) too, but as you expected,
> > it didn't seem to make any difference.
> > 
> Perhaps non-costly means NOFAIL as shown by folding the two

nofail only means that the page allocator doesn't return with NULL.
OOM killer is still not put aside...

> patches into one. Can it make any sense?
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;

This is defeating the whole purpose of the rework - to behave
deterministically. You have just disabled the oom killer completely.
This is not the way to go

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  6:32         ` Hugh Dickins
@ 2016-02-26  9:33           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Thu 25-02-16 22:32:54, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Michal Hocko wrote:
[...]
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.

Yeah this is not a magic bullet. I am happy to hear that the patch
actually helped to reduce the number of OOM kills, though, because that is
what it aims to do. I also believe that supports (at least partially) my
suspicious that it is compaction which doesn't try enough.
order-0 reclaim, even when done repeatedly, doesn't have a great
chances to form higher order pages. Especially when there is a lot of
migrateable memory. I have already talked about this with Vlastimil and
he said that compaction can indeed back off too early because it doesn't
care about !costly request much at all. We will have a look into this
more next week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  9:33           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Thu 25-02-16 22:32:54, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Michal Hocko wrote:
[...]
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.

Yeah this is not a magic bullet. I am happy to hear that the patch
actually helped to reduce the number of OOM kills, though, because that is
what it aims to do. I also believe that supports (at least partially) my
suspicious that it is compaction which doesn't try enough.
order-0 reclaim, even when done repeatedly, doesn't have a great
chances to form higher order pages. Especially when there is a lot of
migrateable memory. I have already talked about this with Vlastimil and
he said that compaction can indeed back off too early because it doesn't
care about !costly request much at all. We will have a look into this
more next week.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  9:24             ` Michal Hocko
@ 2016-02-26 10:27               ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26 10:27 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

>> 
> > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> 
> This is defeating the whole purpose of the rework - to behave
> deterministically. You have just disabled the oom killer completely.
> This is not the way to go
> 
Then in another direction, below is what I can do.

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
@@ -3366,8 +3366,11 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress > 0, no_progress_loops)) {
+		/* Burn more cycles if any zone seems to satisfy our request */
+		no_progress_loops /= 2;
 		goto retry;
+	}
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26 10:27               ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26 10:27 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

>> 
> > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> 
> This is defeating the whole purpose of the rework - to behave
> deterministically. You have just disabled the oom killer completely.
> This is not the way to go
> 
Then in another direction, below is what I can do.

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
@@ -3366,8 +3366,11 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress > 0, no_progress_loops)) {
+		/* Burn more cycles if any zone seems to satisfy our request */
+		no_progress_loops /= 2;
 		goto retry;
+	}
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
--


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26 10:27               ` Hillf Danton
@ 2016-02-26 13:49                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26 13:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 18:27:16, Hillf Danton wrote:
> >> 
> > > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > 
> > This is defeating the whole purpose of the rework - to behave
> > deterministically. You have just disabled the oom killer completely.
> > This is not the way to go
> > 
> Then in another direction, below is what I can do.
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
> @@ -3366,8 +3366,11 @@ retry:
>  		no_progress_loops++;
>  
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, no_progress_loops))
> +				 did_some_progress > 0, no_progress_loops)) {
> +		/* Burn more cycles if any zone seems to satisfy our request */
> +		no_progress_loops /= 2;

No, I do not think this makes any sense. If we need more retry loops
then we can do it by increasing MAX_RECLAIM_RETRIES.

>  		goto retry;
> +	}
>  
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26 13:49                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26 13:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 18:27:16, Hillf Danton wrote:
> >> 
> > > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > 
> > This is defeating the whole purpose of the rework - to behave
> > deterministically. You have just disabled the oom killer completely.
> > This is not the way to go
> > 
> Then in another direction, below is what I can do.
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
> @@ -3366,8 +3366,11 @@ retry:
>  		no_progress_loops++;
>  
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, no_progress_loops))
> +				 did_some_progress > 0, no_progress_loops)) {
> +		/* Burn more cycles if any zone seems to satisfy our request */
> +		no_progress_loops /= 2;

No, I do not think this makes any sense. If we need more retry loops
then we can do it by increasing MAX_RECLAIM_RETRIES.

>  		goto retry;
> +	}
>  
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
                       ` (2 preceding siblings ...)
  (?)
@ 2016-02-29 20:35     ` Michal Hocko
  2016-03-01  7:29         ` Hugh Dickins
  -1 siblings, 1 reply; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 20:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1845 bytes --]

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).

I have tried to reproduce and failed in a virtual on my laptop. I
will try with another host with more CPUs (because my laptop has only
two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
(16, 10 no difference really). I was also collecting vmstat in the
background. The compilation takes ages but the behavior seems consistent
and stable.

If I try 900M for huge pages then I get OOMs but this happens with the
mmotm without my oom rework patch set as well.

It would be great if you could retry and collect /proc/vmstat data
around the OOM time to see what compaction did? (I was using the
attached little program to reduce interference during OOM (no forks, the
code locked in and the resulting file preallocated - e.g.
read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
hits).

Thanks!
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: read_vmstat.c --]
[-- Type: text/x-csrc, Size: 5025 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <time.h>

/*
 * A simple /proc/vmstat collector into a file. It tries hard to guarantee
 * that the content will get into the output file even under a strong memory
 * pressure.
 *
 * Usage
 * ./read_vmstat output_file timeout output_size
 *
 * output_file can be either a non-existing file or - for stdout
 * timeout - time period between two snapshots. s - seconds, ms - miliseconds
 * 	     and m - minutes suffix is allowed
 * output_file - size of the output file. The file is preallocated and pre-filled.
 *
 * If the output reaches the end of the file it will start over overwriting the oldest
 * data. Each snapshot is enclosed by header and footer.
 * =S timestamp
 * [...]
 * E=
 *
 * Please note that your ulimit has to be sufficient to allow to mlock the code+
 * all the buffers.
 *
 * This comes under GPL v2
 * Copyright: Michal Hocko <mhocko@suse.cz> 2015 
 */

#define NS_PER_MS (1000*1000)
#define NS_PER_SEC (1000*NS_PER_MS)

int open_file(const char *str)
{
	int fd;

	fd = open(str, O_CREAT|O_EXCL|O_RDWR, 0755);
	if (fd == -1) {
		perror("open input");
		return 1;
	}

	return fd;
}

int read_timeout(const char *str, struct timespec *timeout)
{
	char *end;
	unsigned long val;

	val = strtoul(str, &end, 10);
	if (val == ULONG_MAX) {
		perror("Invalid number");
		return 1;
	}
	switch(*end) {
		case 's':
			timeout->tv_sec = val;
			break;
		case 'm':
			/* ms vs minute*/
			if (*(end+1) == 's') {
				timeout->tv_sec = (val * NS_PER_MS) / NS_PER_SEC;
				timeout->tv_nsec = (val * NS_PER_MS) % NS_PER_SEC;
			} else {
				timeout->tv_sec = val*60;
			}
			break;
		default:
			fprintf(stderr, "Uknown number %s\n", str);
			return 1;
	}

	return 0;
}

size_t read_size(const char *str)
{
	char *end;
	size_t val = strtoul(str, &end, 10);

	switch (*end) {
		case 'K':
			val <<= 10;
			break;
		case 'M':
			val <<= 20;
			break;
		case 'G':
			val <<= 30;
			break;
	}

	return val;
}

size_t dump_str(char *buffer, size_t buffer_size, size_t pos, const char *in, size_t size)
{
	size_t i;
	for (i = 0; i < size; i++) {
		buffer[pos] = in[i];
		pos = (pos + 1) % buffer_size;
	}

	return pos;
}

/* buffer == NULL -> stdout */
int __collect_logs(const struct timespec *timeout, char *buffer, size_t buffer_size)
{
	char buff[4096]; /* dump to the file automatically */
	time_t before, after;
	int in_fd = open("/proc/vmstat", O_RDONLY);
	size_t out_pos = 0;
	size_t in_pos = 0;
	size_t size = 0;
	int estimate = 0;

	if (in_fd == -1) {
		perror("open vmstat:");
		return 1;
	}

	/* lock everything in */
	if (mlockall(MCL_CURRENT) == -1) {
		perror("mlockall. Continuing anyway");
	}

	while (1) {
		before = time(NULL);

		size = snprintf(buff, sizeof(buff), "=S %lu\n", before);
		lseek(in_fd, 0, SEEK_SET);
		size += read(in_fd, buff + size, sizeof(buff) - size);
		size += snprintf(buff + size, sizeof(buff) - size, "E=\n");
		if (buffer && !estimate) {
			printf("Estimated %d entries fit to the buffer\n", buffer_size/size);
			estimate = 1;
		}

		/* Dump to stdout */
		if (!buffer) {
			printf("%s", buff);
		} else {
			size_t pos;
			pos = dump_str(buffer, buffer_size, out_pos, buff, size);
			if (pos < out_pos)
				fprintf(stderr, "%lu: Buffer wrapped\n", before);
			out_pos = pos;
		}

		after = time(NULL);

		if (after - before > 2) {
			fprintf(stderr, "%d: Snapshoting took %d!!!\n", before, after-before);
		}
		if (nanosleep(timeout, NULL) == -1)
			if (errno == EINTR)
				return 0;
		/* kick in the flushing */
		if (buffer)
			msync(buffer, buffer_size, MS_ASYNC);
	}
}

int collect_logs(int fd, const struct timespec *timeout, size_t buffer_size)
{
	unsigned char *buffer = NULL;

	if (fd != -1) {
		if (ftruncate(fd, buffer_size) == -1) {
			perror("ftruncate");
			return 1;
		}

		if (fallocate(fd, 0, 0, buffer_size) && errno != EOPNOTSUPP) {
			perror("fallocate");
			return 1;
		}

		/* commit it to the disk */
		sync();

		buffer = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
				MAP_SHARED | MAP_POPULATE, fd, 0);
		if (buffer == MAP_FAILED) {
			perror("mmap");
			return 1;
		}
	}

	return __collect_logs(timeout, buffer, buffer_size);
}

int main(int argc, char **argv)
{
	struct timespec timeout = {.tv_sec = 1};
	int fd = -1;
	size_t buffer_size = 10UL<<20;

	if (argc > 1) {
		/* output file */
		if (strcmp(argv[1], "-")) {
			fd = open_file(argv[1]);
			if (fd == -1)
				return 1;
		}

		/* timeout */
		if (argc > 2) {
			if (read_timeout(argv[2], &timeout))
				return 1;

			/* buffer size */
			if (argc > 3) {
				buffer_size = read_size(argv[3]);
				if (buffer_size == -1UL)
					return 1;
			}
		}
	}
	printf("file:%s timeout:%lu.%lus buffer_size:%llu\n",
			(fd == -1)? "stdout" : argv[1],
			timeout.tv_sec, timeout.tv_nsec / NS_PER_MS,
			buffer_size);

	return collect_logs(fd, &timeout, buffer_size);
}

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:23       ` Michal Hocko
@ 2016-02-29 21:02         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

Andrew,
could you queue this one as well, please? This is more a band aid than a
real solution which I will be working on as soon as I am able to
reproduce the issue but the patch should help to some degree at least.

On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-29 21:02         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

Andrew,
could you queue this one as well, please? This is more a band aid than a
real solution which I will be working on as soon as I am able to
reproduce the issue but the patch should help to some degree at least.

On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-29 20:35     ` [PATCH 0/3] OOM detection rework v4 Michal Hocko
@ 2016-03-01  7:29         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-01  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On Mon, 29 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> 
> I have tried to reproduce and failed in a virtual on my laptop. I
> will try with another host with more CPUs (because my laptop has only
> two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> (16, 10 no difference really). I was also collecting vmstat in the
> background. The compilation takes ages but the behavior seems consistent
> and stable.

Thanks a lot for giving it a go.

I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
are less than 2M big??  But probably that's just a misunderstanding
or typo somewhere.

Ignoring that, you're successfully doing a make -20 defconfig build
in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
at all surprised that it takes ages, but I am very surprised that it
does not OOM.  I suppose by rights it ought not to OOM, the built
tree occupies only a little more than 1G, so you do have enough swap;
but I wouldn't get anywhere near that myself without OOMing - I give
myself 1G of RAM (well, minus whatever the booted system takes up)
to do that build in, four times your RAM, yet in my case it OOMs.

That source tree alone occupies more than 700M, so just copying it
into your tmpfs would take a long time.  I'd expect a build in 224M
RAM plus 2G of swap to take so long, that I'd be very grateful to be
OOM killed, even if there is technically enough space.  Unless
perhaps it's some superfast swap that you have?

I was only suggesting to allocate hugetlb pages, if you preferred
not to reboot with artificially reduced RAM.  Not an issue if you're
booting VMs.

It's true that my testing has been done on the physical machines,
no virtualization involved: I expect that accounts for some difference
between us, but as much difference as we're seeing?  That's strange.

> 
> If I try 900M for huge pages then I get OOMs but this happens with the
> mmotm without my oom rework patch set as well.

Right, not at all surprising.

> 
> It would be great if you could retry and collect /proc/vmstat data
> around the OOM time to see what compaction did? (I was using the
> attached little program to reduce interference during OOM (no forks, the
> code locked in and the resulting file preallocated - e.g.
> read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
> hits).
> 
> Thanks!

I'll give it a try, thanks, but not tonight.

Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01  7:29         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-01  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On Mon, 29 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> 
> I have tried to reproduce and failed in a virtual on my laptop. I
> will try with another host with more CPUs (because my laptop has only
> two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> (16, 10 no difference really). I was also collecting vmstat in the
> background. The compilation takes ages but the behavior seems consistent
> and stable.

Thanks a lot for giving it a go.

I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
are less than 2M big??  But probably that's just a misunderstanding
or typo somewhere.

Ignoring that, you're successfully doing a make -20 defconfig build
in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
at all surprised that it takes ages, but I am very surprised that it
does not OOM.  I suppose by rights it ought not to OOM, the built
tree occupies only a little more than 1G, so you do have enough swap;
but I wouldn't get anywhere near that myself without OOMing - I give
myself 1G of RAM (well, minus whatever the booted system takes up)
to do that build in, four times your RAM, yet in my case it OOMs.

That source tree alone occupies more than 700M, so just copying it
into your tmpfs would take a long time.  I'd expect a build in 224M
RAM plus 2G of swap to take so long, that I'd be very grateful to be
OOM killed, even if there is technically enough space.  Unless
perhaps it's some superfast swap that you have?

I was only suggesting to allocate hugetlb pages, if you preferred
not to reboot with artificially reduced RAM.  Not an issue if you're
booting VMs.

It's true that my testing has been done on the physical machines,
no virtualization involved: I expect that accounts for some difference
between us, but as much difference as we're seeing?  That's strange.

> 
> If I try 900M for huge pages then I get OOMs but this happens with the
> mmotm without my oom rework patch set as well.

Right, not at all surprising.

> 
> It would be great if you could retry and collect /proc/vmstat data
> around the OOM time to see what compaction did? (I was using the
> attached little program to reduce interference during OOM (no forks, the
> code locked in and the resulting file preallocated - e.g.
> read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
> hits).
> 
> Thanks!

I'll give it a try, thanks, but not tonight.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01  7:29         ` Hugh Dickins
@ 2016-03-01 13:38           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 13:38 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > > 
> > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > > 
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > 
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation.  j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> > 
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
> 
> Thanks a lot for giving it a go.
> 
> I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big??  But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM.  I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time. 

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M    /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     941904      66556          0       5092     806760
-/+ buffers/cache:     130052     878408
Swap:      2097148      42648    2054500
$ du -sh linux-3.7-rc5/
799M    linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M    /mnt/tmpfs/linux-4.5-rc6/
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     962976      45484          0       7236     820064
-/+ buffers/cache:     135676     872784
Swap:      2097148         16    2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]

real    8m36.621s
user    14m1.642s
sys     2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space.  Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat 
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
 	switch (compact_result) {
 	case COMPACT_DEFERRED:
 		*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
             gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
             gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat 
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
    190 2 1
    122 2 3
     59 2 4

#define COMPACT_SKIPPED         1               
#define COMPACT_PARTIAL         3
#define COMPACT_COMPLETE        4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real    8m46.626s
user    14m15.823s
sys     2m45.471s

the time increased but I haven't checked how stable the result is. 

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM.  Not an issue if you're
> booting VMs.

Ohh, I see.
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 13:38           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 13:38 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > > 
> > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > > 
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > 
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation.  j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> > 
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
> 
> Thanks a lot for giving it a go.
> 
> I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big??  But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM.  I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time. 

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M    /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     941904      66556          0       5092     806760
-/+ buffers/cache:     130052     878408
Swap:      2097148      42648    2054500
$ du -sh linux-3.7-rc5/
799M    linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M    /mnt/tmpfs/linux-4.5-rc6/
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     962976      45484          0       7236     820064
-/+ buffers/cache:     135676     872784
Swap:      2097148         16    2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function a??acpi_data_prop_reada??:
drivers/acpi/property.c:745:8: warning: a??obja?? may be used uninitialized in this function [-Wmaybe-uninitialized]

real    8m36.621s
user    14m1.642s
sys     2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space.  Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat 
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
 	switch (compact_result) {
 	case COMPACT_DEFERRED:
 		*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
             gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
             gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat 
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
    190 2 1
    122 2 3
     59 2 4

#define COMPACT_SKIPPED         1               
#define COMPACT_PARTIAL         3
#define COMPACT_COMPLETE        4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real    8m46.626s
user    14m15.823s
sys     2m45.471s

the time increased but I haven't checked how stable the result is. 

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM.  Not an issue if you're
> booting VMs.

Ohh, I see.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-01 14:40             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 14:40 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Tue 01-03-16 14:38:46, Michal Hocko wrote:
[...]
> the time increased but I haven't checked how stable the result is. 

And those results vary a lot (even when executed from the fresh boot)
as per my further testing. Sure it might be related to the virtual
environment but I do not think this particular test should be used for
the performance regression comparision.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 14:40             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 14:40 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Tue 01-03-16 14:38:46, Michal Hocko wrote:
[...]
> the time increased but I haven't checked how stable the result is. 

And those results vary a lot (even when executed from the fresh boot)
as per my further testing. Sure it might be related to the virtual
environment but I do not think this particular test should be used for
the performance regression comparision.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-01 18:14             ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-01 18:14 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On 03/01/2016 02:38 PM, Michal Hocko wrote:
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   						mode, contended_compaction);
>   	current->flags &= ~PF_MEMALLOC;
>
> +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
>   	switch (compact_result) {
>   	case COMPACT_DEFERRED:
>   		*deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
>               gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>               gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
>      190 2 1
>      122 2 3
>       59 2 4
>
> #define COMPACT_SKIPPED         1
> #define COMPACT_PARTIAL         3
> #define COMPACT_COMPLETE        4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.

Yes, compaction is historically quite careful to avoid making low memory 
conditions worse, and to prevent work if it doesn't look like it can ultimately 
succeed the allocation (so having not enough base pages means that compacting 
them is considered pointless). This aspect of preventing non-zero-order OOMs is 
somewhat unexpected :)

> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
>   								alloc_flags))
>   		return COMPACT_PARTIAL;
>
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return COMPACT_CONTINUE;
> +
>   	/*
>   	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
>   	 * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real    8m46.626s
> user    14m15.823s
> sys     2m45.471s
>
> the time increased but I haven't checked how stable the result is.
>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1
>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.

I was under impression that similar checks to compaction_suitable() were done 
also in compact_finished(), to stop compacting if memory got low due to parallel 
activity. But I guess it was a patch from Joonsoo that didn't get merged.

My only other theory so far is that watermark checks fail in 
__isolate_free_page() when we want to grab page(s) as migration targets. I would 
suggest enabling all compaction tracepoint and the migration tracepoint. Looking 
at the trace could hopefully help faster than going one trace_printk() per attempt.

Once we learn all the relevant places/checks, we can think about how to 
communicate to them that this compaction attempt is "important" and should 
continue as long as possible even in low-memory conditions. Maybe not just a 
costly order check, but we also have alloc_flags or could add something to 
compact_control, etc.

> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?
>
>> I was only suggesting to allocate hugetlb pages, if you preferred
>> not to reboot with artificially reduced RAM.  Not an issue if you're
>> booting VMs.
>
> Ohh, I see.
>
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 18:14             ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-01 18:14 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On 03/01/2016 02:38 PM, Michal Hocko wrote:
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   						mode, contended_compaction);
>   	current->flags &= ~PF_MEMALLOC;
>
> +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
>   	switch (compact_result) {
>   	case COMPACT_DEFERRED:
>   		*deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
>               gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>               gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
>      190 2 1
>      122 2 3
>       59 2 4
>
> #define COMPACT_SKIPPED         1
> #define COMPACT_PARTIAL         3
> #define COMPACT_COMPLETE        4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.

Yes, compaction is historically quite careful to avoid making low memory 
conditions worse, and to prevent work if it doesn't look like it can ultimately 
succeed the allocation (so having not enough base pages means that compacting 
them is considered pointless). This aspect of preventing non-zero-order OOMs is 
somewhat unexpected :)

> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
>   								alloc_flags))
>   		return COMPACT_PARTIAL;
>
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return COMPACT_CONTINUE;
> +
>   	/*
>   	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
>   	 * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real    8m46.626s
> user    14m15.823s
> sys     2m45.471s
>
> the time increased but I haven't checked how stable the result is.
>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1
>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.

I was under impression that similar checks to compaction_suitable() were done 
also in compact_finished(), to stop compacting if memory got low due to parallel 
activity. But I guess it was a patch from Joonsoo that didn't get merged.

My only other theory so far is that watermark checks fail in 
__isolate_free_page() when we want to grab page(s) as migration targets. I would 
suggest enabling all compaction tracepoint and the migration tracepoint. Looking 
at the trace could hopefully help faster than going one trace_printk() per attempt.

Once we learn all the relevant places/checks, we can think about how to 
communicate to them that this compaction attempt is "important" and should 
continue as long as possible even in low-memory conditions. Maybe not just a 
costly order check, but we also have alloc_flags or could add something to 
compact_control, etc.

> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?
>
>> I was only suggesting to allocate hugetlb pages, if you preferred
>> not to reboot with artificially reduced RAM.  Not an issue if you're
>> booting VMs.
>
> Ohh, I see.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-29 21:02         ` Michal Hocko
@ 2016-03-02  2:19           ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

I'm not sure that this is a way to go. See below.

> 
> On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 269a04f20927..f05aca36469b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * OK, so the watermak check has failed. Make sure we do all the
> > +	 * retries for !costly high order requests and hope that multiple
> > +	 * runs of compaction will generate some high order ones for us.
> > +	 *
> > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > +	 * if we are in the retry path - something like priority 0 for the
> > +	 * reclaim
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> > +
> >  	return false;

This seems not a proper fix. Checking watermark with high order has
another meaning that there is high order page or not. This isn't
what we want here. So, following fix is needed.

'if (order)' check isn't needed. It is used to clarify the meaning of
this fix. You can remove it.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..8c80375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                return false;
 
+       /* To check whether compaction is available or not */
+       if (order)
+               order = 0;
+
        /*
         * Keep reclaiming pages while there is a chance this will lead
         * somewhere.  If none of the target zones can satisfy our allocation

> >  }
> >  
> > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto noretry;
> >  
> >  	/*
> > -	 * Costly allocations might have made a progress but this doesn't mean
> > -	 * their order will become available due to high fragmentation so do
> > -	 * not reset the no progress counter for them
> > +	 * High order allocations might have made a progress but this doesn't
> > +	 * mean their order will become available due to high fragmentation so
> > +	 * do not reset the no progress counter for them
> >  	 */
> > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +	if (did_some_progress && !order)
> >  		no_progress_loops = 0;
> >  	else
> >  		no_progress_loops++;

This unconditionally increases no_progress_loops for high order
allocation, so, after 16 iterations, it will fail. If compaction isn't
enabled in Kconfig, 16 times reclaim attempt would not be sufficient
to make high order page. Should we consider this case also?

Thanks.

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:19           ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

I'm not sure that this is a way to go. See below.

> 
> On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 269a04f20927..f05aca36469b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * OK, so the watermak check has failed. Make sure we do all the
> > +	 * retries for !costly high order requests and hope that multiple
> > +	 * runs of compaction will generate some high order ones for us.
> > +	 *
> > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > +	 * if we are in the retry path - something like priority 0 for the
> > +	 * reclaim
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> > +
> >  	return false;

This seems not a proper fix. Checking watermark with high order has
another meaning that there is high order page or not. This isn't
what we want here. So, following fix is needed.

'if (order)' check isn't needed. It is used to clarify the meaning of
this fix. You can remove it.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..8c80375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                return false;
 
+       /* To check whether compaction is available or not */
+       if (order)
+               order = 0;
+
        /*
         * Keep reclaiming pages while there is a chance this will lead
         * somewhere.  If none of the target zones can satisfy our allocation

> >  }
> >  
> > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto noretry;
> >  
> >  	/*
> > -	 * Costly allocations might have made a progress but this doesn't mean
> > -	 * their order will become available due to high fragmentation so do
> > -	 * not reset the no progress counter for them
> > +	 * High order allocations might have made a progress but this doesn't
> > +	 * mean their order will become available due to high fragmentation so
> > +	 * do not reset the no progress counter for them
> >  	 */
> > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +	if (did_some_progress && !order)
> >  		no_progress_loops = 0;
> >  	else
> >  		no_progress_loops++;

This unconditionally increases no_progress_loops for high order
allocation, so, after 16 iterations, it will fail. If compaction isn't
enabled in Kconfig, 16 times reclaim attempt would not be sufficient
to make high order page. Should we consider this case also?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-02  2:28             ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space.  Unless
> > perhaps it's some superfast swap that you have?
> 
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).

Hmm... This looks like one of potential culprit. If page is in
writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
In this case, this page works as pinned page and prevent compaction.
It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
'no_progress_loops > XXX' will help in this situation.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:28             ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space.  Unless
> > perhaps it's some superfast swap that you have?
> 
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).

Hmm... This looks like one of potential culprit. If page is in
writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
In this case, this page works as pinned page and prevent compaction.
It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
'no_progress_loops > XXX' will help in this situation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 18:14             ` Vlastimil Babka
@ 2016-03-02  2:55               ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 113983
> >compact_free_scanned 1433503
> >compact_isolated 134307
> >compact_stall 128
> >compact_fail 26
> >compact_success 102
> >compact_kcompatd_wake 0
> >
> >So the whole load has done the direct compaction only 128 times during
> >that test. This doesn't sound much to me
> >$ grep allocstall /proc/vmstat
> >allocstall 1061
> >
> >we entered the direct reclaim much more but most of the load will be
> >order-0 so this might be still ok. So I've tried the following:
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 1993894b4219..107d444afdb1 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >
> >+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> >+
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> >
> >And the result was:
> >$ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> >this shows that order-2 memory pressure is not overly high in my
> >setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> >So I went back to 800M of hugetlb pages and tried again. It took ages
> >so I have interrupted that after one hour (there was still no OOM). The
> >trace log is quite interesting regardless:
> >$ wc -l ~/trace.log
> >371 /root/trace.log
> >
> >$ grep compact_stall /proc/vmstat
> >compact_stall 190
> >
> >so the compaction was still ignored more than actually invoked for
> >!costly allocations:
> >sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> >     190 2 1
> >     122 2 3
> >      59 2 4
> >
> >#define COMPACT_SKIPPED         1
> >#define COMPACT_PARTIAL         3
> >#define COMPACT_COMPLETE        4
> >
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low
> memory conditions worse, and to prevent work if it doesn't look like
> it can ultimately succeed the allocation (so having not enough base
> pages means that compacting them is considered pointless). This
> aspect of preventing non-zero-order OOMs is somewhat unexpected :)

It's better not to assume that compaction would succeed all the times.
Compaction has some limitations so it sometimes fails.
For example, in lowmem situation, it only scans small parts of memory
and if that part is fragmented by non-movable page, compaction would fail.
And, compaction would defer requests 64 times at maximum if successive
compaction failure happens before.

Depending on compaction heavily is right direction to go but I think
that it's not ready for now. More reclaim would relieve problem.

I tried to fix this situation but not yet finished.

http://thread.gmane.org/gmane.linux.kernel.mm/142364
https://lkml.org/lkml/2015/8/23/182


> >So I have tried the following:
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 4d99e1f5055c..7364e48cf69a 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >
> >+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		return COMPACT_CONTINUE;
> >+
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> >
> >and retried the same test (without huge pages):
> >$ time make -j20 > /dev/null
> >
> >real    8m46.626s
> >user    14m15.823s
> >sys     2m45.471s
> >
> >the time increased but I haven't checked how stable the result is.
> >
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 139822
> >compact_free_scanned 1661642
> >compact_isolated 139407
> >compact_stall 129
> >compact_fail 58
> >compact_success 71
> >compact_kcompatd_wake 1
> >
> >$ grep allocstall /proc/vmstat
> >allocstall 1665
> >
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable()
> were done also in compact_finished(), to stop compacting if memory
> got low due to parallel activity. But I guess it was a patch from
> Joonsoo that didn't get merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration
> targets. I would suggest enabling all compaction tracepoint and the
> migration tracepoint. Looking at the trace could hopefully help
> faster than going one trace_printk() per attempt.

Agreed. It's best thing to do now.

Thanks.

> 
> Once we learn all the relevant places/checks, we can think about how
> to communicate to them that this compaction attempt is "important"
> and should continue as long as possible even in low-memory
> conditions. Maybe not just a costly order check, but we also have
> alloc_flags or could add something to compact_control, etc.
> 
> >I will play with this some more but I would be really interested to hear
> >whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> >even make sense to you?
> >
> >>I was only suggesting to allocate hugetlb pages, if you preferred
> >>not to reboot with artificially reduced RAM.  Not an issue if you're
> >>booting VMs.
> >
> >Ohh, I see.
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:55               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 113983
> >compact_free_scanned 1433503
> >compact_isolated 134307
> >compact_stall 128
> >compact_fail 26
> >compact_success 102
> >compact_kcompatd_wake 0
> >
> >So the whole load has done the direct compaction only 128 times during
> >that test. This doesn't sound much to me
> >$ grep allocstall /proc/vmstat
> >allocstall 1061
> >
> >we entered the direct reclaim much more but most of the load will be
> >order-0 so this might be still ok. So I've tried the following:
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 1993894b4219..107d444afdb1 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >
> >+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> >+
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> >
> >And the result was:
> >$ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> >this shows that order-2 memory pressure is not overly high in my
> >setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> >So I went back to 800M of hugetlb pages and tried again. It took ages
> >so I have interrupted that after one hour (there was still no OOM). The
> >trace log is quite interesting regardless:
> >$ wc -l ~/trace.log
> >371 /root/trace.log
> >
> >$ grep compact_stall /proc/vmstat
> >compact_stall 190
> >
> >so the compaction was still ignored more than actually invoked for
> >!costly allocations:
> >sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> >     190 2 1
> >     122 2 3
> >      59 2 4
> >
> >#define COMPACT_SKIPPED         1
> >#define COMPACT_PARTIAL         3
> >#define COMPACT_COMPLETE        4
> >
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low
> memory conditions worse, and to prevent work if it doesn't look like
> it can ultimately succeed the allocation (so having not enough base
> pages means that compacting them is considered pointless). This
> aspect of preventing non-zero-order OOMs is somewhat unexpected :)

It's better not to assume that compaction would succeed all the times.
Compaction has some limitations so it sometimes fails.
For example, in lowmem situation, it only scans small parts of memory
and if that part is fragmented by non-movable page, compaction would fail.
And, compaction would defer requests 64 times at maximum if successive
compaction failure happens before.

Depending on compaction heavily is right direction to go but I think
that it's not ready for now. More reclaim would relieve problem.

I tried to fix this situation but not yet finished.

http://thread.gmane.org/gmane.linux.kernel.mm/142364
https://lkml.org/lkml/2015/8/23/182


> >So I have tried the following:
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 4d99e1f5055c..7364e48cf69a 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >
> >+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		return COMPACT_CONTINUE;
> >+
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> >
> >and retried the same test (without huge pages):
> >$ time make -j20 > /dev/null
> >
> >real    8m46.626s
> >user    14m15.823s
> >sys     2m45.471s
> >
> >the time increased but I haven't checked how stable the result is.
> >
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 139822
> >compact_free_scanned 1661642
> >compact_isolated 139407
> >compact_stall 129
> >compact_fail 58
> >compact_success 71
> >compact_kcompatd_wake 1
> >
> >$ grep allocstall /proc/vmstat
> >allocstall 1665
> >
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable()
> were done also in compact_finished(), to stop compacting if memory
> got low due to parallel activity. But I guess it was a patch from
> Joonsoo that didn't get merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration
> targets. I would suggest enabling all compaction tracepoint and the
> migration tracepoint. Looking at the trace could hopefully help
> faster than going one trace_printk() per attempt.

Agreed. It's best thing to do now.

Thanks.

> 
> Once we learn all the relevant places/checks, we can think about how
> to communicate to them that this compaction attempt is "important"
> and should continue as long as possible even in low-memory
> conditions. Maybe not just a costly order check, but we also have
> alloc_flags or could add something to compact_control, etc.
> 
> >I will play with this some more but I would be really interested to hear
> >whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> >even make sense to you?
> >
> >>I was only suggesting to allocate hugetlb pages, if you preferred
> >>not to reboot with artificially reduced RAM.  Not an issue if you're
> >>booting VMs.
> >
> >Ohh, I see.
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:19           ` Joonsoo Kim
@ 2016-03-02  9:50             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02  9:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
[...]
> > > +	/*
> > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > +	 * retries for !costly high order requests and hope that multiple
> > > +	 * runs of compaction will generate some high order ones for us.
> > > +	 *
> > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > +	 * if we are in the retry path - something like priority 0 for the
> > > +	 * reclaim
> > > +	 */
> > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > > +
> > >  	return false;
> 
> This seems not a proper fix. Checking watermark with high order has
> another meaning that there is high order page or not. This isn't
> what we want here.

Why not? Why should we retry the reclaim if we do not have >=order page
available? Reclaim itself doesn't guarantee any of the freed pages will
form the requested order. The ordering on the LRU lists is pretty much
random wrt. pfn ordering. On the other hand if we have a page available
which is just hidden by watermarks then it makes perfect sense to retry
and free even order-0 pages.

> So, following fix is needed.

> 'if (order)' check isn't needed. It is used to clarify the meaning of
> this fix. You can remove it.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..8c80375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 return false;
>  
> +       /* To check whether compaction is available or not */
> +       if (order)
> +               order = 0;
> +

This would enforce the order 0 wmark check which is IMHO not correct as
per above.

>         /*
>          * Keep reclaiming pages while there is a chance this will lead
>          * somewhere.  If none of the target zones can satisfy our allocation
> 
> > >  }
> > >  
> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  		goto noretry;
> > >  
> > >  	/*
> > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > -	 * their order will become available due to high fragmentation so do
> > > -	 * not reset the no progress counter for them
> > > +	 * High order allocations might have made a progress but this doesn't
> > > +	 * mean their order will become available due to high fragmentation so
> > > +	 * do not reset the no progress counter for them
> > >  	 */
> > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +	if (did_some_progress && !order)
> > >  		no_progress_loops = 0;
> > >  	else
> > >  		no_progress_loops++;
> 
> This unconditionally increases no_progress_loops for high order
> allocation, so, after 16 iterations, it will fail. If compaction isn't
> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> to make high order page. Should we consider this case also?

How many retries would help? I do not think any number will work
reliably. Configurations without compaction enabled are asking for
problems by definition IMHO. Relying on order-0 reclaim for high order
allocations simply cannot work.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  9:50             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02  9:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
[...]
> > > +	/*
> > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > +	 * retries for !costly high order requests and hope that multiple
> > > +	 * runs of compaction will generate some high order ones for us.
> > > +	 *
> > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > +	 * if we are in the retry path - something like priority 0 for the
> > > +	 * reclaim
> > > +	 */
> > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > > +
> > >  	return false;
> 
> This seems not a proper fix. Checking watermark with high order has
> another meaning that there is high order page or not. This isn't
> what we want here.

Why not? Why should we retry the reclaim if we do not have >=order page
available? Reclaim itself doesn't guarantee any of the freed pages will
form the requested order. The ordering on the LRU lists is pretty much
random wrt. pfn ordering. On the other hand if we have a page available
which is just hidden by watermarks then it makes perfect sense to retry
and free even order-0 pages.

> So, following fix is needed.

> 'if (order)' check isn't needed. It is used to clarify the meaning of
> this fix. You can remove it.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..8c80375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 return false;
>  
> +       /* To check whether compaction is available or not */
> +       if (order)
> +               order = 0;
> +

This would enforce the order 0 wmark check which is IMHO not correct as
per above.

>         /*
>          * Keep reclaiming pages while there is a chance this will lead
>          * somewhere.  If none of the target zones can satisfy our allocation
> 
> > >  }
> > >  
> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  		goto noretry;
> > >  
> > >  	/*
> > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > -	 * their order will become available due to high fragmentation so do
> > > -	 * not reset the no progress counter for them
> > > +	 * High order allocations might have made a progress but this doesn't
> > > +	 * mean their order will become available due to high fragmentation so
> > > +	 * do not reset the no progress counter for them
> > >  	 */
> > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +	if (did_some_progress && !order)
> > >  		no_progress_loops = 0;
> > >  	else
> > >  		no_progress_loops++;
> 
> This unconditionally increases no_progress_loops for high order
> allocation, so, after 16 iterations, it will fail. If compaction isn't
> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> to make high order page. Should we consider this case also?

How many retries would help? I do not think any number will work
reliably. Configurations without compaction enabled are asking for
problems by definition IMHO. Relying on order-0 reclaim for high order
allocations simply cannot work.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 18:14             ` Vlastimil Babka
  (?)
  (?)
@ 2016-03-02 12:24             ` Michal Hocko
  2016-03-02 13:00               ` Michal Hocko
  2016-03-02 13:22                 ` Vlastimil Babka
  -1 siblings, 2 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
[...]
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low memory
> conditions worse, and to prevent work if it doesn't look like it can
> ultimately succeed the allocation (so having not enough base pages means
> that compacting them is considered pointless).

The compaction is running in PF_MEMALLOC context so it shouldn't fail
the allocation. Moreover the additional memory is only temporal until
the migration finishes. Or am I missing something?

> This aspect of preventing non-zero-order OOMs is somewhat unexpected
> :)

I hope we can do something about it then...
 
[...]
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable() were
> done also in compact_finished(), to stop compacting if memory got low due to
> parallel activity. But I guess it was a patch from Joonsoo that didn't get
> merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration targets.

yes this certainly contributes to the problem and triggered in my case a
lot:
$ grep __isolate_free_page trace.log | wc -l
181
$ grep __alloc_pages_direct_compact: trace.log | wc -l
7

> I would suggest enabling all compaction tracepoint and the migration
> tracepoint. Looking at the trace could hopefully help faster than
> going one trace_printk() per attempt.

OK, here we go with both watermarks checks removed and hopefully all the
compaction related tracepoints enabled:
echo 1 > /debug/tracing/events/compaction/enable
echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

this was without the hugetlb handicap. See the trace log and vmstat
after the run attached.

Thanks
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: vmstat.log --]
[-- Type: text/plain, Size: 2333 bytes --]

nr_free_pages 151306
nr_alloc_batch 123
nr_inactive_anon 12815
nr_active_anon 44507
nr_inactive_file 1160
nr_active_file 5910
nr_unevictable 0
nr_mlock 0
nr_anon_pages 232
nr_mapped 1025
nr_file_pages 64246
nr_dirty 2
nr_writeback 0
nr_slab_reclaimable 12344
nr_slab_unreclaimable 21129
nr_page_table_pages 260
nr_kernel_stack 90
nr_unstable 0
nr_bounce 0
nr_vmscan_write 362270
nr_vmscan_immediate_reclaim 43
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 54592
nr_dirtied 5363
nr_written 364001
nr_pages_scanned 0
workingset_refault 16574
workingset_activate 9062
workingset_nodereclaim 640
nr_anon_transparent_hugepages 0
nr_free_cma 0
nr_dirty_threshold 31188
nr_dirty_background_threshold 15594
pgpgin 564127
pgpgout 1457932
pswpin 85569
pswpout 362180
pgalloc_dma 226916
pgalloc_dma32 21472873
pgalloc_normal 0
pgalloc_movable 0
pgfree 22057596
pgactivate 174766
pgdeactivate 919764
pgfault 23950701
pgmajfault 31819
pglazyfreed 0
pgrefill_dma 15589
pgrefill_dma32 999305
pgrefill_normal 0
pgrefill_movable 0
pgsteal_kswapd_dma 5339
pgsteal_kswapd_dma32 322951
pgsteal_kswapd_normal 0
pgsteal_kswapd_movable 0
pgsteal_direct_dma 334
pgsteal_direct_dma32 71877
pgsteal_direct_normal 0
pgsteal_direct_movable 0
pgscan_kswapd_dma 11213
pgscan_kswapd_dma32 653096
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 670
pgscan_direct_dma32 137488
pgscan_direct_normal 0
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 1920
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 351
kswapd_high_wmark_hit_quickly 13
pageoutrun 458
allocstall 1376
pgrotated 360480
drop_pagecache 0
drop_slab 0
pgmigrate_success 204875
pgmigrate_fail 169
compact_migrate_scanned 343087
compact_free_scanned 3597902
compact_isolated 412234
compact_stall 163
compact_fail 99
compact_success 64
compact_kcompatd_wake 2
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 1089
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1561
unevictable_pgs_mlocked 1561
unevictable_pgs_munlocked 1561
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 152
thp_fault_fallback 39
thp_collapse_alloc 69
thp_collapse_alloc_failed 11
thp_split_page 1
thp_split_page_failed 0
thp_deferred_split_page 212
thp_split_pmd 10
thp_zero_page_alloc 2
thp_zero_page_alloc_failed 1

[-- Attachment #3: trace.log.gz --]
[-- Type: application/gzip, Size: 472143 bytes --]

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:55               ` Joonsoo Kim
@ 2016-03-02 12:37                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
[...]
> > Yes, compaction is historically quite careful to avoid making low
> > memory conditions worse, and to prevent work if it doesn't look like
> > it can ultimately succeed the allocation (so having not enough base
> > pages means that compacting them is considered pointless). This
> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
> 
> It's better not to assume that compaction would succeed all the times.
> Compaction has some limitations so it sometimes fails.
> For example, in lowmem situation, it only scans small parts of memory
> and if that part is fragmented by non-movable page, compaction would fail.
> And, compaction would defer requests 64 times at maximum if successive
> compaction failure happens before.
> 
> Depending on compaction heavily is right direction to go but I think
> that it's not ready for now. More reclaim would relieve problem.

I really fail to see why. The reclaimable memory can be migrated as
well, no? Relying on the order-0 reclaim makes only sense to get over
wmarks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 12:37                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
[...]
> > Yes, compaction is historically quite careful to avoid making low
> > memory conditions worse, and to prevent work if it doesn't look like
> > it can ultimately succeed the allocation (so having not enough base
> > pages means that compacting them is considered pointless). This
> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
> 
> It's better not to assume that compaction would succeed all the times.
> Compaction has some limitations so it sometimes fails.
> For example, in lowmem situation, it only scans small parts of memory
> and if that part is fragmented by non-movable page, compaction would fail.
> And, compaction would defer requests 64 times at maximum if successive
> compaction failure happens before.
> 
> Depending on compaction heavily is right direction to go but I think
> that it's not ready for now. More reclaim would relieve problem.

I really fail to see why. The reclaimable memory can be migrated as
well, no? Relying on the order-0 reclaim makes only sense to get over
wmarks.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:28             ` Joonsoo Kim
@ 2016-03-02 12:39               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:28:46, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> 
> Hmm... This looks like one of potential culprit. If page is in
> writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
> In this case, this page works as pinned page and prevent compaction.
> It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
> 'no_progress_loops > XXX' will help in this situation.

Would it make sense to use MIGRATE_SYNC for !costly allocations by
default?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 12:39               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:28:46, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> 
> Hmm... This looks like one of potential culprit. If page is in
> writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
> In this case, this page works as pinned page and prevent compaction.
> It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
> 'no_progress_loops > XXX' will help in this situation.

Would it make sense to use MIGRATE_SYNC for !costly allocations by
default?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:24             ` Michal Hocko
@ 2016-03-02 13:00               ` Michal Hocko
  2016-03-02 13:22                 ` Vlastimil Babka
  1 sibling, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 13:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 3955 bytes --]

On Wed 02-03-16 13:24:10, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
[...]
> > I would suggest enabling all compaction tracepoint and the migration
> > tracepoint. Looking at the trace could hopefully help faster than
> > going one trace_printk() per attempt.
> 
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable
> 
> this was without the hugetlb handicap. See the trace log and vmstat
> after the run attached.

Just for the reference the above was with:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..50954a9a4433 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2245,7 +2245,6 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
-	unsigned long watermark;
 	struct zone *zone;
 	int mt;
 
@@ -2254,14 +2253,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
 
-	if (!is_migrate_isolate(mt)) {
-		/* Obey watermarks as if the page was being allocated */
-		watermark = low_wmark_pages(zone) + (1 << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-			return 0;
-
+	if (!is_migrate_isolate(mt))
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
-	}
 
 	/* Remove page from free list */
 	list_del(&page->lru);

And I rerun the same with the clean mmotm tree and the results are
attached.

As we can see there was less scanning on dma32 both in direct and kswapd
reclaim.
$ grep direct vmstat.*
vmstat.mmotm.log:pgsteal_direct_dma 420
vmstat.mmotm.log:pgsteal_direct_dma32 71234
vmstat.mmotm.log:pgsteal_direct_normal 0
vmstat.mmotm.log:pgsteal_direct_movable 0
vmstat.mmotm.log:pgscan_direct_dma 990
vmstat.mmotm.log:pgscan_direct_dma32 144376
vmstat.mmotm.log:pgscan_direct_normal 0
vmstat.mmotm.log:pgscan_direct_movable 0
vmstat.mmotm.log:pgscan_direct_throttle 0
vmstat.updated.log:pgsteal_direct_dma 334
vmstat.updated.log:pgsteal_direct_dma32 71877
vmstat.updated.log:pgsteal_direct_normal 0
vmstat.updated.log:pgsteal_direct_movable 0
vmstat.updated.log:pgscan_direct_dma 670
vmstat.updated.log:pgscan_direct_dma32 137488
vmstat.updated.log:pgscan_direct_normal 0
vmstat.updated.log:pgscan_direct_movable 0
vmstat.updated.log:pgscan_direct_throttle 0
$ grep kswapd vmstat.*
vmstat.mmotm.log:pgsteal_kswapd_dma 5602
vmstat.mmotm.log:pgsteal_kswapd_dma32 332336
vmstat.mmotm.log:pgsteal_kswapd_normal 0
vmstat.mmotm.log:pgsteal_kswapd_movable 0
vmstat.mmotm.log:pgscan_kswapd_dma 12187
vmstat.mmotm.log:pgscan_kswapd_dma32 679667
vmstat.mmotm.log:pgscan_kswapd_normal 0
vmstat.mmotm.log:pgscan_kswapd_movable 0
vmstat.mmotm.log:kswapd_inodesteal 0
vmstat.mmotm.log:kswapd_low_wmark_hit_quickly 339
vmstat.mmotm.log:kswapd_high_wmark_hit_quickly 10
vmstat.updated.log:pgsteal_kswapd_dma 5339
vmstat.updated.log:pgsteal_kswapd_dma32 322951
vmstat.updated.log:pgsteal_kswapd_normal 0
vmstat.updated.log:pgsteal_kswapd_movable 0
vmstat.updated.log:pgscan_kswapd_dma 11213
vmstat.updated.log:pgscan_kswapd_dma32 653096
vmstat.updated.log:pgscan_kswapd_normal 0
vmstat.updated.log:pgscan_kswapd_movable 0
vmstat.updated.log:kswapd_inodesteal 0
vmstat.updated.log:kswapd_low_wmark_hit_quickly 351
vmstat.updated.log:kswapd_high_wmark_hit_quickly 13
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: trace.mmotm.log.gz --]
[-- Type: application/gzip, Size: 512968 bytes --]

[-- Attachment #3: vmstat.mmotm.log --]
[-- Type: text/plain, Size: 2335 bytes --]

nr_free_pages 149226
nr_alloc_batch 114
nr_inactive_anon 13962
nr_active_anon 46754
nr_inactive_file 634
nr_active_file 5010
nr_unevictable 0
nr_mlock 0
nr_anon_pages 219
nr_mapped 793
nr_file_pages 66233
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 12355
nr_slab_unreclaimable 21208
nr_page_table_pages 320
nr_kernel_stack 92
nr_unstable 0
nr_bounce 0
nr_vmscan_write 358705
nr_vmscan_immediate_reclaim 111
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 58505
nr_dirtied 5516
nr_written 360677
nr_pages_scanned 0
workingset_refault 17291
workingset_activate 11908
workingset_nodereclaim 644
nr_anon_transparent_hugepages 0
nr_free_cma 0
nr_dirty_threshold 30487
nr_dirty_background_threshold 15243
pgpgin 525267
pgpgout 1444464
pswpin 75386
pswpout 358705
pgalloc_dma 241466
pgalloc_dma32 21491760
pgalloc_normal 0
pgalloc_movable 0
pgfree 22110844
pgactivate 204005
pgdeactivate 1033621
pgfault 23929641
pgmajfault 27748
pglazyfreed 0
pgrefill_dma 18759
pgrefill_dma32 1122090
pgrefill_normal 0
pgrefill_movable 0
pgsteal_kswapd_dma 5602
pgsteal_kswapd_dma32 332336
pgsteal_kswapd_normal 0
pgsteal_kswapd_movable 0
pgsteal_direct_dma 420
pgsteal_direct_dma32 71234
pgsteal_direct_normal 0
pgsteal_direct_movable 0
pgscan_kswapd_dma 12187
pgscan_kswapd_dma32 679667
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 990
pgscan_direct_dma32 144376
pgscan_direct_normal 0
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 2052
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 339
kswapd_high_wmark_hit_quickly 10
pageoutrun 448
allocstall 1376
pgrotated 357091
drop_pagecache 0
drop_slab 0
pgmigrate_success 227102
pgmigrate_fail 142
compact_migrate_scanned 374515
compact_free_scanned 4000566
compact_isolated 456131
compact_stall 133
compact_fail 73
compact_success 60
compact_kcompatd_wake 0
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 1087
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1530
unevictable_pgs_mlocked 1530
unevictable_pgs_munlocked 1529
unevictable_pgs_cleared 1
unevictable_pgs_stranded 0
thp_fault_alloc 164
thp_fault_fallback 26
thp_collapse_alloc 159
thp_collapse_alloc_failed 11
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 309
thp_split_pmd 7
thp_zero_page_alloc 3
thp_zero_page_alloc_failed 0

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:24             ` Michal Hocko
@ 2016-03-02 13:22                 ` Vlastimil Babka
  2016-03-02 13:22                 ` Vlastimil Babka
  1 sibling, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-02 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/02/2016 01:24 PM, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
>>
>> I was under impression that similar checks to compaction_suitable() were
>> done also in compact_finished(), to stop compacting if memory got low due to
>> parallel activity. But I guess it was a patch from Joonsoo that didn't get
>> merged.
>>
>> My only other theory so far is that watermark checks fail in
>> __isolate_free_page() when we want to grab page(s) as migration targets.
>
> yes this certainly contributes to the problem and triggered in my case a
> lot:
> $ grep __isolate_free_page trace.log | wc -l
> 181
> $ grep __alloc_pages_direct_compact: trace.log | wc -l
> 7
>
>> I would suggest enabling all compaction tracepoint and the migration
>> tracepoint. Looking at the trace could hopefully help faster than
>> going one trace_printk() per attempt.
>
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

The trace shows only 4 direct compaction attempts with order=2. The rest 
is order=9, i.e. THP, which has little chances of success under such 
pressure, and thus those failures and defers. The few order=2 attempts 
appear all successful (defer_reset is called).

So it seems your system is mostly fine with just reclaim, and there's 
little need for order-2 compaction, and that's also why you can't 
reproduce the OOMs. So I'm afraid we'll learn nothing here, and looks 
like Hugh will have to try those watermark check adjustments/removals 
and/or provide the same kind of trace.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 13:22                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-02 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/02/2016 01:24 PM, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
>>
>> I was under impression that similar checks to compaction_suitable() were
>> done also in compact_finished(), to stop compacting if memory got low due to
>> parallel activity. But I guess it was a patch from Joonsoo that didn't get
>> merged.
>>
>> My only other theory so far is that watermark checks fail in
>> __isolate_free_page() when we want to grab page(s) as migration targets.
>
> yes this certainly contributes to the problem and triggered in my case a
> lot:
> $ grep __isolate_free_page trace.log | wc -l
> 181
> $ grep __alloc_pages_direct_compact: trace.log | wc -l
> 7
>
>> I would suggest enabling all compaction tracepoint and the migration
>> tracepoint. Looking at the trace could hopefully help faster than
>> going one trace_printk() per attempt.
>
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

The trace shows only 4 direct compaction attempts with order=2. The rest 
is order=9, i.e. THP, which has little chances of success under such 
pressure, and thus those failures and defers. The few order=2 attempts 
appear all successful (defer_reset is called).

So it seems your system is mostly fine with just reclaim, and there's 
little need for order-2 compaction, and that's also why you can't 
reproduce the OOMs. So I'm afraid we'll learn nothing here, and looks 
like Hugh will have to try those watermark check adjustments/removals 
and/or provide the same kind of trace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  9:50             ` Michal Hocko
@ 2016-03-02 13:32               ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
>> > > + /*
>> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> > > +  * retries for !costly high order requests and hope that multiple
>> > > +  * runs of compaction will generate some high order ones for us.
>> > > +  *
>> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> > > +  * if we are in the retry path - something like priority 0 for the
>> > > +  * reclaim
>> > > +  */
>> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > +         return true;
>> > > +
>> > >   return false;
>>
>> This seems not a proper fix. Checking watermark with high order has
>> another meaning that there is high order page or not. This isn't
>> what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.

If we have >= order page available, we would not reach here. We would
just allocate it.

And, should_reclaim_retry() is not just for reclaim. It is also for
retrying compaction.

That watermark check is to check further reclaim/compaction
is meaningful. And, for high order case, if there is enough freepage,
compaction could make high order page even if there is no high order
page now.

Adding freeable memory and checking watermark with it doesn't help
in this case because number of high order page isn't changed with it.

I just did quick review to your patches so maybe I am wrong.
Am I missing something?

>> So, following fix is needed.
>
>> 'if (order)' check isn't needed. It is used to clarify the meaning of
>> this fix. You can remove it.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 1993894..8c80375 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>>                 return false;
>>
>> +       /* To check whether compaction is available or not */
>> +       if (order)
>> +               order = 0;
>> +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
>>         /*
>>          * Keep reclaiming pages while there is a chance this will lead
>>          * somewhere.  If none of the target zones can satisfy our allocation
>>
>> > >  }
>> > >
>> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > >           goto noretry;
>> > >
>> > >   /*
>> > > -  * Costly allocations might have made a progress but this doesn't mean
>> > > -  * their order will become available due to high fragmentation so do
>> > > -  * not reset the no progress counter for them
>> > > +  * High order allocations might have made a progress but this doesn't
>> > > +  * mean their order will become available due to high fragmentation so
>> > > +  * do not reset the no progress counter for them
>> > >    */
>> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + if (did_some_progress && !order)
>> > >           no_progress_loops = 0;
>> > >   else
>> > >           no_progress_loops++;
>>
>> This unconditionally increases no_progress_loops for high order
>> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

At least, reset no_progress_loops when did_some_progress. High
order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
as order 0. And, reclaim something would increase probability of
compaction success. Why do we limit retry as 16 times with no
evidence of potential impossibility of making high order page?

And, 16 retry looks not good to me because compaction could defer
actual doing up to 64 times.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 13:32               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
>> > > + /*
>> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> > > +  * retries for !costly high order requests and hope that multiple
>> > > +  * runs of compaction will generate some high order ones for us.
>> > > +  *
>> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> > > +  * if we are in the retry path - something like priority 0 for the
>> > > +  * reclaim
>> > > +  */
>> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > +         return true;
>> > > +
>> > >   return false;
>>
>> This seems not a proper fix. Checking watermark with high order has
>> another meaning that there is high order page or not. This isn't
>> what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.

If we have >= order page available, we would not reach here. We would
just allocate it.

And, should_reclaim_retry() is not just for reclaim. It is also for
retrying compaction.

That watermark check is to check further reclaim/compaction
is meaningful. And, for high order case, if there is enough freepage,
compaction could make high order page even if there is no high order
page now.

Adding freeable memory and checking watermark with it doesn't help
in this case because number of high order page isn't changed with it.

I just did quick review to your patches so maybe I am wrong.
Am I missing something?

>> So, following fix is needed.
>
>> 'if (order)' check isn't needed. It is used to clarify the meaning of
>> this fix. You can remove it.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 1993894..8c80375 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>>                 return false;
>>
>> +       /* To check whether compaction is available or not */
>> +       if (order)
>> +               order = 0;
>> +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
>>         /*
>>          * Keep reclaiming pages while there is a chance this will lead
>>          * somewhere.  If none of the target zones can satisfy our allocation
>>
>> > >  }
>> > >
>> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > >           goto noretry;
>> > >
>> > >   /*
>> > > -  * Costly allocations might have made a progress but this doesn't mean
>> > > -  * their order will become available due to high fragmentation so do
>> > > -  * not reset the no progress counter for them
>> > > +  * High order allocations might have made a progress but this doesn't
>> > > +  * mean their order will become available due to high fragmentation so
>> > > +  * do not reset the no progress counter for them
>> > >    */
>> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + if (did_some_progress && !order)
>> > >           no_progress_loops = 0;
>> > >   else
>> > >           no_progress_loops++;
>>
>> This unconditionally increases no_progress_loops for high order
>> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

At least, reset no_progress_loops when did_some_progress. High
order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
as order 0. And, reclaim something would increase probability of
compaction success. Why do we limit retry as 16 times with no
evidence of potential impossibility of making high order page?

And, 16 retry looks not good to me because compaction could defer
actual doing up to 64 times.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 13:32               ` Joonsoo Kim
@ 2016-03-02 14:06                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > [...]
> >> > > + /*
> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> > > +  * retries for !costly high order requests and hope that multiple
> >> > > +  * runs of compaction will generate some high order ones for us.
> >> > > +  *
> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> > > +  * reclaim
> >> > > +  */
> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > +         return true;
> >> > > +
> >> > >   return false;
> >>
> >> This seems not a proper fix. Checking watermark with high order has
> >> another meaning that there is high order page or not. This isn't
> >> what we want here.
> >
> > Why not? Why should we retry the reclaim if we do not have >=order page
> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > form the requested order. The ordering on the LRU lists is pretty much
> > random wrt. pfn ordering. On the other hand if we have a page available
> > which is just hidden by watermarks then it makes perfect sense to retry
> > and free even order-0 pages.
> 
> If we have >= order page available, we would not reach here. We would
> just allocate it.

not really, we can still be under the low watermark. Note that the
target for the should_reclaim_retry watermark check includes also the
reclaimable memory.
 
> And, should_reclaim_retry() is not just for reclaim. It is also for
> retrying compaction.
> 
> That watermark check is to check further reclaim/compaction
> is meaningful. And, for high order case, if there is enough freepage,
> compaction could make high order page even if there is no high order
> page now.
> 
> Adding freeable memory and checking watermark with it doesn't help
> in this case because number of high order page isn't changed with it.
> 
> I just did quick review to your patches so maybe I am wrong.
> Am I missing something?

The core idea behind should_reclaim_retry is to check whether the
reclaiming all the pages would help to get over the watermark and there
is at least one >= order page. Then it really makes sense to retry. As
the compaction has already was performed before this is called we should
have created some high order pages already. The decay guarantees that we
eventually trigger the OOM killer after some attempts.

If the compaction can backoff and ignore our requests then we are
screwed of course and that should be addressed imho at the compaction
layer. Maybe we can tell the compaction to try harder but I would like
to understand why this shouldn't be a default behavior for !costly
orders.
 
[...]
> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > >           goto noretry;
> >> > >
> >> > >   /*
> >> > > -  * Costly allocations might have made a progress but this doesn't mean
> >> > > -  * their order will become available due to high fragmentation so do
> >> > > -  * not reset the no progress counter for them
> >> > > +  * High order allocations might have made a progress but this doesn't
> >> > > +  * mean their order will become available due to high fragmentation so
> >> > > +  * do not reset the no progress counter for them
> >> > >    */
> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + if (did_some_progress && !order)
> >> > >           no_progress_loops = 0;
> >> > >   else
> >> > >           no_progress_loops++;
> >>
> >> This unconditionally increases no_progress_loops for high order
> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> >> to make high order page. Should we consider this case also?
> >
> > How many retries would help? I do not think any number will work
> > reliably. Configurations without compaction enabled are asking for
> > problems by definition IMHO. Relying on order-0 reclaim for high order
> > allocations simply cannot work.
> 
> At least, reset no_progress_loops when did_some_progress. High
> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> as order 0. And, reclaim something would increase probability of
> compaction success.

This is something I still do not understand. Why would reclaiming
random order-0 pages help compaction? Could you clarify this please?

> Why do we limit retry as 16 times with no evidence of potential
> impossibility of making high order page?

If we tried to compact 16 times without any progress then this sounds
like a sufficient evidence to me. Well, this number is somehow arbitrary
but the main point is to limit it to _some_ number, if we can show that
a larger value would work better then we can update it of course.

> And, 16 retry looks not good to me because compaction could defer
> actual doing up to 64 times.

OK, this is something that needs to be handled in a better way. The
primary question would be why to defer the compaction for <=
PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
why it makes sense it for the best effort mode of operation but !costly
orders should be trying much harder as they are nofail, no?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:06                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > [...]
> >> > > + /*
> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> > > +  * retries for !costly high order requests and hope that multiple
> >> > > +  * runs of compaction will generate some high order ones for us.
> >> > > +  *
> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> > > +  * reclaim
> >> > > +  */
> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > +         return true;
> >> > > +
> >> > >   return false;
> >>
> >> This seems not a proper fix. Checking watermark with high order has
> >> another meaning that there is high order page or not. This isn't
> >> what we want here.
> >
> > Why not? Why should we retry the reclaim if we do not have >=order page
> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > form the requested order. The ordering on the LRU lists is pretty much
> > random wrt. pfn ordering. On the other hand if we have a page available
> > which is just hidden by watermarks then it makes perfect sense to retry
> > and free even order-0 pages.
> 
> If we have >= order page available, we would not reach here. We would
> just allocate it.

not really, we can still be under the low watermark. Note that the
target for the should_reclaim_retry watermark check includes also the
reclaimable memory.
 
> And, should_reclaim_retry() is not just for reclaim. It is also for
> retrying compaction.
> 
> That watermark check is to check further reclaim/compaction
> is meaningful. And, for high order case, if there is enough freepage,
> compaction could make high order page even if there is no high order
> page now.
> 
> Adding freeable memory and checking watermark with it doesn't help
> in this case because number of high order page isn't changed with it.
> 
> I just did quick review to your patches so maybe I am wrong.
> Am I missing something?

The core idea behind should_reclaim_retry is to check whether the
reclaiming all the pages would help to get over the watermark and there
is at least one >= order page. Then it really makes sense to retry. As
the compaction has already was performed before this is called we should
have created some high order pages already. The decay guarantees that we
eventually trigger the OOM killer after some attempts.

If the compaction can backoff and ignore our requests then we are
screwed of course and that should be addressed imho at the compaction
layer. Maybe we can tell the compaction to try harder but I would like
to understand why this shouldn't be a default behavior for !costly
orders.
 
[...]
> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > >           goto noretry;
> >> > >
> >> > >   /*
> >> > > -  * Costly allocations might have made a progress but this doesn't mean
> >> > > -  * their order will become available due to high fragmentation so do
> >> > > -  * not reset the no progress counter for them
> >> > > +  * High order allocations might have made a progress but this doesn't
> >> > > +  * mean their order will become available due to high fragmentation so
> >> > > +  * do not reset the no progress counter for them
> >> > >    */
> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + if (did_some_progress && !order)
> >> > >           no_progress_loops = 0;
> >> > >   else
> >> > >           no_progress_loops++;
> >>
> >> This unconditionally increases no_progress_loops for high order
> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> >> to make high order page. Should we consider this case also?
> >
> > How many retries would help? I do not think any number will work
> > reliably. Configurations without compaction enabled are asking for
> > problems by definition IMHO. Relying on order-0 reclaim for high order
> > allocations simply cannot work.
> 
> At least, reset no_progress_loops when did_some_progress. High
> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> as order 0. And, reclaim something would increase probability of
> compaction success.

This is something I still do not understand. Why would reclaiming
random order-0 pages help compaction? Could you clarify this please?

> Why do we limit retry as 16 times with no evidence of potential
> impossibility of making high order page?

If we tried to compact 16 times without any progress then this sounds
like a sufficient evidence to me. Well, this number is somehow arbitrary
but the main point is to limit it to _some_ number, if we can show that
a larger value would work better then we can update it of course.

> And, 16 retry looks not good to me because compaction could defer
> actual doing up to 64 times.

OK, this is something that needs to be handled in a better way. The
primary question would be why to defer the compaction for <=
PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
why it makes sense it for the best effort mode of operation but !costly
orders should be trying much harder as they are nofail, no?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:37                 ` Michal Hocko
@ 2016-03-02 14:06                   ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Vlastimil Babka, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML

2016-03-02 21:37 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
>> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> [...]
>> > Yes, compaction is historically quite careful to avoid making low
>> > memory conditions worse, and to prevent work if it doesn't look like
>> > it can ultimately succeed the allocation (so having not enough base
>> > pages means that compacting them is considered pointless). This
>> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>>
>> It's better not to assume that compaction would succeed all the times.
>> Compaction has some limitations so it sometimes fails.
>> For example, in lowmem situation, it only scans small parts of memory
>> and if that part is fragmented by non-movable page, compaction would fail.
>> And, compaction would defer requests 64 times at maximum if successive
>> compaction failure happens before.
>>
>> Depending on compaction heavily is right direction to go but I think
>> that it's not ready for now. More reclaim would relieve problem.
>
> I really fail to see why. The reclaimable memory can be migrated as
> well, no? Relying on the order-0 reclaim makes only sense to get over
> wmarks.

Attached link on previous reply mentioned limitation of current compaction
implementation. Briefly speaking, It would not scan all range of memory
due to algorithm limitation so even if there is reclaimable memory that
can be also migrated, compaction could fail.

There is no such limitation on reclaim and that's why I think that compaction
is not ready for now.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:06                   ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Vlastimil Babka, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML

2016-03-02 21:37 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
>> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> [...]
>> > Yes, compaction is historically quite careful to avoid making low
>> > memory conditions worse, and to prevent work if it doesn't look like
>> > it can ultimately succeed the allocation (so having not enough base
>> > pages means that compacting them is considered pointless). This
>> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>>
>> It's better not to assume that compaction would succeed all the times.
>> Compaction has some limitations so it sometimes fails.
>> For example, in lowmem situation, it only scans small parts of memory
>> and if that part is fragmented by non-movable page, compaction would fail.
>> And, compaction would defer requests 64 times at maximum if successive
>> compaction failure happens before.
>>
>> Depending on compaction heavily is right direction to go but I think
>> that it's not ready for now. More reclaim would relieve problem.
>
> I really fail to see why. The reclaimable memory can be migrated as
> well, no? Relying on the order-0 reclaim makes only sense to get over
> wmarks.

Attached link on previous reply mentioned limitation of current compaction
implementation. Briefly speaking, It would not scan all range of memory
due to algorithm limitation so even if there is reclaimable memory that
can be also migrated, compaction could fail.

There is no such limitation on reclaim and that's why I think that compaction
is not ready for now.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 14:06                 ` Michal Hocko
@ 2016-03-02 14:34                   ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> > [...]
>> >> > > + /*
>> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> > > +  *
>> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> > > +  * reclaim
>> >> > > +  */
>> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > +         return true;
>> >> > > +
>> >> > >   return false;
>> >>
>> >> This seems not a proper fix. Checking watermark with high order has
>> >> another meaning that there is high order page or not. This isn't
>> >> what we want here.
>> >
>> > Why not? Why should we retry the reclaim if we do not have >=order page
>> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> > form the requested order. The ordering on the LRU lists is pretty much
>> > random wrt. pfn ordering. On the other hand if we have a page available
>> > which is just hidden by watermarks then it makes perfect sense to retry
>> > and free even order-0 pages.
>>
>> If we have >= order page available, we would not reach here. We would
>> just allocate it.
>
> not really, we can still be under the low watermark. Note that the

you mean min watermark?

> target for the should_reclaim_retry watermark check includes also the
> reclaimable memory.

I guess that usual case for high order allocation failure has enough freepage.

>> And, should_reclaim_retry() is not just for reclaim. It is also for
>> retrying compaction.
>>
>> That watermark check is to check further reclaim/compaction
>> is meaningful. And, for high order case, if there is enough freepage,
>> compaction could make high order page even if there is no high order
>> page now.
>>
>> Adding freeable memory and checking watermark with it doesn't help
>> in this case because number of high order page isn't changed with it.
>>
>> I just did quick review to your patches so maybe I am wrong.
>> Am I missing something?
>
> The core idea behind should_reclaim_retry is to check whether the
> reclaiming all the pages would help to get over the watermark and there
> is at least one >= order page. Then it really makes sense to retry. As

How you can judge that reclaiming all the pages would help to check
there is at least one >= order page?

> the compaction has already was performed before this is called we should
> have created some high order pages already. The decay guarantees that we

Not really. Compaction could fail.

> eventually trigger the OOM killer after some attempts.

Yep.

> If the compaction can backoff and ignore our requests then we are
> screwed of course and that should be addressed imho at the compaction
> layer. Maybe we can tell the compaction to try harder but I would like
> to understand why this shouldn't be a default behavior for !costly
> orders.

Yes, I agree that.

> [...]
>> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > >           goto noretry;
>> >> > >
>> >> > >   /*
>> >> > > -  * Costly allocations might have made a progress but this doesn't mean
>> >> > > -  * their order will become available due to high fragmentation so do
>> >> > > -  * not reset the no progress counter for them
>> >> > > +  * High order allocations might have made a progress but this doesn't
>> >> > > +  * mean their order will become available due to high fragmentation so
>> >> > > +  * do not reset the no progress counter for them
>> >> > >    */
>> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + if (did_some_progress && !order)
>> >> > >           no_progress_loops = 0;
>> >> > >   else
>> >> > >           no_progress_loops++;
>> >>
>> >> This unconditionally increases no_progress_loops for high order
>> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> >> to make high order page. Should we consider this case also?
>> >
>> > How many retries would help? I do not think any number will work
>> > reliably. Configurations without compaction enabled are asking for
>> > problems by definition IMHO. Relying on order-0 reclaim for high order
>> > allocations simply cannot work.
>>
>> At least, reset no_progress_loops when did_some_progress. High
>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> as order 0. And, reclaim something would increase probability of
>> compaction success.
>
> This is something I still do not understand. Why would reclaiming
> random order-0 pages help compaction? Could you clarify this please?

I just can tell simple version. Please check the link from me on another reply.
Compaction could scan more range of memory if we have more freepage.
This is due to algorithm limitation. Anyway, so, reclaiming random
order-0 pages helps compaction.

>> Why do we limit retry as 16 times with no evidence of potential
>> impossibility of making high order page?
>
> If we tried to compact 16 times without any progress then this sounds
> like a sufficient evidence to me. Well, this number is somehow arbitrary
> but the main point is to limit it to _some_ number, if we can show that
> a larger value would work better then we can update it of course.

My arguing is for your band aid patch.
My point is that why retry count for order-0 is reset if there is some progress,
but, retry counter for order up to costly isn't reset even if there is
some progress

>> And, 16 retry looks not good to me because compaction could defer
>> actual doing up to 64 times.
>
> OK, this is something that needs to be handled in a better way. The
> primary question would be why to defer the compaction for <=
> PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
> why it makes sense it for the best effort mode of operation but !costly
> orders should be trying much harder as they are nofail, no?

Make sense.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:34                   ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> > [...]
>> >> > > + /*
>> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> > > +  *
>> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> > > +  * reclaim
>> >> > > +  */
>> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > +         return true;
>> >> > > +
>> >> > >   return false;
>> >>
>> >> This seems not a proper fix. Checking watermark with high order has
>> >> another meaning that there is high order page or not. This isn't
>> >> what we want here.
>> >
>> > Why not? Why should we retry the reclaim if we do not have >=order page
>> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> > form the requested order. The ordering on the LRU lists is pretty much
>> > random wrt. pfn ordering. On the other hand if we have a page available
>> > which is just hidden by watermarks then it makes perfect sense to retry
>> > and free even order-0 pages.
>>
>> If we have >= order page available, we would not reach here. We would
>> just allocate it.
>
> not really, we can still be under the low watermark. Note that the

you mean min watermark?

> target for the should_reclaim_retry watermark check includes also the
> reclaimable memory.

I guess that usual case for high order allocation failure has enough freepage.

>> And, should_reclaim_retry() is not just for reclaim. It is also for
>> retrying compaction.
>>
>> That watermark check is to check further reclaim/compaction
>> is meaningful. And, for high order case, if there is enough freepage,
>> compaction could make high order page even if there is no high order
>> page now.
>>
>> Adding freeable memory and checking watermark with it doesn't help
>> in this case because number of high order page isn't changed with it.
>>
>> I just did quick review to your patches so maybe I am wrong.
>> Am I missing something?
>
> The core idea behind should_reclaim_retry is to check whether the
> reclaiming all the pages would help to get over the watermark and there
> is at least one >= order page. Then it really makes sense to retry. As

How you can judge that reclaiming all the pages would help to check
there is at least one >= order page?

> the compaction has already was performed before this is called we should
> have created some high order pages already. The decay guarantees that we

Not really. Compaction could fail.

> eventually trigger the OOM killer after some attempts.

Yep.

> If the compaction can backoff and ignore our requests then we are
> screwed of course and that should be addressed imho at the compaction
> layer. Maybe we can tell the compaction to try harder but I would like
> to understand why this shouldn't be a default behavior for !costly
> orders.

Yes, I agree that.

> [...]
>> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > >           goto noretry;
>> >> > >
>> >> > >   /*
>> >> > > -  * Costly allocations might have made a progress but this doesn't mean
>> >> > > -  * their order will become available due to high fragmentation so do
>> >> > > -  * not reset the no progress counter for them
>> >> > > +  * High order allocations might have made a progress but this doesn't
>> >> > > +  * mean their order will become available due to high fragmentation so
>> >> > > +  * do not reset the no progress counter for them
>> >> > >    */
>> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + if (did_some_progress && !order)
>> >> > >           no_progress_loops = 0;
>> >> > >   else
>> >> > >           no_progress_loops++;
>> >>
>> >> This unconditionally increases no_progress_loops for high order
>> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> >> to make high order page. Should we consider this case also?
>> >
>> > How many retries would help? I do not think any number will work
>> > reliably. Configurations without compaction enabled are asking for
>> > problems by definition IMHO. Relying on order-0 reclaim for high order
>> > allocations simply cannot work.
>>
>> At least, reset no_progress_loops when did_some_progress. High
>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> as order 0. And, reclaim something would increase probability of
>> compaction success.
>
> This is something I still do not understand. Why would reclaiming
> random order-0 pages help compaction? Could you clarify this please?

I just can tell simple version. Please check the link from me on another reply.
Compaction could scan more range of memory if we have more freepage.
This is due to algorithm limitation. Anyway, so, reclaiming random
order-0 pages helps compaction.

>> Why do we limit retry as 16 times with no evidence of potential
>> impossibility of making high order page?
>
> If we tried to compact 16 times without any progress then this sounds
> like a sufficient evidence to me. Well, this number is somehow arbitrary
> but the main point is to limit it to _some_ number, if we can show that
> a larger value would work better then we can update it of course.

My arguing is for your band aid patch.
My point is that why retry count for order-0 is reset if there is some progress,
but, retry counter for order up to costly isn't reset even if there is
some progress

>> And, 16 retry looks not good to me because compaction could defer
>> actual doing up to 64 times.
>
> OK, this is something that needs to be handled in a better way. The
> primary question would be why to defer the compaction for <=
> PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
> why it makes sense it for the best effort mode of operation but !costly
> orders should be trying much harder as they are nofail, no?

Make sense.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  9:50             ` Michal Hocko
@ 2016-03-02 15:01               ` Minchan Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Minchan Kim @ 2016-03-02 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Sergey Senozhatsky

On Wed, Mar 02, 2016 at 10:50:56AM +0100, Michal Hocko wrote:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
> > > > +	/*
> > > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > > +	 * retries for !costly high order requests and hope that multiple
> > > > +	 * runs of compaction will generate some high order ones for us.
> > > > +	 *
> > > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > > +	 * if we are in the retry path - something like priority 0 for the
> > > > +	 * reclaim
> > > > +	 */
> > > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +		return true;
> > > > +
> > > >  	return false;
> > 
> > This seems not a proper fix. Checking watermark with high order has
> > another meaning that there is high order page or not. This isn't
> > what we want here.
> 
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.
> 
> > So, following fix is needed.
> 
> > 'if (order)' check isn't needed. It is used to clarify the meaning of
> > this fix. You can remove it.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894..8c80375 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >                 return false;
> >  
> > +       /* To check whether compaction is available or not */
> > +       if (order)
> > +               order = 0;
> > +
> 
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
> 
> >         /*
> >          * Keep reclaiming pages while there is a chance this will lead
> >          * somewhere.  If none of the target zones can satisfy our allocation
> > 
> > > >  }
> > > >  
> > > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >  		goto noretry;
> > > >  
> > > >  	/*
> > > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > > -	 * their order will become available due to high fragmentation so do
> > > > -	 * not reset the no progress counter for them
> > > > +	 * High order allocations might have made a progress but this doesn't
> > > > +	 * mean their order will become available due to high fragmentation so
> > > > +	 * do not reset the no progress counter for them
> > > >  	 */
> > > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +	if (did_some_progress && !order)
> > > >  		no_progress_loops = 0;
> > > >  	else
> > > >  		no_progress_loops++;
> > 
> > This unconditionally increases no_progress_loops for high order
> > allocation, so, after 16 iterations, it will fail. If compaction isn't
> > enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> > to make high order page. Should we consider this case also?
> 
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

I left compaction code for a long time so a super hero might make it
perfect now but I don't think the dream come true yet and I believe
any algorithm has a drawback so we end up relying on a fallback approach
in case of not working compaction correctly.

My suggestion is to reintroduce *lumpy reclaim* and kicks in only when
compaction gave up by some reasons. It would be better to rely on
random number retrial of reclaim.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 15:01               ` Minchan Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Minchan Kim @ 2016-03-02 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Sergey Senozhatsky

On Wed, Mar 02, 2016 at 10:50:56AM +0100, Michal Hocko wrote:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
> > > > +	/*
> > > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > > +	 * retries for !costly high order requests and hope that multiple
> > > > +	 * runs of compaction will generate some high order ones for us.
> > > > +	 *
> > > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > > +	 * if we are in the retry path - something like priority 0 for the
> > > > +	 * reclaim
> > > > +	 */
> > > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +		return true;
> > > > +
> > > >  	return false;
> > 
> > This seems not a proper fix. Checking watermark with high order has
> > another meaning that there is high order page or not. This isn't
> > what we want here.
> 
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.
> 
> > So, following fix is needed.
> 
> > 'if (order)' check isn't needed. It is used to clarify the meaning of
> > this fix. You can remove it.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894..8c80375 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >                 return false;
> >  
> > +       /* To check whether compaction is available or not */
> > +       if (order)
> > +               order = 0;
> > +
> 
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
> 
> >         /*
> >          * Keep reclaiming pages while there is a chance this will lead
> >          * somewhere.  If none of the target zones can satisfy our allocation
> > 
> > > >  }
> > > >  
> > > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >  		goto noretry;
> > > >  
> > > >  	/*
> > > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > > -	 * their order will become available due to high fragmentation so do
> > > > -	 * not reset the no progress counter for them
> > > > +	 * High order allocations might have made a progress but this doesn't
> > > > +	 * mean their order will become available due to high fragmentation so
> > > > +	 * do not reset the no progress counter for them
> > > >  	 */
> > > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +	if (did_some_progress && !order)
> > > >  		no_progress_loops = 0;
> > > >  	else
> > > >  		no_progress_loops++;
> > 
> > This unconditionally increases no_progress_loops for high order
> > allocation, so, after 16 iterations, it will fail. If compaction isn't
> > enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> > to make high order page. Should we consider this case also?
> 
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

I left compaction code for a long time so a super hero might make it
perfect now but I don't think the dream come true yet and I believe
any algorithm has a drawback so we end up relying on a fallback approach
in case of not working compaction correctly.

My suggestion is to reintroduce *lumpy reclaim* and kicks in only when
compaction gave up by some reasons. It would be better to rely on
random number retrial of reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 14:34                   ` Joonsoo Kim
@ 2016-03-03  9:26                     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03  9:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> > [...]
> >> >> > > + /*
> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> > > +  *
> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> > > +  * reclaim
> >> >> > > +  */
> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> > > +         return true;
> >> >> > > +
> >> >> > >   return false;
> >> >>
> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> another meaning that there is high order page or not. This isn't
> >> >> what we want here.
> >> >
> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> > and free even order-0 pages.
> >>
> >> If we have >= order page available, we would not reach here. We would
> >> just allocate it.
> >
> > not really, we can still be under the low watermark. Note that the
> 
> you mean min watermark?

ohh, right...
 
> > target for the should_reclaim_retry watermark check includes also the
> > reclaimable memory.
> 
> I guess that usual case for high order allocation failure has enough freepage.

Not sure I understand you mean here but I wouldn't be surprised if high
order failed even with enough free pages. And that is exactly why I am
claiming that reclaiming more pages is no free ticket to high order
pages.

[...]
> >> I just did quick review to your patches so maybe I am wrong.
> >> Am I missing something?
> >
> > The core idea behind should_reclaim_retry is to check whether the
> > reclaiming all the pages would help to get over the watermark and there
> > is at least one >= order page. Then it really makes sense to retry. As
> 
> How you can judge that reclaiming all the pages would help to check
> there is at least one >= order page?

Again, not sure I understand you here. __zone_watermark_ok checks both
wmark and an available page of the sufficient order. While increased
free_pages (which includes reclaimable pages as well) will tell us
whether we have a chance to get over the min wmark, the order check will
tell us we have something to allocate from after we reach the min wmark.
 
> > the compaction has already was performed before this is called we should
> > have created some high order pages already. The decay guarantees that we
> 
> Not really. Compaction could fail.

Yes it could have failed. But what is the point to retry endlessly then?

[...]
> >> At least, reset no_progress_loops when did_some_progress. High
> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >> as order 0. And, reclaim something would increase probability of
> >> compaction success.
> >
> > This is something I still do not understand. Why would reclaiming
> > random order-0 pages help compaction? Could you clarify this please?
> 
> I just can tell simple version. Please check the link from me on another reply.
> Compaction could scan more range of memory if we have more freepage.
> This is due to algorithm limitation. Anyway, so, reclaiming random
> order-0 pages helps compaction.

I will have a look at that code but this just doesn't make any sense.
The compaction should be reshuffling pages, this shouldn't be a function
of free memory.

> >> Why do we limit retry as 16 times with no evidence of potential
> >> impossibility of making high order page?
> >
> > If we tried to compact 16 times without any progress then this sounds
> > like a sufficient evidence to me. Well, this number is somehow arbitrary
> > but the main point is to limit it to _some_ number, if we can show that
> > a larger value would work better then we can update it of course.
> 
> My arguing is for your band aid patch.
> My point is that why retry count for order-0 is reset if there is some progress,
> but, retry counter for order up to costly isn't reset even if there is
> some progress

Because we know that order-0 requests have chance to proceed if we keep
reclaiming order-0 pages while this is not true for order > 0. If we did
reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER 
then we would be back to the zone_reclaimable heuristic. Why? Because
order-0 reclaim progress will keep !costly in the reclaim loop while
compaction still might not make any progress. So we either have to fail
when __zone_watermark_ok fails for the order (which turned out to be
too easy to trigger) or have the fixed amount of retries regardless the
watermark check result. We cannot relax both unless we have other
measures in place.

Sure we can be more intelligent and reset the counter if the
feedback from compaction is optimistic and we are making some
progress. This would be less hackish and the XXX comment points into
that direction. For now I would like this to catch most loads reasonably
and build better heuristics on top. I would like to do as much as
possible to close the obvious regressions but I guess we have to expect
there will be cases where the OOM fires and hasn't before and vice
versa.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03  9:26                     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03  9:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> > [...]
> >> >> > > + /*
> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> > > +  *
> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> > > +  * reclaim
> >> >> > > +  */
> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> > > +         return true;
> >> >> > > +
> >> >> > >   return false;
> >> >>
> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> another meaning that there is high order page or not. This isn't
> >> >> what we want here.
> >> >
> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> > and free even order-0 pages.
> >>
> >> If we have >= order page available, we would not reach here. We would
> >> just allocate it.
> >
> > not really, we can still be under the low watermark. Note that the
> 
> you mean min watermark?

ohh, right...
 
> > target for the should_reclaim_retry watermark check includes also the
> > reclaimable memory.
> 
> I guess that usual case for high order allocation failure has enough freepage.

Not sure I understand you mean here but I wouldn't be surprised if high
order failed even with enough free pages. And that is exactly why I am
claiming that reclaiming more pages is no free ticket to high order
pages.

[...]
> >> I just did quick review to your patches so maybe I am wrong.
> >> Am I missing something?
> >
> > The core idea behind should_reclaim_retry is to check whether the
> > reclaiming all the pages would help to get over the watermark and there
> > is at least one >= order page. Then it really makes sense to retry. As
> 
> How you can judge that reclaiming all the pages would help to check
> there is at least one >= order page?

Again, not sure I understand you here. __zone_watermark_ok checks both
wmark and an available page of the sufficient order. While increased
free_pages (which includes reclaimable pages as well) will tell us
whether we have a chance to get over the min wmark, the order check will
tell us we have something to allocate from after we reach the min wmark.
 
> > the compaction has already was performed before this is called we should
> > have created some high order pages already. The decay guarantees that we
> 
> Not really. Compaction could fail.

Yes it could have failed. But what is the point to retry endlessly then?

[...]
> >> At least, reset no_progress_loops when did_some_progress. High
> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >> as order 0. And, reclaim something would increase probability of
> >> compaction success.
> >
> > This is something I still do not understand. Why would reclaiming
> > random order-0 pages help compaction? Could you clarify this please?
> 
> I just can tell simple version. Please check the link from me on another reply.
> Compaction could scan more range of memory if we have more freepage.
> This is due to algorithm limitation. Anyway, so, reclaiming random
> order-0 pages helps compaction.

I will have a look at that code but this just doesn't make any sense.
The compaction should be reshuffling pages, this shouldn't be a function
of free memory.

> >> Why do we limit retry as 16 times with no evidence of potential
> >> impossibility of making high order page?
> >
> > If we tried to compact 16 times without any progress then this sounds
> > like a sufficient evidence to me. Well, this number is somehow arbitrary
> > but the main point is to limit it to _some_ number, if we can show that
> > a larger value would work better then we can update it of course.
> 
> My arguing is for your band aid patch.
> My point is that why retry count for order-0 is reset if there is some progress,
> but, retry counter for order up to costly isn't reset even if there is
> some progress

Because we know that order-0 requests have chance to proceed if we keep
reclaiming order-0 pages while this is not true for order > 0. If we did
reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER 
then we would be back to the zone_reclaimable heuristic. Why? Because
order-0 reclaim progress will keep !costly in the reclaim loop while
compaction still might not make any progress. So we either have to fail
when __zone_watermark_ok fails for the order (which turned out to be
too easy to trigger) or have the fixed amount of retries regardless the
watermark check result. We cannot relax both unless we have other
measures in place.

Sure we can be more intelligent and reset the counter if the
feedback from compaction is optimistic and we are making some
progress. This would be less hackish and the XXX comment points into
that direction. For now I would like this to catch most loads reasonably
and build better heuristics on top. I would like to do as much as
possible to close the obvious regressions but I guess we have to expect
there will be cases where the OOM fires and hasn't before and vice
versa.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
                             ` (3 preceding siblings ...)
  (?)
@ 2016-03-03  9:54           ` Hugh Dickins
  2016-03-03 12:32               ` Michal Hocko
                               ` (2 more replies)
  -1 siblings, 3 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-03  9:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Joonsoo Kim, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

[-- Attachment #1: Type: TEXT/PLAIN, Size: 12804 bytes --]

On Tue, 1 Mar 2016, Michal Hocko wrote:
> [Adding Vlastimil and Joonsoo for compaction related things - this was a
> large thread but the more interesting part starts with
> http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]
> 
> On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > [...]
> > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > 
> > > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > make defconfig there, then make -j20.
> > > > 
> > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > 
> > > > Except that you'll probably need to fiddle around with that j20,
> > > > it's true for my laptop but not for my workstation.  j20 just happens
> > > > to be what I've had there for years, that I now see breaking down
> > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > but it still doesn't exercise swap very much).
> > > 
> > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > will try with another host with more CPUs (because my laptop has only
> > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap

I've found that the number of CPUs makes quite a difference - I have 4.

And another difference between us may be in our configs: on this laptop
I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
the size of each shmem_inode (and those of course are not swappable).

I found that I could avoid the OOM if I ran the "make -j20" on a
kernel without all those debug options, and booted with nr_cpus=2.
And currently I'm booting the kernel with the debug options in,
but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).

Maybe in the OOM rework, threads are cancelling each other's progress
more destructively, where before they co-operated to some extent?

(All that is on the laptop.  The G5 is still busy full-time bisecting
a powerpc issue: I know it was OOMing with the rework, but I have not
verified the effect of nr_cpus on it.  My x86 workstation has not been
OOMing with the rework - I think that means that I've not been exerting
as much memory pressure on it as I'd thought, that it copes with the load
better, and would only show the difference if I loaded it more heavily.)

> > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > (16, 10 no difference really). I was also collecting vmstat in the
> > > background. The compilation takes ages but the behavior seems consistent
> > > and stable.
> > 
> > Thanks a lot for giving it a go.
> > 
> > I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> > are less than 2M big??  But probably that's just a misunderstanding
> > or typo somewhere.
> 
> A typo. 445 was from 900M test which I was doing while writing the
> email. Sorry about the confusion.

That makes more sense!  Though I'm still amazed that you got anywhere,
taking so much of the usable memory out.

> 
> > Ignoring that, you're successfully doing a make -20 defconfig build
> > in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> > at all surprised that it takes ages, but I am very surprised that it
> > does not OOM.  I suppose by rights it ought not to OOM, the built
> > tree occupies only a little more than 1G, so you do have enough swap;
> > but I wouldn't get anywhere near that myself without OOMing - I give
> > myself 1G of RAM (well, minus whatever the booted system takes up)
> > to do that build in, four times your RAM, yet in my case it OOMs.
> >
> > That source tree alone occupies more than 700M, so just copying it
> > into your tmpfs would take a long time. 
> 
> OK, I just found out that I was cheating a bit. I was building
> linux-3.7-rc5.tar.bz2 which is smaller:
> $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> 537M    /mnt/tmpfs/linux-3.7-rc5/

Right, I have a habit like that too; but my habitual testing still
uses the 2.6.24 source tree, which is rather too old to ask others
to reproduce with - but we both find that the kernel source tree
keeps growing, and prefer to stick with something of a fixed size.

> 
> and after the defconfig build:
> $ free
>              total       used       free     shared    buffers     cached
> Mem:       1008460     941904      66556          0       5092     806760
> -/+ buffers/cache:     130052     878408
> Swap:      2097148      42648    2054500
> $ du -sh linux-3.7-rc5/
> 799M    linux-3.7-rc5/
> 
> Sorry about that but this is what my other tests were using and I forgot
> to check. Now let's try the same with the current linus tree:
> host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> 707M    /mnt/tmpfs/linux-4.5-rc6/
> $ free
>              total       used       free     shared    buffers     cached
> Mem:       1008460     962976      45484          0       7236     820064

I guess we have different versions of "free": mine shows Shmem as shared,
but yours appears to be an older version, just showing 0.

> -/+ buffers/cache:     135676     872784
> Swap:      2097148         16    2097132
> $ time make -j20 > /dev/null
> drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
> drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> 
> real    8m36.621s
> user    14m1.642s
> sys     2m45.238s
> 
> so I wasn't cheating all that much...
> 
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space.  Unless
> > perhaps it's some superfast swap that you have?
> 
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).
> I realize I might be boring already when blaming compaction but let me
> try again ;)
> $ grep compact /proc/vmstat 
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
> 
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
> 
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
>  	switch (compact_result) {
>  	case COMPACT_DEFERRED:
>  		*deferred_compaction = true;
> 
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
>              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> 
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> 
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
> 
> $ grep compact_stall /proc/vmstat 
> compact_stall 190
> 
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
>     190 2 1
>     122 2 3
>      59 2 4
> 
> #define COMPACT_SKIPPED         1               
> #define COMPACT_PARTIAL         3
> #define COMPACT_COMPLETE        4
> 
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
> 
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.
> 
> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
>  								alloc_flags))
>  		return COMPACT_PARTIAL;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return COMPACT_CONTINUE;
> +

I gave that a try just now, but it didn't help me: OOMed much sooner,
after doing half as much work.  (FWIW, I have been including your other
patch, the "Andrew, could you queue this one as well, please" patch.)

I do agree that compaction appears to have closed down when we OOM:
taking that along with my nr_cpus remark (and the make -jNumber),
are parallel compactions interfering with each other destructively,
in a way that they did not before the rework?

>  	/*
>  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
>  	 * This is because during migration, copies of pages need to be
> 
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
> 
> real    8m46.626s
> user    14m15.823s
> sys     2m45.471s
> 
> the time increased but I haven't checked how stable the result is. 

But I didn't investigate its stability either, may have judged against
it too soon.

> 
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1

I have not seen any compact_kcompatd_wakes at all:
perhaps we're too busy compacting directly.

(Vlastimil, there's a "c" missing from that name, it should be
"compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)

> 
> $ grep allocstall /proc/vmstat
> allocstall 1665
> 
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.
> 
> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?

It didn't help me; but I do suspect you're right to be worrying about
the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.

> 
> > I was only suggesting to allocate hugetlb pages, if you preferred
> > not to reboot with artificially reduced RAM.  Not an issue if you're
> > booting VMs.
> 
> Ohh, I see.

I've attached vmstats.xz, output from your read_vmstat proggy;
together with oom.xz, the dmesg for the OOM in question.

I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
that being a count that's always 0 for me: so when you see
"balloon_deflate 1" towards the end, that's where the OOM
kill came in, and shortly after I Ctrl-C'ed.

I hope you can get more out of it than I have - thanks!

Hugh

[-- Attachment #2: Type: APPLICATION/x-xz, Size: 54112 bytes --]

[-- Attachment #3: Type: APPLICATION/x-xz, Size: 3512 bytes --]

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03  9:26                     ` Michal Hocko
@ 2016-03-03 10:29                       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-03 10:29 UTC (permalink / raw)
  To: mhocko, js1304
  Cc: iamjoonsoo.kim, akpm, hughd, torvalds, hannes, mgorman, rientjes,
	hillf.zj, kamezawa.hiroyu, linux-mm, linux-kernel,
	sergey.senozhatsky.work

Michal Hocko wrote:
> Sure we can be more intelligent and reset the counter if the
> feedback from compaction is optimistic and we are making some
> progress. This would be less hackish and the XXX comment points into
> that direction. For now I would like this to catch most loads reasonably
> and build better heuristics on top. I would like to do as much as
> possible to close the obvious regressions but I guess we have to expect
> there will be cases where the OOM fires and hasn't before and vice
> versa.

Aren't you forgetting that some people use panic_on_oom > 0 which means that
premature OOM killer invocation is fatal for them?

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 10:29                       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-03 10:29 UTC (permalink / raw)
  To: mhocko, js1304
  Cc: iamjoonsoo.kim, akpm, hughd, torvalds, hannes, mgorman, rientjes,
	hillf.zj, kamezawa.hiroyu, linux-mm, linux-kernel,
	sergey.senozhatsky.work

Michal Hocko wrote:
> Sure we can be more intelligent and reset the counter if the
> feedback from compaction is optimistic and we are making some
> progress. This would be less hackish and the XXX comment points into
> that direction. For now I would like this to catch most loads reasonably
> and build better heuristics on top. I would like to do as much as
> possible to close the obvious regressions but I guess we have to expect
> there will be cases where the OOM fires and hasn't before and vice
> versa.

Aren't you forgetting that some people use panic_on_oom > 0 which means that
premature OOM killer invocation is fatal for them?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03  9:54           ` Hugh Dickins
@ 2016-03-03 12:32               ` Michal Hocko
  2016-03-04  7:53               ` Joonsoo Kim
  2016-03-04 12:28               ` Michal Hocko
  2 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 12:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
[...]
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >  
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return COMPACT_CONTINUE;
> > +
> 
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work. 

I do not have an explanation why it would cause oom sooner but this
turned out to be incomplete. There is another wmaark check deeper in the
compaction path. Could you try the one from
http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz

I will try to find a machine with more CPUs and try to reproduce this in
the mean time.

I will also have a look at the data you have collected.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 12:32               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 12:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
[...]
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >  
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return COMPACT_CONTINUE;
> > +
> 
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work. 

I do not have an explanation why it would cause oom sooner but this
turned out to be incomplete. There is another wmaark check deeper in the
compaction path. Could you try the one from
http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz

I will try to find a machine with more CPUs and try to reproduce this in
the mean time.

I will also have a look at the data you have collected.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03  9:26                     ` Michal Hocko
@ 2016-03-03 14:10                       ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-03 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
>> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> >> > [...]
>> >> >> > > + /*
>> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> >> > > +  *
>> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> >> > > +  * reclaim
>> >> >> > > +  */
>> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> >> > > +         return true;
>> >> >> > > +
>> >> >> > >   return false;
>> >> >>
>> >> >> This seems not a proper fix. Checking watermark with high order has
>> >> >> another meaning that there is high order page or not. This isn't
>> >> >> what we want here.
>> >> >
>> >> > Why not? Why should we retry the reclaim if we do not have >=order page
>> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> >> > form the requested order. The ordering on the LRU lists is pretty much
>> >> > random wrt. pfn ordering. On the other hand if we have a page available
>> >> > which is just hidden by watermarks then it makes perfect sense to retry
>> >> > and free even order-0 pages.
>> >>
>> >> If we have >= order page available, we would not reach here. We would
>> >> just allocate it.
>> >
>> > not really, we can still be under the low watermark. Note that the
>>
>> you mean min watermark?
>
> ohh, right...
>
>> > target for the should_reclaim_retry watermark check includes also the
>> > reclaimable memory.
>>
>> I guess that usual case for high order allocation failure has enough freepage.
>
> Not sure I understand you mean here but I wouldn't be surprised if high
> order failed even with enough free pages. And that is exactly why I am
> claiming that reclaiming more pages is no free ticket to high order
> pages.

I didn't say that it's free ticket. OOM kill would be the most expensive ticket
that we have. Why do you want to kill something? It also doesn't guarantee
to make high order pages. It is just another way of reclaiming memory. What is
the difference between plain reclaim and OOM kill? Why do we use OOM kill
in this case?

> [...]
>> >> I just did quick review to your patches so maybe I am wrong.
>> >> Am I missing something?
>> >
>> > The core idea behind should_reclaim_retry is to check whether the
>> > reclaiming all the pages would help to get over the watermark and there
>> > is at least one >= order page. Then it really makes sense to retry. As
>>
>> How you can judge that reclaiming all the pages would help to check
>> there is at least one >= order page?
>
> Again, not sure I understand you here. __zone_watermark_ok checks both
> wmark and an available page of the sufficient order. While increased
> free_pages (which includes reclaimable pages as well) will tell us
> whether we have a chance to get over the min wmark, the order check will
> tell us we have something to allocate from after we reach the min wmark.

Again, your assumption would be different with mine. My assumption is that
high order allocation problem happens due to fragmentation rather than
low free memory. In this case, there is no high order page. Even if you can
reclaim 1TB and add this counter to freepage counter, high order page
counter will not be changed and watermark check would fail. So, high order
allocation will not go through retry logic. This is what you want?

>> > the compaction has already was performed before this is called we should
>> > have created some high order pages already. The decay guarantees that we
>>
>> Not really. Compaction could fail.
>
> Yes it could have failed. But what is the point to retry endlessly then?

I didn't say we should retry endlessly.

> [...]
>> >> At least, reset no_progress_loops when did_some_progress. High
>> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> >> as order 0. And, reclaim something would increase probability of
>> >> compaction success.
>> >
>> > This is something I still do not understand. Why would reclaiming
>> > random order-0 pages help compaction? Could you clarify this please?
>>
>> I just can tell simple version. Please check the link from me on another reply.
>> Compaction could scan more range of memory if we have more freepage.
>> This is due to algorithm limitation. Anyway, so, reclaiming random
>> order-0 pages helps compaction.
>
> I will have a look at that code but this just doesn't make any sense.
> The compaction should be reshuffling pages, this shouldn't be a function
> of free memory.

Please refer the link I mentioned before. There is a reason why more free
memory would help compaction success. Compaction doesn't work
like as random reshuffling. It has an algorithm to reduce system overall
fragmentation so there is limitation.

>> >> Why do we limit retry as 16 times with no evidence of potential
>> >> impossibility of making high order page?
>> >
>> > If we tried to compact 16 times without any progress then this sounds
>> > like a sufficient evidence to me. Well, this number is somehow arbitrary
>> > but the main point is to limit it to _some_ number, if we can show that
>> > a larger value would work better then we can update it of course.
>>
>> My arguing is for your band aid patch.
>> My point is that why retry count for order-0 is reset if there is some progress,
>> but, retry counter for order up to costly isn't reset even if there is
>> some progress
>
> Because we know that order-0 requests have chance to proceed if we keep
> reclaiming order-0 pages while this is not true for order > 0. If we did
> reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> then we would be back to the zone_reclaimable heuristic. Why? Because
> order-0 reclaim progress will keep !costly in the reclaim loop while
> compaction still might not make any progress. So we either have to fail
> when __zone_watermark_ok fails for the order (which turned out to be
> too easy to trigger) or have the fixed amount of retries regardless the
> watermark check result. We cannot relax both unless we have other
> measures in place.

As mentioned before, OOM kill also doesn't guarantee to make high order page.
Reclaim more memory as much as possible makes more sense to me.
Timing of OOM kill for order-0 is reasonable because there is not enough
freeable page. But, it's not reasonable to kill something when we have
much reclaimable memory like as your current implementation.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 14:10                       ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-03 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
>> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> >> > [...]
>> >> >> > > + /*
>> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> >> > > +  *
>> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> >> > > +  * reclaim
>> >> >> > > +  */
>> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> >> > > +         return true;
>> >> >> > > +
>> >> >> > >   return false;
>> >> >>
>> >> >> This seems not a proper fix. Checking watermark with high order has
>> >> >> another meaning that there is high order page or not. This isn't
>> >> >> what we want here.
>> >> >
>> >> > Why not? Why should we retry the reclaim if we do not have >=order page
>> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> >> > form the requested order. The ordering on the LRU lists is pretty much
>> >> > random wrt. pfn ordering. On the other hand if we have a page available
>> >> > which is just hidden by watermarks then it makes perfect sense to retry
>> >> > and free even order-0 pages.
>> >>
>> >> If we have >= order page available, we would not reach here. We would
>> >> just allocate it.
>> >
>> > not really, we can still be under the low watermark. Note that the
>>
>> you mean min watermark?
>
> ohh, right...
>
>> > target for the should_reclaim_retry watermark check includes also the
>> > reclaimable memory.
>>
>> I guess that usual case for high order allocation failure has enough freepage.
>
> Not sure I understand you mean here but I wouldn't be surprised if high
> order failed even with enough free pages. And that is exactly why I am
> claiming that reclaiming more pages is no free ticket to high order
> pages.

I didn't say that it's free ticket. OOM kill would be the most expensive ticket
that we have. Why do you want to kill something? It also doesn't guarantee
to make high order pages. It is just another way of reclaiming memory. What is
the difference between plain reclaim and OOM kill? Why do we use OOM kill
in this case?

> [...]
>> >> I just did quick review to your patches so maybe I am wrong.
>> >> Am I missing something?
>> >
>> > The core idea behind should_reclaim_retry is to check whether the
>> > reclaiming all the pages would help to get over the watermark and there
>> > is at least one >= order page. Then it really makes sense to retry. As
>>
>> How you can judge that reclaiming all the pages would help to check
>> there is at least one >= order page?
>
> Again, not sure I understand you here. __zone_watermark_ok checks both
> wmark and an available page of the sufficient order. While increased
> free_pages (which includes reclaimable pages as well) will tell us
> whether we have a chance to get over the min wmark, the order check will
> tell us we have something to allocate from after we reach the min wmark.

Again, your assumption would be different with mine. My assumption is that
high order allocation problem happens due to fragmentation rather than
low free memory. In this case, there is no high order page. Even if you can
reclaim 1TB and add this counter to freepage counter, high order page
counter will not be changed and watermark check would fail. So, high order
allocation will not go through retry logic. This is what you want?

>> > the compaction has already was performed before this is called we should
>> > have created some high order pages already. The decay guarantees that we
>>
>> Not really. Compaction could fail.
>
> Yes it could have failed. But what is the point to retry endlessly then?

I didn't say we should retry endlessly.

> [...]
>> >> At least, reset no_progress_loops when did_some_progress. High
>> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> >> as order 0. And, reclaim something would increase probability of
>> >> compaction success.
>> >
>> > This is something I still do not understand. Why would reclaiming
>> > random order-0 pages help compaction? Could you clarify this please?
>>
>> I just can tell simple version. Please check the link from me on another reply.
>> Compaction could scan more range of memory if we have more freepage.
>> This is due to algorithm limitation. Anyway, so, reclaiming random
>> order-0 pages helps compaction.
>
> I will have a look at that code but this just doesn't make any sense.
> The compaction should be reshuffling pages, this shouldn't be a function
> of free memory.

Please refer the link I mentioned before. There is a reason why more free
memory would help compaction success. Compaction doesn't work
like as random reshuffling. It has an algorithm to reduce system overall
fragmentation so there is limitation.

>> >> Why do we limit retry as 16 times with no evidence of potential
>> >> impossibility of making high order page?
>> >
>> > If we tried to compact 16 times without any progress then this sounds
>> > like a sufficient evidence to me. Well, this number is somehow arbitrary
>> > but the main point is to limit it to _some_ number, if we can show that
>> > a larger value would work better then we can update it of course.
>>
>> My arguing is for your band aid patch.
>> My point is that why retry count for order-0 is reset if there is some progress,
>> but, retry counter for order up to costly isn't reset even if there is
>> some progress
>
> Because we know that order-0 requests have chance to proceed if we keep
> reclaiming order-0 pages while this is not true for order > 0. If we did
> reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> then we would be back to the zone_reclaimable heuristic. Why? Because
> order-0 reclaim progress will keep !costly in the reclaim loop while
> compaction still might not make any progress. So we either have to fail
> when __zone_watermark_ok fails for the order (which turned out to be
> too easy to trigger) or have the fixed amount of retries regardless the
> watermark check result. We cannot relax both unless we have other
> measures in place.

As mentioned before, OOM kill also doesn't guarantee to make high order page.
Reclaim more memory as much as possible makes more sense to me.
Timing of OOM kill for order-0 is reasonable because there is not enough
freeable page. But, it's not reasonable to kill something when we have
much reclaimable memory like as your current implementation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 14:10                       ` Joonsoo Kim
@ 2016-03-03 15:25                         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 15:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> >> > [...]
> >> >> >> > > + /*
> >> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> >> > > +  *
> >> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> >> > > +  * reclaim
> >> >> >> > > +  */
> >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> >> > > +         return true;
> >> >> >> > > +
> >> >> >> > >   return false;
> >> >> >>
> >> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> >> another meaning that there is high order page or not. This isn't
> >> >> >> what we want here.
> >> >> >
> >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> >> > and free even order-0 pages.
> >> >>
> >> >> If we have >= order page available, we would not reach here. We would
> >> >> just allocate it.
> >> >
> >> > not really, we can still be under the low watermark. Note that the
> >>
> >> you mean min watermark?
> >
> > ohh, right...
> >
> >> > target for the should_reclaim_retry watermark check includes also the
> >> > reclaimable memory.
> >>
> >> I guess that usual case for high order allocation failure has enough freepage.
> >
> > Not sure I understand you mean here but I wouldn't be surprised if high
> > order failed even with enough free pages. And that is exactly why I am
> > claiming that reclaiming more pages is no free ticket to high order
> > pages.
> 
> I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> that we have. Why do you want to kill something?

Because all the attempts so far have failed and we should rather not
retry endlessly. With the band-aid we know we will retry
MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
resolve the situation along with the same amount of reclaim rounds to
help and get over watermarks.

> It also doesn't guarantee to make high order pages. It is just another
> way of reclaiming memory. What is the difference between plain reclaim
> and OOM kill? Why do we use OOM kill in this case?

What is our alternative other than keep looping endlessly?

> > [...]
> >> >> I just did quick review to your patches so maybe I am wrong.
> >> >> Am I missing something?
> >> >
> >> > The core idea behind should_reclaim_retry is to check whether the
> >> > reclaiming all the pages would help to get over the watermark and there
> >> > is at least one >= order page. Then it really makes sense to retry. As
> >>
> >> How you can judge that reclaiming all the pages would help to check
> >> there is at least one >= order page?
> >
> > Again, not sure I understand you here. __zone_watermark_ok checks both
> > wmark and an available page of the sufficient order. While increased
> > free_pages (which includes reclaimable pages as well) will tell us
> > whether we have a chance to get over the min wmark, the order check will
> > tell us we have something to allocate from after we reach the min wmark.
> 
> Again, your assumption would be different with mine. My assumption is that
> high order allocation problem happens due to fragmentation rather than
> low free memory. In this case, there is no high order page. Even if you can
> reclaim 1TB and add this counter to freepage counter, high order page
> counter will not be changed and watermark check would fail. So, high order
> allocation will not go through retry logic. This is what you want?

I really want to base the decision on something measurable rather
than a good hope. This is what all the zone_reclaimable() is about. I
understand your concerns that compaction doesn't guarantee anything but
I am quite convinced that we really need an upper bound for retries
(unlike now when zone_reclaimable is basically unbounded assuming
order-0 reclaim makes some progress). What is the best bound is harder
to tell, of course.

[...]
> >> My arguing is for your band aid patch.
> >> My point is that why retry count for order-0 is reset if there is some progress,
> >> but, retry counter for order up to costly isn't reset even if there is
> >> some progress
> >
> > Because we know that order-0 requests have chance to proceed if we keep
> > reclaiming order-0 pages while this is not true for order > 0. If we did
> > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > then we would be back to the zone_reclaimable heuristic. Why? Because
> > order-0 reclaim progress will keep !costly in the reclaim loop while
> > compaction still might not make any progress. So we either have to fail
> > when __zone_watermark_ok fails for the order (which turned out to be
> > too easy to trigger) or have the fixed amount of retries regardless the
> > watermark check result. We cannot relax both unless we have other
> > measures in place.
> 
> As mentioned before, OOM kill also doesn't guarantee to make high order page.

Yes, of course, apart from the kernel stack which is high order there is
no guarantee.

> Reclaim more memory as much as possible makes more sense to me.

But then we are back to square one. How much and how to decide when it
makes sense to give up. Do you have any suggestions on what should be
the criteria? Is there any feedback mechanism from the compaction which
would tell us to keep retrying? Something like did_some_progress from
the order-0 reclaim? Is any of deferred_compaction resp.
contended_compaction usable? Or is there any per-zone flag we can check
and prefer over wmark order check?

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 15:25                         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 15:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> >> > [...]
> >> >> >> > > + /*
> >> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> >> > > +  *
> >> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> >> > > +  * reclaim
> >> >> >> > > +  */
> >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> >> > > +         return true;
> >> >> >> > > +
> >> >> >> > >   return false;
> >> >> >>
> >> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> >> another meaning that there is high order page or not. This isn't
> >> >> >> what we want here.
> >> >> >
> >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> >> > and free even order-0 pages.
> >> >>
> >> >> If we have >= order page available, we would not reach here. We would
> >> >> just allocate it.
> >> >
> >> > not really, we can still be under the low watermark. Note that the
> >>
> >> you mean min watermark?
> >
> > ohh, right...
> >
> >> > target for the should_reclaim_retry watermark check includes also the
> >> > reclaimable memory.
> >>
> >> I guess that usual case for high order allocation failure has enough freepage.
> >
> > Not sure I understand you mean here but I wouldn't be surprised if high
> > order failed even with enough free pages. And that is exactly why I am
> > claiming that reclaiming more pages is no free ticket to high order
> > pages.
> 
> I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> that we have. Why do you want to kill something?

Because all the attempts so far have failed and we should rather not
retry endlessly. With the band-aid we know we will retry
MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
resolve the situation along with the same amount of reclaim rounds to
help and get over watermarks.

> It also doesn't guarantee to make high order pages. It is just another
> way of reclaiming memory. What is the difference between plain reclaim
> and OOM kill? Why do we use OOM kill in this case?

What is our alternative other than keep looping endlessly?

> > [...]
> >> >> I just did quick review to your patches so maybe I am wrong.
> >> >> Am I missing something?
> >> >
> >> > The core idea behind should_reclaim_retry is to check whether the
> >> > reclaiming all the pages would help to get over the watermark and there
> >> > is at least one >= order page. Then it really makes sense to retry. As
> >>
> >> How you can judge that reclaiming all the pages would help to check
> >> there is at least one >= order page?
> >
> > Again, not sure I understand you here. __zone_watermark_ok checks both
> > wmark and an available page of the sufficient order. While increased
> > free_pages (which includes reclaimable pages as well) will tell us
> > whether we have a chance to get over the min wmark, the order check will
> > tell us we have something to allocate from after we reach the min wmark.
> 
> Again, your assumption would be different with mine. My assumption is that
> high order allocation problem happens due to fragmentation rather than
> low free memory. In this case, there is no high order page. Even if you can
> reclaim 1TB and add this counter to freepage counter, high order page
> counter will not be changed and watermark check would fail. So, high order
> allocation will not go through retry logic. This is what you want?

I really want to base the decision on something measurable rather
than a good hope. This is what all the zone_reclaimable() is about. I
understand your concerns that compaction doesn't guarantee anything but
I am quite convinced that we really need an upper bound for retries
(unlike now when zone_reclaimable is basically unbounded assuming
order-0 reclaim makes some progress). What is the best bound is harder
to tell, of course.

[...]
> >> My arguing is for your band aid patch.
> >> My point is that why retry count for order-0 is reset if there is some progress,
> >> but, retry counter for order up to costly isn't reset even if there is
> >> some progress
> >
> > Because we know that order-0 requests have chance to proceed if we keep
> > reclaiming order-0 pages while this is not true for order > 0. If we did
> > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > then we would be back to the zone_reclaimable heuristic. Why? Because
> > order-0 reclaim progress will keep !costly in the reclaim loop while
> > compaction still might not make any progress. So we either have to fail
> > when __zone_watermark_ok fails for the order (which turned out to be
> > too easy to trigger) or have the fixed amount of retries regardless the
> > watermark check result. We cannot relax both unless we have other
> > measures in place.
> 
> As mentioned before, OOM kill also doesn't guarantee to make high order page.

Yes, of course, apart from the kernel stack which is high order there is
no guarantee.

> Reclaim more memory as much as possible makes more sense to me.

But then we are back to square one. How much and how to decide when it
makes sense to give up. Do you have any suggestions on what should be
the criteria? Is there any feedback mechanism from the compaction which
would tell us to keep retrying? Something like did_some_progress from
the order-0 reclaim? Is any of deferred_compaction resp.
contended_compaction usable? Or is there any per-zone flag we can check
and prefer over wmark order check?

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 14:10                       ` Joonsoo Kim
@ 2016-03-03 15:50                         ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-03 15:50 UTC (permalink / raw)
  To: Joonsoo Kim, Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> 
>> [...]
>>>>> At least, reset no_progress_loops when did_some_progress. High
>>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>>>>> as order 0. And, reclaim something would increase probability of
>>>>> compaction success.
>>>>
>>>> This is something I still do not understand. Why would reclaiming
>>>> random order-0 pages help compaction? Could you clarify this please?
>>>
>>> I just can tell simple version. Please check the link from me on another reply.
>>> Compaction could scan more range of memory if we have more freepage.
>>> This is due to algorithm limitation. Anyway, so, reclaiming random
>>> order-0 pages helps compaction.
>>
>> I will have a look at that code but this just doesn't make any sense.
>> The compaction should be reshuffling pages, this shouldn't be a function
>> of free memory.
> 
> Please refer the link I mentioned before. There is a reason why more free
> memory would help compaction success. Compaction doesn't work
> like as random reshuffling. It has an algorithm to reduce system overall
> fragmentation so there is limitation.

I proposed another way to get better results from direct compaction -
don't scan for free pages but get them directly from freelists:

https://lkml.org/lkml/2015/12/3/60

But your redesign would be useful too for kcompactd/khugepaged keeping
overall fragmentation low.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 15:50                         ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-03 15:50 UTC (permalink / raw)
  To: Joonsoo Kim, Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> 
>> [...]
>>>>> At least, reset no_progress_loops when did_some_progress. High
>>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>>>>> as order 0. And, reclaim something would increase probability of
>>>>> compaction success.
>>>>
>>>> This is something I still do not understand. Why would reclaiming
>>>> random order-0 pages help compaction? Could you clarify this please?
>>>
>>> I just can tell simple version. Please check the link from me on another reply.
>>> Compaction could scan more range of memory if we have more freepage.
>>> This is due to algorithm limitation. Anyway, so, reclaiming random
>>> order-0 pages helps compaction.
>>
>> I will have a look at that code but this just doesn't make any sense.
>> The compaction should be reshuffling pages, this shouldn't be a function
>> of free memory.
> 
> Please refer the link I mentioned before. There is a reason why more free
> memory would help compaction success. Compaction doesn't work
> like as random reshuffling. It has an algorithm to reduce system overall
> fragmentation so there is limitation.

I proposed another way to get better results from direct compaction -
don't scan for free pages but get them directly from freelists:

https://lkml.org/lkml/2015/12/3/60

But your redesign would be useful too for kcompactd/khugepaged keeping
overall fragmentation low.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 15:50                         ` Vlastimil Babka
@ 2016-03-03 16:26                           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 16:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Joonsoo Kim, Andrew Morton, Hugh Dickins,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML, Sergey Senozhatsky

On Thu 03-03-16 16:50:16, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> > 
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> > 
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
> 
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
> 
> https://lkml.org/lkml/2015/12/3/60

Yes this makes perfect sense to me (with my limited experience in
this area so I might be missing some obvious problems this would
introduce). The direct compaction for !costly orders is something
we should better satisfy immediately. I would just object that this
shouldn't be reduced to ASYNC compaction requests only. SYNC* modes are
even a more desperate call (at least that is my understanding) for the
page and we should treat them the appropriately.

> But your redesign would be useful too for kcompactd/khugepaged keeping
> overall fragmentation low.

kcompactd can handle and should focus on the long term goals.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 16:26                           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03 16:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Joonsoo Kim, Andrew Morton, Hugh Dickins,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML, Sergey Senozhatsky

On Thu 03-03-16 16:50:16, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> > 
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> > 
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
> 
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
> 
> https://lkml.org/lkml/2015/12/3/60

Yes this makes perfect sense to me (with my limited experience in
this area so I might be missing some obvious problems this would
introduce). The direct compaction for !costly orders is something
we should better satisfy immediately. I would just object that this
shouldn't be reduced to ASYNC compaction requests only. SYNC* modes are
even a more desperate call (at least that is my understanding) for the
page and we should treat them the appropriately.

> But your redesign would be useful too for kcompactd/khugepaged keeping
> overall fragmentation low.

kcompactd can handle and should focus on the long term goals.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 12:32               ` Michal Hocko
@ 2016-03-03 20:57                 ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-03 20:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Joonsoo Kim, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu, 3 Mar 2016, Michal Hocko wrote:
> On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> > On Tue, 1 Mar 2016, Michal Hocko wrote:
> [...]
> > > So I have tried the following:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 4d99e1f5055c..7364e48cf69a 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > >  								alloc_flags))
> > >  		return COMPACT_PARTIAL;
> > >  
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return COMPACT_CONTINUE;
> > > +
> > 
> > I gave that a try just now, but it didn't help me: OOMed much sooner,
> > after doing half as much work. 

I think I exaggerated: sooner, but not _much_ sooner; and I cannot
see now what I based that estimate of "half as much work" on.

> 
> I do not have an explanation why it would cause oom sooner but this
> turned out to be incomplete. There is another wmaark check deeper in the
> compaction path. Could you try the one from
> http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz

I've now added that in: it corrects the "sooner", but does not make
any difference to the fact of OOMing for me.

Hugh

> 
> I will try to find a machine with more CPUs and try to reproduce this in
> the mean time.
> 
> I will also have a look at the data you have collected.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03 20:57                 ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-03 20:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Joonsoo Kim, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu, 3 Mar 2016, Michal Hocko wrote:
> On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> > On Tue, 1 Mar 2016, Michal Hocko wrote:
> [...]
> > > So I have tried the following:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 4d99e1f5055c..7364e48cf69a 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > >  								alloc_flags))
> > >  		return COMPACT_PARTIAL;
> > >  
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return COMPACT_CONTINUE;
> > > +
> > 
> > I gave that a try just now, but it didn't help me: OOMed much sooner,
> > after doing half as much work. 

I think I exaggerated: sooner, but not _much_ sooner; and I cannot
see now what I based that estimate of "half as much work" on.

> 
> I do not have an explanation why it would cause oom sooner but this
> turned out to be incomplete. There is another wmaark check deeper in the
> compaction path. Could you try the one from
> http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz

I've now added that in: it corrects the "sooner", but does not make
any difference to the fact of OOMing for me.

Hugh

> 
> I will try to find a machine with more CPUs and try to reproduce this in
> the mean time.
> 
> I will also have a look at the data you have collected.
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 15:25                         ` Michal Hocko
@ 2016-03-04  5:23                           ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  5:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> > >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> > >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > >> >> > [...]
> > >> >> >> > > + /*
> > >> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> > >> >> >> > > +  * retries for !costly high order requests and hope that multiple
> > >> >> >> > > +  * runs of compaction will generate some high order ones for us.
> > >> >> >> > > +  *
> > >> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> > >> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> > >> >> >> > > +  * reclaim
> > >> >> >> > > +  */
> > >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > >> >> >> > > +         return true;
> > >> >> >> > > +
> > >> >> >> > >   return false;
> > >> >> >>
> > >> >> >> This seems not a proper fix. Checking watermark with high order has
> > >> >> >> another meaning that there is high order page or not. This isn't
> > >> >> >> what we want here.
> > >> >> >
> > >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> > >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > >> >> > form the requested order. The ordering on the LRU lists is pretty much
> > >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> > >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> > >> >> > and free even order-0 pages.
> > >> >>
> > >> >> If we have >= order page available, we would not reach here. We would
> > >> >> just allocate it.
> > >> >
> > >> > not really, we can still be under the low watermark. Note that the
> > >>
> > >> you mean min watermark?
> > >
> > > ohh, right...
> > >
> > >> > target for the should_reclaim_retry watermark check includes also the
> > >> > reclaimable memory.
> > >>
> > >> I guess that usual case for high order allocation failure has enough freepage.
> > >
> > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > order failed even with enough free pages. And that is exactly why I am
> > > claiming that reclaiming more pages is no free ticket to high order
> > > pages.
> > 
> > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > that we have. Why do you want to kill something?
> 
> Because all the attempts so far have failed and we should rather not
> retry endlessly. With the band-aid we know we will retry
> MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> resolve the situation along with the same amount of reclaim rounds to
> help and get over watermarks.
> 
> > It also doesn't guarantee to make high order pages. It is just another
> > way of reclaiming memory. What is the difference between plain reclaim
> > and OOM kill? Why do we use OOM kill in this case?
> 
> What is our alternative other than keep looping endlessly?

Loop as long as free memory or estimated available memory (free +
reclaimable) increases. This means that we did some progress. And,
they will not grow forever because we have just limited reclaimable
memory and limited memory. You can reset no_progress_loops = 0 when
those metric increases than before.

With this bound, we can do our best to try to solve this unpleasant
situation before OOM.

Unconditional 16 looping and then OOM kill really doesn't make any
sense, because it doesn't mean that we already do our best. OOM
should not be called prematurely and AFAIK it is one of goals
on your patches.

If above suggestion doesn't make sense to you, please try to find
another way rather than suggesting work-around that could cause
OOM prematurely in high order allocation case.

Thanks.

> 
> > > [...]
> > >> >> I just did quick review to your patches so maybe I am wrong.
> > >> >> Am I missing something?
> > >> >
> > >> > The core idea behind should_reclaim_retry is to check whether the
> > >> > reclaiming all the pages would help to get over the watermark and there
> > >> > is at least one >= order page. Then it really makes sense to retry. As
> > >>
> > >> How you can judge that reclaiming all the pages would help to check
> > >> there is at least one >= order page?
> > >
> > > Again, not sure I understand you here. __zone_watermark_ok checks both
> > > wmark and an available page of the sufficient order. While increased
> > > free_pages (which includes reclaimable pages as well) will tell us
> > > whether we have a chance to get over the min wmark, the order check will
> > > tell us we have something to allocate from after we reach the min wmark.
> > 
> > Again, your assumption would be different with mine. My assumption is that
> > high order allocation problem happens due to fragmentation rather than
> > low free memory. In this case, there is no high order page. Even if you can
> > reclaim 1TB and add this counter to freepage counter, high order page
> > counter will not be changed and watermark check would fail. So, high order
> > allocation will not go through retry logic. This is what you want?
> 
> I really want to base the decision on something measurable rather
> than a good hope. This is what all the zone_reclaimable() is about. I
> understand your concerns that compaction doesn't guarantee anything but
> I am quite convinced that we really need an upper bound for retries
> (unlike now when zone_reclaimable is basically unbounded assuming
> order-0 reclaim makes some progress). What is the best bound is harder
> to tell, of course.
> 
> [...]
> > >> My arguing is for your band aid patch.
> > >> My point is that why retry count for order-0 is reset if there is some progress,
> > >> but, retry counter for order up to costly isn't reset even if there is
> > >> some progress
> > >
> > > Because we know that order-0 requests have chance to proceed if we keep
> > > reclaiming order-0 pages while this is not true for order > 0. If we did
> > > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > > then we would be back to the zone_reclaimable heuristic. Why? Because
> > > order-0 reclaim progress will keep !costly in the reclaim loop while
> > > compaction still might not make any progress. So we either have to fail
> > > when __zone_watermark_ok fails for the order (which turned out to be
> > > too easy to trigger) or have the fixed amount of retries regardless the
> > > watermark check result. We cannot relax both unless we have other
> > > measures in place.
> > 
> > As mentioned before, OOM kill also doesn't guarantee to make high order page.
> 
> Yes, of course, apart from the kernel stack which is high order there is
> no guarantee.
> 
> > Reclaim more memory as much as possible makes more sense to me.
> 
> But then we are back to square one. How much and how to decide when it
> makes sense to give up. Do you have any suggestions on what should be
> the criteria? Is there any feedback mechanism from the compaction which
> would tell us to keep retrying? Something like did_some_progress from
> the order-0 reclaim? Is any of deferred_compaction resp.
> contended_compaction usable? Or is there any per-zone flag we can check
> and prefer over wmark order check?
> 
> Thanks
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04  5:23                           ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  5:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> > >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> > >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > >> >> > [...]
> > >> >> >> > > + /*
> > >> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> > >> >> >> > > +  * retries for !costly high order requests and hope that multiple
> > >> >> >> > > +  * runs of compaction will generate some high order ones for us.
> > >> >> >> > > +  *
> > >> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> > >> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> > >> >> >> > > +  * reclaim
> > >> >> >> > > +  */
> > >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > >> >> >> > > +         return true;
> > >> >> >> > > +
> > >> >> >> > >   return false;
> > >> >> >>
> > >> >> >> This seems not a proper fix. Checking watermark with high order has
> > >> >> >> another meaning that there is high order page or not. This isn't
> > >> >> >> what we want here.
> > >> >> >
> > >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> > >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > >> >> > form the requested order. The ordering on the LRU lists is pretty much
> > >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> > >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> > >> >> > and free even order-0 pages.
> > >> >>
> > >> >> If we have >= order page available, we would not reach here. We would
> > >> >> just allocate it.
> > >> >
> > >> > not really, we can still be under the low watermark. Note that the
> > >>
> > >> you mean min watermark?
> > >
> > > ohh, right...
> > >
> > >> > target for the should_reclaim_retry watermark check includes also the
> > >> > reclaimable memory.
> > >>
> > >> I guess that usual case for high order allocation failure has enough freepage.
> > >
> > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > order failed even with enough free pages. And that is exactly why I am
> > > claiming that reclaiming more pages is no free ticket to high order
> > > pages.
> > 
> > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > that we have. Why do you want to kill something?
> 
> Because all the attempts so far have failed and we should rather not
> retry endlessly. With the band-aid we know we will retry
> MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> resolve the situation along with the same amount of reclaim rounds to
> help and get over watermarks.
> 
> > It also doesn't guarantee to make high order pages. It is just another
> > way of reclaiming memory. What is the difference between plain reclaim
> > and OOM kill? Why do we use OOM kill in this case?
> 
> What is our alternative other than keep looping endlessly?

Loop as long as free memory or estimated available memory (free +
reclaimable) increases. This means that we did some progress. And,
they will not grow forever because we have just limited reclaimable
memory and limited memory. You can reset no_progress_loops = 0 when
those metric increases than before.

With this bound, we can do our best to try to solve this unpleasant
situation before OOM.

Unconditional 16 looping and then OOM kill really doesn't make any
sense, because it doesn't mean that we already do our best. OOM
should not be called prematurely and AFAIK it is one of goals
on your patches.

If above suggestion doesn't make sense to you, please try to find
another way rather than suggesting work-around that could cause
OOM prematurely in high order allocation case.

Thanks.

> 
> > > [...]
> > >> >> I just did quick review to your patches so maybe I am wrong.
> > >> >> Am I missing something?
> > >> >
> > >> > The core idea behind should_reclaim_retry is to check whether the
> > >> > reclaiming all the pages would help to get over the watermark and there
> > >> > is at least one >= order page. Then it really makes sense to retry. As
> > >>
> > >> How you can judge that reclaiming all the pages would help to check
> > >> there is at least one >= order page?
> > >
> > > Again, not sure I understand you here. __zone_watermark_ok checks both
> > > wmark and an available page of the sufficient order. While increased
> > > free_pages (which includes reclaimable pages as well) will tell us
> > > whether we have a chance to get over the min wmark, the order check will
> > > tell us we have something to allocate from after we reach the min wmark.
> > 
> > Again, your assumption would be different with mine. My assumption is that
> > high order allocation problem happens due to fragmentation rather than
> > low free memory. In this case, there is no high order page. Even if you can
> > reclaim 1TB and add this counter to freepage counter, high order page
> > counter will not be changed and watermark check would fail. So, high order
> > allocation will not go through retry logic. This is what you want?
> 
> I really want to base the decision on something measurable rather
> than a good hope. This is what all the zone_reclaimable() is about. I
> understand your concerns that compaction doesn't guarantee anything but
> I am quite convinced that we really need an upper bound for retries
> (unlike now when zone_reclaimable is basically unbounded assuming
> order-0 reclaim makes some progress). What is the best bound is harder
> to tell, of course.
> 
> [...]
> > >> My arguing is for your band aid patch.
> > >> My point is that why retry count for order-0 is reset if there is some progress,
> > >> but, retry counter for order up to costly isn't reset even if there is
> > >> some progress
> > >
> > > Because we know that order-0 requests have chance to proceed if we keep
> > > reclaiming order-0 pages while this is not true for order > 0. If we did
> > > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > > then we would be back to the zone_reclaimable heuristic. Why? Because
> > > order-0 reclaim progress will keep !costly in the reclaim loop while
> > > compaction still might not make any progress. So we either have to fail
> > > when __zone_watermark_ok fails for the order (which turned out to be
> > > too easy to trigger) or have the fixed amount of retries regardless the
> > > watermark check result. We cannot relax both unless we have other
> > > measures in place.
> > 
> > As mentioned before, OOM kill also doesn't guarantee to make high order page.
> 
> Yes, of course, apart from the kernel stack which is high order there is
> no guarantee.
> 
> > Reclaim more memory as much as possible makes more sense to me.
> 
> But then we are back to square one. How much and how to decide when it
> makes sense to give up. Do you have any suggestions on what should be
> the criteria? Is there any feedback mechanism from the compaction which
> would tell us to keep retrying? Something like did_some_progress from
> the order-0 reclaim? Is any of deferred_compaction resp.
> contended_compaction usable? Or is there any per-zone flag we can check
> and prefer over wmark order check?
> 
> Thanks
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 15:50                         ` Vlastimil Babka
@ 2016-03-04  7:10                           ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  7:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 04:50:16PM +0100, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> > 
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> > 
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
> 
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
> 
> https://lkml.org/lkml/2015/12/3/60
> 

I think that major problem of this approach is that there is no way
to prevent other parallel compacting thread from taking freepage on
targetted aligned block. So, if there are parallel compaction requestors,
they would disturb each others. However, it would not be a problem for order
up to PAGE_ALLOC_COSTLY_ORDER which would be finished so soon.

In fact, for quick allocation, migration scanner is also unnecessary.
There would be a lot of pageblock we cannot do migration. Scanning
all of them in this situation is unnecessary and costly. Moreover, scanning
only half of zone due to limitation of compaction algorithm also looks
not good. Instead, we can get base page on lru list and migrate
neighborhood pages. I named this idea as "lumpy compaction" but didn't
try it. If we only focus on quick allocation, this would be a better way.
Any thought?

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04  7:10                           ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  7:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Thu, Mar 03, 2016 at 04:50:16PM +0100, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> > 
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> > 
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
> 
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
> 
> https://lkml.org/lkml/2015/12/3/60
> 

I think that major problem of this approach is that there is no way
to prevent other parallel compacting thread from taking freepage on
targetted aligned block. So, if there are parallel compaction requestors,
they would disturb each others. However, it would not be a problem for order
up to PAGE_ALLOC_COSTLY_ORDER which would be finished so soon.

In fact, for quick allocation, migration scanner is also unnecessary.
There would be a lot of pageblock we cannot do migration. Scanning
all of them in this situation is unnecessary and costly. Moreover, scanning
only half of zone due to limitation of compaction algorithm also looks
not good. Instead, we can get base page on lru list and migrate
neighborhood pages. I named this idea as "lumpy compaction" but didn't
try it. If we only focus on quick allocation, this would be a better way.
Any thought?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03 20:57                 ` Hugh Dickins
@ 2016-03-04  7:41                   ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-04  7:41 UTC (permalink / raw)
  To: Hugh Dickins, Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/03/2016 09:57 PM, Hugh Dickins wrote:
> 
>>
>> I do not have an explanation why it would cause oom sooner but this
>> turned out to be incomplete. There is another wmaark check deeper in the
>> compaction path. Could you try the one from
>> http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz
> 
> I've now added that in: it corrects the "sooner", but does not make
> any difference to the fact of OOMing for me.

Could you try producing a trace with
echo 1 > /debug/tracing/events/compaction/enable
echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

Hopefully it will hint at what's wrong with:
compact_migrate_scanned 424920
compact_free_scanned 9278408
compact_isolated 469472
compact_stall 377
compact_fail 297
compact_success 80
compact_kcompatd_wake 0

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04  7:41                   ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-04  7:41 UTC (permalink / raw)
  To: Hugh Dickins, Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/03/2016 09:57 PM, Hugh Dickins wrote:
> 
>>
>> I do not have an explanation why it would cause oom sooner but this
>> turned out to be incomplete. There is another wmaark check deeper in the
>> compaction path. Could you try the one from
>> http://lkml.kernel.org/r/20160302130022.GG26686@dhcp22.suse.cz
> 
> I've now added that in: it corrects the "sooner", but does not make
> any difference to the fact of OOMing for me.

Could you try producing a trace with
echo 1 > /debug/tracing/events/compaction/enable
echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

Hopefully it will hint at what's wrong with:
compact_migrate_scanned 424920
compact_free_scanned 9278408
compact_isolated 469472
compact_stall 377
compact_fail 297
compact_success 80
compact_kcompatd_wake 0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03  9:54           ` Hugh Dickins
@ 2016-03-04  7:53               ` Joonsoo Kim
  2016-03-04  7:53               ` Joonsoo Kim
  2016-03-04 12:28               ` Michal Hocko
  2 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  7:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu, Mar 03, 2016 at 01:54:43AM -0800, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]
> > 
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > > 
> > > > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > > 
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > > 
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation.  j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > > 
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> 
> I've found that the number of CPUs makes quite a difference - I have 4.
> 
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).
> 
> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
> 
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
> 
> (All that is on the laptop.  The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it.  My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)
> 
> > > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > > (16, 10 no difference really). I was also collecting vmstat in the
> > > > background. The compilation takes ages but the behavior seems consistent
> > > > and stable.
> > > 
> > > Thanks a lot for giving it a go.
> > > 
> > > I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> > > are less than 2M big??  But probably that's just a misunderstanding
> > > or typo somewhere.
> > 
> > A typo. 445 was from 900M test which I was doing while writing the
> > email. Sorry about the confusion.
> 
> That makes more sense!  Though I'm still amazed that you got anywhere,
> taking so much of the usable memory out.
> 
> > 
> > > Ignoring that, you're successfully doing a make -20 defconfig build
> > > in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> > > at all surprised that it takes ages, but I am very surprised that it
> > > does not OOM.  I suppose by rights it ought not to OOM, the built
> > > tree occupies only a little more than 1G, so you do have enough swap;
> > > but I wouldn't get anywhere near that myself without OOMing - I give
> > > myself 1G of RAM (well, minus whatever the booted system takes up)
> > > to do that build in, four times your RAM, yet in my case it OOMs.
> > >
> > > That source tree alone occupies more than 700M, so just copying it
> > > into your tmpfs would take a long time. 
> > 
> > OK, I just found out that I was cheating a bit. I was building
> > linux-3.7-rc5.tar.bz2 which is smaller:
> > $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> > 537M    /mnt/tmpfs/linux-3.7-rc5/
> 
> Right, I have a habit like that too; but my habitual testing still
> uses the 2.6.24 source tree, which is rather too old to ask others
> to reproduce with - but we both find that the kernel source tree
> keeps growing, and prefer to stick with something of a fixed size.
> 
> > 
> > and after the defconfig build:
> > $ free
> >              total       used       free     shared    buffers     cached
> > Mem:       1008460     941904      66556          0       5092     806760
> > -/+ buffers/cache:     130052     878408
> > Swap:      2097148      42648    2054500
> > $ du -sh linux-3.7-rc5/
> > 799M    linux-3.7-rc5/
> > 
> > Sorry about that but this is what my other tests were using and I forgot
> > to check. Now let's try the same with the current linus tree:
> > host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> > $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> > 707M    /mnt/tmpfs/linux-4.5-rc6/
> > $ free
> >              total       used       free     shared    buffers     cached
> > Mem:       1008460     962976      45484          0       7236     820064
> 
> I guess we have different versions of "free": mine shows Shmem as shared,
> but yours appears to be an older version, just showing 0.
> 
> > -/+ buffers/cache:     135676     872784
> > Swap:      2097148         16    2097132
> > $ time make -j20 > /dev/null
> > drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
> > drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> > 
> > real    8m36.621s
> > user    14m1.642s
> > sys     2m45.238s
> > 
> > so I wasn't cheating all that much...
> > 
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> > I realize I might be boring already when blaming compaction but let me
> > try again ;)
> > $ grep compact /proc/vmstat 
> > compact_migrate_scanned 113983
> > compact_free_scanned 1433503
> > compact_isolated 134307
> > compact_stall 128
> > compact_fail 26
> > compact_success 102
> > compact_kcompatd_wake 0
> > 
> > So the whole load has done the direct compaction only 128 times during
> > that test. This doesn't sound much to me
> > $ grep allocstall /proc/vmstat
> > allocstall 1061
> > 
> > we entered the direct reclaim much more but most of the load will be
> > order-0 so this might be still ok. So I've tried the following:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894b4219..107d444afdb1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >  
> > +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> > +
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> > 
> > And the result was:
> > $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> > 
> > this shows that order-2 memory pressure is not overly high in my
> > setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> > 
> > So I went back to 800M of hugetlb pages and tried again. It took ages
> > so I have interrupted that after one hour (there was still no OOM). The
> > trace log is quite interesting regardless:
> > $ wc -l ~/trace.log
> > 371 /root/trace.log
> > 
> > $ grep compact_stall /proc/vmstat 
> > compact_stall 190
> > 
> > so the compaction was still ignored more than actually invoked for
> > !costly allocations:
> > sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
> >     190 2 1
> >     122 2 3
> >      59 2 4
> > 
> > #define COMPACT_SKIPPED         1               
> > #define COMPACT_PARTIAL         3
> > #define COMPACT_COMPLETE        4
> > 
> > that means that compaction is even not tried in half cases! This
> > doesn't sounds right to me, especially when we are talking about
> > <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> > then we simply rely on the order-0 reclaim to automagically form higher
> > blocks. This might indeed work when we retry many times but I guess this
> > is not a good approach. It leads to a excessive reclaim and the stall
> > for allocation can be really large.
> > 
> > One of the suspicious places is __compaction_suitable which does order-0
> > watermark check (increased by 2<<order). I have put another trace_printk
> > there and it clearly pointed out this was the case.
> > 
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >  
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return COMPACT_CONTINUE;
> > +
> 
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work.  (FWIW, I have been including your other
> patch, the "Andrew, could you queue this one as well, please" patch.)
> 
> I do agree that compaction appears to have closed down when we OOM:
> taking that along with my nr_cpus remark (and the make -jNumber),
> are parallel compactions interfering with each other destructively,
> in a way that they did not before the rework?
> 
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> > 
> > and retried the same test (without huge pages):
> > $ time make -j20 > /dev/null
> > 
> > real    8m46.626s
> > user    14m15.823s
> > sys     2m45.471s
> > 
> > the time increased but I haven't checked how stable the result is. 
> 
> But I didn't investigate its stability either, may have judged against
> it too soon.
> 
> > 
> > $ grep compact /proc/vmstat
> > compact_migrate_scanned 139822
> > compact_free_scanned 1661642
> > compact_isolated 139407
> > compact_stall 129
> > compact_fail 58
> > compact_success 71
> > compact_kcompatd_wake 1
> 
> I have not seen any compact_kcompatd_wakes at all:
> perhaps we're too busy compacting directly.
> 
> (Vlastimil, there's a "c" missing from that name, it should be
> "compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)
> 
> > 
> > $ grep allocstall /proc/vmstat
> > allocstall 1665
> > 
> > this is worse because we have scanned more pages for migration but the
> > overall success rate was much smaller and the direct reclaim was invoked
> > more. I do not have a good theory for that and will play with this some
> > more. Maybe other changes are needed deeper in the compaction code.
> > 
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
> 
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> > 
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM.  Not an issue if you're
> > > booting VMs.
> > 
> > Ohh, I see.
> 
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

Hello, Hugh.

I guess following things from your vmstat.
it could be wrong so please be careful. :)

Before OOM happens,

pgmigrate_success 230007
pgmigrate_fail 94
compact_migrate_scanned 422734
compact_free_scanned 9277915
compact_isolated 469308
compact_stall 370
compact_fail 291
compact_success 79
...
balloon_deflate 0

After OOM happens,

pgmigrate_success 230007                                                                              
pgmigrate_fail 94                                                                                     
compact_migrate_scanned 424920                                                                        
compact_free_scanned 9278408                                                                          
compact_isolated 469472                                                                               
compact_stall 377                                                                                     
compact_fail 297                                                                                      
compact_success 80  
...
balloon_deflate 1

This shows that we tried compaction (compaction stall increases).
Increased compact_isolated tell us that we isolated something for
migration. But, pgmigrate_xxx isn't changed and it means that we
didn't do any actual migration. It could happen when we can't find
freepage. compact_free_scanned changed a little so it seems that
there are many pageblocks with skipbit set and compaction would skip
almost range in this case. This skipbit could be reset when we try more
and reach the reset threshold. How about do test
with MAX_RECLAIM_RETRIES 128 or something larger to see that makes
some difference?

Thanks.

> 
> I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
> that being a count that's always 0 for me: so when you see
> "balloon_deflate 1" towards the end, that's where the OOM
> kill came in, and shortly after I Ctrl-C'ed.
> 
> I hope you can get more out of it than I have - thanks!
> 
> Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04  7:53               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-04  7:53 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu, Mar 03, 2016 at 01:54:43AM -0800, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]
> > 
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > > 
> > > > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > > 
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > > 
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation.  j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > > 
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> 
> I've found that the number of CPUs makes quite a difference - I have 4.
> 
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).
> 
> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
> 
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
> 
> (All that is on the laptop.  The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it.  My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)
> 
> > > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > > (16, 10 no difference really). I was also collecting vmstat in the
> > > > background. The compilation takes ages but the behavior seems consistent
> > > > and stable.
> > > 
> > > Thanks a lot for giving it a go.
> > > 
> > > I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> > > are less than 2M big??  But probably that's just a misunderstanding
> > > or typo somewhere.
> > 
> > A typo. 445 was from 900M test which I was doing while writing the
> > email. Sorry about the confusion.
> 
> That makes more sense!  Though I'm still amazed that you got anywhere,
> taking so much of the usable memory out.
> 
> > 
> > > Ignoring that, you're successfully doing a make -20 defconfig build
> > > in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> > > at all surprised that it takes ages, but I am very surprised that it
> > > does not OOM.  I suppose by rights it ought not to OOM, the built
> > > tree occupies only a little more than 1G, so you do have enough swap;
> > > but I wouldn't get anywhere near that myself without OOMing - I give
> > > myself 1G of RAM (well, minus whatever the booted system takes up)
> > > to do that build in, four times your RAM, yet in my case it OOMs.
> > >
> > > That source tree alone occupies more than 700M, so just copying it
> > > into your tmpfs would take a long time. 
> > 
> > OK, I just found out that I was cheating a bit. I was building
> > linux-3.7-rc5.tar.bz2 which is smaller:
> > $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> > 537M    /mnt/tmpfs/linux-3.7-rc5/
> 
> Right, I have a habit like that too; but my habitual testing still
> uses the 2.6.24 source tree, which is rather too old to ask others
> to reproduce with - but we both find that the kernel source tree
> keeps growing, and prefer to stick with something of a fixed size.
> 
> > 
> > and after the defconfig build:
> > $ free
> >              total       used       free     shared    buffers     cached
> > Mem:       1008460     941904      66556          0       5092     806760
> > -/+ buffers/cache:     130052     878408
> > Swap:      2097148      42648    2054500
> > $ du -sh linux-3.7-rc5/
> > 799M    linux-3.7-rc5/
> > 
> > Sorry about that but this is what my other tests were using and I forgot
> > to check. Now let's try the same with the current linus tree:
> > host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> > $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> > 707M    /mnt/tmpfs/linux-4.5-rc6/
> > $ free
> >              total       used       free     shared    buffers     cached
> > Mem:       1008460     962976      45484          0       7236     820064
> 
> I guess we have different versions of "free": mine shows Shmem as shared,
> but yours appears to be an older version, just showing 0.
> 
> > -/+ buffers/cache:     135676     872784
> > Swap:      2097148         16    2097132
> > $ time make -j20 > /dev/null
> > drivers/acpi/property.c: In function a??acpi_data_prop_reada??:
> > drivers/acpi/property.c:745:8: warning: a??obja?? may be used uninitialized in this function [-Wmaybe-uninitialized]
> > 
> > real    8m36.621s
> > user    14m1.642s
> > sys     2m45.238s
> > 
> > so I wasn't cheating all that much...
> > 
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> > I realize I might be boring already when blaming compaction but let me
> > try again ;)
> > $ grep compact /proc/vmstat 
> > compact_migrate_scanned 113983
> > compact_free_scanned 1433503
> > compact_isolated 134307
> > compact_stall 128
> > compact_fail 26
> > compact_success 102
> > compact_kcompatd_wake 0
> > 
> > So the whole load has done the direct compaction only 128 times during
> > that test. This doesn't sound much to me
> > $ grep allocstall /proc/vmstat
> > allocstall 1061
> > 
> > we entered the direct reclaim much more but most of the load will be
> > order-0 so this might be still ok. So I've tried the following:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894b4219..107d444afdb1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >  
> > +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> > +
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> > 
> > And the result was:
> > $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> > 
> > this shows that order-2 memory pressure is not overly high in my
> > setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> > 
> > So I went back to 800M of hugetlb pages and tried again. It took ages
> > so I have interrupted that after one hour (there was still no OOM). The
> > trace log is quite interesting regardless:
> > $ wc -l ~/trace.log
> > 371 /root/trace.log
> > 
> > $ grep compact_stall /proc/vmstat 
> > compact_stall 190
> > 
> > so the compaction was still ignored more than actually invoked for
> > !costly allocations:
> > sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
> >     190 2 1
> >     122 2 3
> >      59 2 4
> > 
> > #define COMPACT_SKIPPED         1               
> > #define COMPACT_PARTIAL         3
> > #define COMPACT_COMPLETE        4
> > 
> > that means that compaction is even not tried in half cases! This
> > doesn't sounds right to me, especially when we are talking about
> > <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> > then we simply rely on the order-0 reclaim to automagically form higher
> > blocks. This might indeed work when we retry many times but I guess this
> > is not a good approach. It leads to a excessive reclaim and the stall
> > for allocation can be really large.
> > 
> > One of the suspicious places is __compaction_suitable which does order-0
> > watermark check (increased by 2<<order). I have put another trace_printk
> > there and it clearly pointed out this was the case.
> > 
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >  
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return COMPACT_CONTINUE;
> > +
> 
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work.  (FWIW, I have been including your other
> patch, the "Andrew, could you queue this one as well, please" patch.)
> 
> I do agree that compaction appears to have closed down when we OOM:
> taking that along with my nr_cpus remark (and the make -jNumber),
> are parallel compactions interfering with each other destructively,
> in a way that they did not before the rework?
> 
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> > 
> > and retried the same test (without huge pages):
> > $ time make -j20 > /dev/null
> > 
> > real    8m46.626s
> > user    14m15.823s
> > sys     2m45.471s
> > 
> > the time increased but I haven't checked how stable the result is. 
> 
> But I didn't investigate its stability either, may have judged against
> it too soon.
> 
> > 
> > $ grep compact /proc/vmstat
> > compact_migrate_scanned 139822
> > compact_free_scanned 1661642
> > compact_isolated 139407
> > compact_stall 129
> > compact_fail 58
> > compact_success 71
> > compact_kcompatd_wake 1
> 
> I have not seen any compact_kcompatd_wakes at all:
> perhaps we're too busy compacting directly.
> 
> (Vlastimil, there's a "c" missing from that name, it should be
> "compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)
> 
> > 
> > $ grep allocstall /proc/vmstat
> > allocstall 1665
> > 
> > this is worse because we have scanned more pages for migration but the
> > overall success rate was much smaller and the direct reclaim was invoked
> > more. I do not have a good theory for that and will play with this some
> > more. Maybe other changes are needed deeper in the compaction code.
> > 
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
> 
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> > 
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM.  Not an issue if you're
> > > booting VMs.
> > 
> > Ohh, I see.
> 
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

Hello, Hugh.

I guess following things from your vmstat.
it could be wrong so please be careful. :)

Before OOM happens,

pgmigrate_success 230007
pgmigrate_fail 94
compact_migrate_scanned 422734
compact_free_scanned 9277915
compact_isolated 469308
compact_stall 370
compact_fail 291
compact_success 79
...
balloon_deflate 0

After OOM happens,

pgmigrate_success 230007                                                                              
pgmigrate_fail 94                                                                                     
compact_migrate_scanned 424920                                                                        
compact_free_scanned 9278408                                                                          
compact_isolated 469472                                                                               
compact_stall 377                                                                                     
compact_fail 297                                                                                      
compact_success 80  
...
balloon_deflate 1

This shows that we tried compaction (compaction stall increases).
Increased compact_isolated tell us that we isolated something for
migration. But, pgmigrate_xxx isn't changed and it means that we
didn't do any actual migration. It could happen when we can't find
freepage. compact_free_scanned changed a little so it seems that
there are many pageblocks with skipbit set and compaction would skip
almost range in this case. This skipbit could be reset when we try more
and reach the reset threshold. How about do test
with MAX_RECLAIM_RETRIES 128 or something larger to see that makes
some difference?

Thanks.

> 
> I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
> that being a count that's always 0 for me: so when you see
> "balloon_deflate 1" towards the end, that's where the OOM
> kill came in, and shortly after I Ctrl-C'ed.
> 
> I hope you can get more out of it than I have - thanks!
> 
> Hugh



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-03  9:54           ` Hugh Dickins
@ 2016-03-04 12:28               ` Michal Hocko
  2016-03-04  7:53               ` Joonsoo Kim
  2016-03-04 12:28               ` Michal Hocko
  2 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 12:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]
> > 
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > > 
> > > > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > > 
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > > 
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation.  j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > > 
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> 
> I've found that the number of CPUs makes quite a difference - I have 4.
> 
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).

I had everything but PROVE_LOCKING. Enabling this option doesn't change
anything (except for the overal runtime which is longer of course) in my
2 cpus setup, though.

All the following is with the clean mmotm (mmotm-2016-02-24-16-18)
without any additional change.  I have moved my kvm setup to a larger
machine. The storage is a standard spinning rust and I've made sure that
the swap is not cached on the host and the swap IO is done directly by
doing
-drive file=swap-2G.qcow,if=ide,index=2,cache=none

retested with 4CPUs and make -j20
real    8m42.263s
user    20m52.838s
sys     8m8.805s

with 16CPU and make -j20
real    3m34.806s
user    20m25.245s
sys     8m39.366s

and the same with -j60 which actually triggered the OOM
$ grep "invoked oom-killer:" oomrework.qcow_serial.log
[10064.286799] cc1 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[...]
[10064.394172] DMA32 free:3764kB min:3796kB low:4776kB high:5756kB active_anon:394184kB inactive_anon:394168kB active_file:1836kB inactive_file:2156kB unevictable:0kB isolated(anon):148kB isolated(file):0kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:96kB mapped:1308kB shmem:6704kB slab_reclaimable:51356kB slab_unreclaimable:100532kB kernel_stack:7328kB pagetables:15944kB unstable:0kB bounce:0kB free_pcp:1796kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:63244 all_unreclaimable? yes
[...]
[10560.926971] cc1 invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[...]
[10561.007362] DMA32 free:4800kB min:3796kB low:4776kB high:5756kB active_anon:393112kB inactive_anon:393508kB active_file:1560kB inactive_file:1428kB unevictable:0kB isolated(anon):2452kB isolated(file):212kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:564kB mapped:2552kB shmem:7664kB slab_reclaimable:51352kB slab_unreclaimable:100396kB kernel_stack:7392kB pagetables:16196kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1172 all_unreclaimable? no

but those are simple order-0 OOMs. So this cannot be a compaction
related.  The second oom is probably racing with the exiting task
because we are over the low wmark. This would suggest we have exhausted
all the attempts with no progress.

This was all after fresh boot so then I stayed with 16CPUs and did
make -j20 > /dev/null
make clean

in the loop and left it run overnight. This should randomize the swap
IO and also should have a better chance of longterm fragmentation.
It survived 300 iterations.

I really have no idea what might be the difference with your setup. So
I've tried to test linux-next (next-20160226) just to make sure that
this is not something mmotm git tree (which I maintain) specific.

> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
> 
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
> 
> (All that is on the laptop.  The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it.  My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)

I am currently testing with the swap backed on sshfs (with -o direct_io)
which should emulate a really slow storage. But still not OOM, I only
managed to hit:
INFO: task khugepaged:246 blocked for more than 120 seconds.
int the IO path
[  480.422500]  [<ffffffff812b0c9b>] get_request+0x440/0x55e
[  480.423444]  [<ffffffff81081148>] ? wait_woken+0x72/0x72
[  480.424447]  [<ffffffff812b3071>] blk_queue_bio+0x16d/0x302
[  480.425566]  [<ffffffff812b1607>] generic_make_request+0xc0/0x15e
[  480.426642]  [<ffffffff812b17ae>] submit_bio+0x109/0x114
[  480.427704]  [<ffffffff81147101>] __swap_writepage+0x1ea/0x1f9
[  480.430364]  [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[  480.432718]  [<ffffffff815a8aed>] ? _raw_spin_unlock+0x31/0x44
[  480.433722]  [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[  480.434697]  [<ffffffff8114714a>] swap_writepage+0x3a/0x3e
[  480.435718]  [<ffffffff81122bbe>] shmem_writepage+0x37b/0x3d1
[  480.436757]  [<ffffffff8111dbe8>] shrink_page_list+0x49c/0xd88
 
[...]
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
> 
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> > 
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM.  Not an issue if you're
> > > booting VMs.
> > 
> > Ohh, I see.
> 
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

[  796.225322] sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
[  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB high:5948kB active_anon:588776kB inactive_anon:188816kB active_file:20432kB inactive_file:6928kB unevictable:12268kB isolated(anon):128kB isolated(file):8kB present:1046128kB managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB mapped:35556kB shmem:12684kB slab_reclaimable:55628kB slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM) 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8820kB
[...]

This is really interesting because there are some order-2+ pages
available. Even more striking is that free is way above high watermark.
This would suggest that declaring OOM must have raced with an exiting
task. This is not that unexpected because gcc are quite shortlived
and `make' spawns new as soon the last one terminated. This race is not
new and we cannot do much better without a moving the wmark check closer
to the actual do_send_sig_info. This is not the main problem though. The
thing that you are able to trigger this consistently is what bothers me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04 12:28               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 12:28 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Vlastimil Babka, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]
> > 
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > > 
> > > > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > > 
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > > 
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation.  j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > > 
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> 
> I've found that the number of CPUs makes quite a difference - I have 4.
> 
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).

I had everything but PROVE_LOCKING. Enabling this option doesn't change
anything (except for the overal runtime which is longer of course) in my
2 cpus setup, though.

All the following is with the clean mmotm (mmotm-2016-02-24-16-18)
without any additional change.  I have moved my kvm setup to a larger
machine. The storage is a standard spinning rust and I've made sure that
the swap is not cached on the host and the swap IO is done directly by
doing
-drive file=swap-2G.qcow,if=ide,index=2,cache=none

retested with 4CPUs and make -j20
real    8m42.263s
user    20m52.838s
sys     8m8.805s

with 16CPU and make -j20
real    3m34.806s
user    20m25.245s
sys     8m39.366s

and the same with -j60 which actually triggered the OOM
$ grep "invoked oom-killer:" oomrework.qcow_serial.log
[10064.286799] cc1 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[...]
[10064.394172] DMA32 free:3764kB min:3796kB low:4776kB high:5756kB active_anon:394184kB inactive_anon:394168kB active_file:1836kB inactive_file:2156kB unevictable:0kB isolated(anon):148kB isolated(file):0kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:96kB mapped:1308kB shmem:6704kB slab_reclaimable:51356kB slab_unreclaimable:100532kB kernel_stack:7328kB pagetables:15944kB unstable:0kB bounce:0kB free_pcp:1796kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:63244 all_unreclaimable? yes
[...]
[10560.926971] cc1 invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[...]
[10561.007362] DMA32 free:4800kB min:3796kB low:4776kB high:5756kB active_anon:393112kB inactive_anon:393508kB active_file:1560kB inactive_file:1428kB unevictable:0kB isolated(anon):2452kB isolated(file):212kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:564kB mapped:2552kB shmem:7664kB slab_reclaimable:51352kB slab_unreclaimable:100396kB kernel_stack:7392kB pagetables:16196kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1172 all_unreclaimable? no

but those are simple order-0 OOMs. So this cannot be a compaction
related.  The second oom is probably racing with the exiting task
because we are over the low wmark. This would suggest we have exhausted
all the attempts with no progress.

This was all after fresh boot so then I stayed with 16CPUs and did
make -j20 > /dev/null
make clean

in the loop and left it run overnight. This should randomize the swap
IO and also should have a better chance of longterm fragmentation.
It survived 300 iterations.

I really have no idea what might be the difference with your setup. So
I've tried to test linux-next (next-20160226) just to make sure that
this is not something mmotm git tree (which I maintain) specific.

> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
> 
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
> 
> (All that is on the laptop.  The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it.  My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)

I am currently testing with the swap backed on sshfs (with -o direct_io)
which should emulate a really slow storage. But still not OOM, I only
managed to hit:
INFO: task khugepaged:246 blocked for more than 120 seconds.
int the IO path
[  480.422500]  [<ffffffff812b0c9b>] get_request+0x440/0x55e
[  480.423444]  [<ffffffff81081148>] ? wait_woken+0x72/0x72
[  480.424447]  [<ffffffff812b3071>] blk_queue_bio+0x16d/0x302
[  480.425566]  [<ffffffff812b1607>] generic_make_request+0xc0/0x15e
[  480.426642]  [<ffffffff812b17ae>] submit_bio+0x109/0x114
[  480.427704]  [<ffffffff81147101>] __swap_writepage+0x1ea/0x1f9
[  480.430364]  [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[  480.432718]  [<ffffffff815a8aed>] ? _raw_spin_unlock+0x31/0x44
[  480.433722]  [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[  480.434697]  [<ffffffff8114714a>] swap_writepage+0x3a/0x3e
[  480.435718]  [<ffffffff81122bbe>] shmem_writepage+0x37b/0x3d1
[  480.436757]  [<ffffffff8111dbe8>] shrink_page_list+0x49c/0xd88
 
[...]
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
> 
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
> 
> > 
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM.  Not an issue if you're
> > > booting VMs.
> > 
> > Ohh, I see.
> 
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

[  796.225322] sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
[  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB high:5948kB active_anon:588776kB inactive_anon:188816kB active_file:20432kB inactive_file:6928kB unevictable:12268kB isolated(anon):128kB isolated(file):8kB present:1046128kB managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB mapped:35556kB shmem:12684kB slab_reclaimable:55628kB slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM) 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8820kB
[...]

This is really interesting because there are some order-2+ pages
available. Even more striking is that free is way above high watermark.
This would suggest that declaring OOM must have raced with an exiting
task. This is not that unexpected because gcc are quite shortlived
and `make' spawns new as soon the last one terminated. This race is not
new and we cannot do much better without a moving the wmark check closer
to the actual do_send_sig_info. This is not the main problem though. The
thing that you are able to trigger this consistently is what bothers me.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-04  5:23                           ` Joonsoo Kim
@ 2016-03-04 15:15                             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 15:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
[...]
> > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > >
> > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > order failed even with enough free pages. And that is exactly why I am
> > > > claiming that reclaiming more pages is no free ticket to high order
> > > > pages.
> > > 
> > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > that we have. Why do you want to kill something?
> > 
> > Because all the attempts so far have failed and we should rather not
> > retry endlessly. With the band-aid we know we will retry
> > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > resolve the situation along with the same amount of reclaim rounds to
> > help and get over watermarks.
> > 
> > > It also doesn't guarantee to make high order pages. It is just another
> > > way of reclaiming memory. What is the difference between plain reclaim
> > > and OOM kill? Why do we use OOM kill in this case?
> > 
> > What is our alternative other than keep looping endlessly?
> 
> Loop as long as free memory or estimated available memory (free +
> reclaimable) increases. This means that we did some progress. And,
> they will not grow forever because we have just limited reclaimable
> memory and limited memory. You can reset no_progress_loops = 0 when
> those metric increases than before.

Hmm, why is this any better than taking the feedback from the reclaim
(did_some_progress)?
 
> With this bound, we can do our best to try to solve this unpleasant
> situation before OOM.
> 
> Unconditional 16 looping and then OOM kill really doesn't make any
> sense, because it doesn't mean that we already do our best.

16 is not really that important. We can change that if that doesn't
sounds sufficient. But please note that each reclaim round means
that we have scanned all eligible LRUs to find and reclaim something
and asked direct compaction to prepare a high order page.
This sounds like "do our best" to me.

Now it seems that we need more changes at least in the compaction area
because the code doesn't seem to fit the nature of !costly allocation
requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
high order pages, I would much rather see some feedback mechanism which
would measurable and evaluated in some way but is this really necessary
for the initial version?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04 15:15                             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 15:15 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
[...]
> > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > >
> > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > order failed even with enough free pages. And that is exactly why I am
> > > > claiming that reclaiming more pages is no free ticket to high order
> > > > pages.
> > > 
> > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > that we have. Why do you want to kill something?
> > 
> > Because all the attempts so far have failed and we should rather not
> > retry endlessly. With the band-aid we know we will retry
> > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > resolve the situation along with the same amount of reclaim rounds to
> > help and get over watermarks.
> > 
> > > It also doesn't guarantee to make high order pages. It is just another
> > > way of reclaiming memory. What is the difference between plain reclaim
> > > and OOM kill? Why do we use OOM kill in this case?
> > 
> > What is our alternative other than keep looping endlessly?
> 
> Loop as long as free memory or estimated available memory (free +
> reclaimable) increases. This means that we did some progress. And,
> they will not grow forever because we have just limited reclaimable
> memory and limited memory. You can reset no_progress_loops = 0 when
> those metric increases than before.

Hmm, why is this any better than taking the feedback from the reclaim
(did_some_progress)?
 
> With this bound, we can do our best to try to solve this unpleasant
> situation before OOM.
> 
> Unconditional 16 looping and then OOM kill really doesn't make any
> sense, because it doesn't mean that we already do our best.

16 is not really that important. We can change that if that doesn't
sounds sufficient. But please note that each reclaim round means
that we have scanned all eligible LRUs to find and reclaim something
and asked direct compaction to prepare a high order page.
This sounds like "do our best" to me.

Now it seems that we need more changes at least in the compaction area
because the code doesn't seem to fit the nature of !costly allocation
requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
high order pages, I would much rather see some feedback mechanism which
would measurable and evaluated in some way but is this really necessary
for the initial version?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-04 15:15                             ` Michal Hocko
@ 2016-03-04 17:39                               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 17:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri 04-03-16 16:15:58, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
[...]
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
> 
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something

this should read "scanned potentially all eligible LRUs..."

> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-04 17:39                               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-04 17:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri 04-03-16 16:15:58, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
[...]
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
> 
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something

this should read "scanned potentially all eligible LRUs..."

> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-04 15:15                             ` Michal Hocko
@ 2016-03-07  5:23                               ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-07  5:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri, Mar 04, 2016 at 04:15:58PM +0100, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> > On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> [...]
> > > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > > >
> > > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > > order failed even with enough free pages. And that is exactly why I am
> > > > > claiming that reclaiming more pages is no free ticket to high order
> > > > > pages.
> > > > 
> > > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > > that we have. Why do you want to kill something?
> > > 
> > > Because all the attempts so far have failed and we should rather not
> > > retry endlessly. With the band-aid we know we will retry
> > > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > > resolve the situation along with the same amount of reclaim rounds to
> > > help and get over watermarks.
> > > 
> > > > It also doesn't guarantee to make high order pages. It is just another
> > > > way of reclaiming memory. What is the difference between plain reclaim
> > > > and OOM kill? Why do we use OOM kill in this case?
> > > 
> > > What is our alternative other than keep looping endlessly?
> > 
> > Loop as long as free memory or estimated available memory (free +
> > reclaimable) increases. This means that we did some progress. And,
> > they will not grow forever because we have just limited reclaimable
> > memory and limited memory. You can reset no_progress_loops = 0 when
> > those metric increases than before.
> 
> Hmm, why is this any better than taking the feedback from the reclaim
> (did_some_progress)?

My suggestion could be only applied to high order case. In this case,
free page and reclaimable page is already sufficient and parallel
free page consumer would re-generate reclaimable page endlessly so
positive did_some_progress will be returned endlessy. We need to stop
retry at some point so we need some metric that ensures finite retry
in any case.

>  
> > With this bound, we can do our best to try to solve this unpleasant
> > situation before OOM.
> > 
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
> 
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something
> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.

AFAIK, each reclaim round doesn't reclaim all reclaimable page. It has
a limit to reclaim. It looks not our best to me and N retry only
multipies that limit N times. It also doesn't look like our best to
me and will lead to premature OOM kill.

> Now it seems that we need more changes at least in the compaction area
> because the code doesn't seem to fit the nature of !costly allocation
> requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
> high order pages, I would much rather see some feedback mechanism which
> would measurable and evaluated in some way but is this really necessary
> for the initial version?

I don't know. My analysis is just based on my guess and background knowledge,
not practical usecase, so I'm not sure it is necessary for the initial
version or not. It's up to you.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-07  5:23                               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-07  5:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, Linux Memory Management List, LKML,
	Sergey Senozhatsky

On Fri, Mar 04, 2016 at 04:15:58PM +0100, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> > On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> [...]
> > > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > > >
> > > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > > order failed even with enough free pages. And that is exactly why I am
> > > > > claiming that reclaiming more pages is no free ticket to high order
> > > > > pages.
> > > > 
> > > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > > that we have. Why do you want to kill something?
> > > 
> > > Because all the attempts so far have failed and we should rather not
> > > retry endlessly. With the band-aid we know we will retry
> > > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > > resolve the situation along with the same amount of reclaim rounds to
> > > help and get over watermarks.
> > > 
> > > > It also doesn't guarantee to make high order pages. It is just another
> > > > way of reclaiming memory. What is the difference between plain reclaim
> > > > and OOM kill? Why do we use OOM kill in this case?
> > > 
> > > What is our alternative other than keep looping endlessly?
> > 
> > Loop as long as free memory or estimated available memory (free +
> > reclaimable) increases. This means that we did some progress. And,
> > they will not grow forever because we have just limited reclaimable
> > memory and limited memory. You can reset no_progress_loops = 0 when
> > those metric increases than before.
> 
> Hmm, why is this any better than taking the feedback from the reclaim
> (did_some_progress)?

My suggestion could be only applied to high order case. In this case,
free page and reclaimable page is already sufficient and parallel
free page consumer would re-generate reclaimable page endlessly so
positive did_some_progress will be returned endlessy. We need to stop
retry at some point so we need some metric that ensures finite retry
in any case.

>  
> > With this bound, we can do our best to try to solve this unpleasant
> > situation before OOM.
> > 
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
> 
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something
> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.

AFAIK, each reclaim round doesn't reclaim all reclaimable page. It has
a limit to reclaim. It looks not our best to me and N retry only
multipies that limit N times. It also doesn't look like our best to
me and will lead to premature OOM kill.

> Now it seems that we need more changes at least in the compaction area
> because the code doesn't seem to fit the nature of !costly allocation
> requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
> high order pages, I would much rather see some feedback mechanism which
> would measurable and evaluated in some way but is this really necessary
> for the initial version?

I don't know. My analysis is just based on my guess and background knowledge,
not practical usecase, so I'm not sure it is necessary for the initial
version or not. It's up to you.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-02-29 21:02         ` Michal Hocko
@ 2016-03-07 16:08           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-07 16:08 UTC (permalink / raw)
  To: Hugh Dickins, Sergey Senozhatsky, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim, Vlastimil Babka

On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

Joonsoo wasn't very happy about this approach so let me try a different
way. What do you think about the following? Hugh, Sergey does it help
for your load? I have tested it with the Hugh's load and there was no
major difference from the previous testing so at least nothing has blown
up as I am not able to reproduce the issue here.

Other changes in the compaction are still needed but I would like to not
depend on them right now.
---
>From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 7 Mar 2016 15:30:37 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if the last compaction round
was either inactive (deferred, skipped or bailed out early due to
contention) or it told us to continue.

Additionally define COMPACT_NONE which reflects cases where the
compaction is completely disabled.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  2 ++
 mm/page_alloc.c            | 41 ++++++++++++++++++++++++-----------------
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..a4cec4a03f7d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,8 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+/* compaction disabled */
+#define COMPACT_NONE		-1
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* compaction didn't start as it was deferred due to past failures */
 #define COMPACT_DEFERRED	0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f89e3cbfdf90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2819,28 +2819,22 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		unsigned long *compact_result)
 {
-	unsigned long compact_result;
 	struct page *page;
 
-	if (!order)
+	if (!order) {
+		*compact_result = COMPACT_NONE;
 		return NULL;
+	}
 
 	current->flags |= PF_MEMALLOC;
-	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
-	switch (compact_result) {
-	case COMPACT_DEFERRED:
-		*deferred_compaction = true;
-		/* fall-through */
-	case COMPACT_SKIPPED:
+	if (*compact_result <= COMPACT_SKIPPED)
 		return NULL;
-	default:
-		break;
-	}
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2875,8 +2869,9 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		unsigned long *compact_result)
 {
+	*compact_result = COMPACT_NONE;
 	return NULL;
 }
 #endif /* CONFIG_COMPACTION */
@@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int alloc_flags;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
-	bool deferred_compaction = false;
+	unsigned long compact_result;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
 	int no_progress_loops = 0;
 
@@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&contended_compaction,
-					&deferred_compaction);
+					&compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * to heavily disrupt the system, so we fail the allocation
 		 * instead of entering direct reclaim.
 		 */
-		if (deferred_compaction)
+		if (compact_result == COMPACT_DEFERRED)
 			goto nopage;
 
 		/*
@@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_CONTINUE)
+			goto retry;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			goto retry;
+	}
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
 					    &contended_compaction,
-					    &deferred_compaction);
+					    &compact_result);
 	if (page)
 		goto got_pg;
 nopage:
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-07 16:08           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-07 16:08 UTC (permalink / raw)
  To: Hugh Dickins, Sergey Senozhatsky, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim, Vlastimil Babka

On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

Joonsoo wasn't very happy about this approach so let me try a different
way. What do you think about the following? Hugh, Sergey does it help
for your load? I have tested it with the Hugh's load and there was no
major difference from the previous testing so at least nothing has blown
up as I am not able to reproduce the issue here.

Other changes in the compaction are still needed but I would like to not
depend on them right now.
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-07 16:08           ` Michal Hocko
@ 2016-03-08  3:51             ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  3:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

Hello Michal,

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.

(next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")

seems it's significantly less likely to oom-kill now, but I still can see
something like this

[  501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  501.942796] CPU: 3 PID: 409 Comm: coretemp-sensor Not tainted 4.5.0-rc6-next-20160307-dbg-00015-g8a56edd-dirty #250
[  501.942801]  0000000000000000 ffff88013114fb88 ffffffff812364e9 0000000000000000
[  501.942804]  ffff88013114fd28 ffff88013114fbf8 ffffffff8113b11c ffff88013114fba8
[  501.942807]  ffffffff810835c1 ffff88013114fbc8 0000000000000206 ffffffff81a46de0
[  501.942808] Call Trace:
[  501.942813]  [<ffffffff812364e9>] dump_stack+0x67/0x90
[  501.942817]  [<ffffffff8113b11c>] dump_header.isra.5+0x54/0x359
[  501.942820]  [<ffffffff810835c1>] ? trace_hardirqs_on+0xd/0xf
[  501.942823]  [<ffffffff810f97c2>] oom_kill_process+0x89/0x503
[  501.942825]  [<ffffffff810f9ffe>] out_of_memory+0x372/0x38d
[  501.942827]  [<ffffffff810fe5ae>] __alloc_pages_nodemask+0x9b6/0xa92
[  501.942830]  [<ffffffff810fe882>] alloc_kmem_pages_node+0x1b/0x1d
[  501.942833]  [<ffffffff81041f86>] copy_process.part.9+0xfe/0x17f4
[  501.942835]  [<ffffffff810858f6>] ? lock_acquire+0x10f/0x1a3
[  501.942837]  [<ffffffff8104380f>] _do_fork+0xbd/0x5da
[  501.942838]  [<ffffffff81083598>] ? trace_hardirqs_on_caller+0x16c/0x188
[  501.942842]  [<ffffffff81001a79>] ? do_syscall_64+0x18/0xe6
[  501.942844]  [<ffffffff81043db2>] SyS_clone+0x19/0x1b
[  501.942845]  [<ffffffff81001abb>] do_syscall_64+0x5a/0xe6
[  501.942848]  [<ffffffff8151245a>] entry_SYSCALL64_slow_path+0x25/0x25
[  501.942850] Mem-Info:
[  501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
                active_file:31213 inactive_file:302048 isolated_file:0
                unevictable:0 dirty:44 writeback:221 unstable:0
                slab_reclaimable:43570 slab_unreclaimable:5651
                mapped:16660 shmem:29495 pagetables:2542 bounce:0
                free:10884 free_pcp:214 free_cma:0
[  501.942859] DMA free:14896kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:96kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:0kB slab_reclaimable:28kB slab_unreclaimable:108kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  501.942862] lowmem_reserve[]: 0 3031 3855 3855
[  501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  501.942870] lowmem_reserve[]: 0 0 824 824
[  501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no
[  501.942879] lowmem_reserve[]: 0 0 0 0
[  501.942902] DMA: 6*4kB (UME) 3*8kB (M) 2*16kB (UM) 3*32kB (ME) 2*64kB (ME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 14896kB
[  501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
[  501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB
[  501.942922] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  501.942923] 362670 total pagecache pages
[  501.942924] 0 pages in swap cache
[  501.942926] Swap cache stats: add 150, delete 150, find 0/0
[  501.942926] Free swap  = 8388504kB
[  501.942927] Total swap = 8388604kB
[  501.942928] 1032092 pages RAM
[  501.942928] 0 pages HighMem/MovableOnly
[  501.942929] 40111 pages reserved
[  501.942930] 0 pages hwpoisoned
[  501.942930] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  501.942935] [  162]     0   162    15823     1065      35       3        0             0 systemd-journal
[  501.942991] [  186]     0   186     8586     1054      19       3        0         -1000 systemd-udevd
[  501.942993] [  287]     0   287     3557      651      12       3        0             0 crond
[  501.942995] [  288]    81   288     8159      775      20       3        0          -900 dbus-daemon
[  501.942997] [  289]     0   289     3843      518      13       3        0             0 systemd-logind
[  501.942999] [  294]     0   294    22455      856      47       3        0             0 login
[  501.943001] [  302]  1000   302     8481     1029      20       3        0             0 systemd
[  501.943003] [  304]  1000   304    24212      438      47       3        0             0 (sd-pam)
[  501.943005] [  309]  1000   309     4431     1123      14       3        0             0 bash
[  501.943007] [  316]  1000   316     3712      764      13       3        0             0 startx
[  501.943009] [  338]  1000   338     3976      255      14       3        0             0 xinit
[  501.943012] [  339]  1000   339    44397    11311      90       3        0             0 Xorg
[  501.943014] [  341]  1000   341    39703     4045      78       3        0             0 openbox
[  501.943016] [  352]  1000   352    43465     2997      86       4        0             0 tint2
[  501.943018] [  355]  1000   355    33962     4351      57       3        0             0 urxvt
[  501.943020] [  356]  1000   356     4466     1155      13       3        0             0 bash
[  501.943022] [  359]  1000   359     4433     1116      13       3        0             0 bash
[  501.943024] [  364]  1000   364    49365     6236      62       3        0             0 urxvt
[  501.943026] [  365]  1000   365     4433     1093      15       3        0             0 bash
[  501.943028] [  368]  1000   368     5203      745      15       3        0             0 tmux
[  501.943030] [  370]  1000   370     6336     1374      17       3        0             0 tmux
[  501.943046] [  371]  1000   371     4433     1100      14       3        0             0 bash
[  501.943049] [  378]  1000   378     4433     1115      13       3        0             0 bash
[  501.943051] [  381]  1000   381     5203      763      16       3        0             0 tmux
[  501.943053] [  382]  1000   382     4433     1089      15       3        0             0 bash
[  501.943055] [  389]  1000   389     4433     1078      15       3        0             0 bash
[  501.943057] [  392]  1000   392     4433     1078      15       3        0             0 bash
[  501.943058] [  395]  1000   395     4433     1090      14       3        0             0 bash
[  501.943060] [  398]  1000   398     4433     1111      14       3        0             0 bash
[  501.943062] [  401]  1000   401    10126     1010      25       3        0             0 top
[  501.943064] [  403]  1000   403     4433     1129      14       3        0             0 bash
[  501.943066] [  409]  1000   409     3740      786      13       3        0             0 coretemp-sensor
[  501.943069] [  443]  1000   443    25873     3141      51       3        0             0 urxvt
[  501.943071] [  444]  1000   444     4433     1110      13       3        0             0 bash
[  501.943073] [  447]  1000   447    68144    55547     138       3        0             0 mutt
[  501.943075] [  450]  1000   450    29966     3825      51       3        0             0 urxvt
[  501.943077] [  451]  1000   451     4433     1117      14       3        0             0 bash
[  501.943079] [  456]  1000   456    29967     3793      53       3        0             0 urxvt
[  501.943081] [  457]  1000   457     4433     1085      14       3        0             0 bash
[  501.943083] [  462]  1000   462    29967     3845      51       4        0             0 urxvt
[  501.943085] [  463]  1000   463     4433     1093      14       3        0             0 bash
[  501.943087] [  468]  1000   468    29967     3793      50       3        0             0 urxvt
[  501.943089] [  469]  1000   469     4433     1086      15       3        0             0 bash
[  501.943091] [  493]  1000   493    52976     6416      69       3        0             0 urxvt
[  501.943093] [  494]  1000   494     4433     1106      14       3        0             0 bash
[  501.943095] [  499]  1000   499    29966     3792      54       3        0             0 urxvt
[  501.943097] [  500]  1000   500     4433     1078      14       3        0             0 bash
[  501.943099] [  525]     0   525    17802     1108      38       3        0             0 sudo
[  501.943101] [  528]     0   528   186583      768     207       4        0             0 journalctl
[  501.943103] [  550]  1000   550    42144     9259      66       4        0             0 urxvt
[  501.943105] [  551]  1000   551     4433     1067      14       4        0             0 bash
[  501.943107] [  557]  1000   557    11115      768      27       3        0             0 su
[  501.943109] [  579]     0   579     4462     1148      13       3        0             0 bash
[  501.943111] [  963]  1000   963     4433     1075      14       3        0             0 bash
[  501.943113] [  981]  1000   981     4433     1114      13       3        0             0 bash
[  501.943115] [  993]  1000   993     4432     1118      14       3        0             0 bash
[  501.943117] [ 1062]  1000  1062     5203      734      15       3        0             0 tmux
[  501.943119] [ 1063]  1000  1063    13805    10479      32       3        0             0 bash
[  501.943121] [ 1145]  1000  1145     4466     1144      14       3        0             0 bash
[  501.943123] [ 4331]  1000  4331   287422    64040     429       4        0             0 firefox
[  501.943125] [ 4440]  1000  4440     8132      761      20       3        0             0 dbus-daemon
[  501.943127] [ 4470]  1000  4470    83823      934      31       4        0             0 at-spi-bus-laun
[  501.943129] [17875]  1000 17875     7549     1926      20       3        0             0 vim
[  501.943131] [27066]  1000 27066     4432     1120      15       3        0             0 bash
[  501.943133] [27073]  1000 27073     4432     1071      13       3        0             0 bash
[  501.943135] [27079]  1000 27079     4432     1077      15       3        0             0 bash
[  501.943137] [27085]  1000 27085     4432     1080      14       3        0             0 bash
[  501.943139] [27091]  1000 27091     4432     1091      14       3        0             0 bash
[  501.943141] [27097]  1000 27097     4432     1096      15       3        0             0 bash
[  501.943143] [ 1235]     0  1235     3745      809      11       3        0             0 zram-test.sh
[  501.943145] [ 2316]  1000  2316     1759      166       9       3        0             0 sleep
[  501.943147] [ 2323]     0  2323     3302     1946      12       3        0             0 dd
[  501.943148] Out of memory: Kill process 4331 (firefox) score 20 or sacrifice child
[  501.943352] Killed process 4331 (firefox) total-vm:1149688kB, anon-rss:207844kB, file-rss:48172kB, shmem-rss:516kB

	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08  3:51             ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  3:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

Hello Michal,

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.

(next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")

seems it's significantly less likely to oom-kill now, but I still can see
something like this

[  501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  501.942796] CPU: 3 PID: 409 Comm: coretemp-sensor Not tainted 4.5.0-rc6-next-20160307-dbg-00015-g8a56edd-dirty #250
[  501.942801]  0000000000000000 ffff88013114fb88 ffffffff812364e9 0000000000000000
[  501.942804]  ffff88013114fd28 ffff88013114fbf8 ffffffff8113b11c ffff88013114fba8
[  501.942807]  ffffffff810835c1 ffff88013114fbc8 0000000000000206 ffffffff81a46de0
[  501.942808] Call Trace:
[  501.942813]  [<ffffffff812364e9>] dump_stack+0x67/0x90
[  501.942817]  [<ffffffff8113b11c>] dump_header.isra.5+0x54/0x359
[  501.942820]  [<ffffffff810835c1>] ? trace_hardirqs_on+0xd/0xf
[  501.942823]  [<ffffffff810f97c2>] oom_kill_process+0x89/0x503
[  501.942825]  [<ffffffff810f9ffe>] out_of_memory+0x372/0x38d
[  501.942827]  [<ffffffff810fe5ae>] __alloc_pages_nodemask+0x9b6/0xa92
[  501.942830]  [<ffffffff810fe882>] alloc_kmem_pages_node+0x1b/0x1d
[  501.942833]  [<ffffffff81041f86>] copy_process.part.9+0xfe/0x17f4
[  501.942835]  [<ffffffff810858f6>] ? lock_acquire+0x10f/0x1a3
[  501.942837]  [<ffffffff8104380f>] _do_fork+0xbd/0x5da
[  501.942838]  [<ffffffff81083598>] ? trace_hardirqs_on_caller+0x16c/0x188
[  501.942842]  [<ffffffff81001a79>] ? do_syscall_64+0x18/0xe6
[  501.942844]  [<ffffffff81043db2>] SyS_clone+0x19/0x1b
[  501.942845]  [<ffffffff81001abb>] do_syscall_64+0x5a/0xe6
[  501.942848]  [<ffffffff8151245a>] entry_SYSCALL64_slow_path+0x25/0x25
[  501.942850] Mem-Info:
[  501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
                active_file:31213 inactive_file:302048 isolated_file:0
                unevictable:0 dirty:44 writeback:221 unstable:0
                slab_reclaimable:43570 slab_unreclaimable:5651
                mapped:16660 shmem:29495 pagetables:2542 bounce:0
                free:10884 free_pcp:214 free_cma:0
[  501.942859] DMA free:14896kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:96kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:0kB slab_reclaimable:28kB slab_unreclaimable:108kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  501.942862] lowmem_reserve[]: 0 3031 3855 3855
[  501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  501.942870] lowmem_reserve[]: 0 0 824 824
[  501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no
[  501.942879] lowmem_reserve[]: 0 0 0 0
[  501.942902] DMA: 6*4kB (UME) 3*8kB (M) 2*16kB (UM) 3*32kB (ME) 2*64kB (ME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 14896kB
[  501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
[  501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB
[  501.942922] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  501.942923] 362670 total pagecache pages
[  501.942924] 0 pages in swap cache
[  501.942926] Swap cache stats: add 150, delete 150, find 0/0
[  501.942926] Free swap  = 8388504kB
[  501.942927] Total swap = 8388604kB
[  501.942928] 1032092 pages RAM
[  501.942928] 0 pages HighMem/MovableOnly
[  501.942929] 40111 pages reserved
[  501.942930] 0 pages hwpoisoned
[  501.942930] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  501.942935] [  162]     0   162    15823     1065      35       3        0             0 systemd-journal
[  501.942991] [  186]     0   186     8586     1054      19       3        0         -1000 systemd-udevd
[  501.942993] [  287]     0   287     3557      651      12       3        0             0 crond
[  501.942995] [  288]    81   288     8159      775      20       3        0          -900 dbus-daemon
[  501.942997] [  289]     0   289     3843      518      13       3        0             0 systemd-logind
[  501.942999] [  294]     0   294    22455      856      47       3        0             0 login
[  501.943001] [  302]  1000   302     8481     1029      20       3        0             0 systemd
[  501.943003] [  304]  1000   304    24212      438      47       3        0             0 (sd-pam)
[  501.943005] [  309]  1000   309     4431     1123      14       3        0             0 bash
[  501.943007] [  316]  1000   316     3712      764      13       3        0             0 startx
[  501.943009] [  338]  1000   338     3976      255      14       3        0             0 xinit
[  501.943012] [  339]  1000   339    44397    11311      90       3        0             0 Xorg
[  501.943014] [  341]  1000   341    39703     4045      78       3        0             0 openbox
[  501.943016] [  352]  1000   352    43465     2997      86       4        0             0 tint2
[  501.943018] [  355]  1000   355    33962     4351      57       3        0             0 urxvt
[  501.943020] [  356]  1000   356     4466     1155      13       3        0             0 bash
[  501.943022] [  359]  1000   359     4433     1116      13       3        0             0 bash
[  501.943024] [  364]  1000   364    49365     6236      62       3        0             0 urxvt
[  501.943026] [  365]  1000   365     4433     1093      15       3        0             0 bash
[  501.943028] [  368]  1000   368     5203      745      15       3        0             0 tmux
[  501.943030] [  370]  1000   370     6336     1374      17       3        0             0 tmux
[  501.943046] [  371]  1000   371     4433     1100      14       3        0             0 bash
[  501.943049] [  378]  1000   378     4433     1115      13       3        0             0 bash
[  501.943051] [  381]  1000   381     5203      763      16       3        0             0 tmux
[  501.943053] [  382]  1000   382     4433     1089      15       3        0             0 bash
[  501.943055] [  389]  1000   389     4433     1078      15       3        0             0 bash
[  501.943057] [  392]  1000   392     4433     1078      15       3        0             0 bash
[  501.943058] [  395]  1000   395     4433     1090      14       3        0             0 bash
[  501.943060] [  398]  1000   398     4433     1111      14       3        0             0 bash
[  501.943062] [  401]  1000   401    10126     1010      25       3        0             0 top
[  501.943064] [  403]  1000   403     4433     1129      14       3        0             0 bash
[  501.943066] [  409]  1000   409     3740      786      13       3        0             0 coretemp-sensor
[  501.943069] [  443]  1000   443    25873     3141      51       3        0             0 urxvt
[  501.943071] [  444]  1000   444     4433     1110      13       3        0             0 bash
[  501.943073] [  447]  1000   447    68144    55547     138       3        0             0 mutt
[  501.943075] [  450]  1000   450    29966     3825      51       3        0             0 urxvt
[  501.943077] [  451]  1000   451     4433     1117      14       3        0             0 bash
[  501.943079] [  456]  1000   456    29967     3793      53       3        0             0 urxvt
[  501.943081] [  457]  1000   457     4433     1085      14       3        0             0 bash
[  501.943083] [  462]  1000   462    29967     3845      51       4        0             0 urxvt
[  501.943085] [  463]  1000   463     4433     1093      14       3        0             0 bash
[  501.943087] [  468]  1000   468    29967     3793      50       3        0             0 urxvt
[  501.943089] [  469]  1000   469     4433     1086      15       3        0             0 bash
[  501.943091] [  493]  1000   493    52976     6416      69       3        0             0 urxvt
[  501.943093] [  494]  1000   494     4433     1106      14       3        0             0 bash
[  501.943095] [  499]  1000   499    29966     3792      54       3        0             0 urxvt
[  501.943097] [  500]  1000   500     4433     1078      14       3        0             0 bash
[  501.943099] [  525]     0   525    17802     1108      38       3        0             0 sudo
[  501.943101] [  528]     0   528   186583      768     207       4        0             0 journalctl
[  501.943103] [  550]  1000   550    42144     9259      66       4        0             0 urxvt
[  501.943105] [  551]  1000   551     4433     1067      14       4        0             0 bash
[  501.943107] [  557]  1000   557    11115      768      27       3        0             0 su
[  501.943109] [  579]     0   579     4462     1148      13       3        0             0 bash
[  501.943111] [  963]  1000   963     4433     1075      14       3        0             0 bash
[  501.943113] [  981]  1000   981     4433     1114      13       3        0             0 bash
[  501.943115] [  993]  1000   993     4432     1118      14       3        0             0 bash
[  501.943117] [ 1062]  1000  1062     5203      734      15       3        0             0 tmux
[  501.943119] [ 1063]  1000  1063    13805    10479      32       3        0             0 bash
[  501.943121] [ 1145]  1000  1145     4466     1144      14       3        0             0 bash
[  501.943123] [ 4331]  1000  4331   287422    64040     429       4        0             0 firefox
[  501.943125] [ 4440]  1000  4440     8132      761      20       3        0             0 dbus-daemon
[  501.943127] [ 4470]  1000  4470    83823      934      31       4        0             0 at-spi-bus-laun
[  501.943129] [17875]  1000 17875     7549     1926      20       3        0             0 vim
[  501.943131] [27066]  1000 27066     4432     1120      15       3        0             0 bash
[  501.943133] [27073]  1000 27073     4432     1071      13       3        0             0 bash
[  501.943135] [27079]  1000 27079     4432     1077      15       3        0             0 bash
[  501.943137] [27085]  1000 27085     4432     1080      14       3        0             0 bash
[  501.943139] [27091]  1000 27091     4432     1091      14       3        0             0 bash
[  501.943141] [27097]  1000 27097     4432     1096      15       3        0             0 bash
[  501.943143] [ 1235]     0  1235     3745      809      11       3        0             0 zram-test.sh
[  501.943145] [ 2316]  1000  2316     1759      166       9       3        0             0 sleep
[  501.943147] [ 2323]     0  2323     3302     1946      12       3        0             0 dd
[  501.943148] Out of memory: Kill process 4331 (firefox) score 20 or sacrifice child
[  501.943352] Killed process 4331 (firefox) total-vm:1149688kB, anon-rss:207844kB, file-rss:48172kB, shmem-rss:516kB

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08  3:51             ` Sergey Senozhatsky
@ 2016-03-08  9:08               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08  9:08 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim, Vlastimil Babka

On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> Hello Michal,
> 
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> > 
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
> 
> (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
> 
> seems it's significantly less likely to oom-kill now, but I still can see
> something like this

Thanks for the testing. This is highly appreciated. If you are able to
reproduce this then collecting compaction related tracepoints might be
really helpful.

> [  501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
>                 active_file:31213 inactive_file:302048 isolated_file:0
>                 unevictable:0 dirty:44 writeback:221 unstable:0
>                 slab_reclaimable:43570 slab_unreclaimable:5651
>                 mapped:16660 shmem:29495 pagetables:2542 bounce:0
>                 free:10884 free_pcp:214 free_cma:0
[...]
> [  501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [  501.942870] lowmem_reserve[]: 0 0 824 824
> [  501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no

Both DMA32 and Normal zones are over high watermarks so this OOM is due
to the memory fragmentation.

> [  501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
> [  501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB

There are no order-2+ pages usable even after we know that the
compaction was active and didn't back out early. I might be missing
something of course and the patch might still be tweaked to be more
conservative. Tracepoints should tell us more though.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08  9:08               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08  9:08 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim, Vlastimil Babka

On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> Hello Michal,
> 
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> > 
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
> 
> (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
> 
> seems it's significantly less likely to oom-kill now, but I still can see
> something like this

Thanks for the testing. This is highly appreciated. If you are able to
reproduce this then collecting compaction related tracepoints might be
really helpful.

> [  501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
>                 active_file:31213 inactive_file:302048 isolated_file:0
>                 unevictable:0 dirty:44 writeback:221 unstable:0
>                 slab_reclaimable:43570 slab_unreclaimable:5651
>                 mapped:16660 shmem:29495 pagetables:2542 bounce:0
>                 free:10884 free_pcp:214 free_cma:0
[...]
> [  501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [  501.942870] lowmem_reserve[]: 0 0 824 824
> [  501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no

Both DMA32 and Normal zones are over high watermarks so this OOM is due
to the memory fragmentation.

> [  501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
> [  501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB

There are no order-2+ pages usable even after we know that the
compaction was active and didn't back out early. I might be missing
something of course and the patch might still be tweaked to be more
conservative. Tracepoints should tell us more though.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08  9:08               ` Michal Hocko
@ 2016-03-08  9:24                 ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

On (03/08/16 10:08), Michal Hocko wrote:
> On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> > Hello Michal,
> > 
> > On (03/07/16 17:08), Michal Hocko wrote:
> > > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > > Andrew,
> > > > could you queue this one as well, please? This is more a band aid than a
> > > > real solution which I will be working on as soon as I am able to
> > > > reproduce the issue but the patch should help to some degree at least.
> > > 
> > > Joonsoo wasn't very happy about this approach so let me try a different
> > > way. What do you think about the following? Hugh, Sergey does it help
> > > for your load? I have tested it with the Hugh's load and there was no
> > > major difference from the previous testing so at least nothing has blown
> > > up as I am not able to reproduce the issue here.
> > 
> > (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
> > 
> > seems it's significantly less likely to oom-kill now, but I still can see
> > something like this
> 
> Thanks for the testing. This is highly appreciated. If you are able to
> reproduce this then collecting compaction related tracepoints might be
> really helpful.
> 

oh, wow... compaction is disabled, somehow.

  $ zcat /proc/config.gz | grep -i CONFIG_COMPACTION
  # CONFIG_COMPACTION is not set

I should have checked that, sorry!

will enable and re-test.

	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08  9:24                 ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

On (03/08/16 10:08), Michal Hocko wrote:
> On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> > Hello Michal,
> > 
> > On (03/07/16 17:08), Michal Hocko wrote:
> > > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > > Andrew,
> > > > could you queue this one as well, please? This is more a band aid than a
> > > > real solution which I will be working on as soon as I am able to
> > > > reproduce the issue but the patch should help to some degree at least.
> > > 
> > > Joonsoo wasn't very happy about this approach so let me try a different
> > > way. What do you think about the following? Hugh, Sergey does it help
> > > for your load? I have tested it with the Hugh's load and there was no
> > > major difference from the previous testing so at least nothing has blown
> > > up as I am not able to reproduce the issue here.
> > 
> > (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
> > 
> > seems it's significantly less likely to oom-kill now, but I still can see
> > something like this
> 
> Thanks for the testing. This is highly appreciated. If you are able to
> reproduce this then collecting compaction related tracepoints might be
> really helpful.
> 

oh, wow... compaction is disabled, somehow.

  $ zcat /proc/config.gz | grep -i CONFIG_COMPACTION
  # CONFIG_COMPACTION is not set

I should have checked that, sorry!

will enable and re-test.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-07 16:08           ` Michal Hocko
@ 2016-03-08  9:24             ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08  9:24 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Sergey Senozhatsky, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim

On 03/07/2016 05:08 PM, Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
> 
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
> 
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  2 ++
>  mm/page_alloc.c            | 41 ++++++++++++++++++++++++-----------------
>  2 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +/* compaction disabled */
> +#define COMPACT_NONE		-1
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* compaction didn't start as it was deferred due to past failures */
>  #define COMPACT_DEFERRED	0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		unsigned long *compact_result)
>  {
> -	unsigned long compact_result;
>  	struct page *page;
>  
> -	if (!order)
> +	if (!order) {
> +		*compact_result = COMPACT_NONE;
>  		return NULL;
> +	}
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)

COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
work as expected.

>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		unsigned long *compact_result)
>  {
> +	*compact_result = COMPACT_NONE;
>  	return NULL;
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	unsigned long compact_result;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_CONTINUE)

Same here.
I was going to say that this didn't have effect on Sergey's test, but
turns out it did :)

> +			goto retry;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			goto retry;
> +	}
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08  9:24             ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08  9:24 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Sergey Senozhatsky, Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim

On 03/07/2016 05:08 PM, Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
> 
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
> 
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  2 ++
>  mm/page_alloc.c            | 41 ++++++++++++++++++++++++-----------------
>  2 files changed, 26 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +/* compaction disabled */
> +#define COMPACT_NONE		-1
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* compaction didn't start as it was deferred due to past failures */
>  #define COMPACT_DEFERRED	0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		unsigned long *compact_result)
>  {
> -	unsigned long compact_result;
>  	struct page *page;
>  
> -	if (!order)
> +	if (!order) {
> +		*compact_result = COMPACT_NONE;
>  		return NULL;
> +	}
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)

COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
work as expected.

>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		unsigned long *compact_result)
>  {
> +	*compact_result = COMPACT_NONE;
>  	return NULL;
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	unsigned long compact_result;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_CONTINUE)

Same here.
I was going to say that this didn't have effect on Sergey's test, but
turns out it did :)

> +			goto retry;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			goto retry;
> +	}
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08  9:24             ` Vlastimil Babka
@ 2016-03-08  9:32               ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Sergey Senozhatsky, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim

On (03/08/16 10:24), Vlastimil Babka wrote:
[..]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				 did_some_progress > 0, no_progress_loops))
> >  		goto retry;
> >  
> > +	/*
> > +	 * !costly allocations are really important and we have to make sure
> > +	 * the compaction wasn't deferred or didn't bail out early due to locks
> > +	 * contention before we go OOM.
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +		if (compact_result <= COMPACT_CONTINUE)
> 
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

I'm sorry, my test is not correct. I have disabled compaction last weeked on
purpose - to provoke more OOM-kills and OOM conditions for reworked printk()
patch set testing (http://marc.info/?l=linux-kernel&m=145734549308803); and I
forgot to re-enable it.

	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08  9:32               ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Sergey Senozhatsky, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Joonsoo Kim

On (03/08/16 10:24), Vlastimil Babka wrote:
[..]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				 did_some_progress > 0, no_progress_loops))
> >  		goto retry;
> >  
> > +	/*
> > +	 * !costly allocations are really important and we have to make sure
> > +	 * the compaction wasn't deferred or didn't bail out early due to locks
> > +	 * contention before we go OOM.
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +		if (compact_result <= COMPACT_CONTINUE)
> 
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

I'm sorry, my test is not correct. I have disabled compaction last weeked on
purpose - to provoke more OOM-kills and OOM conditions for reworked printk()
patch set testing (http://marc.info/?l=linux-kernel&m=145734549308803); and I
forgot to re-enable it.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08  9:24             ` Vlastimil Babka
@ 2016-03-08  9:46               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08  9:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
[...]
> > @@ -2819,28 +2819,22 @@ static struct page *
> >  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  		int alloc_flags, const struct alloc_context *ac,
> >  		enum migrate_mode mode, int *contended_compaction,
> > -		bool *deferred_compaction)
> > +		unsigned long *compact_result)
> >  {
> > -	unsigned long compact_result;
> >  	struct page *page;
> >  
> > -	if (!order)
> > +	if (!order) {
> > +		*compact_result = COMPACT_NONE;
> >  		return NULL;
> > +	}
> >  
> >  	current->flags |= PF_MEMALLOC;
> > -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> > +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >  
> > -	switch (compact_result) {
> > -	case COMPACT_DEFERRED:
> > -		*deferred_compaction = true;
> > -		/* fall-through */
> > -	case COMPACT_SKIPPED:
> > +	if (*compact_result <= COMPACT_SKIPPED)
> 
> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
> work as expected.

Well, COMPACT_NONE is documented as /* compaction disabled */ so we
should never get it from try_to_compact_pages.

[...]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				 did_some_progress > 0, no_progress_loops))
> >  		goto retry;
> >  
> > +	/*
> > +	 * !costly allocations are really important and we have to make sure
> > +	 * the compaction wasn't deferred or didn't bail out early due to locks
> > +	 * contention before we go OOM.
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +		if (compact_result <= COMPACT_CONTINUE)
> 
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

This should work as expected because compact_result is unsigned long
and so this is the unsigned arithmetic. I can make
#define COMPACT_NONE            -1UL

to make the intention more obvious if you prefer, though.

Thanks for the review.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08  9:46               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08  9:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
[...]
> > @@ -2819,28 +2819,22 @@ static struct page *
> >  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  		int alloc_flags, const struct alloc_context *ac,
> >  		enum migrate_mode mode, int *contended_compaction,
> > -		bool *deferred_compaction)
> > +		unsigned long *compact_result)
> >  {
> > -	unsigned long compact_result;
> >  	struct page *page;
> >  
> > -	if (!order)
> > +	if (!order) {
> > +		*compact_result = COMPACT_NONE;
> >  		return NULL;
> > +	}
> >  
> >  	current->flags |= PF_MEMALLOC;
> > -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> > +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >  
> > -	switch (compact_result) {
> > -	case COMPACT_DEFERRED:
> > -		*deferred_compaction = true;
> > -		/* fall-through */
> > -	case COMPACT_SKIPPED:
> > +	if (*compact_result <= COMPACT_SKIPPED)
> 
> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
> work as expected.

Well, COMPACT_NONE is documented as /* compaction disabled */ so we
should never get it from try_to_compact_pages.

[...]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  				 did_some_progress > 0, no_progress_loops))
> >  		goto retry;
> >  
> > +	/*
> > +	 * !costly allocations are really important and we have to make sure
> > +	 * the compaction wasn't deferred or didn't bail out early due to locks
> > +	 * contention before we go OOM.
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > +		if (compact_result <= COMPACT_CONTINUE)
> 
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

This should work as expected because compact_result is unsigned long
and so this is the unsigned arithmetic. I can make
#define COMPACT_NONE            -1UL

to make the intention more obvious if you prefer, though.

Thanks for the review.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08  9:46               ` Michal Hocko
@ 2016-03-08  9:52                 ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08  9:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 10:46 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
> [...]
>>> @@ -2819,28 +2819,22 @@ static struct page *
>>>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>>  		int alloc_flags, const struct alloc_context *ac,
>>>  		enum migrate_mode mode, int *contended_compaction,
>>> -		bool *deferred_compaction)
>>> +		unsigned long *compact_result)
>>>  {
>>> -	unsigned long compact_result;
>>>  	struct page *page;
>>>  
>>> -	if (!order)
>>> +	if (!order) {
>>> +		*compact_result = COMPACT_NONE;
>>>  		return NULL;
>>> +	}
>>>  
>>>  	current->flags |= PF_MEMALLOC;
>>> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>>  						mode, contended_compaction);
>>>  	current->flags &= ~PF_MEMALLOC;
>>>  
>>> -	switch (compact_result) {
>>> -	case COMPACT_DEFERRED:
>>> -		*deferred_compaction = true;
>>> -		/* fall-through */
>>> -	case COMPACT_SKIPPED:
>>> +	if (*compact_result <= COMPACT_SKIPPED)
>>
>> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
>> work as expected.
> 
> Well, COMPACT_NONE is documented as /* compaction disabled */ so we
> should never get it from try_to_compact_pages.

Right.

>
> [...]
>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>  				 did_some_progress > 0, no_progress_loops))
>>>  		goto retry;
>>>  
>>> +	/*
>>> +	 * !costly allocations are really important and we have to make sure
>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
>>> +	 * contention before we go OOM.
>>> +	 */
>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>> +		if (compact_result <= COMPACT_CONTINUE)
>>
>> Same here.
>> I was going to say that this didn't have effect on Sergey's test, but
>> turns out it did :)
> 
> This should work as expected because compact_result is unsigned long
> and so this is the unsigned arithmetic. I can make
> #define COMPACT_NONE            -1UL
> 
> to make the intention more obvious if you prefer, though.

Well, what wasn't obvious to me is actually that here (unlike in the
test above) it was actually intended that COMPACT_NONE doesn't result in
a retry. But it makes sense, otherwise we would retry endlessly if
reclaim couldn't form a higher-order page, right.

> Thanks for the review.
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08  9:52                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08  9:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 10:46 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
> [...]
>>> @@ -2819,28 +2819,22 @@ static struct page *
>>>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>>  		int alloc_flags, const struct alloc_context *ac,
>>>  		enum migrate_mode mode, int *contended_compaction,
>>> -		bool *deferred_compaction)
>>> +		unsigned long *compact_result)
>>>  {
>>> -	unsigned long compact_result;
>>>  	struct page *page;
>>>  
>>> -	if (!order)
>>> +	if (!order) {
>>> +		*compact_result = COMPACT_NONE;
>>>  		return NULL;
>>> +	}
>>>  
>>>  	current->flags |= PF_MEMALLOC;
>>> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>>  						mode, contended_compaction);
>>>  	current->flags &= ~PF_MEMALLOC;
>>>  
>>> -	switch (compact_result) {
>>> -	case COMPACT_DEFERRED:
>>> -		*deferred_compaction = true;
>>> -		/* fall-through */
>>> -	case COMPACT_SKIPPED:
>>> +	if (*compact_result <= COMPACT_SKIPPED)
>>
>> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
>> work as expected.
> 
> Well, COMPACT_NONE is documented as /* compaction disabled */ so we
> should never get it from try_to_compact_pages.

Right.

>
> [...]
>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>  				 did_some_progress > 0, no_progress_loops))
>>>  		goto retry;
>>>  
>>> +	/*
>>> +	 * !costly allocations are really important and we have to make sure
>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
>>> +	 * contention before we go OOM.
>>> +	 */
>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>> +		if (compact_result <= COMPACT_CONTINUE)
>>
>> Same here.
>> I was going to say that this didn't have effect on Sergey's test, but
>> turns out it did :)
> 
> This should work as expected because compact_result is unsigned long
> and so this is the unsigned arithmetic. I can make
> #define COMPACT_NONE            -1UL
> 
> to make the intention more obvious if you prefer, though.

Well, what wasn't obvious to me is actually that here (unlike in the
test above) it was actually intended that COMPACT_NONE doesn't result in
a retry. But it makes sense, otherwise we would retry endlessly if
reclaim couldn't form a higher-order page, right.

> Thanks for the review.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-07 16:08           ` Michal Hocko
@ 2016-03-08  9:58             ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
> 
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.

works fine for me.

$  cat /proc/vmstat | egrep -e "compact|swap"
pgsteal_kswapd_dma 7
pgsteal_kswapd_dma32 6457075
pgsteal_kswapd_normal 1462767
pgsteal_kswapd_movable 0
pgscan_kswapd_dma 18
pgscan_kswapd_dma32 6544126
pgscan_kswapd_normal 1495604
pgscan_kswapd_movable 0
kswapd_inodesteal 29
kswapd_low_wmark_hit_quickly 1168
kswapd_high_wmark_hit_quickly 1627
compact_migrate_scanned 5762793
compact_free_scanned 54090239
compact_isolated 1303895
compact_stall 1542
compact_fail 1117
compact_success 425
compact_kcompatd_wake 0

no OOM-kills after 6 rounds of tests.

Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

thanks!

	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08  9:58             ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-03-08  9:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
> 
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.

works fine for me.

$  cat /proc/vmstat | egrep -e "compact|swap"
pgsteal_kswapd_dma 7
pgsteal_kswapd_dma32 6457075
pgsteal_kswapd_normal 1462767
pgsteal_kswapd_movable 0
pgscan_kswapd_dma 18
pgscan_kswapd_dma32 6544126
pgscan_kswapd_normal 1495604
pgscan_kswapd_movable 0
kswapd_inodesteal 29
kswapd_low_wmark_hit_quickly 1168
kswapd_high_wmark_hit_quickly 1627
compact_migrate_scanned 5762793
compact_free_scanned 54090239
compact_isolated 1303895
compact_stall 1542
compact_fail 1117
compact_success 425
compact_kcompatd_wake 0

no OOM-kills after 6 rounds of tests.

Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

thanks!

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08  9:52                 ` Vlastimil Babka
@ 2016-03-08 10:10                   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> On 03/08/2016 10:46 AM, Michal Hocko wrote:
[...]
> >>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>  				 did_some_progress > 0, no_progress_loops))
> >>>  		goto retry;
> >>>  
> >>> +	/*
> >>> +	 * !costly allocations are really important and we have to make sure
> >>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> >>> +	 * contention before we go OOM.
> >>> +	 */
> >>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>> +		if (compact_result <= COMPACT_CONTINUE)
> >>
> >> Same here.
> >> I was going to say that this didn't have effect on Sergey's test, but
> >> turns out it did :)
> > 
> > This should work as expected because compact_result is unsigned long
> > and so this is the unsigned arithmetic. I can make
> > #define COMPACT_NONE            -1UL
> > 
> > to make the intention more obvious if you prefer, though.
> 
> Well, what wasn't obvious to me is actually that here (unlike in the
> test above) it was actually intended that COMPACT_NONE doesn't result in
> a retry. But it makes sense, otherwise we would retry endlessly if
> reclaim couldn't form a higher-order page, right.

Yeah, that was the whole point. An alternative would be moving the test
into should_compact_retry(order, compact_result, contended_compaction)
which would be CONFIG_COMPACTION specific so we can get rid of the
COMPACT_NONE altogether. Something like the following. We would lose the
always initialized compact_result but this would matter only for
order==0 and we check for that. Even gcc doesn't complain.

A more important question is whether the criteria I have chosen are
reasonable and reasonably independent on the particular implementation
of the compaction. I still cannot convince myself about the convergence
here. Is it possible that the compaction would keep returning 
compact_result <= COMPACT_CONTINUE while not making any progress at all?

Sure we can see a case where somebody is stealing the compacted blocks
but that is very same with the order-0 where parallel mem eaters will
piggy back on the reclaimer and there is no upper boundary as well well.

---
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a4cec4a03f7d..4cd4ddf64cc7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,8 +1,6 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* compaction disabled */
-#define COMPACT_NONE		-1
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* compaction didn't start as it was deferred due to past failures */
 #define COMPACT_DEFERRED	0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f89e3cbfdf90..c5932a218fc6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2823,10 +2823,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page;
 
-	if (!order) {
-		*compact_result = COMPACT_NONE;
+	if (!order)
 		return NULL;
-	}
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
@@ -2864,6 +2862,25 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+		     int contended_compaction)
+{
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_CONTINUE)
+			return true;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -2871,9 +2888,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum migrate_mode mode, int *contended_compaction,
 		unsigned long *compact_result)
 {
-	*compact_result = COMPACT_NONE;
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+		     int contended_compaction)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3289,17 +3312,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
-	/*
-	 * !costly allocations are really important and we have to make sure
-	 * the compaction wasn't deferred or didn't bail out early due to locks
-	 * contention before we go OOM.
-	 */
-	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		if (compact_result <= COMPACT_CONTINUE)
-			goto retry;
-		if (contended_compaction > COMPACT_CONTENDED_NONE)
-			goto retry;
-	}
+	if (should_compact_retry(order, compact_result, contended_compaction))
+		goto retry;
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08 10:10                   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 10:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> On 03/08/2016 10:46 AM, Michal Hocko wrote:
[...]
> >>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>  				 did_some_progress > 0, no_progress_loops))
> >>>  		goto retry;
> >>>  
> >>> +	/*
> >>> +	 * !costly allocations are really important and we have to make sure
> >>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> >>> +	 * contention before we go OOM.
> >>> +	 */
> >>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>> +		if (compact_result <= COMPACT_CONTINUE)
> >>
> >> Same here.
> >> I was going to say that this didn't have effect on Sergey's test, but
> >> turns out it did :)
> > 
> > This should work as expected because compact_result is unsigned long
> > and so this is the unsigned arithmetic. I can make
> > #define COMPACT_NONE            -1UL
> > 
> > to make the intention more obvious if you prefer, though.
> 
> Well, what wasn't obvious to me is actually that here (unlike in the
> test above) it was actually intended that COMPACT_NONE doesn't result in
> a retry. But it makes sense, otherwise we would retry endlessly if
> reclaim couldn't form a higher-order page, right.

Yeah, that was the whole point. An alternative would be moving the test
into should_compact_retry(order, compact_result, contended_compaction)
which would be CONFIG_COMPACTION specific so we can get rid of the
COMPACT_NONE altogether. Something like the following. We would lose the
always initialized compact_result but this would matter only for
order==0 and we check for that. Even gcc doesn't complain.

A more important question is whether the criteria I have chosen are
reasonable and reasonably independent on the particular implementation
of the compaction. I still cannot convince myself about the convergence
here. Is it possible that the compaction would keep returning 
compact_result <= COMPACT_CONTINUE while not making any progress at all?

Sure we can see a case where somebody is stealing the compacted blocks
but that is very same with the order-0 where parallel mem eaters will
piggy back on the reclaimer and there is no upper boundary as well well.

---
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a4cec4a03f7d..4cd4ddf64cc7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,8 +1,6 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* compaction disabled */
-#define COMPACT_NONE		-1
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* compaction didn't start as it was deferred due to past failures */
 #define COMPACT_DEFERRED	0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f89e3cbfdf90..c5932a218fc6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2823,10 +2823,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page;
 
-	if (!order) {
-		*compact_result = COMPACT_NONE;
+	if (!order)
 		return NULL;
-	}
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
@@ -2864,6 +2862,25 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+		     int contended_compaction)
+{
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_CONTINUE)
+			return true;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -2871,9 +2888,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum migrate_mode mode, int *contended_compaction,
 		unsigned long *compact_result)
 {
-	*compact_result = COMPACT_NONE;
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+		     int contended_compaction)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3289,17 +3312,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
-	/*
-	 * !costly allocations are really important and we have to make sure
-	 * the compaction wasn't deferred or didn't bail out early due to locks
-	 * contention before we go OOM.
-	 */
-	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
-		if (compact_result <= COMPACT_CONTINUE)
-			goto retry;
-		if (contended_compaction > COMPACT_CONTENDED_NONE)
-			goto retry;
-	}
+	if (should_compact_retry(order, compact_result, contended_compaction))
+		goto retry;
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-07 16:08           ` Michal Hocko
                             ` (3 preceding siblings ...)
  (?)
@ 2016-03-08 10:36           ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-08 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim,
	Vlastimil Babka

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1308 bytes --]

On Mon, 7 Mar 2016, Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
> 
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.

Did not help with my load at all, I'm afraid: quite the reverse,
OOMed very much sooner (as usual on order=2), and with much more
noise (multiple OOMs) than your previous patch.

vmstats.xz attached; sorry, I don't have tracing built in,
and must move on to the powerpc issue before going back to bed.

I do hate replying without having something constructive to say, but
have very little time to think about this, and no bright ideas so far.

I do not understand why it's so easy for me to reproduce, yet impossible
for you - unless it's that you are still doing all your testing in a VM?
Is Sergey the only other to see this issue?

Hugh

[-- Attachment #2: Type: APPLICATION/x-xz, Size: 8200 bytes --]

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08 10:10                   ` Michal Hocko
@ 2016-03-08 11:12                     ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 11:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 11:10 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
>> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> [...]
>>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>>>  				 did_some_progress > 0, no_progress_loops))
>>>>>  		goto retry;
>>>>>  
>>>>> +	/*
>>>>> +	 * !costly allocations are really important and we have to make sure
>>>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
>>>>> +	 * contention before we go OOM.
>>>>> +	 */
>>>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>>>> +		if (compact_result <= COMPACT_CONTINUE)
>>>>
>>>> Same here.
>>>> I was going to say that this didn't have effect on Sergey's test, but
>>>> turns out it did :)
>>>
>>> This should work as expected because compact_result is unsigned long
>>> and so this is the unsigned arithmetic. I can make
>>> #define COMPACT_NONE            -1UL
>>>
>>> to make the intention more obvious if you prefer, though.
>>
>> Well, what wasn't obvious to me is actually that here (unlike in the
>> test above) it was actually intended that COMPACT_NONE doesn't result in
>> a retry. But it makes sense, otherwise we would retry endlessly if
>> reclaim couldn't form a higher-order page, right.
> 
> Yeah, that was the whole point. An alternative would be moving the test
> into should_compact_retry(order, compact_result, contended_compaction)
> which would be CONFIG_COMPACTION specific so we can get rid of the
> COMPACT_NONE altogether. Something like the following. We would lose the
> always initialized compact_result but this would matter only for
> order==0 and we check for that. Even gcc doesn't complain.

Yeah I like this version better, you can add my Acked-By.

Thanks.

> A more important question is whether the criteria I have chosen are
> reasonable and reasonably independent on the particular implementation
> of the compaction. I still cannot convince myself about the convergence
> here. Is it possible that the compaction would keep returning 
> compact_result <= COMPACT_CONTINUE while not making any progress at all?

Theoretically, if reclaim/compaction suitability decisions and
allocation attempts didn't match the watermark checks, including the
alloc_flags and classzone_idx parameters. Possible scenarios:

- reclaim thinks compaction has enough to proceed, but compaction thinks
otherwise and returns COMPACT_SKIPPED
- compaction thinks it succeeded and returns COMPACT_PARTIAL, but
allocation attempt fails
- and perhaps some other combinations

> Sure we can see a case where somebody is stealing the compacted blocks
> but that is very same with the order-0 where parallel mem eaters will
> piggy back on the reclaimer and there is no upper boundary as well well.

Yep.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08 11:12                     ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 11:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 11:10 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
>> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> [...]
>>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>>>  				 did_some_progress > 0, no_progress_loops))
>>>>>  		goto retry;
>>>>>  
>>>>> +	/*
>>>>> +	 * !costly allocations are really important and we have to make sure
>>>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
>>>>> +	 * contention before we go OOM.
>>>>> +	 */
>>>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>>>> +		if (compact_result <= COMPACT_CONTINUE)
>>>>
>>>> Same here.
>>>> I was going to say that this didn't have effect on Sergey's test, but
>>>> turns out it did :)
>>>
>>> This should work as expected because compact_result is unsigned long
>>> and so this is the unsigned arithmetic. I can make
>>> #define COMPACT_NONE            -1UL
>>>
>>> to make the intention more obvious if you prefer, though.
>>
>> Well, what wasn't obvious to me is actually that here (unlike in the
>> test above) it was actually intended that COMPACT_NONE doesn't result in
>> a retry. But it makes sense, otherwise we would retry endlessly if
>> reclaim couldn't form a higher-order page, right.
> 
> Yeah, that was the whole point. An alternative would be moving the test
> into should_compact_retry(order, compact_result, contended_compaction)
> which would be CONFIG_COMPACTION specific so we can get rid of the
> COMPACT_NONE altogether. Something like the following. We would lose the
> always initialized compact_result but this would matter only for
> order==0 and we check for that. Even gcc doesn't complain.

Yeah I like this version better, you can add my Acked-By.

Thanks.

> A more important question is whether the criteria I have chosen are
> reasonable and reasonably independent on the particular implementation
> of the compaction. I still cannot convince myself about the convergence
> here. Is it possible that the compaction would keep returning 
> compact_result <= COMPACT_CONTINUE while not making any progress at all?

Theoretically, if reclaim/compaction suitability decisions and
allocation attempts didn't match the watermark checks, including the
alloc_flags and classzone_idx parameters. Possible scenarios:

- reclaim thinks compaction has enough to proceed, but compaction thinks
otherwise and returns COMPACT_SKIPPED
- compaction thinks it succeeded and returns COMPACT_PARTIAL, but
allocation attempt fails
- and perhaps some other combinations

> Sure we can see a case where somebody is stealing the compacted blocks
> but that is very same with the order-0 where parallel mem eaters will
> piggy back on the reclaimer and there is no upper boundary as well well.

Yep.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08 11:12                     ` Vlastimil Babka
@ 2016-03-08 12:22                       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 12:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 12:12:20, Vlastimil Babka wrote:
> On 03/08/2016 11:10 AM, Michal Hocko wrote:
> > On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> >> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> > [...]
> >>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>>>  				 did_some_progress > 0, no_progress_loops))
> >>>>>  		goto retry;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * !costly allocations are really important and we have to make sure
> >>>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> >>>>> +	 * contention before we go OOM.
> >>>>> +	 */
> >>>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>>>> +		if (compact_result <= COMPACT_CONTINUE)
> >>>>
> >>>> Same here.
> >>>> I was going to say that this didn't have effect on Sergey's test, but
> >>>> turns out it did :)
> >>>
> >>> This should work as expected because compact_result is unsigned long
> >>> and so this is the unsigned arithmetic. I can make
> >>> #define COMPACT_NONE            -1UL
> >>>
> >>> to make the intention more obvious if you prefer, though.
> >>
> >> Well, what wasn't obvious to me is actually that here (unlike in the
> >> test above) it was actually intended that COMPACT_NONE doesn't result in
> >> a retry. But it makes sense, otherwise we would retry endlessly if
> >> reclaim couldn't form a higher-order page, right.
> > 
> > Yeah, that was the whole point. An alternative would be moving the test
> > into should_compact_retry(order, compact_result, contended_compaction)
> > which would be CONFIG_COMPACTION specific so we can get rid of the
> > COMPACT_NONE altogether. Something like the following. We would lose the
> > always initialized compact_result but this would matter only for
> > order==0 and we check for that. Even gcc doesn't complain.
> 
> Yeah I like this version better, you can add my Acked-By.

OK, patch updated and I will post it as a reply to the original email.
 
> Thanks.
> 
> > A more important question is whether the criteria I have chosen are
> > reasonable and reasonably independent on the particular implementation
> > of the compaction. I still cannot convince myself about the convergence
> > here. Is it possible that the compaction would keep returning 
> > compact_result <= COMPACT_CONTINUE while not making any progress at all?
> 
> Theoretically, if reclaim/compaction suitability decisions and
> allocation attempts didn't match the watermark checks, including the
> alloc_flags and classzone_idx parameters. Possible scenarios:
> 
> - reclaim thinks compaction has enough to proceed, but compaction thinks
> otherwise and returns COMPACT_SKIPPED
> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
> allocation attempt fails
> - and perhaps some other combinations

But that might happen right now as well so it wouldn't be a regression,
right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08 12:22                       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 12:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On Tue 08-03-16 12:12:20, Vlastimil Babka wrote:
> On 03/08/2016 11:10 AM, Michal Hocko wrote:
> > On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> >> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> > [...]
> >>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>>>  				 did_some_progress > 0, no_progress_loops))
> >>>>>  		goto retry;
> >>>>>  
> >>>>> +	/*
> >>>>> +	 * !costly allocations are really important and we have to make sure
> >>>>> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> >>>>> +	 * contention before we go OOM.
> >>>>> +	 */
> >>>>> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>>>> +		if (compact_result <= COMPACT_CONTINUE)
> >>>>
> >>>> Same here.
> >>>> I was going to say that this didn't have effect on Sergey's test, but
> >>>> turns out it did :)
> >>>
> >>> This should work as expected because compact_result is unsigned long
> >>> and so this is the unsigned arithmetic. I can make
> >>> #define COMPACT_NONE            -1UL
> >>>
> >>> to make the intention more obvious if you prefer, though.
> >>
> >> Well, what wasn't obvious to me is actually that here (unlike in the
> >> test above) it was actually intended that COMPACT_NONE doesn't result in
> >> a retry. But it makes sense, otherwise we would retry endlessly if
> >> reclaim couldn't form a higher-order page, right.
> > 
> > Yeah, that was the whole point. An alternative would be moving the test
> > into should_compact_retry(order, compact_result, contended_compaction)
> > which would be CONFIG_COMPACTION specific so we can get rid of the
> > COMPACT_NONE altogether. Something like the following. We would lose the
> > always initialized compact_result but this would matter only for
> > order==0 and we check for that. Even gcc doesn't complain.
> 
> Yeah I like this version better, you can add my Acked-By.

OK, patch updated and I will post it as a reply to the original email.
 
> Thanks.
> 
> > A more important question is whether the criteria I have chosen are
> > reasonable and reasonably independent on the particular implementation
> > of the compaction. I still cannot convince myself about the convergence
> > here. Is it possible that the compaction would keep returning 
> > compact_result <= COMPACT_CONTINUE while not making any progress at all?
> 
> Theoretically, if reclaim/compaction suitability decisions and
> allocation attempts didn't match the watermark checks, including the
> alloc_flags and classzone_idx parameters. Possible scenarios:
> 
> - reclaim thinks compaction has enough to proceed, but compaction thinks
> otherwise and returns COMPACT_SKIPPED
> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
> allocation attempt fails
> - and perhaps some other combinations

But that might happen right now as well so it wouldn't be a regression,
right?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
  2016-03-08 12:22                       ` Michal Hocko
@ 2016-03-08 12:29                         ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 12:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 01:22 PM, Michal Hocko wrote:
>> Thanks.
>>
>>> A more important question is whether the criteria I have chosen are
>>> reasonable and reasonably independent on the particular implementation
>>> of the compaction. I still cannot convince myself about the convergence
>>> here. Is it possible that the compaction would keep returning 
>>> compact_result <= COMPACT_CONTINUE while not making any progress at all?
>>
>> Theoretically, if reclaim/compaction suitability decisions and
>> allocation attempts didn't match the watermark checks, including the
>> alloc_flags and classzone_idx parameters. Possible scenarios:
>>
>> - reclaim thinks compaction has enough to proceed, but compaction thinks
>> otherwise and returns COMPACT_SKIPPED
>> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
>> allocation attempt fails
>> - and perhaps some other combinations
> 
> But that might happen right now as well so it wouldn't be a regression,
> right?

Maybe, somehow, I didn't study closely how the retry decisions work.
Your patch adds another way to retry so it's theoretically more
dangerous. Just hinting at what to possibly check (the watermark checks) :)

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more
@ 2016-03-08 12:29                         ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 12:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim

On 03/08/2016 01:22 PM, Michal Hocko wrote:
>> Thanks.
>>
>>> A more important question is whether the criteria I have chosen are
>>> reasonable and reasonably independent on the particular implementation
>>> of the compaction. I still cannot convince myself about the convergence
>>> here. Is it possible that the compaction would keep returning 
>>> compact_result <= COMPACT_CONTINUE while not making any progress at all?
>>
>> Theoretically, if reclaim/compaction suitability decisions and
>> allocation attempts didn't match the watermark checks, including the
>> alloc_flags and classzone_idx parameters. Possible scenarios:
>>
>> - reclaim thinks compaction has enough to proceed, but compaction thinks
>> otherwise and returns COMPACT_SKIPPED
>> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
>> allocation attempt fails
>> - and perhaps some other combinations
> 
> But that might happen right now as well so it wouldn't be a regression,
> right?

Maybe, somehow, I didn't study closely how the retry decisions work.
Your patch adds another way to retry so it's theoretically more
dangerous. Just hinting at what to possibly check (the watermark checks) :)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 0/2] oom rework: high order enahncements
  2016-03-07 16:08           ` Michal Hocko
@ 2016-03-08 13:42             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

The first two patches are cleanups for the compaction and the second
patch is updated as per Vlastimil's feedback. I didn't add his Acked-by
because I have added COMPACT_SHOULD_RETRY to make the retry logic in
the page allocator more robust for future changes.

Hugh has still reported this is not sufficient but I would prefer to
handle the issue he is seeing in a separate patch once we understand
what is going on there. The second patch sounds like a reasonable
starting point to me.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 0/2] oom rework: high order enahncements
@ 2016-03-08 13:42             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

The first two patches are cleanups for the compaction and the second
patch is updated as per Vlastimil's feedback. I didn't add his Acked-by
because I have added COMPACT_SHOULD_RETRY to make the retry logic in
the page allocator more robust for future changes.

Hugh has still reported this is not sufficient but I would prefer to
handle the issue he is seeing in a separate patch once we understand
what is going on there. The second patch sounds like a reasonable
starting point to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
  2016-03-08 13:42             ` Michal Hocko
@ 2016-03-08 13:42               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
 mm/compaction.c            | 27 ++++++++++++++-------------
 mm/page_alloc.c            |  2 +-
 3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..b167801187e7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
 #define _LINUX_COMPACTION_H
 
 /* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED	0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED		1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE	2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL		3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE	4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE	5
-#define COMPACT_NOT_SUITABLE_ZONE	6
-#define COMPACT_CONTENDED		7
 /* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+	/* compaction didn't start as it was deferred due to past failures */
+	COMPACT_DEFERRED,
+	/*
+	 * compaction didn't start as it was not possible or direct reclaim
+	 * was more suitable
+	 */
+	COMPACT_SKIPPED,
+	/* compaction should continue to another pageblock */
+	COMPACT_CONTINUE,
+	/*
+	 * direct compaction partially compacted a zone and there are suitable
+	 * pages
+	 */
+	COMPACT_PARTIAL,
+	/* The full zone was compacted */
+	COMPACT_COMPLETE,
+	/* For more detailed tracepoint output */
+	COMPACT_NO_SUITABLE_PAGE,
+	COMPACT_NOT_SUITABLE_ZONE,
+	COMPACT_CONTENDED,
+};
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
 /* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+			unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx);
 
 extern void defer_compaction(struct zone *zone, int order);
@@ -53,7 +62,7 @@ extern void compaction_defer_reset(struct zone *zone, int order,
 extern bool compaction_restarting(struct zone *zone, int order);
 
 #else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
 			const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
@@ -69,7 +78,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
 
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 585de54dbe8c..0f61f12d82b6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1195,7 +1195,7 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
 	unsigned int order;
@@ -1258,8 +1258,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
 	return COMPACT_NO_SUITABLE_PAGE;
 }
 
-static int compact_finished(struct zone *zone, struct compact_control *cc,
-			    const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+			struct compact_control *cc,
+			const int migratetype)
 {
 	int ret;
 
@@ -1278,7 +1279,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
  *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
  *   COMPACT_CONTINUE - If compaction should run now
  */
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	int fragindex;
@@ -1323,10 +1324,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 	return COMPACT_CONTINUE;
 }
 
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 
 	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
 	trace_mm_compaction_suitable(zone, order, ret);
@@ -1336,9 +1337,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
 	return ret;
 }
 
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
-	int ret;
+	enum compact_result ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
 	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1483,11 +1484,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
 		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
 		int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -1524,7 +1525,7 @@ int sysctl_extfrag_threshold = 500;
  *
  * This is the main entry point for direct page compaction.
  */
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
 {
@@ -1532,7 +1533,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
-	int rc = COMPACT_DEFERRED;
+	enum compact_result rc = COMPACT_DEFERRED;
 	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
 	*contended = COMPACT_CONTENDED_NONE;
@@ -1546,7 +1547,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
-		int status;
+		enum compact_result status;
 		int zone_contended;
 
 		if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..4acc0aa1aee0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,7 +2821,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum migrate_mode mode, int *contended_compaction,
 		bool *deferred_compaction)
 {
-	unsigned long compact_result;
+	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
@ 2016-03-08 13:42               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
 mm/compaction.c            | 27 ++++++++++++++-------------
 mm/page_alloc.c            |  2 +-
 3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..b167801187e7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
 #define _LINUX_COMPACTION_H
 
 /* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED	0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED		1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE	2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL		3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE	4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE	5
-#define COMPACT_NOT_SUITABLE_ZONE	6
-#define COMPACT_CONTENDED		7
 /* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+	/* compaction didn't start as it was deferred due to past failures */
+	COMPACT_DEFERRED,
+	/*
+	 * compaction didn't start as it was not possible or direct reclaim
+	 * was more suitable
+	 */
+	COMPACT_SKIPPED,
+	/* compaction should continue to another pageblock */
+	COMPACT_CONTINUE,
+	/*
+	 * direct compaction partially compacted a zone and there are suitable
+	 * pages
+	 */
+	COMPACT_PARTIAL,
+	/* The full zone was compacted */
+	COMPACT_COMPLETE,
+	/* For more detailed tracepoint output */
+	COMPACT_NO_SUITABLE_PAGE,
+	COMPACT_NOT_SUITABLE_ZONE,
+	COMPACT_CONTENDED,
+};
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
 /* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+			unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx);
 
 extern void defer_compaction(struct zone *zone, int order);
@@ -53,7 +62,7 @@ extern void compaction_defer_reset(struct zone *zone, int order,
 extern bool compaction_restarting(struct zone *zone, int order);
 
 #else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
 			const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
@@ -69,7 +78,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
 
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 585de54dbe8c..0f61f12d82b6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1195,7 +1195,7 @@ static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
 	unsigned int order;
@@ -1258,8 +1258,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
 	return COMPACT_NO_SUITABLE_PAGE;
 }
 
-static int compact_finished(struct zone *zone, struct compact_control *cc,
-			    const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+			struct compact_control *cc,
+			const int migratetype)
 {
 	int ret;
 
@@ -1278,7 +1279,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
  *   COMPACT_PARTIAL  - If the allocation would succeed without compaction
  *   COMPACT_CONTINUE - If compaction should run now
  */
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
 	int fragindex;
@@ -1323,10 +1324,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 	return COMPACT_CONTINUE;
 }
 
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
 					int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 
 	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
 	trace_mm_compaction_suitable(zone, order, ret);
@@ -1336,9 +1337,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
 	return ret;
 }
 
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
 {
-	int ret;
+	enum compact_result ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
 	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1483,11 +1484,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
 		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
 		int alloc_flags, int classzone_idx)
 {
-	unsigned long ret;
+	enum compact_result ret;
 	struct compact_control cc = {
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
@@ -1524,7 +1525,7 @@ int sysctl_extfrag_threshold = 500;
  *
  * This is the main entry point for direct page compaction.
  */
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			int alloc_flags, const struct alloc_context *ac,
 			enum migrate_mode mode, int *contended)
 {
@@ -1532,7 +1533,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
-	int rc = COMPACT_DEFERRED;
+	enum compact_result rc = COMPACT_DEFERRED;
 	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
 
 	*contended = COMPACT_CONTENDED_NONE;
@@ -1546,7 +1547,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
-		int status;
+		enum compact_result status;
 		int zone_contended;
 
 		if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..4acc0aa1aee0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,7 +2821,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum migrate_mode mode, int *contended_compaction,
 		bool *deferred_compaction)
 {
-	unsigned long compact_result;
+	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
  2016-03-08 13:42             ` Michal Hocko
@ 2016-03-08 13:42               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
  switch (ret) {
  ^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0f61f12d82b6..86968d3a04e6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1347,15 +1347,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
 	ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
 							cc->classzone_idx);
-	switch (ret) {
-	case COMPACT_PARTIAL:
-	case COMPACT_SKIPPED:
-		/* Compaction is likely to fail */
+	/* Compaction is likely to fail */
+	if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
 		return ret;
-	case COMPACT_CONTINUE:
-		/* Fall through to compaction */
-		;
-	}
+
+	/* huh, compaction_suitable is returning something unexpected */
+	VM_BUG_ON(ret != COMPACT_CONTINUE);
 
 	/*
 	 * Clear pageblock skip if there were failures recently and compaction
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
@ 2016-03-08 13:42               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function a??compact_zonea??:
mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_DEFERREDa?? not handled in switch [-Wswitch]
  switch (ret) {
  ^
mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_COMPLETEa?? not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_NO_SUITABLE_PAGEa?? not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_NOT_SUITABLE_ZONEa?? not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_CONTENDEDa?? not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0f61f12d82b6..86968d3a04e6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1347,15 +1347,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 
 	ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
 							cc->classzone_idx);
-	switch (ret) {
-	case COMPACT_PARTIAL:
-	case COMPACT_SKIPPED:
-		/* Compaction is likely to fail */
+	/* Compaction is likely to fail */
+	if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
 		return ret;
-	case COMPACT_CONTINUE:
-		/* Fall through to compaction */
-		;
-	}
+
+	/* huh, compaction_suitable is returning something unexpected */
+	VM_BUG_ON(ret != COMPACT_CONTINUE);
 
 	/*
 	 * Clear pageblock skip if there were failures recently and compaction
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-08 13:42             ` Michal Hocko
@ 2016-03-08 13:42               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if the last compaction round
was either inactive (deferred, skipped or bailed out early due to
contention) or it told us to continue.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  5 +++++
 mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b167801187e7..49e04326dcb8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,11 @@ enum compact_result {
 	/* compaction should continue to another pageblock */
 	COMPACT_CONTINUE,
 	/*
+	 * whoever is calling compaction should retry because it was either
+	 * not active or it tells us there is more work to be done.
+	 */
+	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
+	/*
 	 * direct compaction partially compacted a zone and there are suitable
 	 * pages
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4acc0aa1aee0..041aeb1dc3b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2819,28 +2819,20 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
-	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
-	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
-	switch (compact_result) {
-	case COMPACT_DEFERRED:
-		*deferred_compaction = true;
-		/* fall-through */
-	case COMPACT_SKIPPED:
+	if (*compact_result <= COMPACT_SKIPPED)
 		return NULL;
-	default:
-		break;
-	}
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction)
+{
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_SHOULD_RETRY)
+			return true;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int alloc_flags;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
-	bool deferred_compaction = false;
+	enum compact_result compact_result;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
 	int no_progress_loops = 0;
 
@@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&contended_compaction,
-					&deferred_compaction);
+					&compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * to heavily disrupt the system, so we fail the allocation
 		 * instead of entering direct reclaim.
 		 */
-		if (deferred_compaction)
+		if (compact_result == COMPACT_DEFERRED)
 			goto nopage;
 
 		/*
@@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
+	if (should_compact_retry(order, compact_result, contended_compaction))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
 					    &contended_compaction,
-					    &deferred_compaction);
+					    &compact_result);
 	if (page)
 		goto got_pg;
 nopage:
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-08 13:42               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if the last compaction round
was either inactive (deferred, skipped or bailed out early due to
contention) or it told us to continue.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  5 +++++
 mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
 2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b167801187e7..49e04326dcb8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,11 @@ enum compact_result {
 	/* compaction should continue to another pageblock */
 	COMPACT_CONTINUE,
 	/*
+	 * whoever is calling compaction should retry because it was either
+	 * not active or it tells us there is more work to be done.
+	 */
+	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
+	/*
 	 * direct compaction partially compacted a zone and there are suitable
 	 * pages
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4acc0aa1aee0..041aeb1dc3b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2819,28 +2819,20 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
-	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
-	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
-	switch (compact_result) {
-	case COMPACT_DEFERRED:
-		*deferred_compaction = true;
-		/* fall-through */
-	case COMPACT_SKIPPED:
+	if (*compact_result <= COMPACT_SKIPPED)
 		return NULL;
-	default:
-		break;
-	}
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction)
+{
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_SHOULD_RETRY)
+			return true;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int alloc_flags;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
-	bool deferred_compaction = false;
+	enum compact_result compact_result;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
 	int no_progress_loops = 0;
 
@@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&contended_compaction,
-					&deferred_compaction);
+					&compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * to heavily disrupt the system, so we fail the allocation
 		 * instead of entering direct reclaim.
 		 */
-		if (deferred_compaction)
+		if (compact_result == COMPACT_DEFERRED)
 			goto nopage;
 
 		/*
@@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
+	if (should_compact_retry(order, compact_result, contended_compaction))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
 					    &contended_compaction,
-					    &deferred_compaction);
+					    &compact_result);
 	if (page)
 		goto got_pg;
 nopage:
-- 
2.7.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08  9:58             ` Sergey Senozhatsky
@ 2016-03-08 13:57               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:57 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim, Vlastimil Babka

On Tue 08-03-16 18:58:24, Sergey Senozhatsky wrote:
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> > 
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
> > 
> > Other changes in the compaction are still needed but I would like to not
> > depend on them right now.
> 
> works fine for me.
> 
> $  cat /proc/vmstat | egrep -e "compact|swap"
> pgsteal_kswapd_dma 7
> pgsteal_kswapd_dma32 6457075
> pgsteal_kswapd_normal 1462767
> pgsteal_kswapd_movable 0
> pgscan_kswapd_dma 18
> pgscan_kswapd_dma32 6544126
> pgscan_kswapd_normal 1495604
> pgscan_kswapd_movable 0
> kswapd_inodesteal 29
> kswapd_low_wmark_hit_quickly 1168
> kswapd_high_wmark_hit_quickly 1627
> compact_migrate_scanned 5762793
> compact_free_scanned 54090239
> compact_isolated 1303895
> compact_stall 1542
> compact_fail 1117
> compact_success 425
> compact_kcompatd_wake 0
> 
> no OOM-kills after 6 rounds of tests.
> 
> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Thanks for retesting!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08 13:57               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 13:57 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Joonsoo Kim, Vlastimil Babka

On Tue 08-03-16 18:58:24, Sergey Senozhatsky wrote:
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> > 
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
> > 
> > Other changes in the compaction are still needed but I would like to not
> > depend on them right now.
> 
> works fine for me.
> 
> $  cat /proc/vmstat | egrep -e "compact|swap"
> pgsteal_kswapd_dma 7
> pgsteal_kswapd_dma32 6457075
> pgsteal_kswapd_normal 1462767
> pgsteal_kswapd_movable 0
> pgscan_kswapd_dma 18
> pgscan_kswapd_dma32 6544126
> pgscan_kswapd_normal 1495604
> pgscan_kswapd_movable 0
> kswapd_inodesteal 29
> kswapd_low_wmark_hit_quickly 1168
> kswapd_high_wmark_hit_quickly 1627
> compact_migrate_scanned 5762793
> compact_free_scanned 54090239
> compact_isolated 1303895
> compact_stall 1542
> compact_fail 1117
> compact_success 425
> compact_kcompatd_wake 0
> 
> no OOM-kills after 6 rounds of tests.
> 
> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Thanks for retesting!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-08 14:19                 ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:19 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
> 
> but there doesn't seem to be any reason for that. All functions which

I vaguely recall trying this once and running into header dependency
hell. But maybe it was something a bit different and involved storing a
value in struct compact_control.

> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
> 
> This is a pure cleanup and shouldn't introduce any functional changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
@ 2016-03-08 14:19                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:19 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
> 
> but there doesn't seem to be any reason for that. All functions which

I vaguely recall trying this once and running into header dependency
hell. But maybe it was something a bit different and involved storing a
value in struct compact_control.

> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
> 
> This is a pure cleanup and shouldn't introduce any functional changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-08 14:22                 ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:22 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"

Potentially a squash into that patch then?

> mm/compaction.c: In function ‘compact_zone’:
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
>   switch (ret) {
>   ^
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
> 
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
@ 2016-03-08 14:22                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:22 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"

Potentially a squash into that patch then?

> mm/compaction.c: In function a??compact_zonea??:
> mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_DEFERREDa?? not handled in switch [-Wswitch]
>   switch (ret) {
>   ^
> mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_COMPLETEa?? not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_NO_SUITABLE_PAGEa?? not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_NOT_SUITABLE_ZONEa?? not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value a??COMPACT_CONTENDEDa?? not handled in switch [-Wswitch]
> 
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-08 14:34                 ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:34 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  5 +++++
>  mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
>  2 files changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..49e04326dcb8 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -14,6 +14,11 @@ enum compact_result {
>  	/* compaction should continue to another pageblock */
>  	COMPACT_CONTINUE,
>  	/*
> +	 * whoever is calling compaction should retry because it was either
> +	 * not active or it tells us there is more work to be done.
> +	 */
> +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,

Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
overloading of CONTINUE only applies to compaction_suitable(). But the
value that should_compact_retry() is testing comes only from
try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

> +	/*
>  	 * direct compaction partially compacted a zone and there are suitable
>  	 * pages
>  	 */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..041aeb1dc3b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,20 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> + (unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SHOULD_RETRY)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> @@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-08 14:34                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 14:34 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML,
	Michal Hocko

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  5 +++++
>  mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
>  2 files changed, 42 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..49e04326dcb8 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -14,6 +14,11 @@ enum compact_result {
>  	/* compaction should continue to another pageblock */
>  	COMPACT_CONTINUE,
>  	/*
> +	 * whoever is calling compaction should retry because it was either
> +	 * not active or it tells us there is more work to be done.
> +	 */
> +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,

Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
overloading of CONTINUE only applies to compaction_suitable(). But the
value that should_compact_retry() is testing comes only from
try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

> +	/*
>  	 * direct compaction partially compacted a zone and there are suitable
>  	 * pages
>  	 */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..041aeb1dc3b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,20 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> + (unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SHOULD_RETRY)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> @@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-08 14:34                 ` Vlastimil Babka
@ 2016-03-08 14:48                   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 14:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
> On 03/08/2016 02:42 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and there is no guarantee
> > further reclaim/compaction attempts would help but at least make sure
> > that the compaction was active before we go OOM and keep retrying even
> > if should_reclaim_retry tells us to oom if the last compaction round
> > was either inactive (deferred, skipped or bailed out early due to
> > contention) or it told us to continue.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/compaction.h |  5 +++++
> >  mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
> >  2 files changed, 42 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..49e04326dcb8 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -14,6 +14,11 @@ enum compact_result {
> >  	/* compaction should continue to another pageblock */
> >  	COMPACT_CONTINUE,
> >  	/*
> > +	 * whoever is calling compaction should retry because it was either
> > +	 * not active or it tells us there is more work to be done.
> > +	 */
> > +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
> 
> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
> overloading of CONTINUE only applies to compaction_suitable(). But the
> value that should_compact_retry() is testing comes only from
> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

Well the idea was that I wanted to cover all the _possible_ cases where
compaction might want to tell us "please try again even when the last
round wasn't really successful". COMPACT_CONTINUE might not be returned
right now but we can come up with that in the future. It sounds like a
sensible feedback to me. But maybe there would be a better name for such
a feedback. I confess this is a bit oom-rework centric name...

Also I find it better to hide details behind a more generic name.

I am open to suggestions here, of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-08 14:48                   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 14:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
> On 03/08/2016 02:42 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and there is no guarantee
> > further reclaim/compaction attempts would help but at least make sure
> > that the compaction was active before we go OOM and keep retrying even
> > if should_reclaim_retry tells us to oom if the last compaction round
> > was either inactive (deferred, skipped or bailed out early due to
> > contention) or it told us to continue.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/compaction.h |  5 +++++
> >  mm/page_alloc.c            | 53 ++++++++++++++++++++++++++++++++--------------
> >  2 files changed, 42 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..49e04326dcb8 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -14,6 +14,11 @@ enum compact_result {
> >  	/* compaction should continue to another pageblock */
> >  	COMPACT_CONTINUE,
> >  	/*
> > +	 * whoever is calling compaction should retry because it was either
> > +	 * not active or it tells us there is more work to be done.
> > +	 */
> > +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
> 
> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
> overloading of CONTINUE only applies to compaction_suitable(). But the
> value that should_compact_retry() is testing comes only from
> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

Well the idea was that I wanted to cover all the _possible_ cases where
compaction might want to tell us "please try again even when the last
round wasn't really successful". COMPACT_CONTINUE might not be returned
right now but we can come up with that in the future. It sounds like a
sensible feedback to me. But maybe there would be a better name for such
a feedback. I confess this is a bit oom-rework centric name...

Also I find it better to hide details behind a more generic name.

I am open to suggestions here, of course.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-08 14:48                   ` Michal Hocko
@ 2016-03-08 15:03                     ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 15:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On 03/08/2016 03:48 PM, Michal Hocko wrote:
> On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
>>> --- a/include/linux/compaction.h
>>> +++ b/include/linux/compaction.h
>>> @@ -14,6 +14,11 @@ enum compact_result {
>>>  	/* compaction should continue to another pageblock */
>>>  	COMPACT_CONTINUE,
>>>  	/*
>>> +	 * whoever is calling compaction should retry because it was either
>>> +	 * not active or it tells us there is more work to be done.
>>> +	 */
>>> +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
>>
>> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
>> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
>> overloading of CONTINUE only applies to compaction_suitable(). But the
>> value that should_compact_retry() is testing comes only from
>> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?
> 
> Well the idea was that I wanted to cover all the _possible_ cases where
> compaction might want to tell us "please try again even when the last
> round wasn't really successful". COMPACT_CONTINUE might not be returned
> right now but we can come up with that in the future. It sounds like a
> sensible feedback to me. But maybe there would be a better name for such
> a feedback. I confess this is a bit oom-rework centric name...

Hmm, I see. But it doesn't really tell use to please try again. That
interpretation is indeed oom-specific. What it's actually telling us is
either a) reclaim and then try again (COMPACT_SKIPPED), b) try again
just to overcome the deferred state (COMPACT_DEFERRED). COMPACT_CONTINUE
says "go ahead", but only from compaction_suitable().
So the attempt a generic name doesn't really work here I'm afraid :/

> Also I find it better to hide details behind a more generic name.
> 
> I am open to suggestions here, of course.
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-08 15:03                     ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-08 15:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On 03/08/2016 03:48 PM, Michal Hocko wrote:
> On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
>>> --- a/include/linux/compaction.h
>>> +++ b/include/linux/compaction.h
>>> @@ -14,6 +14,11 @@ enum compact_result {
>>>  	/* compaction should continue to another pageblock */
>>>  	COMPACT_CONTINUE,
>>>  	/*
>>> +	 * whoever is calling compaction should retry because it was either
>>> +	 * not active or it tells us there is more work to be done.
>>> +	 */
>>> +	COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
>>
>> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
>> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
>> overloading of CONTINUE only applies to compaction_suitable(). But the
>> value that should_compact_retry() is testing comes only from
>> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?
> 
> Well the idea was that I wanted to cover all the _possible_ cases where
> compaction might want to tell us "please try again even when the last
> round wasn't really successful". COMPACT_CONTINUE might not be returned
> right now but we can come up with that in the future. It sounds like a
> sensible feedback to me. But maybe there would be a better name for such
> a feedback. I confess this is a bit oom-rework centric name...

Hmm, I see. But it doesn't really tell use to please try again. That
interpretation is indeed oom-specific. What it's actually telling us is
either a) reclaim and then try again (COMPACT_SKIPPED), b) try again
just to overcome the deferred state (COMPACT_DEFERRED). COMPACT_CONTINUE
says "go ahead", but only from compaction_suitable().
So the attempt a generic name doesn't really work here I'm afraid :/

> Also I find it better to hide details behind a more generic name.
> 
> I am open to suggestions here, of course.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-07 16:08           ` Michal Hocko
@ 2016-03-08 15:19             ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-08 15:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help

I'm still not happy. Just ensuring one compaction run doesn't mean our
best. What's your purpose of OOM rework? From my understanding,
you'd like to trigger OOM kill deterministic and *not prematurely*.
This makes sense.

But, what you did in case of high order allocation is completely different
with original purpose. It may be deterministic but *completely premature*.
There is no way to prevent premature OOM kill. So, I want to ask one more
time. Why OOM kill is better than retry reclaiming when there is reclaimable
page? Deterministic is for what? It ensures something more?

Please see Hugh's latest vmstat. There are plenty of anon pages when
OOM kill happens and it may have enough swap space. Even if
compaction runs and fails, why do we need to kill something
in this case? OOM kill should be a last resort.

Please see Hugh's previous report and OOM dump.

[  796.540791] Mem-Info:
[  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
 active_file:5107 inactive_file:1664 isolated_file:57
 unevictable:3067 dirty:4 writeback:75 unstable:0
 slab_reclaimable:13907 slab_unreclaimable:23236
 mapped:8889 shmem:3171 pagetables:2176 bounce:0
 free:1637 free_pcp:54 free_cma:0
[  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
high:5948kB active_anon:588776kB inactive_anon:188816kB
active_file:20432kB inactive_file:6928kB unevictable:12268kB
isolated(anon):128kB isolated(file):8kB present:1046128kB
managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  796.685815] lowmem_reserve[]: 0 0 0
[  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 8820kB
[  796.729696] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB

See [  796.557378] and [  796.630465].
In this 100 ms time interval, freepage increase a lot and
there are enough high order pages. OOM kill happen later
so freepage would come from reclaim. This shows
that your previous implementation which uses static retry number
causes premature OOM.

This attempt using compaction result looks not different to me.
It would also cause premature OOM kill.

I don't insist endless retry. I just want a more scientific criteria
that prevents
premature OOM kill. I'm really tire to say same thing again and again.
Am I missing something? This is the situation that I totally misunderstand
something? Please let me know.

Note: your current implementation doesn't consider which zone is compacted.
If DMA zone which easily fail to make high order page is compacted,
your implementation will not do retry. It also looks not our best.

Thanks.

> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
>
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
>
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  2 ++
>  mm/page_alloc.c            | 41 ++++++++++++++++++++++++-----------------
>  2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>
> +/* compaction disabled */
> +#define COMPACT_NONE           -1
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* compaction didn't start as it was deferred due to past failures */
>  #define COMPACT_DEFERRED       0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>                 int alloc_flags, const struct alloc_context *ac,
>                 enum migrate_mode mode, int *contended_compaction,
> -               bool *deferred_compaction)
> +               unsigned long *compact_result)
>  {
> -       unsigned long compact_result;
>         struct page *page;
>
> -       if (!order)
> +       if (!order) {
> +               *compact_result = COMPACT_NONE;
>                 return NULL;
> +       }
>
>         current->flags |= PF_MEMALLOC;
> -       compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +       *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>                                                 mode, contended_compaction);
>         current->flags &= ~PF_MEMALLOC;
>
> -       switch (compact_result) {
> -       case COMPACT_DEFERRED:
> -               *deferred_compaction = true;
> -               /* fall-through */
> -       case COMPACT_SKIPPED:
> +       if (*compact_result <= COMPACT_SKIPPED)
>                 return NULL;
> -       default:
> -               break;
> -       }
>
>         /*
>          * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>                 int alloc_flags, const struct alloc_context *ac,
>                 enum migrate_mode mode, int *contended_compaction,
> -               bool *deferred_compaction)
> +               unsigned long *compact_result)
>  {
> +       *compact_result = COMPACT_NONE;
>         return NULL;
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         int alloc_flags;
>         unsigned long did_some_progress;
>         enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -       bool deferred_compaction = false;
> +       unsigned long compact_result;
>         int contended_compaction = COMPACT_CONTENDED_NONE;
>         int no_progress_loops = 0;
>
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>                                         migration_mode,
>                                         &contended_compaction,
> -                                       &deferred_compaction);
> +                                       &compact_result);
>         if (page)
>                 goto got_pg;
>
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                  * to heavily disrupt the system, so we fail the allocation
>                  * instead of entering direct reclaim.
>                  */
> -               if (deferred_compaction)
> +               if (compact_result == COMPACT_DEFERRED)
>                         goto nopage;
>
>                 /*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                  did_some_progress > 0, no_progress_loops))
>                 goto retry;
>
> +       /*
> +        * !costly allocations are really important and we have to make sure
> +        * the compaction wasn't deferred or didn't bail out early due to locks
> +        * contention before we go OOM.
> +        */
> +       if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +               if (compact_result <= COMPACT_CONTINUE)
> +                       goto retry;
> +               if (contended_compaction > COMPACT_CONTENDED_NONE)
> +                       goto retry;
> +       }
> +
>         /* Reclaim has failed us, start killing things */
>         page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>         if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>                                             ac, migration_mode,
>                                             &contended_compaction,
> -                                           &deferred_compaction);
> +                                           &compact_result);
>         if (page)
>                 goto got_pg;
>  nopage:
> --
> 2.7.0
>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08 15:19             ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-08 15:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help

I'm still not happy. Just ensuring one compaction run doesn't mean our
best. What's your purpose of OOM rework? From my understanding,
you'd like to trigger OOM kill deterministic and *not prematurely*.
This makes sense.

But, what you did in case of high order allocation is completely different
with original purpose. It may be deterministic but *completely premature*.
There is no way to prevent premature OOM kill. So, I want to ask one more
time. Why OOM kill is better than retry reclaiming when there is reclaimable
page? Deterministic is for what? It ensures something more?

Please see Hugh's latest vmstat. There are plenty of anon pages when
OOM kill happens and it may have enough swap space. Even if
compaction runs and fails, why do we need to kill something
in this case? OOM kill should be a last resort.

Please see Hugh's previous report and OOM dump.

[  796.540791] Mem-Info:
[  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
 active_file:5107 inactive_file:1664 isolated_file:57
 unevictable:3067 dirty:4 writeback:75 unstable:0
 slab_reclaimable:13907 slab_unreclaimable:23236
 mapped:8889 shmem:3171 pagetables:2176 bounce:0
 free:1637 free_pcp:54 free_cma:0
[  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
high:5948kB active_anon:588776kB inactive_anon:188816kB
active_file:20432kB inactive_file:6928kB unevictable:12268kB
isolated(anon):128kB isolated(file):8kB present:1046128kB
managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  796.685815] lowmem_reserve[]: 0 0 0
[  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 8820kB
[  796.729696] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB

See [  796.557378] and [  796.630465].
In this 100 ms time interval, freepage increase a lot and
there are enough high order pages. OOM kill happen later
so freepage would come from reclaim. This shows
that your previous implementation which uses static retry number
causes premature OOM.

This attempt using compaction result looks not different to me.
It would also cause premature OOM kill.

I don't insist endless retry. I just want a more scientific criteria
that prevents
premature OOM kill. I'm really tire to say same thing again and again.
Am I missing something? This is the situation that I totally misunderstand
something? Please let me know.

Note: your current implementation doesn't consider which zone is compacted.
If DMA zone which easily fail to make high order page is compacted,
your implementation will not do retry. It also looks not our best.

Thanks.

> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
>
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
>
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h |  2 ++
>  mm/page_alloc.c            | 41 ++++++++++++++++++++++++-----------------
>  2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>
> +/* compaction disabled */
> +#define COMPACT_NONE           -1
>  /* Return values for compact_zone() and try_to_compact_pages() */
>  /* compaction didn't start as it was deferred due to past failures */
>  #define COMPACT_DEFERRED       0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>                 int alloc_flags, const struct alloc_context *ac,
>                 enum migrate_mode mode, int *contended_compaction,
> -               bool *deferred_compaction)
> +               unsigned long *compact_result)
>  {
> -       unsigned long compact_result;
>         struct page *page;
>
> -       if (!order)
> +       if (!order) {
> +               *compact_result = COMPACT_NONE;
>                 return NULL;
> +       }
>
>         current->flags |= PF_MEMALLOC;
> -       compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +       *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>                                                 mode, contended_compaction);
>         current->flags &= ~PF_MEMALLOC;
>
> -       switch (compact_result) {
> -       case COMPACT_DEFERRED:
> -               *deferred_compaction = true;
> -               /* fall-through */
> -       case COMPACT_SKIPPED:
> +       if (*compact_result <= COMPACT_SKIPPED)
>                 return NULL;
> -       default:
> -               break;
> -       }
>
>         /*
>          * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>                 int alloc_flags, const struct alloc_context *ac,
>                 enum migrate_mode mode, int *contended_compaction,
> -               bool *deferred_compaction)
> +               unsigned long *compact_result)
>  {
> +       *compact_result = COMPACT_NONE;
>         return NULL;
>  }
>  #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         int alloc_flags;
>         unsigned long did_some_progress;
>         enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -       bool deferred_compaction = false;
> +       unsigned long compact_result;
>         int contended_compaction = COMPACT_CONTENDED_NONE;
>         int no_progress_loops = 0;
>
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>                                         migration_mode,
>                                         &contended_compaction,
> -                                       &deferred_compaction);
> +                                       &compact_result);
>         if (page)
>                 goto got_pg;
>
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                  * to heavily disrupt the system, so we fail the allocation
>                  * instead of entering direct reclaim.
>                  */
> -               if (deferred_compaction)
> +               if (compact_result == COMPACT_DEFERRED)
>                         goto nopage;
>
>                 /*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                  did_some_progress > 0, no_progress_loops))
>                 goto retry;
>
> +       /*
> +        * !costly allocations are really important and we have to make sure
> +        * the compaction wasn't deferred or didn't bail out early due to locks
> +        * contention before we go OOM.
> +        */
> +       if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +               if (compact_result <= COMPACT_CONTINUE)
> +                       goto retry;
> +               if (contended_compaction > COMPACT_CONTENDED_NONE)
> +                       goto retry;
> +       }
> +
>         /* Reclaim has failed us, start killing things */
>         page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>         if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>                                             ac, migration_mode,
>                                             &contended_compaction,
> -                                           &deferred_compaction);
> +                                           &compact_result);
>         if (page)
>                 goto got_pg;
>  nopage:
> --
> 2.7.0
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08 15:19             ` Joonsoo Kim
@ 2016-03-08 16:05               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 16:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> Andrew,
> >> could you queue this one as well, please? This is more a band aid than a
> >> real solution which I will be working on as soon as I am able to
> >> reproduce the issue but the patch should help to some degree at least.
> >
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> 
> I'm still not happy. Just ensuring one compaction run doesn't mean our
> best.

OK, let me think about it some more.

> What's your purpose of OOM rework? From my understanding,
> you'd like to trigger OOM kill deterministic and *not prematurely*.
> This makes sense.

Well this is a bit awkward because we do not have any proper definition
of what prematurely actually means. We do not know whether something
changes and decides to free some memory right after we made the decision.
We also do not know whether reclaiming some more memory would help
because we might be trashing over few remaining pages so there would be
still some progress, albeit small, progress. The system would be
basically unusable and the OOM killer would be a large relief. What I
want to achieve is to have a clear definition of _when_ we fire and do
not fire _often_ to be impractical. There are loads where the new
implementation behaved slightly better (see the cover for my tests) and
there surely be some where this will be worse. I want this to be
reasonably good. I am not claiming we are there yet and the interaction
with the compaction seems like it needs some work, no question about
that.

> But, what you did in case of high order allocation is completely different
> with original purpose. It may be deterministic but *completely premature*.
> There is no way to prevent premature OOM kill. So, I want to ask one more
> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> page? Deterministic is for what? It ensures something more?

yes, If we keep reclaiming we can soon start trashing or over reclaim
too much which would hurt more processes. If you invoke the OOM killer
instead then chances are that you will release a lot of memory at once
and that would help to reconcile the memory pressure as well as free
some page blocks which couldn't have been compacted before and not
affect potentially many processes. The effect would be reduced to a
single process. If we had a proper trashing detection feedback we could
do much more clever decisions of course.

But back to the !costly OOMs. Once your system is fragmented so heavily
that there are no free blocks that would satisfy !costly request then
something has gone terribly wrong and we should fix it. To me it sounds
like we do not care about those requests early enough and only start
carying after we hit the wall. Maybe kcompactd can help us in this
regards.

> Please see Hugh's latest vmstat. There are plenty of anon pages when
> OOM kill happens and it may have enough swap space. Even if
> compaction runs and fails, why do we need to kill something
> in this case? OOM kill should be a last resort.

Well this would be the case even if we were trashing over swap.
Refaulting the swapped out memory all over again...

> Please see Hugh's previous report and OOM dump.
> 
> [  796.540791] Mem-Info:
> [  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
>  active_file:5107 inactive_file:1664 isolated_file:57
>  unevictable:3067 dirty:4 writeback:75 unstable:0
>  slab_reclaimable:13907 slab_unreclaimable:23236
>  mapped:8889 shmem:3171 pagetables:2176 bounce:0
>  free:1637 free_pcp:54 free_cma:0
> [  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
> high:5948kB active_anon:588776kB inactive_anon:188816kB
> active_file:20432kB inactive_file:6928kB unevictable:12268kB
> isolated(anon):128kB isolated(file):8kB present:1046128kB
> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  796.685815] lowmem_reserve[]: 0 0 0
> [  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 8820kB
> [  796.729696] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> 
> See [  796.557378] and [  796.630465].
> In this 100 ms time interval, freepage increase a lot and
> there are enough high order pages. OOM kill happen later
> so freepage would come from reclaim. This shows
> that your previous implementation which uses static retry number
> causes premature OOM.

Or simply one of the gcc simply exitted and freed up a memory which is
more likely. As I've tried to explain in other email, we cannot prevent
from those races. We simply do not have a crystal ball. All we know is
that at the time we checked the watermarks the last time there were
simply no eligible high order pages available.

> This attempt using compaction result looks not different to me.
> It would also cause premature OOM kill.
> 
> I don't insist endless retry. I just want a more scientific criteria
> that prevents premature OOM kill.

That is exactly what I try to achive here. Right now we are relying on
zone_reclaimable heuristic. That relies that some pages are freed (and
reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
pages this is basically unbounded. What I am trying to achieve here
is to base the decision on the feedback. The first attempt was to use
the reclaim feedback. This turned out to be not sufficient for higher
orders because compaction can deffer and skip if we are close to
watermarks which is really surprising to me. So now I've tried to make
sure that we do not hit this path. I agree we can do better but there
always will be a moment to simply give up. Whatever that moment will
be we can still find loads which could theoretically go on for little
more and survive.

> I'm really tire to say same thing again and again.
> Am I missing something? This is the situation that I totally misunderstand
> something? Please let me know.
> 
> Note: your current implementation doesn't consider which zone is compacted.
> If DMA zone which easily fail to make high order page is compacted,
> your implementation will not do retry. It also looks not our best.

Why are we even consider DMA zone when we cannot ever allocate from this
zone?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08 16:05               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-08 16:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> Andrew,
> >> could you queue this one as well, please? This is more a band aid than a
> >> real solution which I will be working on as soon as I am able to
> >> reproduce the issue but the patch should help to some degree at least.
> >
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> 
> I'm still not happy. Just ensuring one compaction run doesn't mean our
> best.

OK, let me think about it some more.

> What's your purpose of OOM rework? From my understanding,
> you'd like to trigger OOM kill deterministic and *not prematurely*.
> This makes sense.

Well this is a bit awkward because we do not have any proper definition
of what prematurely actually means. We do not know whether something
changes and decides to free some memory right after we made the decision.
We also do not know whether reclaiming some more memory would help
because we might be trashing over few remaining pages so there would be
still some progress, albeit small, progress. The system would be
basically unusable and the OOM killer would be a large relief. What I
want to achieve is to have a clear definition of _when_ we fire and do
not fire _often_ to be impractical. There are loads where the new
implementation behaved slightly better (see the cover for my tests) and
there surely be some where this will be worse. I want this to be
reasonably good. I am not claiming we are there yet and the interaction
with the compaction seems like it needs some work, no question about
that.

> But, what you did in case of high order allocation is completely different
> with original purpose. It may be deterministic but *completely premature*.
> There is no way to prevent premature OOM kill. So, I want to ask one more
> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> page? Deterministic is for what? It ensures something more?

yes, If we keep reclaiming we can soon start trashing or over reclaim
too much which would hurt more processes. If you invoke the OOM killer
instead then chances are that you will release a lot of memory at once
and that would help to reconcile the memory pressure as well as free
some page blocks which couldn't have been compacted before and not
affect potentially many processes. The effect would be reduced to a
single process. If we had a proper trashing detection feedback we could
do much more clever decisions of course.

But back to the !costly OOMs. Once your system is fragmented so heavily
that there are no free blocks that would satisfy !costly request then
something has gone terribly wrong and we should fix it. To me it sounds
like we do not care about those requests early enough and only start
carying after we hit the wall. Maybe kcompactd can help us in this
regards.

> Please see Hugh's latest vmstat. There are plenty of anon pages when
> OOM kill happens and it may have enough swap space. Even if
> compaction runs and fails, why do we need to kill something
> in this case? OOM kill should be a last resort.

Well this would be the case even if we were trashing over swap.
Refaulting the swapped out memory all over again...

> Please see Hugh's previous report and OOM dump.
> 
> [  796.540791] Mem-Info:
> [  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
>  active_file:5107 inactive_file:1664 isolated_file:57
>  unevictable:3067 dirty:4 writeback:75 unstable:0
>  slab_reclaimable:13907 slab_unreclaimable:23236
>  mapped:8889 shmem:3171 pagetables:2176 bounce:0
>  free:1637 free_pcp:54 free_cma:0
> [  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
> high:5948kB active_anon:588776kB inactive_anon:188816kB
> active_file:20432kB inactive_file:6928kB unevictable:12268kB
> isolated(anon):128kB isolated(file):8kB present:1046128kB
> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  796.685815] lowmem_reserve[]: 0 0 0
> [  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 8820kB
> [  796.729696] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
> 
> See [  796.557378] and [  796.630465].
> In this 100 ms time interval, freepage increase a lot and
> there are enough high order pages. OOM kill happen later
> so freepage would come from reclaim. This shows
> that your previous implementation which uses static retry number
> causes premature OOM.

Or simply one of the gcc simply exitted and freed up a memory which is
more likely. As I've tried to explain in other email, we cannot prevent
from those races. We simply do not have a crystal ball. All we know is
that at the time we checked the watermarks the last time there were
simply no eligible high order pages available.

> This attempt using compaction result looks not different to me.
> It would also cause premature OOM kill.
> 
> I don't insist endless retry. I just want a more scientific criteria
> that prevents premature OOM kill.

That is exactly what I try to achive here. Right now we are relying on
zone_reclaimable heuristic. That relies that some pages are freed (and
reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
pages this is basically unbounded. What I am trying to achieve here
is to base the decision on the feedback. The first attempt was to use
the reclaim feedback. This turned out to be not sufficient for higher
orders because compaction can deffer and skip if we are close to
watermarks which is really surprising to me. So now I've tried to make
sure that we do not hit this path. I agree we can do better but there
always will be a moment to simply give up. Whatever that moment will
be we can still find loads which could theoretically go on for little
more and survive.

> I'm really tire to say same thing again and again.
> Am I missing something? This is the situation that I totally misunderstand
> something? Please let me know.
> 
> Note: your current implementation doesn't consider which zone is compacted.
> If DMA zone which easily fail to make high order page is compacted,
> your implementation will not do retry. It also looks not our best.

Why are we even consider DMA zone when we cannot ever allocate from this
zone?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08 16:05               ` Michal Hocko
@ 2016-03-08 17:03                 ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-08 17:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> Andrew,
>> >> could you queue this one as well, please? This is more a band aid than a
>> >> real solution which I will be working on as soon as I am able to
>> >> reproduce the issue but the patch should help to some degree at least.
>> >
>> > Joonsoo wasn't very happy about this approach so let me try a different
>> > way. What do you think about the following? Hugh, Sergey does it help
>>
>> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> best.
>
> OK, let me think about it some more.
>
>> What's your purpose of OOM rework? From my understanding,
>> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> This makes sense.
>
> Well this is a bit awkward because we do not have any proper definition
> of what prematurely actually means. We do not know whether something

If we don't have proper definition to it, please define it first. We
need to improve
the situation toward the clear goal. Just certain number of retry which has no
base doesn't make any sense.

> changes and decides to free some memory right after we made the decision.
> We also do not know whether reclaiming some more memory would help
> because we might be trashing over few remaining pages so there would be
> still some progress, albeit small, progress. The system would be
> basically unusable and the OOM killer would be a large relief. What I
> want to achieve is to have a clear definition of _when_ we fire and do

If we have no clear definition about premature, what's the meaning of
a clear definition of _when_? It would just mean random time.

> not fire _often_ to be impractical. There are loads where the new
> implementation behaved slightly better (see the cover for my tests) and
> there surely be some where this will be worse. I want this to be
> reasonably good. I am not claiming we are there yet and the interaction
> with the compaction seems like it needs some work, no question about
> that.
>
>> But, what you did in case of high order allocation is completely different
>> with original purpose. It may be deterministic but *completely premature*.
>> There is no way to prevent premature OOM kill. So, I want to ask one more
>> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> page? Deterministic is for what? It ensures something more?
>
> yes, If we keep reclaiming we can soon start trashing or over reclaim
> too much which would hurt more processes. If you invoke the OOM killer
> instead then chances are that you will release a lot of memory at once
> and that would help to reconcile the memory pressure as well as free
> some page blocks which couldn't have been compacted before and not
> affect potentially many processes. The effect would be reduced to a
> single process. If we had a proper trashing detection feedback we could
> do much more clever decisions of course.

It looks like you did it for performance reason. You'd better think again about
effect of OOM kill. We don't have enough knowledge about user space program
architecture and killing one important process could lead to whole
system unusable. Moreover, OOM kill could cause important data loss so
should be avoided as much as possible. Performance reason cannot
justify OOM kill.

>
> But back to the !costly OOMs. Once your system is fragmented so heavily
> that there are no free blocks that would satisfy !costly request then
> something has gone terribly wrong and we should fix it. To me it sounds
> like we do not care about those requests early enough and only start
> carying after we hit the wall. Maybe kcompactd can help us in this
> regards.

Yes, but, it's another issue. In any situation, !costly OOM should not happen
prematurely.

>> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> OOM kill happens and it may have enough swap space. Even if
>> compaction runs and fails, why do we need to kill something
>> in this case? OOM kill should be a last resort.
>
> Well this would be the case even if we were trashing over swap.
> Refaulting the swapped out memory all over again...

If thrashing is a main obstacle to decide proper OOM point,
we need to invent a way to handle thrashing or invent reasonable metric
which isn't affected by thrashing.

>> Please see Hugh's previous report and OOM dump.
>>
>> [  796.540791] Mem-Info:
>> [  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
>>  active_file:5107 inactive_file:1664 isolated_file:57
>>  unevictable:3067 dirty:4 writeback:75 unstable:0
>>  slab_reclaimable:13907 slab_unreclaimable:23236
>>  mapped:8889 shmem:3171 pagetables:2176 bounce:0
>>  free:1637 free_pcp:54 free_cma:0
>> [  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
>> high:5948kB active_anon:588776kB inactive_anon:188816kB
>> active_file:20432kB inactive_file:6928kB unevictable:12268kB
>> isolated(anon):128kB isolated(file):8kB present:1046128kB
>> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
>> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
>> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
>> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
>> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> [  796.685815] lowmem_reserve[]: 0 0 0
>> [  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
>> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 8820kB
>> [  796.729696] Node 0 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=2048kB
>>
>> See [  796.557378] and [  796.630465].
>> In this 100 ms time interval, freepage increase a lot and
>> there are enough high order pages. OOM kill happen later
>> so freepage would come from reclaim. This shows
>> that your previous implementation which uses static retry number
>> causes premature OOM.
>
> Or simply one of the gcc simply exitted and freed up a memory which is

It doesn't matter where free memory comes from. If free memory increases
due to gcc exit, it implies that we can reclaim some memory from it. There is
no reason to trigger OOM in this case.

> more likely. As I've tried to explain in other email, we cannot prevent
> from those races. We simply do not have a crystal ball. All we know is
> that at the time we checked the watermarks the last time there were
> simply no eligible high order pages available.
>
>> This attempt using compaction result looks not different to me.
>> It would also cause premature OOM kill.
>>
>> I don't insist endless retry. I just want a more scientific criteria
>> that prevents premature OOM kill.
>
> That is exactly what I try to achive here. Right now we are relying on
> zone_reclaimable heuristic. That relies that some pages are freed (and
> reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
> pages this is basically unbounded. What I am trying to achieve here
> is to base the decision on the feedback. The first attempt was to use
> the reclaim feedback. This turned out to be not sufficient for higher
> orders because compaction can deffer and skip if we are close to
> watermarks which is really surprising to me. So now I've tried to make
> sure that we do not hit this path. I agree we can do better but there
> always will be a moment to simply give up. Whatever that moment will
> be we can still find loads which could theoretically go on for little
> more and survive.

Problem is that, to me, current implementation looks really simple
give up. Maybe, precise definition about premature would be helpful here.
Without it, it would be just subjective.

>
>> I'm really tire to say same thing again and again.
>> Am I missing something? This is the situation that I totally misunderstand
>> something? Please let me know.
>>
>> Note: your current implementation doesn't consider which zone is compacted.
>> If DMA zone which easily fail to make high order page is compacted,
>> your implementation will not do retry. It also looks not our best.
>
> Why are we even consider DMA zone when we cannot ever allocate from this
> zone?

This is just an example. It could be ZONE_NORMAL and something else. If
we don't try all zones to compact, it's reasonable point to trigger OOM?

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-08 17:03                 ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-08 17:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> Andrew,
>> >> could you queue this one as well, please? This is more a band aid than a
>> >> real solution which I will be working on as soon as I am able to
>> >> reproduce the issue but the patch should help to some degree at least.
>> >
>> > Joonsoo wasn't very happy about this approach so let me try a different
>> > way. What do you think about the following? Hugh, Sergey does it help
>>
>> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> best.
>
> OK, let me think about it some more.
>
>> What's your purpose of OOM rework? From my understanding,
>> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> This makes sense.
>
> Well this is a bit awkward because we do not have any proper definition
> of what prematurely actually means. We do not know whether something

If we don't have proper definition to it, please define it first. We
need to improve
the situation toward the clear goal. Just certain number of retry which has no
base doesn't make any sense.

> changes and decides to free some memory right after we made the decision.
> We also do not know whether reclaiming some more memory would help
> because we might be trashing over few remaining pages so there would be
> still some progress, albeit small, progress. The system would be
> basically unusable and the OOM killer would be a large relief. What I
> want to achieve is to have a clear definition of _when_ we fire and do

If we have no clear definition about premature, what's the meaning of
a clear definition of _when_? It would just mean random time.

> not fire _often_ to be impractical. There are loads where the new
> implementation behaved slightly better (see the cover for my tests) and
> there surely be some where this will be worse. I want this to be
> reasonably good. I am not claiming we are there yet and the interaction
> with the compaction seems like it needs some work, no question about
> that.
>
>> But, what you did in case of high order allocation is completely different
>> with original purpose. It may be deterministic but *completely premature*.
>> There is no way to prevent premature OOM kill. So, I want to ask one more
>> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> page? Deterministic is for what? It ensures something more?
>
> yes, If we keep reclaiming we can soon start trashing or over reclaim
> too much which would hurt more processes. If you invoke the OOM killer
> instead then chances are that you will release a lot of memory at once
> and that would help to reconcile the memory pressure as well as free
> some page blocks which couldn't have been compacted before and not
> affect potentially many processes. The effect would be reduced to a
> single process. If we had a proper trashing detection feedback we could
> do much more clever decisions of course.

It looks like you did it for performance reason. You'd better think again about
effect of OOM kill. We don't have enough knowledge about user space program
architecture and killing one important process could lead to whole
system unusable. Moreover, OOM kill could cause important data loss so
should be avoided as much as possible. Performance reason cannot
justify OOM kill.

>
> But back to the !costly OOMs. Once your system is fragmented so heavily
> that there are no free blocks that would satisfy !costly request then
> something has gone terribly wrong and we should fix it. To me it sounds
> like we do not care about those requests early enough and only start
> carying after we hit the wall. Maybe kcompactd can help us in this
> regards.

Yes, but, it's another issue. In any situation, !costly OOM should not happen
prematurely.

>> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> OOM kill happens and it may have enough swap space. Even if
>> compaction runs and fails, why do we need to kill something
>> in this case? OOM kill should be a last resort.
>
> Well this would be the case even if we were trashing over swap.
> Refaulting the swapped out memory all over again...

If thrashing is a main obstacle to decide proper OOM point,
we need to invent a way to handle thrashing or invent reasonable metric
which isn't affected by thrashing.

>> Please see Hugh's previous report and OOM dump.
>>
>> [  796.540791] Mem-Info:
>> [  796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
>>  active_file:5107 inactive_file:1664 isolated_file:57
>>  unevictable:3067 dirty:4 writeback:75 unstable:0
>>  slab_reclaimable:13907 slab_unreclaimable:23236
>>  mapped:8889 shmem:3171 pagetables:2176 bounce:0
>>  free:1637 free_pcp:54 free_cma:0
>> [  796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
>> high:5948kB active_anon:588776kB inactive_anon:188816kB
>> active_file:20432kB inactive_file:6928kB unevictable:12268kB
>> isolated(anon):128kB isolated(file):8kB present:1046128kB
>> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
>> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
>> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
>> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
>> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> [  796.685815] lowmem_reserve[]: 0 0 0
>> [  796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
>> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 8820kB
>> [  796.729696] Node 0 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=2048kB
>>
>> See [  796.557378] and [  796.630465].
>> In this 100 ms time interval, freepage increase a lot and
>> there are enough high order pages. OOM kill happen later
>> so freepage would come from reclaim. This shows
>> that your previous implementation which uses static retry number
>> causes premature OOM.
>
> Or simply one of the gcc simply exitted and freed up a memory which is

It doesn't matter where free memory comes from. If free memory increases
due to gcc exit, it implies that we can reclaim some memory from it. There is
no reason to trigger OOM in this case.

> more likely. As I've tried to explain in other email, we cannot prevent
> from those races. We simply do not have a crystal ball. All we know is
> that at the time we checked the watermarks the last time there were
> simply no eligible high order pages available.
>
>> This attempt using compaction result looks not different to me.
>> It would also cause premature OOM kill.
>>
>> I don't insist endless retry. I just want a more scientific criteria
>> that prevents premature OOM kill.
>
> That is exactly what I try to achive here. Right now we are relying on
> zone_reclaimable heuristic. That relies that some pages are freed (and
> reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
> pages this is basically unbounded. What I am trying to achieve here
> is to base the decision on the feedback. The first attempt was to use
> the reclaim feedback. This turned out to be not sufficient for higher
> orders because compaction can deffer and skip if we are close to
> watermarks which is really surprising to me. So now I've tried to make
> sure that we do not hit this path. I agree we can do better but there
> always will be a moment to simply give up. Whatever that moment will
> be we can still find loads which could theoretically go on for little
> more and survive.

Problem is that, to me, current implementation looks really simple
give up. Maybe, precise definition about premature would be helpful here.
Without it, it would be just subjective.

>
>> I'm really tire to say same thing again and again.
>> Am I missing something? This is the situation that I totally misunderstand
>> something? Please let me know.
>>
>> Note: your current implementation doesn't consider which zone is compacted.
>> If DMA zone which easily fail to make high order page is compacted,
>> your implementation will not do retry. It also looks not our best.
>
> Why are we even consider DMA zone when we cannot ever allocate from this
> zone?

This is just an example. It could be ZONE_NORMAL and something else. If
we don't try all zones to compact, it's reasonable point to trigger OOM?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-09  3:55                 ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-03-09  3:55 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Hugh Dickins', 'Sergey Senozhatsky',
	'Vlastimil Babka', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', 'Joonsoo Kim',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
> 
> but there doesn't seem to be any reason for that. All functions which
> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
> 
> This is a pure cleanup and shouldn't introduce any functional changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum
@ 2016-03-09  3:55                 ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-03-09  3:55 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Hugh Dickins', 'Sergey Senozhatsky',
	'Vlastimil Babka', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', 'Joonsoo Kim',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
> 
> but there doesn't seem to be any reason for that. All functions which
> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
> 
> This is a pure cleanup and shouldn't introduce any functional changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-09  3:57                 ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-03-09  3:57 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Hugh Dickins', 'Sergey Senozhatsky',
	'Vlastimil Babka', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', 'Joonsoo Kim',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"
> 
> mm/compaction.c: In function ‘compact_zone’:
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
>   switch (ret) {
>   ^
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
> 
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone
@ 2016-03-09  3:57                 ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-03-09  3:57 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Hugh Dickins', 'Sergey Senozhatsky',
	'Vlastimil Babka', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', 'Joonsoo Kim',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"
> 
> mm/compaction.c: In function ‘compact_zone’:
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
>   switch (ret) {
>   ^
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
> 
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-08 17:03                 ` Joonsoo Kim
@ 2016-03-09 10:41                   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-09 10:41 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> >> Andrew,
> >> >> could you queue this one as well, please? This is more a band aid than a
> >> >> real solution which I will be working on as soon as I am able to
> >> >> reproduce the issue but the patch should help to some degree at least.
> >> >
> >> > Joonsoo wasn't very happy about this approach so let me try a different
> >> > way. What do you think about the following? Hugh, Sergey does it help
> >>
> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
> >> best.
> >
> > OK, let me think about it some more.
> >
> >> What's your purpose of OOM rework? From my understanding,
> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> This makes sense.
> >
> > Well this is a bit awkward because we do not have any proper definition
> > of what prematurely actually means. We do not know whether something
> 
> If we don't have proper definition to it, please define it first.

OK, I should have probably said that _there_is_no_proper_definition_...
This will always be about heuristics as the clear cut can be pretty
subjective and what some load might see as unreasonable retries others
might see as insufficient. Our ultimate goal is to behave reasonable for
reasonable workloads. I am somehow skeptical about formulating this
into a single equation...

> We need to improve the situation toward the clear goal. Just certain
> number of retry which has no base doesn't make any sense.

Certain number of retries is what we already have right now. And that
certain number is hard to define even though it looks as simple as

NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages

because this is highly fragile when there are only few pages freed
regularly but not sufficient to get us out of the loop... I am trying
to formulate those retries somehow more deterministically considering
the feedback _and_ an estimate about the feasibility of future
reclaim/compaction. I admit that my attempts at compaction part have
been far from ideal so far. Partially because I missed many aspects
how it works.

[...]
> > not fire _often_ to be impractical. There are loads where the new
> > implementation behaved slightly better (see the cover for my tests) and
> > there surely be some where this will be worse. I want this to be
> > reasonably good. I am not claiming we are there yet and the interaction
> > with the compaction seems like it needs some work, no question about
> > that.
> >
> >> But, what you did in case of high order allocation is completely different
> >> with original purpose. It may be deterministic but *completely premature*.
> >> There is no way to prevent premature OOM kill. So, I want to ask one more
> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> >> page? Deterministic is for what? It ensures something more?
> >
> > yes, If we keep reclaiming we can soon start trashing or over reclaim
> > too much which would hurt more processes. If you invoke the OOM killer
> > instead then chances are that you will release a lot of memory at once
> > and that would help to reconcile the memory pressure as well as free
> > some page blocks which couldn't have been compacted before and not
> > affect potentially many processes. The effect would be reduced to a
> > single process. If we had a proper trashing detection feedback we could
> > do much more clever decisions of course.
> 
> It looks like you did it for performance reason. You'd better think again about
> effect of OOM kill. We don't have enough knowledge about user space program
> architecture and killing one important process could lead to whole
> system unusable. Moreover, OOM kill could cause important data loss so
> should be avoided as much as possible. Performance reason cannot
> justify OOM kill.

No I am not talking about performance. I am talking about the system
healthiness as whole.

> > But back to the !costly OOMs. Once your system is fragmented so heavily
> > that there are no free blocks that would satisfy !costly request then
> > something has gone terribly wrong and we should fix it. To me it sounds
> > like we do not care about those requests early enough and only start
> > carying after we hit the wall. Maybe kcompactd can help us in this
> > regards.
> 
> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> prematurely.

I fully agree and I guess we also agree on the assumption that we
shouldn't retry endlessly. So let's focus on what the OOM convergence
criteria should look like. I have another proposal which I will send as
a reply to the previous one.

> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
> >> OOM kill happens and it may have enough swap space. Even if
> >> compaction runs and fails, why do we need to kill something
> >> in this case? OOM kill should be a last resort.
> >
> > Well this would be the case even if we were trashing over swap.
> > Refaulting the swapped out memory all over again...
> 
> If thrashing is a main obstacle to decide proper OOM point,
> we need to invent a way to handle thrashing or invent reasonable metric
> which isn't affected by thrashing.

Great, you are welcome to come up with one. But more seriously, isn't
the retries limiting a way to reduce the chances of threshing? It might
be not the ideal one because it doesn't work 100% but can we simply come
up with the one which works that reliable. This is a hard problem which
we haven't been able to solve for ages.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-09 10:41                   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-09 10:41 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> >> Andrew,
> >> >> could you queue this one as well, please? This is more a band aid than a
> >> >> real solution which I will be working on as soon as I am able to
> >> >> reproduce the issue but the patch should help to some degree at least.
> >> >
> >> > Joonsoo wasn't very happy about this approach so let me try a different
> >> > way. What do you think about the following? Hugh, Sergey does it help
> >>
> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
> >> best.
> >
> > OK, let me think about it some more.
> >
> >> What's your purpose of OOM rework? From my understanding,
> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> This makes sense.
> >
> > Well this is a bit awkward because we do not have any proper definition
> > of what prematurely actually means. We do not know whether something
> 
> If we don't have proper definition to it, please define it first.

OK, I should have probably said that _there_is_no_proper_definition_...
This will always be about heuristics as the clear cut can be pretty
subjective and what some load might see as unreasonable retries others
might see as insufficient. Our ultimate goal is to behave reasonable for
reasonable workloads. I am somehow skeptical about formulating this
into a single equation...

> We need to improve the situation toward the clear goal. Just certain
> number of retry which has no base doesn't make any sense.

Certain number of retries is what we already have right now. And that
certain number is hard to define even though it looks as simple as

NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages

because this is highly fragile when there are only few pages freed
regularly but not sufficient to get us out of the loop... I am trying
to formulate those retries somehow more deterministically considering
the feedback _and_ an estimate about the feasibility of future
reclaim/compaction. I admit that my attempts at compaction part have
been far from ideal so far. Partially because I missed many aspects
how it works.

[...]
> > not fire _often_ to be impractical. There are loads where the new
> > implementation behaved slightly better (see the cover for my tests) and
> > there surely be some where this will be worse. I want this to be
> > reasonably good. I am not claiming we are there yet and the interaction
> > with the compaction seems like it needs some work, no question about
> > that.
> >
> >> But, what you did in case of high order allocation is completely different
> >> with original purpose. It may be deterministic but *completely premature*.
> >> There is no way to prevent premature OOM kill. So, I want to ask one more
> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> >> page? Deterministic is for what? It ensures something more?
> >
> > yes, If we keep reclaiming we can soon start trashing or over reclaim
> > too much which would hurt more processes. If you invoke the OOM killer
> > instead then chances are that you will release a lot of memory at once
> > and that would help to reconcile the memory pressure as well as free
> > some page blocks which couldn't have been compacted before and not
> > affect potentially many processes. The effect would be reduced to a
> > single process. If we had a proper trashing detection feedback we could
> > do much more clever decisions of course.
> 
> It looks like you did it for performance reason. You'd better think again about
> effect of OOM kill. We don't have enough knowledge about user space program
> architecture and killing one important process could lead to whole
> system unusable. Moreover, OOM kill could cause important data loss so
> should be avoided as much as possible. Performance reason cannot
> justify OOM kill.

No I am not talking about performance. I am talking about the system
healthiness as whole.

> > But back to the !costly OOMs. Once your system is fragmented so heavily
> > that there are no free blocks that would satisfy !costly request then
> > something has gone terribly wrong and we should fix it. To me it sounds
> > like we do not care about those requests early enough and only start
> > carying after we hit the wall. Maybe kcompactd can help us in this
> > regards.
> 
> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> prematurely.

I fully agree and I guess we also agree on the assumption that we
shouldn't retry endlessly. So let's focus on what the OOM convergence
criteria should look like. I have another proposal which I will send as
a reply to the previous one.

> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
> >> OOM kill happens and it may have enough swap space. Even if
> >> compaction runs and fails, why do we need to kill something
> >> in this case? OOM kill should be a last resort.
> >
> > Well this would be the case even if we were trashing over swap.
> > Refaulting the swapped out memory all over again...
> 
> If thrashing is a main obstacle to decide proper OOM point,
> we need to invent a way to handle thrashing or invent reasonable metric
> which isn't affected by thrashing.

Great, you are welcome to come up with one. But more seriously, isn't
the retries limiting a way to reduce the chances of threshing? It might
be not the ideal one because it doesn't work 100% but can we simply come
up with the one which works that reliable. This is a hard problem which
we haven't been able to solve for ages.

[...]

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-08 13:42               ` Michal Hocko
@ 2016-03-09 11:11                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-09 11:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

Joonsoo has pointed out that this attempt is still not sufficient
becasuse we might have invoked only a single compaction round which
is might be not enough. I fully agree with that. Here is my take on
that. It is again based on the number of retries loop.

I was also playing with an idea of doing something similar to the
reclaim retry logic:
	if (order) {
		if (compaction_made_progress(compact_result)
			no_compact_progress = 0;
		else if (compaction_failed(compact_result)
			no_compact_progress++;
	}
but it is compaction_failed() part which is not really
straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
compaction_suitable however hide this from compaction users so it
seems like we can never see it.

Maybe we can update the feedback mechanism from the compaction but
retries count seems reasonably easy to understand and pragmatic. If
we cannot form a order page after we tried for N times then it really
doesn't make much sense to continue and we are oom for this order. I am
holding my breath to hear from Hugh on this, though. In case it doesn't
then I would be really interested whether changing MAX_COMPACT_RETRIES
makes any difference.

I haven't preserved Tested-by from Sergey to be on the safe side even
though strictly speaking this should be less prone to high order OOMs
because we clearly retry more times.
---
>From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 9 Mar 2016 10:57:42 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
	- the last compaction round was either inactive (deferred,
	  skipped or bailed out early due to contention) or
	- we haven't completed at least MAX_COMPACT_RETRIES successful
	  (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
	  rounds.

The first rule ensures that the very last attempt for compaction
was ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy packing of other contexts to steal the compacted pages while
the current context manages to retry to allocate them.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 10 +++++++
 mm/page_alloc.c            | 68 +++++++++++++++++++++++++++++++++++-----------
 2 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b167801187e7..7d028ccf440a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
 				bool alloc_success);
 extern bool compaction_restarting(struct zone *zone, int order);
 
+static inline bool compaction_made_progress(enum compact_result result)
+{
+	return (compact_result > COMPACT_SKIPPED &&
+				compact_result < COMPACT_NO_SUITABLE_PAGE)
+}
+
 #else
 static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
@@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return true;
 }
 
+static inline bool compaction_made_progress(enum compact_result result)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4acc0aa1aee0..5f1fc3793836 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
 #ifdef CONFIG_COMPACTION
 /* Try memory compaction for high-order allocations before reclaim */
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
-	enum compact_result compact_result;
 	struct page *page;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
-	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
-	switch (compact_result) {
-	case COMPACT_DEFERRED:
-		*deferred_compaction = true;
-		/* fall-through */
-	case COMPACT_SKIPPED:
+	if (*compact_result <= COMPACT_SKIPPED)
 		return NULL;
-	default:
-		break;
-	}
 
 	/*
 	 * At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction, int compaction_retries)
+{
+	/*
+	 * !costly allocations are really important and we have to make sure
+	 * the compaction wasn't deferred or didn't bail out early due to locks
+	 * contention before we go OOM. Still cap the reclaim retry loops with
+	 * progress to prevent from looping forever and potential trashing.
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+		if (compact_result <= COMPACT_SKIPPED)
+			return true;
+		if (contended_compaction > COMPACT_CONTENDED_NONE)
+			return true;
+		if (compaction_retries <= MAX_COMPACT_RETRIES)
+			return true;
+	}
+
+	return false;
+}
 #else
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		int alloc_flags, const struct alloc_context *ac,
 		enum migrate_mode mode, int *contended_compaction,
-		bool *deferred_compaction)
+		enum compact_result *compact_result)
 {
 	return NULL;
 }
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+		     int contended_compaction, int compaction_retries)
+{
+	return false;
+}
 #endif /* CONFIG_COMPACTION */
 
 /* Perform direct synchronous page reclaim */
@@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int alloc_flags;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
-	bool deferred_compaction = false;
+	enum compact_result compact_result;
+	int compaction_retries = 0;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
 	int no_progress_loops = 0;
 
@@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&contended_compaction,
-					&deferred_compaction);
+					&compact_result);
 	if (page)
 		goto got_pg;
 
+	if (order && compaction_made_progress(compact_result))
+		compaction_retries++;
+
 	/* Checks for THP-specific high-order allocations */
 	if (is_thp_gfp_mask(gfp_mask)) {
 		/*
@@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * to heavily disrupt the system, so we fail the allocation
 		 * instead of entering direct reclaim.
 		 */
-		if (deferred_compaction)
+		if (compact_result == COMPACT_DEFERRED)
 			goto nopage;
 
 		/*
@@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
+	if (should_compact_retry(order, compact_result, contended_compaction,
+				 compaction_retries))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
 					    ac, migration_mode,
 					    &contended_compaction,
-					    &deferred_compaction);
+					    &compact_result);
 	if (page)
 		goto got_pg;
 nopage:
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-09 11:11                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-09 11:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

Joonsoo has pointed out that this attempt is still not sufficient
becasuse we might have invoked only a single compaction round which
is might be not enough. I fully agree with that. Here is my take on
that. It is again based on the number of retries loop.

I was also playing with an idea of doing something similar to the
reclaim retry logic:
	if (order) {
		if (compaction_made_progress(compact_result)
			no_compact_progress = 0;
		else if (compaction_failed(compact_result)
			no_compact_progress++;
	}
but it is compaction_failed() part which is not really
straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
compaction_suitable however hide this from compaction users so it
seems like we can never see it.

Maybe we can update the feedback mechanism from the compaction but
retries count seems reasonably easy to understand and pragmatic. If
we cannot form a order page after we tried for N times then it really
doesn't make much sense to continue and we are oom for this order. I am
holding my breath to hear from Hugh on this, though. In case it doesn't
then I would be really interested whether changing MAX_COMPACT_RETRIES
makes any difference.

I haven't preserved Tested-by from Sergey to be on the safe side even
though strictly speaking this should be less prone to high order OOMs
because we clearly retry more times.
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-09 11:11                 ` Michal Hocko
@ 2016-03-09 14:07                   ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-09 14:07 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On 03/09/2016 12:11 PM, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
> 
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> 	if (order) {
> 		if (compaction_made_progress(compact_result)

Progress for compaction would probably mean counting successful
migrations. This would converge towards a definitive false (without
parallel activity) in the current implementation, but probably not for
the proposed redesigns where migration and free scanner initial
positions are not fixed.

> 			no_compact_progress = 0;
> 		else if (compaction_failed(compact_result)
> 			no_compact_progress++;
> 	}
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.

Anything other than COMPACT_PARTIAL is "failed" :) But it doesn't itself
hint at whether retrying makes sense or not. Reclaim is simpler in this
sense...

> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though. In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
> 
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round was either inactive (deferred,
> 	  skipped or bailed out early due to contention) or
> 	- we haven't completed at least MAX_COMPACT_RETRIES successful
> 	  (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> 	  rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, this could work.
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/compaction.h | 10 +++++++
>  mm/page_alloc.c            | 68 +++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 62 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return (compact_result > COMPACT_SKIPPED &&
> +				compact_result < COMPACT_NO_SUITABLE_PAGE)
> +}
> +
>  #else
>  static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SKIPPED)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
> +
>  	/* Checks for THP-specific high-order allocations */
>  	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction,
> +				 compaction_retries))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-09 14:07                   ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-09 14:07 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Hugh Dickins, Sergey Senozhatsky, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim, linux-mm, LKML

On 03/09/2016 12:11 PM, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
> 
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> 	if (order) {
> 		if (compaction_made_progress(compact_result)

Progress for compaction would probably mean counting successful
migrations. This would converge towards a definitive false (without
parallel activity) in the current implementation, but probably not for
the proposed redesigns where migration and free scanner initial
positions are not fixed.

> 			no_compact_progress = 0;
> 		else if (compaction_failed(compact_result)
> 			no_compact_progress++;
> 	}
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.

Anything other than COMPACT_PARTIAL is "failed" :) But it doesn't itself
hint at whether retrying makes sense or not. Reclaim is simpler in this
sense...

> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though. In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
> 
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round was either inactive (deferred,
> 	  skipped or bailed out early due to contention) or
> 	- we haven't completed at least MAX_COMPACT_RETRIES successful
> 	  (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> 	  rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Yeah, this could work.
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/compaction.h | 10 +++++++
>  mm/page_alloc.c            | 68 +++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 62 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return (compact_result > COMPACT_SKIPPED &&
> +				compact_result < COMPACT_NO_SUITABLE_PAGE)
> +}
> +
>  #else
>  static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SKIPPED)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
> +
>  	/* Checks for THP-specific high-order allocations */
>  	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction,
> +				 compaction_retries))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2016-03-11 10:45   ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 10:45 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

(Posting as a reply to this thread.)

I was trying to test side effect of "oom, oom_reaper: disable oom_reaper for
oom_kill_allocating_task" compared to "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space" using a reproducer shown below.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/prctl.h>
#include <signal.h>

static char buffer[4096] = { };

static int file_io(void *unused)
{
	const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
	sleep(2);
	while (write(fd, buffer, sizeof(buffer)) > 0);
	close(fd);
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	if (chdir("/tmp"))
		return 1;
	for (i = 0; i < 64; i++)
		if (fork() == 0) {
			static cpu_set_t set = { { 1 } };
			const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sched_setaffinity(0, sizeof(set), &set);
			snprintf(buffer, sizeof(buffer), "file_io.%02u", i);
			prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0);
			for (i = 0; i < 16; i++)
				clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL);
			while (1)
				pause();
		}
	{ /* A dummy process for invoking the OOM killer. */
		char *buf = NULL;
		unsigned long i;
		unsigned long size = 0;
		prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		sleep(4);
		for (i = 0; i < size; i += 4096)
			buf[i] = '\0'; /* Will cause OOM due to overcommit */
	}
	kill(-1, SIGKILL);
	return * (char *) NULL; /* Not reached. */
}
---------- Reproducer end ----------

The characteristic of this reproducer is that the OOM killer chooses the same mm
for multiple times due to clone(!CLONE_SIGHAND && CLONE_VM) and the OOM reaper
happily skips reaping that mm due to marking that mm_struct as MMF_OOM_KILLED or
marking only first victim's signal_struct as OOM_SCORE_ADJ_MIN, which means that
nobody can unlock TIF_MEMDIE when non-first victim cannot terminate.

But the problem I can hit trivially is that kswapd got stuck at unkillable lock
when all allocating tasks are waiting at congestion_wait(). This situation resembles
http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
but not looping at too_many_isolated() in shrink_inactive_list().
I don't know what is happening.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160311.txt.xz .
---------- console log start ----------
[   81.282661] memeater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[   81.297589] memeater cpuset=/ mems_allowed=0
[   81.303615] CPU: 2 PID: 1239 Comm: memeater Tainted: G        W       4.5.0-rc7-next-20160310 #103
(...snipped...)
[   81.456295] Out of memory: Kill process 1240 (file_io.00) score 999 or sacrifice child
[   81.459768] Killed process 1240 (file_io.00) total-vm:4308kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   81.682547] ksmtuned invoked oom-killer: gfp_mask=0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO), order=0, oom_score_adj=0
[   81.703992] ksmtuned cpuset=/ mems_allowed=0
[   81.709402] CPU: 1 PID: 2330 Comm: ksmtuned Tainted: G        W       4.5.0-rc7-next-20160310 #103
(...snipped...)
[   81.928733] Out of memory: Kill process 1248 (file_io.00) score 1000 or sacrifice child
[   81.932194] Killed process 1248 (file_io.00) total-vm:4308kB, anon-rss:104kB, file-rss:1044kB, shmem-rss:0kB
(...snipped...)
[  136.837273] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  136.858075] lowmem_reserve[]: 0 953 953 953
[  136.860609] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_cma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
[  136.880330] lowmem_reserve[]: 0 0 0 0
[  136.883137] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  136.890862] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[  143.721805] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  143.724711]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  143.727782]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  143.730815]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  143.733839] Call Trace:
[  143.735190]  [<ffffffff81666600>] schedule+0x30/0x80
[  143.737387]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  143.739964]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  143.742944]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  143.745315]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  143.747737]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  143.750071]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  143.752534]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  143.755230]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  143.757959]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  143.760382]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  143.762785]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  143.765347]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  143.767801]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  143.770084]  [<ffffffff81157661>] kswapd+0x441/0x830
[  143.772193]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  143.774941]  [<ffffffff8109181e>] kthread+0xee/0x110
[  143.777025]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  143.779276]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  144.479298] file_io.00      D ffff88003ac97cb8     0  1248      1 0x00100084
[  144.482410]  ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[  144.485513]  ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[  144.488618]  ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[  144.491661] Call Trace:
[  144.492921]  [<ffffffff81666600>] schedule+0x30/0x80
[  144.495066]  [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[  144.497582]  [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[  144.500060]  [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  144.503077]  [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[  144.505494]  [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  144.508375]  [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[  144.510996]  [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[  144.514521]  [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[  144.517230]  [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[  144.519407]  [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[  144.521772]  [<ffffffff811c0713>] SyS_write+0x53/0xd0
[  144.523909]  [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[  144.526145]  [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  145.684411] kworker/3:3     D ffff88000e987878     0  2329      2 0x00000080
[  145.684415] Workqueue: events_freezable_power_ disk_events_workfn
[  145.684416]  ffff88000e987878 ffff880037d76140 ffff88000e980100 ffff88000e988000
[  145.684417]  ffff88000e9878b0 ffff88003d6d02c0 00000000fffd9bc4 ffff88003ffdf100
[  145.684417]  ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[  145.684418] Call Trace:
[  145.684419]  [<ffffffff81666600>] schedule+0x30/0x80
[  145.684419]  [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[  145.684420]  [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[  145.684421]  [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[  145.684422]  [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[  145.684423]  [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[  145.684424]  [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[  145.684425]  [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[  145.684426]  [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[  145.684427]  [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[  145.684428]  [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[  145.684430]  [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[  145.684431]  [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[  145.684433]  [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[  145.684435]  [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[  145.684436]  [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[  145.684438]  [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  145.684440]  [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[  145.684441]  [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[  145.684442]  [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[  145.684443]  [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[  145.684445]  [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[  145.684446]  [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[  145.684448]  [<ffffffff8108b846>] worker_thread+0x126/0x490
[  145.684449]  [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[  145.684450]  [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[  145.684451]  [<ffffffff8109181e>] kthread+0xee/0x110
[  145.684452]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  145.684453]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  208.035194] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  208.051970] lowmem_reserve[]: 0 953 953 953
[  208.054174] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39064kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1644kB local_pcp:108kB free_cma:0kB writeback_tmp:0kB pages_scanned:1882904 all_unreclaimable? yes
[  208.072237] lowmem_reserve[]: 0 0 0 0
[  208.074340] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  208.080915] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[  290.388544] INFO: task kswapd0:52 blocked for more than 120 seconds.
[  290.391197]       Tainted: G        W       4.5.0-rc7-next-20160310 #103
[  290.393979] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  290.397150] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  290.400194]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  290.403394]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  290.406715]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  290.409874] Call Trace:
[  290.411242]  [<ffffffff81666600>] schedule+0x30/0x80
[  290.413423]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  290.416100]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  290.418835]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  290.421278]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  290.423672]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  290.426042]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  290.428569]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  290.431173]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  290.433753]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  290.436135]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  290.438583]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  290.441090]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  290.443500]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  290.445703]  [<ffffffff81157661>] kswapd+0x441/0x830
[  290.447973]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  290.450676]  [<ffffffff8109181e>] kthread+0xee/0x110
[  290.452780]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  290.455018]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
[  290.457910] 1 lock held by kswapd0/52:
[  290.459813]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
(...snipped...)
[  336.562747] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  336.589823] lowmem_reserve[]: 0 953 953 953
[  336.593296] Node 0 DMA32 free:3776kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39192kB kernel_stack:20416kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1520kB local_pcp:100kB free_cma:0kB writeback_tmp:0kB pages_scanned:1001584 all_unreclaimable? yes
[  336.618011] lowmem_reserve[]: 0 0 0 0
[  336.620073] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  336.626844] Node 0 DMA32: 860*4kB (UME) 18*8kB (UME) 8*16kB (UM) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3776kB
(...snipped...)
[  393.774051] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  393.777018]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  393.779986]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  393.783000]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  393.785958] Call Trace:
[  393.787191]  [<ffffffff81666600>] schedule+0x30/0x80
[  393.789198]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  393.791707]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  393.794364]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  393.796634]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  393.798952]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  393.801274]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  393.803709]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  393.806254]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  393.808718]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  393.811002]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  393.813415]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  393.815834]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  393.818316]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  393.820472]  [<ffffffff81157661>] kswapd+0x441/0x830
[  393.822658]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  393.825463]  [<ffffffff8109181e>] kthread+0xee/0x110
[  393.827626]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  393.829824]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  395.000240] file_io.00      D ffff88003ac97cb8     0  1248      1 0x00100084
[  395.003355]  ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[  395.006582]  ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[  395.010026]  ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[  395.013010] Call Trace:
[  395.014201]  [<ffffffff81666600>] schedule+0x30/0x80
[  395.016248]  [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[  395.018824]  [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[  395.021194]  [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  395.024197]  [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[  395.026672]  [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  395.029525]  [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[  395.032029]  [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[  395.034589]  [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[  395.036723]  [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[  395.038841]  [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[  395.041069]  [<ffffffff811c0713>] SyS_write+0x53/0xd0
[  395.043258]  [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[  395.045511]  [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  446.012823] kworker/3:3     D ffff88000e987878     0  2329      2 0x00000080
[  446.015632] Workqueue: events_freezable_power_ disk_events_workfn
[  446.018103]  ffff88000e987878 ffff88003cc0c040 ffff88000e980100 ffff88000e988000
[  446.021099]  ffff88000e9878b0 ffff88003d6d02c0 0000000100016c95 ffff88003ffdf100
[  446.024247]  ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[  446.027332] Call Trace:
[  446.028568]  [<ffffffff81666600>] schedule+0x30/0x80
[  446.030748]  [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[  446.033122]  [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[  446.035466]  [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[  446.037756]  [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[  446.039960]  [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[  446.042385]  [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[  446.044651]  [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[  446.046817]  [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[  446.049395]  [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[  446.051700]  [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[  446.054089]  [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[  446.056515]  [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[  446.058737]  [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[  446.061105]  [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[  446.063334]  [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[  446.065810]  [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  446.068343]  [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[  446.070897]  [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[  446.073569]  [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[  446.075895]  [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[  446.078340]  [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[  446.080696]  [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[  446.083069]  [<ffffffff8108b846>] worker_thread+0x126/0x490
[  446.085395]  [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[  446.087587]  [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[  446.089996]  [<ffffffff8109181e>] kthread+0xee/0x110
[  446.092242]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  446.094527]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
---------- console log end ----------

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 10:45   ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 10:45 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

(Posting as a reply to this thread.)

I was trying to test side effect of "oom, oom_reaper: disable oom_reaper for
oom_kill_allocating_task" compared to "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space" using a reproducer shown below.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/prctl.h>
#include <signal.h>

static char buffer[4096] = { };

static int file_io(void *unused)
{
	const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
	sleep(2);
	while (write(fd, buffer, sizeof(buffer)) > 0);
	close(fd);
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	if (chdir("/tmp"))
		return 1;
	for (i = 0; i < 64; i++)
		if (fork() == 0) {
			static cpu_set_t set = { { 1 } };
			const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sched_setaffinity(0, sizeof(set), &set);
			snprintf(buffer, sizeof(buffer), "file_io.%02u", i);
			prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0);
			for (i = 0; i < 16; i++)
				clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL);
			while (1)
				pause();
		}
	{ /* A dummy process for invoking the OOM killer. */
		char *buf = NULL;
		unsigned long i;
		unsigned long size = 0;
		prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0);
		for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
			char *cp = realloc(buf, size);
			if (!cp) {
				size >>= 1;
				break;
			}
			buf = cp;
		}
		sleep(4);
		for (i = 0; i < size; i += 4096)
			buf[i] = '\0'; /* Will cause OOM due to overcommit */
	}
	kill(-1, SIGKILL);
	return * (char *) NULL; /* Not reached. */
}
---------- Reproducer end ----------

The characteristic of this reproducer is that the OOM killer chooses the same mm
for multiple times due to clone(!CLONE_SIGHAND && CLONE_VM) and the OOM reaper
happily skips reaping that mm due to marking that mm_struct as MMF_OOM_KILLED or
marking only first victim's signal_struct as OOM_SCORE_ADJ_MIN, which means that
nobody can unlock TIF_MEMDIE when non-first victim cannot terminate.

But the problem I can hit trivially is that kswapd got stuck at unkillable lock
when all allocating tasks are waiting at congestion_wait(). This situation resembles
http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
but not looping at too_many_isolated() in shrink_inactive_list().
I don't know what is happening.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160311.txt.xz .
---------- console log start ----------
[   81.282661] memeater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[   81.297589] memeater cpuset=/ mems_allowed=0
[   81.303615] CPU: 2 PID: 1239 Comm: memeater Tainted: G        W       4.5.0-rc7-next-20160310 #103
(...snipped...)
[   81.456295] Out of memory: Kill process 1240 (file_io.00) score 999 or sacrifice child
[   81.459768] Killed process 1240 (file_io.00) total-vm:4308kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[   81.682547] ksmtuned invoked oom-killer: gfp_mask=0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO), order=0, oom_score_adj=0
[   81.703992] ksmtuned cpuset=/ mems_allowed=0
[   81.709402] CPU: 1 PID: 2330 Comm: ksmtuned Tainted: G        W       4.5.0-rc7-next-20160310 #103
(...snipped...)
[   81.928733] Out of memory: Kill process 1248 (file_io.00) score 1000 or sacrifice child
[   81.932194] Killed process 1248 (file_io.00) total-vm:4308kB, anon-rss:104kB, file-rss:1044kB, shmem-rss:0kB
(...snipped...)
[  136.837273] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  136.858075] lowmem_reserve[]: 0 953 953 953
[  136.860609] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_cma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
[  136.880330] lowmem_reserve[]: 0 0 0 0
[  136.883137] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  136.890862] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[  143.721805] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  143.724711]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  143.727782]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  143.730815]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  143.733839] Call Trace:
[  143.735190]  [<ffffffff81666600>] schedule+0x30/0x80
[  143.737387]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  143.739964]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  143.742944]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  143.745315]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  143.747737]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  143.750071]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  143.752534]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  143.755230]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  143.757959]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  143.760382]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  143.762785]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  143.765347]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  143.767801]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  143.770084]  [<ffffffff81157661>] kswapd+0x441/0x830
[  143.772193]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  143.774941]  [<ffffffff8109181e>] kthread+0xee/0x110
[  143.777025]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  143.779276]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  144.479298] file_io.00      D ffff88003ac97cb8     0  1248      1 0x00100084
[  144.482410]  ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[  144.485513]  ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[  144.488618]  ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[  144.491661] Call Trace:
[  144.492921]  [<ffffffff81666600>] schedule+0x30/0x80
[  144.495066]  [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[  144.497582]  [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[  144.500060]  [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  144.503077]  [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[  144.505494]  [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  144.508375]  [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[  144.510996]  [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[  144.514521]  [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[  144.517230]  [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[  144.519407]  [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[  144.521772]  [<ffffffff811c0713>] SyS_write+0x53/0xd0
[  144.523909]  [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[  144.526145]  [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  145.684411] kworker/3:3     D ffff88000e987878     0  2329      2 0x00000080
[  145.684415] Workqueue: events_freezable_power_ disk_events_workfn
[  145.684416]  ffff88000e987878 ffff880037d76140 ffff88000e980100 ffff88000e988000
[  145.684417]  ffff88000e9878b0 ffff88003d6d02c0 00000000fffd9bc4 ffff88003ffdf100
[  145.684417]  ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[  145.684418] Call Trace:
[  145.684419]  [<ffffffff81666600>] schedule+0x30/0x80
[  145.684419]  [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[  145.684420]  [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[  145.684421]  [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[  145.684422]  [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[  145.684423]  [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[  145.684424]  [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[  145.684425]  [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[  145.684426]  [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[  145.684427]  [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[  145.684428]  [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[  145.684430]  [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[  145.684431]  [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[  145.684433]  [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[  145.684435]  [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[  145.684436]  [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[  145.684438]  [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  145.684440]  [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[  145.684441]  [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[  145.684442]  [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[  145.684443]  [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[  145.684445]  [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[  145.684446]  [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[  145.684448]  [<ffffffff8108b846>] worker_thread+0x126/0x490
[  145.684449]  [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[  145.684450]  [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[  145.684451]  [<ffffffff8109181e>] kthread+0xee/0x110
[  145.684452]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  145.684453]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  208.035194] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  208.051970] lowmem_reserve[]: 0 953 953 953
[  208.054174] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39064kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1644kB local_pcp:108kB free_cma:0kB writeback_tmp:0kB pages_scanned:1882904 all_unreclaimable? yes
[  208.072237] lowmem_reserve[]: 0 0 0 0
[  208.074340] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  208.080915] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[  290.388544] INFO: task kswapd0:52 blocked for more than 120 seconds.
[  290.391197]       Tainted: G        W       4.5.0-rc7-next-20160310 #103
[  290.393979] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  290.397150] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  290.400194]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  290.403394]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  290.406715]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  290.409874] Call Trace:
[  290.411242]  [<ffffffff81666600>] schedule+0x30/0x80
[  290.413423]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  290.416100]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  290.418835]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  290.421278]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  290.423672]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  290.426042]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  290.428569]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  290.431173]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  290.433753]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  290.436135]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  290.438583]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  290.441090]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  290.443500]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  290.445703]  [<ffffffff81157661>] kswapd+0x441/0x830
[  290.447973]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  290.450676]  [<ffffffff8109181e>] kthread+0xee/0x110
[  290.452780]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  290.455018]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
[  290.457910] 1 lock held by kswapd0/52:
[  290.459813]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
(...snipped...)
[  336.562747] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[  336.589823] lowmem_reserve[]: 0 953 953 953
[  336.593296] Node 0 DMA32 free:3776kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39192kB kernel_stack:20416kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1520kB local_pcp:100kB free_cma:0kB writeback_tmp:0kB pages_scanned:1001584 all_unreclaimable? yes
[  336.618011] lowmem_reserve[]: 0 0 0 0
[  336.620073] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[  336.626844] Node 0 DMA32: 860*4kB (UME) 18*8kB (UME) 8*16kB (UM) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3776kB
(...snipped...)
[  393.774051] kswapd0         D ffff880039ffb760     0    52      2 0x00000000
[  393.777018]  ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[  393.779986]  ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[  393.783000]  ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[  393.785958] Call Trace:
[  393.787191]  [<ffffffff81666600>] schedule+0x30/0x80
[  393.789198]  [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[  393.791707]  [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[  393.794364]  [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[  393.796634]  [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[  393.798952]  [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[  393.801274]  [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[  393.803709]  [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[  393.806254]  [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[  393.808718]  [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[  393.811002]  [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[  393.813415]  [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[  393.815834]  [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[  393.818316]  [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[  393.820472]  [<ffffffff81157661>] kswapd+0x441/0x830
[  393.822658]  [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[  393.825463]  [<ffffffff8109181e>] kthread+0xee/0x110
[  393.827626]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  393.829824]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  395.000240] file_io.00      D ffff88003ac97cb8     0  1248      1 0x00100084
[  395.003355]  ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[  395.006582]  ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[  395.010026]  ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[  395.013010] Call Trace:
[  395.014201]  [<ffffffff81666600>] schedule+0x30/0x80
[  395.016248]  [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[  395.018824]  [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[  395.021194]  [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  395.024197]  [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[  395.026672]  [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[  395.029525]  [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[  395.032029]  [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[  395.034589]  [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[  395.036723]  [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[  395.038841]  [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[  395.041069]  [<ffffffff811c0713>] SyS_write+0x53/0xd0
[  395.043258]  [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[  395.045511]  [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[  446.012823] kworker/3:3     D ffff88000e987878     0  2329      2 0x00000080
[  446.015632] Workqueue: events_freezable_power_ disk_events_workfn
[  446.018103]  ffff88000e987878 ffff88003cc0c040 ffff88000e980100 ffff88000e988000
[  446.021099]  ffff88000e9878b0 ffff88003d6d02c0 0000000100016c95 ffff88003ffdf100
[  446.024247]  ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[  446.027332] Call Trace:
[  446.028568]  [<ffffffff81666600>] schedule+0x30/0x80
[  446.030748]  [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[  446.033122]  [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[  446.035466]  [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[  446.037756]  [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[  446.039960]  [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[  446.042385]  [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[  446.044651]  [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[  446.046817]  [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[  446.049395]  [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[  446.051700]  [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[  446.054089]  [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[  446.056515]  [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[  446.058737]  [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[  446.061105]  [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[  446.063334]  [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[  446.065810]  [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  446.068343]  [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[  446.070897]  [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[  446.073569]  [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[  446.075895]  [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[  446.078340]  [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[  446.080696]  [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[  446.083069]  [<ffffffff8108b846>] worker_thread+0x126/0x490
[  446.085395]  [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[  446.087587]  [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[  446.089996]  [<ffffffff8109181e>] kthread+0xee/0x110
[  446.092242]  [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[  446.094527]  [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
---------- console log end ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-09 11:11                 ` Michal Hocko
@ 2016-03-11 12:17                   ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-11 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Wed, 9 Mar 2016, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
> 
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> 	if (order) {
> 		if (compaction_made_progress(compact_result)
> 			no_compact_progress = 0;
> 		else if (compaction_failed(compact_result)
> 			no_compact_progress++;
> 	}
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.
> 
> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though.

Never a wise strategy.  But I just got around to it tonight.

I do believe you've nailed it with this patch!  Thank you!

I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
but I've not got to them yet): so far it is looking good on all machines.

After a quick go with the simple make -j20 in tmpfs, which survived
a cycle on the laptop, I've switched back to my original tougher load,
and that's going well so far: no sign of any OOMs.  But I've interrupted
on the laptop to report back to you now, then I'll leave it running
overnight.

> In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
> 
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round was either inactive (deferred,
> 	  skipped or bailed out early due to contention) or
> 	- we haven't completed at least MAX_COMPACT_RETRIES successful
> 	  (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> 	  rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h | 10 +++++++
>  mm/page_alloc.c            | 68 +++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 62 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return (compact_result > COMPACT_SKIPPED &&
> +				compact_result < COMPACT_NO_SUITABLE_PAGE)

That line didn't build at all:

        return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

> +}
> +
>  #else
>  static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SKIPPED)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
> +
>  	/* Checks for THP-specific high-order allocations */
>  	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction,
> +				 compaction_retries))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-11 12:17                   ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-11 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Wed, 9 Mar 2016, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
> 
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> 	if (order) {
> 		if (compaction_made_progress(compact_result)
> 			no_compact_progress = 0;
> 		else if (compaction_failed(compact_result)
> 			no_compact_progress++;
> 	}
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.
> 
> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though.

Never a wise strategy.  But I just got around to it tonight.

I do believe you've nailed it with this patch!  Thank you!

I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
but I've not got to them yet): so far it is looking good on all machines.

After a quick go with the simple make -j20 in tmpfs, which survived
a cycle on the laptop, I've switched back to my original tougher load,
and that's going well so far: no sign of any OOMs.  But I've interrupted
on the laptop to report back to you now, then I'll leave it running
overnight.

> In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
> 
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> 	- the last compaction round was either inactive (deferred,
> 	  skipped or bailed out early due to contention) or
> 	- we haven't completed at least MAX_COMPACT_RETRIES successful
> 	  (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> 	  rounds.
> 
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
> 
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h | 10 +++++++
>  mm/page_alloc.c            | 68 +++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 62 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
>  				bool alloc_success);
>  extern bool compaction_restarting(struct zone *zone, int order);
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return (compact_result > COMPACT_SKIPPED &&
> +				compact_result < COMPACT_NO_SUITABLE_PAGE)

That line didn't build at all:

        return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

> +}
> +
>  #else
>  static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>  			unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return true;
>  }
>  
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
>  #ifdef CONFIG_COMPACTION
>  /* Try memory compaction for high-order allocations before reclaim */
>  static struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
> -	enum compact_result compact_result;
>  	struct page *page;
>  
>  	if (!order)
>  		return NULL;
>  
>  	current->flags |= PF_MEMALLOC;
> -	compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> +	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>  						mode, contended_compaction);
>  	current->flags &= ~PF_MEMALLOC;
>  
> -	switch (compact_result) {
> -	case COMPACT_DEFERRED:
> -		*deferred_compaction = true;
> -		/* fall-through */
> -	case COMPACT_SKIPPED:
> +	if (*compact_result <= COMPACT_SKIPPED)
>  		return NULL;
> -	default:
> -		break;
> -	}
>  
>  	/*
>  	 * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	/*
> +	 * !costly allocations are really important and we have to make sure
> +	 * the compaction wasn't deferred or didn't bail out early due to locks
> +	 * contention before we go OOM. Still cap the reclaim retry loops with
> +	 * progress to prevent from looping forever and potential trashing.
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +		if (compact_result <= COMPACT_SKIPPED)
> +			return true;
> +		if (contended_compaction > COMPACT_CONTENDED_NONE)
> +			return true;
> +		if (compaction_retries <= MAX_COMPACT_RETRIES)
> +			return true;
> +	}
> +
> +	return false;
> +}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  		int alloc_flags, const struct alloc_context *ac,
>  		enum migrate_mode mode, int *contended_compaction,
> -		bool *deferred_compaction)
> +		enum compact_result *compact_result)
>  {
>  	return NULL;
>  }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> +		     int contended_compaction, int compaction_retries)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	int alloc_flags;
>  	unsigned long did_some_progress;
>  	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> -	bool deferred_compaction = false;
> +	enum compact_result compact_result;
> +	int compaction_retries = 0;
>  	int contended_compaction = COMPACT_CONTENDED_NONE;
>  	int no_progress_loops = 0;
>  
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&contended_compaction,
> -					&deferred_compaction);
> +					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> +	if (order && compaction_made_progress(compact_result))
> +		compaction_retries++;
> +
>  	/* Checks for THP-specific high-order allocations */
>  	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		 * to heavily disrupt the system, so we fail the allocation
>  		 * instead of entering direct reclaim.
>  		 */
> -		if (deferred_compaction)
> +		if (compact_result == COMPACT_DEFERRED)
>  			goto nopage;
>  
>  		/*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				 did_some_progress > 0, no_progress_loops))
>  		goto retry;
>  
> +	if (should_compact_retry(order, compact_result, contended_compaction,
> +				 compaction_retries))
> +		goto retry;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
>  					    ac, migration_mode,
>  					    &contended_compaction,
> -					    &deferred_compaction);
> +					    &compact_result);
>  	if (page)
>  		goto got_pg;
>  nopage:
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-11 12:17                   ` Hugh Dickins
@ 2016-03-11 13:06                     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 13:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> On Wed, 9 Mar 2016, Michal Hocko wrote:
> > Joonsoo has pointed out that this attempt is still not sufficient
> > becasuse we might have invoked only a single compaction round which
> > is might be not enough. I fully agree with that. Here is my take on
> > that. It is again based on the number of retries loop.
> > 
> > I was also playing with an idea of doing something similar to the
> > reclaim retry logic:
> > 	if (order) {
> > 		if (compaction_made_progress(compact_result)
> > 			no_compact_progress = 0;
> > 		else if (compaction_failed(compact_result)
> > 			no_compact_progress++;
> > 	}
> > but it is compaction_failed() part which is not really
> > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > compaction_suitable however hide this from compaction users so it
> > seems like we can never see it.
> > 
> > Maybe we can update the feedback mechanism from the compaction but
> > retries count seems reasonably easy to understand and pragmatic. If
> > we cannot form a order page after we tried for N times then it really
> > doesn't make much sense to continue and we are oom for this order. I am
> > holding my breath to hear from Hugh on this, though.
> 
> Never a wise strategy.  But I just got around to it tonight.
> 
> I do believe you've nailed it with this patch!  Thank you!

That's a great news! Thanks for testing.

> I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> but I've not got to them yet): so far it is looking good on all machines.
> 
> After a quick go with the simple make -j20 in tmpfs, which survived
> a cycle on the laptop, I've switched back to my original tougher load,
> and that's going well so far: no sign of any OOMs.  But I've interrupted
> on the laptop to report back to you now, then I'll leave it running
> overnight.

OK, let's wait for the rest of the tests but I find it really optimistic
considering how easily you could trigger the issue previously. Anyway
I hope for your Tested-by after you are reasonably confident your loads
are behaving well.

[...]
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..7d028ccf440a 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> >  				bool alloc_success);
> >  extern bool compaction_restarting(struct zone *zone, int order);
> >  
> > +static inline bool compaction_made_progress(enum compact_result result)
> > +{
> > +	return (compact_result > COMPACT_SKIPPED &&
> > +				compact_result < COMPACT_NO_SUITABLE_PAGE)
> 
> That line didn't build at all:
> 
>         return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

those last minute changes... Sorry about that. Fixed.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-11 13:06                     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 13:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> On Wed, 9 Mar 2016, Michal Hocko wrote:
> > Joonsoo has pointed out that this attempt is still not sufficient
> > becasuse we might have invoked only a single compaction round which
> > is might be not enough. I fully agree with that. Here is my take on
> > that. It is again based on the number of retries loop.
> > 
> > I was also playing with an idea of doing something similar to the
> > reclaim retry logic:
> > 	if (order) {
> > 		if (compaction_made_progress(compact_result)
> > 			no_compact_progress = 0;
> > 		else if (compaction_failed(compact_result)
> > 			no_compact_progress++;
> > 	}
> > but it is compaction_failed() part which is not really
> > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > compaction_suitable however hide this from compaction users so it
> > seems like we can never see it.
> > 
> > Maybe we can update the feedback mechanism from the compaction but
> > retries count seems reasonably easy to understand and pragmatic. If
> > we cannot form a order page after we tried for N times then it really
> > doesn't make much sense to continue and we are oom for this order. I am
> > holding my breath to hear from Hugh on this, though.
> 
> Never a wise strategy.  But I just got around to it tonight.
> 
> I do believe you've nailed it with this patch!  Thank you!

That's a great news! Thanks for testing.

> I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> but I've not got to them yet): so far it is looking good on all machines.
> 
> After a quick go with the simple make -j20 in tmpfs, which survived
> a cycle on the laptop, I've switched back to my original tougher load,
> and that's going well so far: no sign of any OOMs.  But I've interrupted
> on the laptop to report back to you now, then I'll leave it running
> overnight.

OK, let's wait for the rest of the tests but I find it really optimistic
considering how easily you could trigger the issue previously. Anyway
I hope for your Tested-by after you are reasonably confident your loads
are behaving well.

[...]
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..7d028ccf440a 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> >  				bool alloc_success);
> >  extern bool compaction_restarting(struct zone *zone, int order);
> >  
> > +static inline bool compaction_made_progress(enum compact_result result)
> > +{
> > +	return (compact_result > COMPACT_SKIPPED &&
> > +				compact_result < COMPACT_NO_SUITABLE_PAGE)
> 
> That line didn't build at all:
> 
>         return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

those last minute changes... Sorry about that. Fixed.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 10:45   ` Tetsuo Handa
@ 2016-03-11 13:08     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 13:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> (Posting as a reply to this thread.)

I really do not see how this is related to this thread.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 13:08     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 13:08 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> (Posting as a reply to this thread.)

I really do not see how this is related to this thread.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 13:08     ` Michal Hocko
@ 2016-03-11 13:32       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 13:32 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > (Posting as a reply to this thread.)
> 
> I really do not see how this is related to this thread.

All allocating tasks are looping at

                        /*
                         * If we didn't make any progress and have a lot of
                         * dirty + writeback pages then we should wait for
                         * an IO to complete to slow down the reclaim and
                         * prevent from pre mature OOM
                         */
                        if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
                                congestion_wait(BLK_RW_ASYNC, HZ/10);
                                return true;
                        }

in should_reclaim_retry().

should_reclaim_retry() was added by OOM detection rework, wan't it?

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 13:32       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 13:32 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > (Posting as a reply to this thread.)
> 
> I really do not see how this is related to this thread.

All allocating tasks are looping at

                        /*
                         * If we didn't make any progress and have a lot of
                         * dirty + writeback pages then we should wait for
                         * an IO to complete to slow down the reclaim and
                         * prevent from pre mature OOM
                         */
                        if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
                                congestion_wait(BLK_RW_ASYNC, HZ/10);
                                return true;
                        }

in should_reclaim_retry().

should_reclaim_retry() was added by OOM detection rework, wan't it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-09 10:41                   ` Michal Hocko
@ 2016-03-11 14:53                     ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-11 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-09 19:41 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
>> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> >> Andrew,
>> >> >> could you queue this one as well, please? This is more a band aid than a
>> >> >> real solution which I will be working on as soon as I am able to
>> >> >> reproduce the issue but the patch should help to some degree at least.
>> >> >
>> >> > Joonsoo wasn't very happy about this approach so let me try a different
>> >> > way. What do you think about the following? Hugh, Sergey does it help
>> >>
>> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> >> best.
>> >
>> > OK, let me think about it some more.
>> >
>> >> What's your purpose of OOM rework? From my understanding,
>> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> >> This makes sense.
>> >
>> > Well this is a bit awkward because we do not have any proper definition
>> > of what prematurely actually means. We do not know whether something
>>
>> If we don't have proper definition to it, please define it first.
>
> OK, I should have probably said that _there_is_no_proper_definition_...
> This will always be about heuristics as the clear cut can be pretty
> subjective and what some load might see as unreasonable retries others
> might see as insufficient. Our ultimate goal is to behave reasonable for
> reasonable workloads. I am somehow skeptical about formulating this
> into a single equation...

I don't want a theoretically perfect definition. We need something that
can be used for judging further changes. So, how can you judge that
reasonable behave for reasonable workload? What's your criteria?
If someone complains 16 retries is too small and the other complains
16 retries is too big, what's your decision in this case?

If you decide to increase number of retry in this case, when can we
stop that increasing? If someone complains again that XX is too small
then do you continue to increase it?

For me, for order 0 case, reasonable part is watermark checking with
available (free + reclaimable) memory. It shows that we've done
our best so it doesn't matter that how many times we retry.

But, for high order case, there is no *feasible* estimation. Watermark
check as you did here isn't feasible because high order freepage
problem usually happen when there are enough but fragmented freepages.
It would be always failed. Without feasible estimation, N retry can't
show anything.

Your logic here is just like below.

"We've tried N times reclaim/compaction and failed. It is proved that
there is no possibility to make high order page. We should trigger OOM now."

Is it true that there is no possibility to make high order page in this case?
Can you be sure?

If someone who get OOM complains regression, can you persuade him
by above logic?

I don't think so. This is why I ask you to make proper definition on
term *premature* here.

>> We need to improve the situation toward the clear goal. Just certain
>> number of retry which has no base doesn't make any sense.
>
> Certain number of retries is what we already have right now. And that
> certain number is hard to define even though it looks as simple as
>
> NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages
>
> because this is highly fragile when there are only few pages freed
> regularly but not sufficient to get us out of the loop... I am trying
> to formulate those retries somehow more deterministically considering
> the feedback _and_ an estimate about the feasibility of future
> reclaim/compaction. I admit that my attempts at compaction part have
> been far from ideal so far. Partially because I missed many aspects
> how it works.
> [...]
>> > not fire _often_ to be impractical. There are loads where the new
>> > implementation behaved slightly better (see the cover for my tests) and
>> > there surely be some where this will be worse. I want this to be
>> > reasonably good. I am not claiming we are there yet and the interaction
>> > with the compaction seems like it needs some work, no question about
>> > that.
>> >
>> >> But, what you did in case of high order allocation is completely different
>> >> with original purpose. It may be deterministic but *completely premature*.
>> >> There is no way to prevent premature OOM kill. So, I want to ask one more
>> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> >> page? Deterministic is for what? It ensures something more?
>> >
>> > yes, If we keep reclaiming we can soon start trashing or over reclaim
>> > too much which would hurt more processes. If you invoke the OOM killer
>> > instead then chances are that you will release a lot of memory at once
>> > and that would help to reconcile the memory pressure as well as free
>> > some page blocks which couldn't have been compacted before and not
>> > affect potentially many processes. The effect would be reduced to a
>> > single process. If we had a proper trashing detection feedback we could
>> > do much more clever decisions of course.
>>
>> It looks like you did it for performance reason. You'd better think again about
>> effect of OOM kill. We don't have enough knowledge about user space program
>> architecture and killing one important process could lead to whole
>> system unusable. Moreover, OOM kill could cause important data loss so
>> should be avoided as much as possible. Performance reason cannot
>> justify OOM kill.
>
> No I am not talking about performance. I am talking about the system
> healthiness as whole.

So, do you think that more frequent OOM kill is healthier than other ways?

>> > But back to the !costly OOMs. Once your system is fragmented so heavily
>> > that there are no free blocks that would satisfy !costly request then
>> > something has gone terribly wrong and we should fix it. To me it sounds
>> > like we do not care about those requests early enough and only start
>> > carying after we hit the wall. Maybe kcompactd can help us in this
>> > regards.
>>
>> Yes, but, it's another issue. In any situation, !costly OOM should not happen
>> prematurely.
>
> I fully agree and I guess we also agree on the assumption that we
> shouldn't retry endlessly. So let's focus on what the OOM convergence
> criteria should look like. I have another proposal which I will send as
> a reply to the previous one.

That's also insufficient to me. It just add one more brute force retry
for compaction
without any reasonable estimation.

>> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> >> OOM kill happens and it may have enough swap space. Even if
>> >> compaction runs and fails, why do we need to kill something
>> >> in this case? OOM kill should be a last resort.
>> >
>> > Well this would be the case even if we were trashing over swap.
>> > Refaulting the swapped out memory all over again...
>>
>> If thrashing is a main obstacle to decide proper OOM point,
>> we need to invent a way to handle thrashing or invent reasonable metric
>> which isn't affected by thrashing.
>
> Great, you are welcome to come up with one. But more seriously, isn't

For example, we can collect how many pages reclaimed and compare
it with reclaimable pages at start. If number of reclaimed pages
exceed it, we can think that we've tried to reclaim all reclaimable pages
at least once and can go next step such as OOM.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-11 14:53                     ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-11 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

2016-03-09 19:41 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
>> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> >> Andrew,
>> >> >> could you queue this one as well, please? This is more a band aid than a
>> >> >> real solution which I will be working on as soon as I am able to
>> >> >> reproduce the issue but the patch should help to some degree at least.
>> >> >
>> >> > Joonsoo wasn't very happy about this approach so let me try a different
>> >> > way. What do you think about the following? Hugh, Sergey does it help
>> >>
>> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> >> best.
>> >
>> > OK, let me think about it some more.
>> >
>> >> What's your purpose of OOM rework? From my understanding,
>> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> >> This makes sense.
>> >
>> > Well this is a bit awkward because we do not have any proper definition
>> > of what prematurely actually means. We do not know whether something
>>
>> If we don't have proper definition to it, please define it first.
>
> OK, I should have probably said that _there_is_no_proper_definition_...
> This will always be about heuristics as the clear cut can be pretty
> subjective and what some load might see as unreasonable retries others
> might see as insufficient. Our ultimate goal is to behave reasonable for
> reasonable workloads. I am somehow skeptical about formulating this
> into a single equation...

I don't want a theoretically perfect definition. We need something that
can be used for judging further changes. So, how can you judge that
reasonable behave for reasonable workload? What's your criteria?
If someone complains 16 retries is too small and the other complains
16 retries is too big, what's your decision in this case?

If you decide to increase number of retry in this case, when can we
stop that increasing? If someone complains again that XX is too small
then do you continue to increase it?

For me, for order 0 case, reasonable part is watermark checking with
available (free + reclaimable) memory. It shows that we've done
our best so it doesn't matter that how many times we retry.

But, for high order case, there is no *feasible* estimation. Watermark
check as you did here isn't feasible because high order freepage
problem usually happen when there are enough but fragmented freepages.
It would be always failed. Without feasible estimation, N retry can't
show anything.

Your logic here is just like below.

"We've tried N times reclaim/compaction and failed. It is proved that
there is no possibility to make high order page. We should trigger OOM now."

Is it true that there is no possibility to make high order page in this case?
Can you be sure?

If someone who get OOM complains regression, can you persuade him
by above logic?

I don't think so. This is why I ask you to make proper definition on
term *premature* here.

>> We need to improve the situation toward the clear goal. Just certain
>> number of retry which has no base doesn't make any sense.
>
> Certain number of retries is what we already have right now. And that
> certain number is hard to define even though it looks as simple as
>
> NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages
>
> because this is highly fragile when there are only few pages freed
> regularly but not sufficient to get us out of the loop... I am trying
> to formulate those retries somehow more deterministically considering
> the feedback _and_ an estimate about the feasibility of future
> reclaim/compaction. I admit that my attempts at compaction part have
> been far from ideal so far. Partially because I missed many aspects
> how it works.
> [...]
>> > not fire _often_ to be impractical. There are loads where the new
>> > implementation behaved slightly better (see the cover for my tests) and
>> > there surely be some where this will be worse. I want this to be
>> > reasonably good. I am not claiming we are there yet and the interaction
>> > with the compaction seems like it needs some work, no question about
>> > that.
>> >
>> >> But, what you did in case of high order allocation is completely different
>> >> with original purpose. It may be deterministic but *completely premature*.
>> >> There is no way to prevent premature OOM kill. So, I want to ask one more
>> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> >> page? Deterministic is for what? It ensures something more?
>> >
>> > yes, If we keep reclaiming we can soon start trashing or over reclaim
>> > too much which would hurt more processes. If you invoke the OOM killer
>> > instead then chances are that you will release a lot of memory at once
>> > and that would help to reconcile the memory pressure as well as free
>> > some page blocks which couldn't have been compacted before and not
>> > affect potentially many processes. The effect would be reduced to a
>> > single process. If we had a proper trashing detection feedback we could
>> > do much more clever decisions of course.
>>
>> It looks like you did it for performance reason. You'd better think again about
>> effect of OOM kill. We don't have enough knowledge about user space program
>> architecture and killing one important process could lead to whole
>> system unusable. Moreover, OOM kill could cause important data loss so
>> should be avoided as much as possible. Performance reason cannot
>> justify OOM kill.
>
> No I am not talking about performance. I am talking about the system
> healthiness as whole.

So, do you think that more frequent OOM kill is healthier than other ways?

>> > But back to the !costly OOMs. Once your system is fragmented so heavily
>> > that there are no free blocks that would satisfy !costly request then
>> > something has gone terribly wrong and we should fix it. To me it sounds
>> > like we do not care about those requests early enough and only start
>> > carying after we hit the wall. Maybe kcompactd can help us in this
>> > regards.
>>
>> Yes, but, it's another issue. In any situation, !costly OOM should not happen
>> prematurely.
>
> I fully agree and I guess we also agree on the assumption that we
> shouldn't retry endlessly. So let's focus on what the OOM convergence
> criteria should look like. I have another proposal which I will send as
> a reply to the previous one.

That's also insufficient to me. It just add one more brute force retry
for compaction
without any reasonable estimation.

>> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> >> OOM kill happens and it may have enough swap space. Even if
>> >> compaction runs and fails, why do we need to kill something
>> >> in this case? OOM kill should be a last resort.
>> >
>> > Well this would be the case even if we were trashing over swap.
>> > Refaulting the swapped out memory all over again...
>>
>> If thrashing is a main obstacle to decide proper OOM point,
>> we need to invent a way to handle thrashing or invent reasonable metric
>> which isn't affected by thrashing.
>
> Great, you are welcome to come up with one. But more seriously, isn't

For example, we can collect how many pages reclaimed and compare
it with reclaimable pages at start. If number of reclaimed pages
exceed it, we can think that we've tried to reclaim all reclaimable pages
at least once and can go next step such as OOM.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
  2016-03-11 14:53                     ` Joonsoo Kim
@ 2016-03-11 15:20                       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 15:20 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Fri 11-03-16 23:53:18, Joonsoo Kim wrote:
> 2016-03-09 19:41 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> >> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
[...]
> >> >> What's your purpose of OOM rework? From my understanding,
> >> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> >> This makes sense.
> >> >
> >> > Well this is a bit awkward because we do not have any proper definition
> >> > of what prematurely actually means. We do not know whether something
> >>
> >> If we don't have proper definition to it, please define it first.
> >
> > OK, I should have probably said that _there_is_no_proper_definition_...
> > This will always be about heuristics as the clear cut can be pretty
> > subjective and what some load might see as unreasonable retries others
> > might see as insufficient. Our ultimate goal is to behave reasonable for
> > reasonable workloads. I am somehow skeptical about formulating this
> > into a single equation...
> 
> I don't want a theoretically perfect definition. We need something that
> can be used for judging further changes. So, how can you judge that
> reasonable behave for reasonable workload? What's your criteria?
> If someone complains 16 retries is too small and the other complains
> 16 retries is too big, what's your decision in this case?

The number of retries is the implementation detail. What matters,
really, is whether we can argue about the particular load and why it
should resp. shouldn't trigger the OOM killer. We can use our
tracepoints to have a look and judge the overall progress or lack of it
and see if we could do better. It is not the number of retries to tweak
first. It is the reclaim/compaction to be made more reliable.  Tweaking
the retries would be just the very last resort. If we can see that
compaction doesn't form the high order pages in a sufficient pace we
should find out why.

> If you decide to increase number of retry in this case, when can we
> stop that increasing? If someone complains again that XX is too small
> then do you continue to increase it?
> 
> For me, for order 0 case, reasonable part is watermark checking with
> available (free + reclaimable) memory. It shows that we've done
> our best so it doesn't matter that how many times we retry.
> 
> But, for high order case, there is no *feasible* estimation. Watermark
> check as you did here isn't feasible because high order freepage
> problem usually happen when there are enough but fragmented freepages.
> It would be always failed. Without feasible estimation, N retry can't
> show anything.

That's why I have done compaction retry loop independent on it in the
last patch.

> Your logic here is just like below.
> 
> "We've tried N times reclaim/compaction and failed. It is proved that
> there is no possibility to make high order page. We should trigger OOM now."

Have you seen the last patch where I make sure that the compaction had
to report _success_ at least N times to declare the OOM? I think we can
be reasonably sure that keep compacting again and again without any
bound doesn't make much sense when that doesn't lead to a requested
order page.

> Is it true that there is no possibility to make high order page in this case?
> Can you be sure?

The thing I am trying to tell you, and I seem to fail here, is that you
simply cannot be sure. Full stop. We might be staggering on the edge of the
cliff and fall or be lucky and end up on the safe side.

> If someone who get OOM complains regression, can you persuade him
> by above logic?

This really depends on the particular load of course.

> I don't think so. This is why I ask you to make proper definition on
> term *premature* here.

Sigh. And what if that particular reporter doesn't agree with my
"proper" definition because it doesn't suite the workload of the
interest? I mean, anything we end up doing is highly subjective and
it's been like that since ever OOM was introduced.

[...]
> >> It looks like you did it for performance reason. You'd better think again about
> >> effect of OOM kill. We don't have enough knowledge about user space program
> >> architecture and killing one important process could lead to whole
> >> system unusable. Moreover, OOM kill could cause important data loss so
> >> should be avoided as much as possible. Performance reason cannot
> >> justify OOM kill.
> >
> > No I am not talking about performance. I am talking about the system
> > healthiness as whole.
> 
> So, do you think that more frequent OOM kill is healthier than other ways?

I didn't say so. And except for the Hugh's testcase I haven't seen the
rework would cause that. As per the last testing result it seems that
this particular case has been fixed. If you believe that you can see
other cases than I am more than happy to look at them.

> >> > But back to the !costly OOMs. Once your system is fragmented so heavily
> >> > that there are no free blocks that would satisfy !costly request then
> >> > something has gone terribly wrong and we should fix it. To me it sounds
> >> > like we do not care about those requests early enough and only start
> >> > carying after we hit the wall. Maybe kcompactd can help us in this
> >> > regards.
> >>
> >> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> >> prematurely.
> >
> > I fully agree and I guess we also agree on the assumption that we
> > shouldn't retry endlessly. So let's focus on what the OOM convergence
> > criteria should look like. I have another proposal which I will send as
> > a reply to the previous one.
> 
> That's also insufficient to me. It just add one more brute force retry
> for compaction
> without any reasonable estimation.

The compaction absolutely lacks any useful feedback mechanism. If we
ever grow one I am more than happy to make the estimate better.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)
@ 2016-03-11 15:20                       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 15:20 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Sergey Senozhatsky, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Vlastimil Babka

On Fri 11-03-16 23:53:18, Joonsoo Kim wrote:
> 2016-03-09 19:41 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> >> 2016-03-09 1:05 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
[...]
> >> >> What's your purpose of OOM rework? From my understanding,
> >> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> >> This makes sense.
> >> >
> >> > Well this is a bit awkward because we do not have any proper definition
> >> > of what prematurely actually means. We do not know whether something
> >>
> >> If we don't have proper definition to it, please define it first.
> >
> > OK, I should have probably said that _there_is_no_proper_definition_...
> > This will always be about heuristics as the clear cut can be pretty
> > subjective and what some load might see as unreasonable retries others
> > might see as insufficient. Our ultimate goal is to behave reasonable for
> > reasonable workloads. I am somehow skeptical about formulating this
> > into a single equation...
> 
> I don't want a theoretically perfect definition. We need something that
> can be used for judging further changes. So, how can you judge that
> reasonable behave for reasonable workload? What's your criteria?
> If someone complains 16 retries is too small and the other complains
> 16 retries is too big, what's your decision in this case?

The number of retries is the implementation detail. What matters,
really, is whether we can argue about the particular load and why it
should resp. shouldn't trigger the OOM killer. We can use our
tracepoints to have a look and judge the overall progress or lack of it
and see if we could do better. It is not the number of retries to tweak
first. It is the reclaim/compaction to be made more reliable.  Tweaking
the retries would be just the very last resort. If we can see that
compaction doesn't form the high order pages in a sufficient pace we
should find out why.

> If you decide to increase number of retry in this case, when can we
> stop that increasing? If someone complains again that XX is too small
> then do you continue to increase it?
> 
> For me, for order 0 case, reasonable part is watermark checking with
> available (free + reclaimable) memory. It shows that we've done
> our best so it doesn't matter that how many times we retry.
> 
> But, for high order case, there is no *feasible* estimation. Watermark
> check as you did here isn't feasible because high order freepage
> problem usually happen when there are enough but fragmented freepages.
> It would be always failed. Without feasible estimation, N retry can't
> show anything.

That's why I have done compaction retry loop independent on it in the
last patch.

> Your logic here is just like below.
> 
> "We've tried N times reclaim/compaction and failed. It is proved that
> there is no possibility to make high order page. We should trigger OOM now."

Have you seen the last patch where I make sure that the compaction had
to report _success_ at least N times to declare the OOM? I think we can
be reasonably sure that keep compacting again and again without any
bound doesn't make much sense when that doesn't lead to a requested
order page.

> Is it true that there is no possibility to make high order page in this case?
> Can you be sure?

The thing I am trying to tell you, and I seem to fail here, is that you
simply cannot be sure. Full stop. We might be staggering on the edge of the
cliff and fall or be lucky and end up on the safe side.

> If someone who get OOM complains regression, can you persuade him
> by above logic?

This really depends on the particular load of course.

> I don't think so. This is why I ask you to make proper definition on
> term *premature* here.

Sigh. And what if that particular reporter doesn't agree with my
"proper" definition because it doesn't suite the workload of the
interest? I mean, anything we end up doing is highly subjective and
it's been like that since ever OOM was introduced.

[...]
> >> It looks like you did it for performance reason. You'd better think again about
> >> effect of OOM kill. We don't have enough knowledge about user space program
> >> architecture and killing one important process could lead to whole
> >> system unusable. Moreover, OOM kill could cause important data loss so
> >> should be avoided as much as possible. Performance reason cannot
> >> justify OOM kill.
> >
> > No I am not talking about performance. I am talking about the system
> > healthiness as whole.
> 
> So, do you think that more frequent OOM kill is healthier than other ways?

I didn't say so. And except for the Hugh's testcase I haven't seen the
rework would cause that. As per the last testing result it seems that
this particular case has been fixed. If you believe that you can see
other cases than I am more than happy to look at them.

> >> > But back to the !costly OOMs. Once your system is fragmented so heavily
> >> > that there are no free blocks that would satisfy !costly request then
> >> > something has gone terribly wrong and we should fix it. To me it sounds
> >> > like we do not care about those requests early enough and only start
> >> > carying after we hit the wall. Maybe kcompactd can help us in this
> >> > regards.
> >>
> >> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> >> prematurely.
> >
> > I fully agree and I guess we also agree on the assumption that we
> > shouldn't retry endlessly. So let's focus on what the OOM convergence
> > criteria should look like. I have another proposal which I will send as
> > a reply to the previous one.
> 
> That's also insufficient to me. It just add one more brute force retry
> for compaction
> without any reasonable estimation.

The compaction absolutely lacks any useful feedback mechanism. If we
ever grow one I am more than happy to make the estimate better.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 13:32       ` Tetsuo Handa
@ 2016-03-11 15:28         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 15:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > (Posting as a reply to this thread.)
> > 
> > I really do not see how this is related to this thread.
> 
> All allocating tasks are looping at
> 
>                         /*
>                          * If we didn't make any progress and have a lot of
>                          * dirty + writeback pages then we should wait for
>                          * an IO to complete to slow down the reclaim and
>                          * prevent from pre mature OOM
>                          */
>                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
>                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>                                 return true;
>                         }
> 
> in should_reclaim_retry().
> 
> should_reclaim_retry() was added by OOM detection rework, wan't it?

What happens without this patch applied. In other words, it all smells
like the IO got stuck somewhere and the direct reclaim cannot perform it
so we have to wait for the flushers to make a progress for us. Are those
stuck? Is the IO making any progress at all or it is just too slow and
it would finish actually.  Wouldn't we just wait somewhere else in the
direct reclaim path instead.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 15:28         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 15:28 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > (Posting as a reply to this thread.)
> > 
> > I really do not see how this is related to this thread.
> 
> All allocating tasks are looping at
> 
>                         /*
>                          * If we didn't make any progress and have a lot of
>                          * dirty + writeback pages then we should wait for
>                          * an IO to complete to slow down the reclaim and
>                          * prevent from pre mature OOM
>                          */
>                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
>                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
>                                 return true;
>                         }
> 
> in should_reclaim_retry().
> 
> should_reclaim_retry() was added by OOM detection rework, wan't it?

What happens without this patch applied. In other words, it all smells
like the IO got stuck somewhere and the direct reclaim cannot perform it
so we have to wait for the flushers to make a progress for us. Are those
stuck? Is the IO making any progress at all or it is just too slow and
it would finish actually.  Wouldn't we just wait somewhere else in the
direct reclaim path instead.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 15:28         ` Michal Hocko
@ 2016-03-11 16:49           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 16:49 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > (Posting as a reply to this thread.)
> > > 
> > > I really do not see how this is related to this thread.
> > 
> > All allocating tasks are looping at
> > 
> >                         /*
> >                          * If we didn't make any progress and have a lot of
> >                          * dirty + writeback pages then we should wait for
> >                          * an IO to complete to slow down the reclaim and
> >                          * prevent from pre mature OOM
> >                          */
> >                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> >                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> >                                 return true;
> >                         }
> > 
> > in should_reclaim_retry().
> > 
> > should_reclaim_retry() was added by OOM detection rework, wan't it?
> 
> What happens without this patch applied. In other words, it all smells
> like the IO got stuck somewhere and the direct reclaim cannot perform it
> so we have to wait for the flushers to make a progress for us. Are those
> stuck? Is the IO making any progress at all or it is just too slow and
> it would finish actually.  Wouldn't we just wait somewhere else in the
> direct reclaim path instead.

As of next-20160311, CPU usage becomes 0% when this problem occurs.

If I remove

  mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
  mm: use watermark checks for __GFP_REPEAT high order allocations
  mm: throttle on IO only when there are too many dirty and writeback pages
  mm-oom-rework-oom-detection-checkpatch-fixes
  mm, oom: rework oom detection

then CPU usage becomes 60% and most of allocating tasks
are looping at

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 16:49           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 16:49 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > (Posting as a reply to this thread.)
> > > 
> > > I really do not see how this is related to this thread.
> > 
> > All allocating tasks are looping at
> > 
> >                         /*
> >                          * If we didn't make any progress and have a lot of
> >                          * dirty + writeback pages then we should wait for
> >                          * an IO to complete to slow down the reclaim and
> >                          * prevent from pre mature OOM
> >                          */
> >                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> >                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> >                                 return true;
> >                         }
> > 
> > in should_reclaim_retry().
> > 
> > should_reclaim_retry() was added by OOM detection rework, wan't it?
> 
> What happens without this patch applied. In other words, it all smells
> like the IO got stuck somewhere and the direct reclaim cannot perform it
> so we have to wait for the flushers to make a progress for us. Are those
> stuck? Is the IO making any progress at all or it is just too slow and
> it would finish actually.  Wouldn't we just wait somewhere else in the
> direct reclaim path instead.

As of next-20160311, CPU usage becomes 0% when this problem occurs.

If I remove

  mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
  mm: use watermark checks for __GFP_REPEAT high order allocations
  mm: throttle on IO only when there are too many dirty and writeback pages
  mm-oom-rework-oom-detection-checkpatch-fixes
  mm, oom: rework oom detection

then CPU usage becomes 60% and most of allocating tasks
are looping at

        /*
         * Acquire the oom lock.  If that fails, somebody else is
         * making progress for us.
         */
        if (!mutex_trylock(&oom_lock)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 16:49           ` Tetsuo Handa
@ 2016-03-11 17:00             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 17:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > > (Posting as a reply to this thread.)
> > > > 
> > > > I really do not see how this is related to this thread.
> > > 
> > > All allocating tasks are looping at
> > > 
> > >                         /*
> > >                          * If we didn't make any progress and have a lot of
> > >                          * dirty + writeback pages then we should wait for
> > >                          * an IO to complete to slow down the reclaim and
> > >                          * prevent from pre mature OOM
> > >                          */
> > >                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > >                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >                                 return true;
> > >                         }
> > > 
> > > in should_reclaim_retry().
> > > 
> > > should_reclaim_retry() was added by OOM detection rework, wan't it?
> > 
> > What happens without this patch applied. In other words, it all smells
> > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > so we have to wait for the flushers to make a progress for us. Are those
> > stuck? Is the IO making any progress at all or it is just too slow and
> > it would finish actually.  Wouldn't we just wait somewhere else in the
> > direct reclaim path instead.
> 
> As of next-20160311, CPU usage becomes 0% when this problem occurs.
> 
> If I remove
> 
>   mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
>   mm: use watermark checks for __GFP_REPEAT high order allocations
>   mm: throttle on IO only when there are too many dirty and writeback pages
>   mm-oom-rework-oom-detection-checkpatch-fixes
>   mm, oom: rework oom detection
> 
> then CPU usage becomes 60% and most of allocating tasks
> are looping at
> 
>         /*
>          * Acquire the oom lock.  If that fails, somebody else is
>          * making progress for us.
>          */
>         if (!mutex_trylock(&oom_lock)) {
>                 *did_some_progress = 1;
>                 schedule_timeout_uninterruptible(1);
>                 return NULL;
>         }
> 
> in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

OK, that would suggest that the oom rework patches are not really
related. They just moved from the livelock to a sleep which is good in
general IMHO. We even know that it is most probably the IO that is the
problem because we know that more than half of the reclaimable memory is
either dirty or under writeback. That is where you should be looking.
Why the IO is not making progress or such a slow progress.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 17:00             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-11 17:00 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > > (Posting as a reply to this thread.)
> > > > 
> > > > I really do not see how this is related to this thread.
> > > 
> > > All allocating tasks are looping at
> > > 
> > >                         /*
> > >                          * If we didn't make any progress and have a lot of
> > >                          * dirty + writeback pages then we should wait for
> > >                          * an IO to complete to slow down the reclaim and
> > >                          * prevent from pre mature OOM
> > >                          */
> > >                         if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > >                                 congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >                                 return true;
> > >                         }
> > > 
> > > in should_reclaim_retry().
> > > 
> > > should_reclaim_retry() was added by OOM detection rework, wan't it?
> > 
> > What happens without this patch applied. In other words, it all smells
> > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > so we have to wait for the flushers to make a progress for us. Are those
> > stuck? Is the IO making any progress at all or it is just too slow and
> > it would finish actually.  Wouldn't we just wait somewhere else in the
> > direct reclaim path instead.
> 
> As of next-20160311, CPU usage becomes 0% when this problem occurs.
> 
> If I remove
> 
>   mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
>   mm: use watermark checks for __GFP_REPEAT high order allocations
>   mm: throttle on IO only when there are too many dirty and writeback pages
>   mm-oom-rework-oom-detection-checkpatch-fixes
>   mm, oom: rework oom detection
> 
> then CPU usage becomes 60% and most of allocating tasks
> are looping at
> 
>         /*
>          * Acquire the oom lock.  If that fails, somebody else is
>          * making progress for us.
>          */
>         if (!mutex_trylock(&oom_lock)) {
>                 *did_some_progress = 1;
>                 schedule_timeout_uninterruptible(1);
>                 return NULL;
>         }
> 
> in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

OK, that would suggest that the oom rework patches are not really
related. They just moved from the livelock to a sleep which is good in
general IMHO. We even know that it is most probably the IO that is the
problem because we know that more than half of the reclaimable memory is
either dirty or under writeback. That is where you should be looking.
Why the IO is not making progress or such a slow progress.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 17:00             ` Michal Hocko
@ 2016-03-11 17:20               ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 17:20 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > What happens without this patch applied. In other words, it all smells
> > > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > > so we have to wait for the flushers to make a progress for us. Are those
> > > stuck? Is the IO making any progress at all or it is just too slow and
> > > it would finish actually.  Wouldn't we just wait somewhere else in the
> > > direct reclaim path instead.
> > 
> > As of next-20160311, CPU usage becomes 0% when this problem occurs.
> > 
> > If I remove
> > 
> >   mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
> >   mm: use watermark checks for __GFP_REPEAT high order allocations
> >   mm: throttle on IO only when there are too many dirty and writeback pages
> >   mm-oom-rework-oom-detection-checkpatch-fixes
> >   mm, oom: rework oom detection
> > 
> > then CPU usage becomes 60% and most of allocating tasks
> > are looping at
> > 
> >         /*
> >          * Acquire the oom lock.  If that fails, somebody else is
> >          * making progress for us.
> >          */
> >         if (!mutex_trylock(&oom_lock)) {
> >                 *did_some_progress = 1;
> >                 schedule_timeout_uninterruptible(1);
> >                 return NULL;
> >         }
> > 
> > in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).
> 
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
> 

Excuse me, but I can't understand why you think the oom rework patches are not
related. This problem occurs immediately after the OOM killer is invoked, which
means that there is little reclaimable memory.

  Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB mana\
ged:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_c\
ma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
  Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB

The OOM killer is invoked (but nothing happens due to TIF_MEMDIE) if I remove
the oom rework patches, which means that there is little reclaimable memory.

My understanding is that memory allocation requests needed for doing I/O cannot
be satisfied because free: is below min: . And since kswapd got stuck, nobody can
perform operations needed for making 2*(writeback + dirty) > reclaimable false.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-11 17:20               ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-11 17:20 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > What happens without this patch applied. In other words, it all smells
> > > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > > so we have to wait for the flushers to make a progress for us. Are those
> > > stuck? Is the IO making any progress at all or it is just too slow and
> > > it would finish actually.  Wouldn't we just wait somewhere else in the
> > > direct reclaim path instead.
> > 
> > As of next-20160311, CPU usage becomes 0% when this problem occurs.
> > 
> > If I remove
> > 
> >   mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
> >   mm: use watermark checks for __GFP_REPEAT high order allocations
> >   mm: throttle on IO only when there are too many dirty and writeback pages
> >   mm-oom-rework-oom-detection-checkpatch-fixes
> >   mm, oom: rework oom detection
> > 
> > then CPU usage becomes 60% and most of allocating tasks
> > are looping at
> > 
> >         /*
> >          * Acquire the oom lock.  If that fails, somebody else is
> >          * making progress for us.
> >          */
> >         if (!mutex_trylock(&oom_lock)) {
> >                 *did_some_progress = 1;
> >                 schedule_timeout_uninterruptible(1);
> >                 return NULL;
> >         }
> > 
> > in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).
> 
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
> 

Excuse me, but I can't understand why you think the oom rework patches are not
related. This problem occurs immediately after the OOM killer is invoked, which
means that there is little reclaimable memory.

  Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB mana\
ged:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_c\
ma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
  Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB

The OOM killer is invoked (but nothing happens due to TIF_MEMDIE) if I remove
the oom rework patches, which means that there is little reclaimable memory.

My understanding is that memory allocation requests needed for doing I/O cannot
be satisfied because free: is below min: . And since kswapd got stuck, nobody can
perform operations needed for making 2*(writeback + dirty) > reclaimable false.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-11 13:06                     ` Michal Hocko
@ 2016-03-11 19:08                       ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-11 19:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri, 11 Mar 2016, Michal Hocko wrote:
> On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > Joonsoo has pointed out that this attempt is still not sufficient
> > > becasuse we might have invoked only a single compaction round which
> > > is might be not enough. I fully agree with that. Here is my take on
> > > that. It is again based on the number of retries loop.
> > > 
> > > I was also playing with an idea of doing something similar to the
> > > reclaim retry logic:
> > > 	if (order) {
> > > 		if (compaction_made_progress(compact_result)
> > > 			no_compact_progress = 0;
> > > 		else if (compaction_failed(compact_result)
> > > 			no_compact_progress++;
> > > 	}
> > > but it is compaction_failed() part which is not really
> > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > compaction_suitable however hide this from compaction users so it
> > > seems like we can never see it.
> > > 
> > > Maybe we can update the feedback mechanism from the compaction but
> > > retries count seems reasonably easy to understand and pragmatic. If
> > > we cannot form a order page after we tried for N times then it really
> > > doesn't make much sense to continue and we are oom for this order. I am
> > > holding my breath to hear from Hugh on this, though.
> > 
> > Never a wise strategy.  But I just got around to it tonight.
> > 
> > I do believe you've nailed it with this patch!  Thank you!
> 
> That's a great news! Thanks for testing.
> 
> > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > but I've not got to them yet): so far it is looking good on all machines.
> > 
> > After a quick go with the simple make -j20 in tmpfs, which survived
> > a cycle on the laptop, I've switched back to my original tougher load,
> > and that's going well so far: no sign of any OOMs.  But I've interrupted
> > on the laptop to report back to you now, then I'll leave it running
> > overnight.
> 
> OK, let's wait for the rest of the tests but I find it really optimistic
> considering how easily you could trigger the issue previously. Anyway
> I hope for your Tested-by after you are reasonably confident your loads
> are behaving well.

Three have been stably running load for between 6 and 7 hours now,
no problems, looking very good:

Tested-by: Hugh Dickins <hughd@google.com>

I'll be interested to see how my huge tmpfs loads fare with the rework,
but I'm not quite ready to try that today; and any issue there (I've no
reason to suppose that there will be) can be a separate investigation
for me to make at some future date.  It was this order=2 regression
that was holding me back, and I've now no objection to your patches
(though nobody should imagine that I've actually studied them).

Thank you, Michal.

Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-11 19:08                       ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-11 19:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri, 11 Mar 2016, Michal Hocko wrote:
> On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > Joonsoo has pointed out that this attempt is still not sufficient
> > > becasuse we might have invoked only a single compaction round which
> > > is might be not enough. I fully agree with that. Here is my take on
> > > that. It is again based on the number of retries loop.
> > > 
> > > I was also playing with an idea of doing something similar to the
> > > reclaim retry logic:
> > > 	if (order) {
> > > 		if (compaction_made_progress(compact_result)
> > > 			no_compact_progress = 0;
> > > 		else if (compaction_failed(compact_result)
> > > 			no_compact_progress++;
> > > 	}
> > > but it is compaction_failed() part which is not really
> > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > compaction_suitable however hide this from compaction users so it
> > > seems like we can never see it.
> > > 
> > > Maybe we can update the feedback mechanism from the compaction but
> > > retries count seems reasonably easy to understand and pragmatic. If
> > > we cannot form a order page after we tried for N times then it really
> > > doesn't make much sense to continue and we are oom for this order. I am
> > > holding my breath to hear from Hugh on this, though.
> > 
> > Never a wise strategy.  But I just got around to it tonight.
> > 
> > I do believe you've nailed it with this patch!  Thank you!
> 
> That's a great news! Thanks for testing.
> 
> > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > but I've not got to them yet): so far it is looking good on all machines.
> > 
> > After a quick go with the simple make -j20 in tmpfs, which survived
> > a cycle on the laptop, I've switched back to my original tougher load,
> > and that's going well so far: no sign of any OOMs.  But I've interrupted
> > on the laptop to report back to you now, then I'll leave it running
> > overnight.
> 
> OK, let's wait for the rest of the tests but I find it really optimistic
> considering how easily you could trigger the issue previously. Anyway
> I hope for your Tested-by after you are reasonably confident your loads
> are behaving well.

Three have been stably running load for between 6 and 7 hours now,
no problems, looking very good:

Tested-by: Hugh Dickins <hughd@google.com>

I'll be interested to see how my huge tmpfs loads fare with the rework,
but I'm not quite ready to try that today; and any issue there (I've no
reason to suppose that there will be) can be a separate investigation
for me to make at some future date.  It was this order=2 regression
that was holding me back, and I've now no objection to your patches
(though nobody should imagine that I've actually studied them).

Thank you, Michal.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-11 17:20               ` Tetsuo Handa
@ 2016-03-12  4:08                 ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-12  4:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
> 

A footnote. Regarding this reproducer, the problem was "anybody can declare
OOM and call out_of_memory(). But out_of_memory() does nothing because there
is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
and the problem is "nobody can declare OOM and call out_of_memory(). Although
out_of_memory() will do nothing because there is a thread which has
TIF_MEMDIE." after the OOM detection rework patches.

Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > Although there are memory allocating tasks passing gfp flags with
> > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > it is blocked at down() called from memory reclaim path. And since it is
> > legal to block kswapd from memory reclaim path (am I correct?), I think
> > we must not assume that current_is_kswapd() check will break the infinite
> > loop condition.
> 
> Right, the threads that are blocked in writeback waiting on memory
> reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> that does not avoid the problem that kswapd can then get stuck
> on those locks, too. Hence there is no guarantee that kswapd can
> make reclaim progress if it does dirty page writeback...

Unless we address the issue Dave commented, the OOM detection rework patches
add a new location of livelock (which is demonstrated by this reproducer) in
the memory allocator. It is an unfortunate change that we add a new location
of livelock when we are trying to solve thrashing problem.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-12  4:08                 ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-12  4:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
> 

A footnote. Regarding this reproducer, the problem was "anybody can declare
OOM and call out_of_memory(). But out_of_memory() does nothing because there
is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
and the problem is "nobody can declare OOM and call out_of_memory(). Although
out_of_memory() will do nothing because there is a thread which has
TIF_MEMDIE." after the OOM detection rework patches.

Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > Although there are memory allocating tasks passing gfp flags with
> > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > it is blocked at down() called from memory reclaim path. And since it is
> > legal to block kswapd from memory reclaim path (am I correct?), I think
> > we must not assume that current_is_kswapd() check will break the infinite
> > loop condition.
> 
> Right, the threads that are blocked in writeback waiting on memory
> reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> that does not avoid the problem that kswapd can then get stuck
> on those locks, too. Hence there is no guarantee that kswapd can
> make reclaim progress if it does dirty page writeback...

Unless we address the issue Dave commented, the OOM detection rework patches
add a new location of livelock (which is demonstrated by this reproducer) in
the memory allocator. It is an unfortunate change that we add a new location
of livelock when we are trying to solve thrashing problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-12  4:08                 ` Tetsuo Handa
@ 2016-03-13 14:41                   ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-13 14:41 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, that would suggest that the oom rework patches are not really
> > related. They just moved from the livelock to a sleep which is good in
> > general IMHO. We even know that it is most probably the IO that is the
> > problem because we know that more than half of the reclaimable memory is
> > either dirty or under writeback. That is where you should be looking.
> > Why the IO is not making progress or such a slow progress.
> > 
> 
> A footnote. Regarding this reproducer, the problem was "anybody can declare
> OOM and call out_of_memory(). But out_of_memory() does nothing because there
> is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
> and the problem is "nobody can declare OOM and call out_of_memory(). Although
> out_of_memory() will do nothing because there is a thread which has
> TIF_MEMDIE." after the OOM detection rework patches.

According to kmallocwd, allocating tasks are very slowly able to call
out_of_memory() ( http://I-love.SAKURA.ne.jp/tmp/serial-20160313.txt.xz ).
It seems that the oom detection rework patches are not really related.

> 
> Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > > Although there are memory allocating tasks passing gfp flags with
> > > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > > it is blocked at down() called from memory reclaim path. And since it is
> > > legal to block kswapd from memory reclaim path (am I correct?), I think
> > > we must not assume that current_is_kswapd() check will break the infinite
> > > loop condition.
> > 
> > Right, the threads that are blocked in writeback waiting on memory
> > reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> > that does not avoid the problem that kswapd can then get stuck
> > on those locks, too. Hence there is no guarantee that kswapd can
> > make reclaim progress if it does dirty page writeback...
> 
> Unless we address the issue Dave commented, the OOM detection rework patches
> add a new location of livelock (which is demonstrated by this reproducer) in
> the memory allocator. It is an unfortunate change that we add a new location
> of livelock when we are trying to solve thrashing problem.
> 

The oom detection rework patches did not add a new location of livelock.
They just did not address the problem that I/O cannot make progress.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-13 14:41                   ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-13 14:41 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, that would suggest that the oom rework patches are not really
> > related. They just moved from the livelock to a sleep which is good in
> > general IMHO. We even know that it is most probably the IO that is the
> > problem because we know that more than half of the reclaimable memory is
> > either dirty or under writeback. That is where you should be looking.
> > Why the IO is not making progress or such a slow progress.
> > 
> 
> A footnote. Regarding this reproducer, the problem was "anybody can declare
> OOM and call out_of_memory(). But out_of_memory() does nothing because there
> is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
> and the problem is "nobody can declare OOM and call out_of_memory(). Although
> out_of_memory() will do nothing because there is a thread which has
> TIF_MEMDIE." after the OOM detection rework patches.

According to kmallocwd, allocating tasks are very slowly able to call
out_of_memory() ( http://I-love.SAKURA.ne.jp/tmp/serial-20160313.txt.xz ).
It seems that the oom detection rework patches are not really related.

> 
> Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > > Although there are memory allocating tasks passing gfp flags with
> > > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > > it is blocked at down() called from memory reclaim path. And since it is
> > > legal to block kswapd from memory reclaim path (am I correct?), I think
> > > we must not assume that current_is_kswapd() check will break the infinite
> > > loop condition.
> > 
> > Right, the threads that are blocked in writeback waiting on memory
> > reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> > that does not avoid the problem that kswapd can then get stuck
> > on those locks, too. Hence there is no guarantee that kswapd can
> > make reclaim progress if it does dirty page writeback...
> 
> Unless we address the issue Dave commented, the OOM detection rework patches
> add a new location of livelock (which is demonstrated by this reproducer) in
> the memory allocator. It is an unfortunate change that we add a new location
> of livelock when we are trying to solve thrashing problem.
> 

The oom detection rework patches did not add a new location of livelock.
They just did not address the problem that I/O cannot make progress.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
  2016-03-11 19:08                       ` Hugh Dickins
@ 2016-03-14 16:21                         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-14 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri 11-03-16 11:08:05, Hugh Dickins wrote:
> On Fri, 11 Mar 2016, Michal Hocko wrote:
> > On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > > Joonsoo has pointed out that this attempt is still not sufficient
> > > > becasuse we might have invoked only a single compaction round which
> > > > is might be not enough. I fully agree with that. Here is my take on
> > > > that. It is again based on the number of retries loop.
> > > > 
> > > > I was also playing with an idea of doing something similar to the
> > > > reclaim retry logic:
> > > > 	if (order) {
> > > > 		if (compaction_made_progress(compact_result)
> > > > 			no_compact_progress = 0;
> > > > 		else if (compaction_failed(compact_result)
> > > > 			no_compact_progress++;
> > > > 	}
> > > > but it is compaction_failed() part which is not really
> > > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > > compaction_suitable however hide this from compaction users so it
> > > > seems like we can never see it.
> > > > 
> > > > Maybe we can update the feedback mechanism from the compaction but
> > > > retries count seems reasonably easy to understand and pragmatic. If
> > > > we cannot form a order page after we tried for N times then it really
> > > > doesn't make much sense to continue and we are oom for this order. I am
> > > > holding my breath to hear from Hugh on this, though.
> > > 
> > > Never a wise strategy.  But I just got around to it tonight.
> > > 
> > > I do believe you've nailed it with this patch!  Thank you!
> > 
> > That's a great news! Thanks for testing.
> > 
> > > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > > but I've not got to them yet): so far it is looking good on all machines.
> > > 
> > > After a quick go with the simple make -j20 in tmpfs, which survived
> > > a cycle on the laptop, I've switched back to my original tougher load,
> > > and that's going well so far: no sign of any OOMs.  But I've interrupted
> > > on the laptop to report back to you now, then I'll leave it running
> > > overnight.
> > 
> > OK, let's wait for the rest of the tests but I find it really optimistic
> > considering how easily you could trigger the issue previously. Anyway
> > I hope for your Tested-by after you are reasonably confident your loads
> > are behaving well.
> 
> Three have been stably running load for between 6 and 7 hours now,
> no problems, looking very good:
> 
> Tested-by: Hugh Dickins <hughd@google.com>

Thanks!

> I'll be interested to see how my huge tmpfs loads fare with the rework,
> but I'm not quite ready to try that today; and any issue there (I've no
> reason to suppose that there will be) can be a separate investigation
> for me to make at some future date.  It was this order=2 regression
> that was holding me back, and I've now no objection to your patches
> (though nobody should imagine that I've actually studied them).

I still have some work on top pending and I do not want to rush these
changes and target this for 4.7. 4.6 is just too close and I would hate
to push some last minute changes. I think oom_reaper would be large
enough for 4.6 in this area. 

I will post the full series after rc1. Andrew feel free to drop it from
the mmotm tree for now. I would prefer they got all reviewed together
rather than a larger number of fixups.

Thanks Hugh for your testing. I wish I could depend on it less but I've
not been able to reproduce not matter how much I tried.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 3/3] mm, oom: protect !costly allocations some more
@ 2016-03-14 16:21                         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-14 16:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sergey Senozhatsky, Vlastimil Babka,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, Joonsoo Kim,
	linux-mm, LKML

On Fri 11-03-16 11:08:05, Hugh Dickins wrote:
> On Fri, 11 Mar 2016, Michal Hocko wrote:
> > On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > > Joonsoo has pointed out that this attempt is still not sufficient
> > > > becasuse we might have invoked only a single compaction round which
> > > > is might be not enough. I fully agree with that. Here is my take on
> > > > that. It is again based on the number of retries loop.
> > > > 
> > > > I was also playing with an idea of doing something similar to the
> > > > reclaim retry logic:
> > > > 	if (order) {
> > > > 		if (compaction_made_progress(compact_result)
> > > > 			no_compact_progress = 0;
> > > > 		else if (compaction_failed(compact_result)
> > > > 			no_compact_progress++;
> > > > 	}
> > > > but it is compaction_failed() part which is not really
> > > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > > compaction_suitable however hide this from compaction users so it
> > > > seems like we can never see it.
> > > > 
> > > > Maybe we can update the feedback mechanism from the compaction but
> > > > retries count seems reasonably easy to understand and pragmatic. If
> > > > we cannot form a order page after we tried for N times then it really
> > > > doesn't make much sense to continue and we are oom for this order. I am
> > > > holding my breath to hear from Hugh on this, though.
> > > 
> > > Never a wise strategy.  But I just got around to it tonight.
> > > 
> > > I do believe you've nailed it with this patch!  Thank you!
> > 
> > That's a great news! Thanks for testing.
> > 
> > > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > > but I've not got to them yet): so far it is looking good on all machines.
> > > 
> > > After a quick go with the simple make -j20 in tmpfs, which survived
> > > a cycle on the laptop, I've switched back to my original tougher load,
> > > and that's going well so far: no sign of any OOMs.  But I've interrupted
> > > on the laptop to report back to you now, then I'll leave it running
> > > overnight.
> > 
> > OK, let's wait for the rest of the tests but I find it really optimistic
> > considering how easily you could trigger the issue previously. Anyway
> > I hope for your Tested-by after you are reasonably confident your loads
> > are behaving well.
> 
> Three have been stably running load for between 6 and 7 hours now,
> no problems, looking very good:
> 
> Tested-by: Hugh Dickins <hughd@google.com>

Thanks!

> I'll be interested to see how my huge tmpfs loads fare with the rework,
> but I'm not quite ready to try that today; and any issue there (I've no
> reason to suppose that there will be) can be a separate investigation
> for me to make at some future date.  It was this order=2 regression
> that was holding me back, and I've now no objection to your patches
> (though nobody should imagine that I've actually studied them).

I still have some work on top pending and I do not want to rush these
changes and target this for 4.7. 4.6 is just too close and I would hate
to push some last minute changes. I think oom_reaper would be large
enough for 4.6 in this area. 

I will post the full series after rc1. Andrew feel free to drop it from
the mmotm tree for now. I would prefer they got all reviewed together
rather than a larger number of fixups.

Thanks Hugh for your testing. I wish I could depend on it less but I've
not been able to reproduce not matter how much I tried.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
  2015-12-15 18:19   ` Michal Hocko
@ 2016-03-17 11:35     ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-17 11:35 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

Today I was testing

----------
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c950e6e8..aa52e23ac280 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 {
 	struct wb_writeback_work *work;
 
-	if (!wb_has_dirty_io(wb))
+	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
 		return;
 
 	/*
----------

using next-20160317, and I got below results.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160317.txt.xz .
---------- console log ----------
[ 1354.048836] Out of memory: Kill process 3641 (file_io.02) score 1000 or sacrifice child
[ 1354.054773] Killed process 3641 (file_io.02) total-vm:4308kB, anon-rss:104kB, file-rss:1264kB, shmem-rss:0kB
[ 1593.471245] sysrq: SysRq : Show State
(...snipped...)
[ 1595.944649] kswapd0         D ffff88003681f760     0    53      2 0x00000000
[ 1595.949872]  ffff88003681f760 ffff88003fbfa140 ffff88003681a040 ffff880036820000
[ 1595.955342]  ffff88002b5e0750 ffff88002b5e0768 ffff88003681f958 0000000000000001
[ 1595.960826]  ffff88003681f778 ffffffff81660570 ffff88003681a040 ffff88003681f7d8
[ 1595.966319] Call Trace:
[ 1595.968662]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1595.972552]  [<ffffffff81663fd6>] rwsem_down_read_failed+0xd6/0x140
[ 1595.977199]  [<ffffffff81322d98>] call_rwsem_down_read_failed+0x18/0x30
[ 1595.982087]  [<ffffffff810b8b4b>] down_read_nested+0x3b/0x50
[ 1595.986370]  [<ffffffffa024bcbb>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.990681]  [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.994898]  [<ffffffffa0236330>] xfs_map_blocks+0x80/0x150 [xfs]
[ 1595.999441]  [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.004138]  [<ffffffffa02376b6>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 1596.008692]  [<ffffffff811538ef>] pageout.isra.43+0x18f/0x240
[ 1596.012938]  [<ffffffff81155253>] shrink_page_list+0x803/0xae0
[ 1596.017247]  [<ffffffff81155c8b>] shrink_inactive_list+0x1fb/0x460
[ 1596.021771]  [<ffffffff81156896>] shrink_zone_memcg+0x5b6/0x780
[ 1596.026103]  [<ffffffff81156b34>] shrink_zone+0xd4/0x2f0
[ 1596.030111]  [<ffffffff811579e1>] kswapd+0x441/0x830
[ 1596.033847]  [<ffffffff811575a0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 1596.038786]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.042546]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.046591]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1596.216946] kworker/u128:1  D ffff8800368eaf78     0    70      2 0x00000000
[ 1596.222105] Workqueue: writeback wb_workfn (flush-8:0)
[ 1596.226009]  ffff8800368eaf78 ffff88003aa4c040 ffff88003686c0c0 ffff8800368ec000
[ 1596.231502]  ffff8800368eafb0 ffff88003d610300 000000010013c47d ffff88003ffdf100
[ 1596.237003]  ffff8800368eaf90 ffffffff81660570 ffff88003d610300 ffff8800368eb038
[ 1596.242505] Call Trace:
[ 1596.244750]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1596.248519]  [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1596.252841]  [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1596.257153]  [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1596.261424]  [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1596.265390]  [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1596.269836]  [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1596.273978]  [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1596.278153]  [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1596.283301]  [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1596.287737]  [<ffffffffa0270d70>] xfs_buf_allocate_memory+0x170/0x2ab [xfs]
[ 1596.292829]  [<ffffffffa023c9aa>] xfs_buf_get_map+0xfa/0x160 [xfs]
[ 1596.297457]  [<ffffffffa023cea9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1596.302034]  [<ffffffffa02670e7>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1596.307004]  [<ffffffffa02171b3>] xfs_btree_read_buf_block.constprop.29+0x73/0xc0 [xfs]
[ 1596.312736]  [<ffffffffa021727b>] xfs_btree_lookup_get_block+0x7b/0xf0 [xfs]
[ 1596.317859]  [<ffffffffa021b981>] xfs_btree_lookup+0xc1/0x580 [xfs]
[ 1596.322448]  [<ffffffffa0205dcc>] ? xfs_allocbt_init_cursor+0x3c/0xc0 [xfs]
[ 1596.327478]  [<ffffffffa0204290>] xfs_alloc_ag_vextent_near+0xb0/0x880 [xfs]
[ 1596.332841]  [<ffffffffa0204b57>] xfs_alloc_ag_vextent+0xf7/0x130 [xfs]
[ 1596.338547]  [<ffffffffa02056a2>] xfs_alloc_vextent+0x3b2/0x480 [xfs]
[ 1596.343706]  [<ffffffffa021316f>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1596.348841]  [<ffffffffa02134c9>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1596.353988]  [<ffffffffa0213eba>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1596.359255]  [<ffffffffa02493cd>] xfs_iomap_write_allocate+0x16d/0x380 [xfs]
[ 1596.365138]  [<ffffffffa02363ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1596.370046]  [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.375322]  [<ffffffff8114d756>] write_cache_pages+0x1f6/0x490
[ 1596.380014]  [<ffffffffa0237180>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1596.385220]  [<ffffffffa0236fa6>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1596.389823]  [<ffffffff8114e8bc>] do_writepages+0x1c/0x30
[ 1596.393865]  [<ffffffff811ed543>] __writeback_single_inode+0x33/0x170
[ 1596.398583]  [<ffffffff811ede3e>] writeback_sb_inodes+0x2ce/0x570
[ 1596.403200]  [<ffffffff811ee167>] __writeback_inodes_wb+0x87/0xc0
[ 1596.407955]  [<ffffffff811ee38b>] wb_writeback+0x1eb/0x220
[ 1596.412037]  [<ffffffff811eea2f>] wb_workfn+0x1df/0x2b0
[ 1596.416133]  [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1596.420437]  [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1596.424836]  [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1596.428948]  [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1596.433635]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.437346]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.441325]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1599.581883] kworker/0:2     D ffff880036743878     0  3476      2 0x00000080
[ 1599.587099] Workqueue: events_freezable_power_ disk_events_workfn
[ 1599.591615]  ffff880036743878 ffffffff81c0d540 ffff880039c02040 ffff880036744000
[ 1599.597112]  ffff8800367438b0 ffff88003d610300 000000010013d1a9 ffff88003ffdf100
[ 1599.602613]  ffff880036743890 ffffffff81660570 ffff88003d610300 ffff880036743938
[ 1599.608068] Call Trace:
[ 1599.610155]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1599.613996]  [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1599.618285]  [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1599.622537]  [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1599.626721]  [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1599.630666]  [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1599.635108]  [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1599.639234]  [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1599.643156]  [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1599.647774]  [<ffffffff810bc500>] ? mark_lock+0x620/0x680
[ 1599.651785]  [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1599.656235]  [<ffffffff812e108d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 1599.660662]  [<ffffffff812e2454>] bio_copy_kern+0xc4/0x180
[ 1599.664702]  [<ffffffff812ed070>] blk_rq_map_kern+0x70/0x130
[ 1599.668864]  [<ffffffff8144c4bd>] scsi_execute+0x12d/0x160
[ 1599.672950]  [<ffffffff8144c5e4>] scsi_execute_req_flags+0x84/0xf0
[ 1599.677784]  [<ffffffffa01e8762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 1599.682744]  [<ffffffffa01ce163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 1599.687747]  [<ffffffffa01e8ba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1599.692752]  [<ffffffff812f874b>] disk_check_events+0x5b/0x150
[ 1599.697130]  [<ffffffff812f8857>] disk_events_workfn+0x17/0x20
[ 1599.701783]  [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1599.706347]  [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1599.710809]  [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1599.715005]  [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1599.719427]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1599.723220]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1599.727240]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1698.163933] 1 lock held by kswapd0/53:
[ 1698.166948]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1698.174361] 5 locks held by kworker/u128:1/70:
[ 1698.177849]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.184626]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.191670]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff811c35d6>] trylock_super+0x16/0x50
[ 1698.198449]  #3:  (sb_internal){.+.+.?}, at: [<ffffffff811c35ac>] __sb_start_write+0xcc/0xe0
[ 1698.204743]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcef>] xfs_ilock+0x7f/0xe0 [xfs]
(...snipped...)
[ 1698.222061] 2 locks held by kworker/0:2/3476:
[ 1698.225546]  #0:  ("events_freezable_power_efficient"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.233350]  #1:  ((&(&ev->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
(...snipped...)
[ 1718.427909] Showing busy workqueues and worker pools:
[ 1718.432224] workqueue events: flags=0x0
[ 1718.435754]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1718.440769]     in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1718.445766]     pending: vmpressure_work_fn, cache_reap
[ 1718.450227] workqueue events_power_efficient: flags=0x80
[ 1718.454645]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1718.459663]     pending: fb_flashcursor
[ 1718.463133] workqueue events_freezable_power_: flags=0x84
[ 1718.467620]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1718.472552]     in-flight: 3476:disk_events_workfn
[ 1718.476643] workqueue writeback: flags=0x4e
[ 1718.480197]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1718.484977]     in-flight: 70:wb_workfn
[ 1718.488671] workqueue vmstat: flags=0xc
[ 1718.492312]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 MAYDAY
[ 1718.497665]     pending: vmstat_update
[ 1718.501304] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1718.507471] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=15s workers=2 manager: 3490
[ 1718.513528] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
[ 1745.495540] sysrq: SysRq : Show Memory
[ 1745.508581] Mem-Info:
[ 1745.516772] active_anon:182211 inactive_anon:12238 isolated_anon:0
[ 1745.516772]  active_file:6978 inactive_file:19887 isolated_file:32
[ 1745.516772]  unevictable:0 dirty:19697 writeback:214 unstable:0
[ 1745.516772]  slab_reclaimable:2382 slab_unreclaimable:8786
[ 1745.516772]  mapped:6820 shmem:12582 pagetables:1311 bounce:0
[ 1745.516772]  free:1877 free_pcp:132 free_cma:0
[ 1745.563639] Node 0 DMA free:3868kB min:60kB low:72kB high:84kB active_anon:6184kB inactive_anon:1120kB active_file:644kB inactive_file:1784kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1784kB writeback:0kB mapped:644kB shmem:1172kB slab_reclaimable:220kB slab_unreclaimable:660kB kernel_stack:496kB pagetables:252kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:15392 all_unreclaimable? yes
[ 1745.595872] lowmem_reserve[]: 0 953 953 953
[ 1745.599508] Node 0 DMA32 free:3640kB min:3780kB low:4752kB high:5724kB active_anon:722660kB inactive_anon:47832kB active_file:27268kB inactive_file:77764kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB managed:980852kB mlocked:0kB dirty:77004kB writeback:856kB mapped:26636kB shmem:49156kB slab_reclaimable:9308kB slab_unreclaimable:34484kB kernel_stack:19760kB pagetables:4992kB unstable:0kB bounce:0kB free_pcp:528kB local_pcp:60kB free_cma:0kB writeback_tmp:0kB pages_scanned:1387692 all_unreclaimable? yes
[ 1745.633558] lowmem_reserve[]: 0 0 0 0
[ 1745.636871] Node 0 DMA: 25*4kB (UME) 9*8kB (UME) 7*16kB (UME) 2*32kB (ME) 3*64kB (ME) 4*128kB (UE) 3*256kB (UME) 4*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 3868kB
[ 1745.648828] Node 0 DMA32: 886*4kB (UE) 8*8kB (UM) 2*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3640kB
[ 1745.658179] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1745.664712] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1745.671127] 39477 total pagecache pages
[ 1745.674392] 0 pages in swap cache
[ 1745.677315] Swap cache stats: add 0, delete 0, find 0/0
[ 1745.681493] Free swap  = 0kB
[ 1745.684113] Total swap = 0kB
[ 1745.686786] 262013 pages RAM
[ 1745.689386] 0 pages HighMem/MovableOnly
[ 1745.692883] 12824 pages reserved
[ 1745.695779] 0 pages cma reserved
[ 1745.698763] 0 pages hwpoisoned
[ 1746.841678] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 44s!
[ 1746.866634] Showing busy workqueues and worker pools:
[ 1746.881055] workqueue events: flags=0x0
[ 1746.887480]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1746.894205]     in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1746.900892]     pending: vmpressure_work_fn, cache_reap
[ 1746.906938] workqueue events_power_efficient: flags=0x80
[ 1746.912780]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1746.917657]     pending: fb_flashcursor
[ 1746.920983] workqueue events_freezable_power_: flags=0x84
[ 1746.925304]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1746.930114]     in-flight: 3476:disk_events_workfn
[ 1746.934076] workqueue writeback: flags=0x4e
[ 1746.937546]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1746.942258]     in-flight: 70:wb_workfn
[ 1746.945978] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1746.952268] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=44s workers=2 manager: 3490
[ 1746.958276] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
---------- console log ----------

This is an OOM-livelocked situation where kswapd got stuck and
allocating tasks are sleeping at

	/*
	 * If we didn't make any progress and have a lot of
	 * dirty + writeback pages then we should wait for
	 * an IO to complete to slow down the reclaim and
	 * prevent from pre mature OOM
	 */
	if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);
		return true;
	}

in should_reclaim_retry(). Presumably out_of_memory() is called (I didn't
confirm it using kmallocwd), and this is a situation where "we need to select
next OOM-victim" or "fail !__GFP_FS && !__GFP_NOFAIL allocation requests".

But what I felt strange is what should_reclaim_retry() is doing.

Michal Hocko wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f77e283fb8c6..b2de8c8761ad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		 */
>  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>  				ac->high_zoneidx, alloc_flags, available)) {
> -			/* Wait for some write requests to complete then retry */
> -			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			unsigned long writeback;
> +			unsigned long dirty;
> +
> +			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> +			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> +
> +			/*
> +			 * If we didn't make any progress and have a lot of
> +			 * dirty + writeback pages then we should wait for
> +			 * an IO to complete to slow down the reclaim and
> +			 * prevent from pre mature OOM
> +			 */
> +			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +				return true;
> +			}

writeback and dirty are used only when did_some_progress == 0. Thus, we don't
need to calculate writeback and dirty using zone_page_state_snapshot() unless
did_some_progress == 0.

But, does it make sense to take writeback and dirty into account when
disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

> +
> +			/*
> +			 * Memory allocation/reclaim might be called from a WQ
> +			 * context and the current implementation of the WQ
> +			 * concurrency control doesn't recognize that
> +			 * a particular WQ is congested if the worker thread is
> +			 * looping without ever sleeping. Therefore we have to
> +			 * do a short sleep here rather than calling
> +			 * cond_resched().
> +			 */
> +			if (current->flags & PF_WQ_WORKER)
> +				schedule_timeout(1);

This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
Please update.

> +			else
> +				cond_resched();
> +
>  			return true;
>  		}
>  	}
> -- 

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
@ 2016-03-17 11:35     ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-03-17 11:35 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

Today I was testing

----------
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c950e6e8..aa52e23ac280 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 {
 	struct wb_writeback_work *work;
 
-	if (!wb_has_dirty_io(wb))
+	if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
 		return;
 
 	/*
----------

using next-20160317, and I got below results.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160317.txt.xz .
---------- console log ----------
[ 1354.048836] Out of memory: Kill process 3641 (file_io.02) score 1000 or sacrifice child
[ 1354.054773] Killed process 3641 (file_io.02) total-vm:4308kB, anon-rss:104kB, file-rss:1264kB, shmem-rss:0kB
[ 1593.471245] sysrq: SysRq : Show State
(...snipped...)
[ 1595.944649] kswapd0         D ffff88003681f760     0    53      2 0x00000000
[ 1595.949872]  ffff88003681f760 ffff88003fbfa140 ffff88003681a040 ffff880036820000
[ 1595.955342]  ffff88002b5e0750 ffff88002b5e0768 ffff88003681f958 0000000000000001
[ 1595.960826]  ffff88003681f778 ffffffff81660570 ffff88003681a040 ffff88003681f7d8
[ 1595.966319] Call Trace:
[ 1595.968662]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1595.972552]  [<ffffffff81663fd6>] rwsem_down_read_failed+0xd6/0x140
[ 1595.977199]  [<ffffffff81322d98>] call_rwsem_down_read_failed+0x18/0x30
[ 1595.982087]  [<ffffffff810b8b4b>] down_read_nested+0x3b/0x50
[ 1595.986370]  [<ffffffffa024bcbb>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.990681]  [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.994898]  [<ffffffffa0236330>] xfs_map_blocks+0x80/0x150 [xfs]
[ 1595.999441]  [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.004138]  [<ffffffffa02376b6>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 1596.008692]  [<ffffffff811538ef>] pageout.isra.43+0x18f/0x240
[ 1596.012938]  [<ffffffff81155253>] shrink_page_list+0x803/0xae0
[ 1596.017247]  [<ffffffff81155c8b>] shrink_inactive_list+0x1fb/0x460
[ 1596.021771]  [<ffffffff81156896>] shrink_zone_memcg+0x5b6/0x780
[ 1596.026103]  [<ffffffff81156b34>] shrink_zone+0xd4/0x2f0
[ 1596.030111]  [<ffffffff811579e1>] kswapd+0x441/0x830
[ 1596.033847]  [<ffffffff811575a0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 1596.038786]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.042546]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.046591]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1596.216946] kworker/u128:1  D ffff8800368eaf78     0    70      2 0x00000000
[ 1596.222105] Workqueue: writeback wb_workfn (flush-8:0)
[ 1596.226009]  ffff8800368eaf78 ffff88003aa4c040 ffff88003686c0c0 ffff8800368ec000
[ 1596.231502]  ffff8800368eafb0 ffff88003d610300 000000010013c47d ffff88003ffdf100
[ 1596.237003]  ffff8800368eaf90 ffffffff81660570 ffff88003d610300 ffff8800368eb038
[ 1596.242505] Call Trace:
[ 1596.244750]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1596.248519]  [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1596.252841]  [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1596.257153]  [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1596.261424]  [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1596.265390]  [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1596.269836]  [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1596.273978]  [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1596.278153]  [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1596.283301]  [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1596.287737]  [<ffffffffa0270d70>] xfs_buf_allocate_memory+0x170/0x2ab [xfs]
[ 1596.292829]  [<ffffffffa023c9aa>] xfs_buf_get_map+0xfa/0x160 [xfs]
[ 1596.297457]  [<ffffffffa023cea9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1596.302034]  [<ffffffffa02670e7>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1596.307004]  [<ffffffffa02171b3>] xfs_btree_read_buf_block.constprop.29+0x73/0xc0 [xfs]
[ 1596.312736]  [<ffffffffa021727b>] xfs_btree_lookup_get_block+0x7b/0xf0 [xfs]
[ 1596.317859]  [<ffffffffa021b981>] xfs_btree_lookup+0xc1/0x580 [xfs]
[ 1596.322448]  [<ffffffffa0205dcc>] ? xfs_allocbt_init_cursor+0x3c/0xc0 [xfs]
[ 1596.327478]  [<ffffffffa0204290>] xfs_alloc_ag_vextent_near+0xb0/0x880 [xfs]
[ 1596.332841]  [<ffffffffa0204b57>] xfs_alloc_ag_vextent+0xf7/0x130 [xfs]
[ 1596.338547]  [<ffffffffa02056a2>] xfs_alloc_vextent+0x3b2/0x480 [xfs]
[ 1596.343706]  [<ffffffffa021316f>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1596.348841]  [<ffffffffa02134c9>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1596.353988]  [<ffffffffa0213eba>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1596.359255]  [<ffffffffa02493cd>] xfs_iomap_write_allocate+0x16d/0x380 [xfs]
[ 1596.365138]  [<ffffffffa02363ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1596.370046]  [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.375322]  [<ffffffff8114d756>] write_cache_pages+0x1f6/0x490
[ 1596.380014]  [<ffffffffa0237180>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1596.385220]  [<ffffffffa0236fa6>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1596.389823]  [<ffffffff8114e8bc>] do_writepages+0x1c/0x30
[ 1596.393865]  [<ffffffff811ed543>] __writeback_single_inode+0x33/0x170
[ 1596.398583]  [<ffffffff811ede3e>] writeback_sb_inodes+0x2ce/0x570
[ 1596.403200]  [<ffffffff811ee167>] __writeback_inodes_wb+0x87/0xc0
[ 1596.407955]  [<ffffffff811ee38b>] wb_writeback+0x1eb/0x220
[ 1596.412037]  [<ffffffff811eea2f>] wb_workfn+0x1df/0x2b0
[ 1596.416133]  [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1596.420437]  [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1596.424836]  [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1596.428948]  [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1596.433635]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.437346]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.441325]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1599.581883] kworker/0:2     D ffff880036743878     0  3476      2 0x00000080
[ 1599.587099] Workqueue: events_freezable_power_ disk_events_workfn
[ 1599.591615]  ffff880036743878 ffffffff81c0d540 ffff880039c02040 ffff880036744000
[ 1599.597112]  ffff8800367438b0 ffff88003d610300 000000010013d1a9 ffff88003ffdf100
[ 1599.602613]  ffff880036743890 ffffffff81660570 ffff88003d610300 ffff880036743938
[ 1599.608068] Call Trace:
[ 1599.610155]  [<ffffffff81660570>] schedule+0x30/0x80
[ 1599.613996]  [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1599.618285]  [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1599.622537]  [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1599.626721]  [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1599.630666]  [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1599.635108]  [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1599.639234]  [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1599.643156]  [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1599.647774]  [<ffffffff810bc500>] ? mark_lock+0x620/0x680
[ 1599.651785]  [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1599.656235]  [<ffffffff812e108d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 1599.660662]  [<ffffffff812e2454>] bio_copy_kern+0xc4/0x180
[ 1599.664702]  [<ffffffff812ed070>] blk_rq_map_kern+0x70/0x130
[ 1599.668864]  [<ffffffff8144c4bd>] scsi_execute+0x12d/0x160
[ 1599.672950]  [<ffffffff8144c5e4>] scsi_execute_req_flags+0x84/0xf0
[ 1599.677784]  [<ffffffffa01e8762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 1599.682744]  [<ffffffffa01ce163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 1599.687747]  [<ffffffffa01e8ba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1599.692752]  [<ffffffff812f874b>] disk_check_events+0x5b/0x150
[ 1599.697130]  [<ffffffff812f8857>] disk_events_workfn+0x17/0x20
[ 1599.701783]  [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1599.706347]  [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1599.710809]  [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1599.715005]  [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1599.719427]  [<ffffffff8109196e>] kthread+0xee/0x110
[ 1599.723220]  [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1599.727240]  [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1698.163933] 1 lock held by kswapd0/53:
[ 1698.166948]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1698.174361] 5 locks held by kworker/u128:1/70:
[ 1698.177849]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.184626]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.191670]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff811c35d6>] trylock_super+0x16/0x50
[ 1698.198449]  #3:  (sb_internal){.+.+.?}, at: [<ffffffff811c35ac>] __sb_start_write+0xcc/0xe0
[ 1698.204743]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcef>] xfs_ilock+0x7f/0xe0 [xfs]
(...snipped...)
[ 1698.222061] 2 locks held by kworker/0:2/3476:
[ 1698.225546]  #0:  ("events_freezable_power_efficient"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.233350]  #1:  ((&(&ev->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
(...snipped...)
[ 1718.427909] Showing busy workqueues and worker pools:
[ 1718.432224] workqueue events: flags=0x0
[ 1718.435754]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1718.440769]     in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1718.445766]     pending: vmpressure_work_fn, cache_reap
[ 1718.450227] workqueue events_power_efficient: flags=0x80
[ 1718.454645]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1718.459663]     pending: fb_flashcursor
[ 1718.463133] workqueue events_freezable_power_: flags=0x84
[ 1718.467620]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1718.472552]     in-flight: 3476:disk_events_workfn
[ 1718.476643] workqueue writeback: flags=0x4e
[ 1718.480197]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1718.484977]     in-flight: 70:wb_workfn
[ 1718.488671] workqueue vmstat: flags=0xc
[ 1718.492312]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 MAYDAY
[ 1718.497665]     pending: vmstat_update
[ 1718.501304] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1718.507471] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=15s workers=2 manager: 3490
[ 1718.513528] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
[ 1745.495540] sysrq: SysRq : Show Memory
[ 1745.508581] Mem-Info:
[ 1745.516772] active_anon:182211 inactive_anon:12238 isolated_anon:0
[ 1745.516772]  active_file:6978 inactive_file:19887 isolated_file:32
[ 1745.516772]  unevictable:0 dirty:19697 writeback:214 unstable:0
[ 1745.516772]  slab_reclaimable:2382 slab_unreclaimable:8786
[ 1745.516772]  mapped:6820 shmem:12582 pagetables:1311 bounce:0
[ 1745.516772]  free:1877 free_pcp:132 free_cma:0
[ 1745.563639] Node 0 DMA free:3868kB min:60kB low:72kB high:84kB active_anon:6184kB inactive_anon:1120kB active_file:644kB inactive_file:1784kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1784kB writeback:0kB mapped:644kB shmem:1172kB slab_reclaimable:220kB slab_unreclaimable:660kB kernel_stack:496kB pagetables:252kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:15392 all_unreclaimable? yes
[ 1745.595872] lowmem_reserve[]: 0 953 953 953
[ 1745.599508] Node 0 DMA32 free:3640kB min:3780kB low:4752kB high:5724kB active_anon:722660kB inactive_anon:47832kB active_file:27268kB inactive_file:77764kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB managed:980852kB mlocked:0kB dirty:77004kB writeback:856kB mapped:26636kB shmem:49156kB slab_reclaimable:9308kB slab_unreclaimable:34484kB kernel_stack:19760kB pagetables:4992kB unstable:0kB bounce:0kB free_pcp:528kB local_pcp:60kB free_cma:0kB writeback_tmp:0kB pages_scanned:1387692 all_unreclaimable? yes
[ 1745.633558] lowmem_reserve[]: 0 0 0 0
[ 1745.636871] Node 0 DMA: 25*4kB (UME) 9*8kB (UME) 7*16kB (UME) 2*32kB (ME) 3*64kB (ME) 4*128kB (UE) 3*256kB (UME) 4*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 3868kB
[ 1745.648828] Node 0 DMA32: 886*4kB (UE) 8*8kB (UM) 2*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3640kB
[ 1745.658179] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1745.664712] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1745.671127] 39477 total pagecache pages
[ 1745.674392] 0 pages in swap cache
[ 1745.677315] Swap cache stats: add 0, delete 0, find 0/0
[ 1745.681493] Free swap  = 0kB
[ 1745.684113] Total swap = 0kB
[ 1745.686786] 262013 pages RAM
[ 1745.689386] 0 pages HighMem/MovableOnly
[ 1745.692883] 12824 pages reserved
[ 1745.695779] 0 pages cma reserved
[ 1745.698763] 0 pages hwpoisoned
[ 1746.841678] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 44s!
[ 1746.866634] Showing busy workqueues and worker pools:
[ 1746.881055] workqueue events: flags=0x0
[ 1746.887480]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1746.894205]     in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1746.900892]     pending: vmpressure_work_fn, cache_reap
[ 1746.906938] workqueue events_power_efficient: flags=0x80
[ 1746.912780]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1746.917657]     pending: fb_flashcursor
[ 1746.920983] workqueue events_freezable_power_: flags=0x84
[ 1746.925304]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1746.930114]     in-flight: 3476:disk_events_workfn
[ 1746.934076] workqueue writeback: flags=0x4e
[ 1746.937546]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1746.942258]     in-flight: 70:wb_workfn
[ 1746.945978] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1746.952268] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=44s workers=2 manager: 3490
[ 1746.958276] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
---------- console log ----------

This is an OOM-livelocked situation where kswapd got stuck and
allocating tasks are sleeping at

	/*
	 * If we didn't make any progress and have a lot of
	 * dirty + writeback pages then we should wait for
	 * an IO to complete to slow down the reclaim and
	 * prevent from pre mature OOM
	 */
	if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);
		return true;
	}

in should_reclaim_retry(). Presumably out_of_memory() is called (I didn't
confirm it using kmallocwd), and this is a situation where "we need to select
next OOM-victim" or "fail !__GFP_FS && !__GFP_NOFAIL allocation requests".

But what I felt strange is what should_reclaim_retry() is doing.

Michal Hocko wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f77e283fb8c6..b2de8c8761ad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		 */
>  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
>  				ac->high_zoneidx, alloc_flags, available)) {
> -			/* Wait for some write requests to complete then retry */
> -			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			unsigned long writeback;
> +			unsigned long dirty;
> +
> +			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> +			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> +
> +			/*
> +			 * If we didn't make any progress and have a lot of
> +			 * dirty + writeback pages then we should wait for
> +			 * an IO to complete to slow down the reclaim and
> +			 * prevent from pre mature OOM
> +			 */
> +			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +				return true;
> +			}

writeback and dirty are used only when did_some_progress == 0. Thus, we don't
need to calculate writeback and dirty using zone_page_state_snapshot() unless
did_some_progress == 0.

But, does it make sense to take writeback and dirty into account when
disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

> +
> +			/*
> +			 * Memory allocation/reclaim might be called from a WQ
> +			 * context and the current implementation of the WQ
> +			 * concurrency control doesn't recognize that
> +			 * a particular WQ is congested if the worker thread is
> +			 * looping without ever sleeping. Therefore we have to
> +			 * do a short sleep here rather than calling
> +			 * cond_resched().
> +			 */
> +			if (current->flags & PF_WQ_WORKER)
> +				schedule_timeout(1);

This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
Please update.

> +			else
> +				cond_resched();
> +
>  			return true;
>  		}
>  	}
> -- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
  2016-03-17 11:35     ` Tetsuo Handa
@ 2016-03-17 12:01       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-17 12:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 17-03-16 20:35:23, Tetsuo Handa wrote:
[...]
> But what I felt strange is what should_reclaim_retry() is doing.
> 
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f77e283fb8c6..b2de8c8761ad 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		 */
> >  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> >  				ac->high_zoneidx, alloc_flags, available)) {
> > -			/* Wait for some write requests to complete then retry */
> > -			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +			unsigned long writeback;
> > +			unsigned long dirty;
> > +
> > +			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> > +			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> > +
> > +			/*
> > +			 * If we didn't make any progress and have a lot of
> > +			 * dirty + writeback pages then we should wait for
> > +			 * an IO to complete to slow down the reclaim and
> > +			 * prevent from pre mature OOM
> > +			 */
> > +			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +				return true;
> > +			}
> 
> writeback and dirty are used only when did_some_progress == 0. Thus, we don't
> need to calculate writeback and dirty using zone_page_state_snapshot() unless
> did_some_progress == 0.

OK, I will move this into if !did_some_progress.

> But, does it make sense to take writeback and dirty into account when
> disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
> wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
> Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

I have considered skiping the throttling part for GFP_NOFS/GFP_NOIO
previously but I couldn't have convinced myself it would make any
difference. We know there was no progress in the reclaim and even if the
current context is doing FS/IO allocation potentially then it obviously
cannot get its memory so it cannot proceed. So now we are in the state
where we either busy loop or sleep for a while. So I ended up not
complicating the code even more. If you have a use case where busy
waiting makes a difference then I would vote for a separate patch with a
clear description.

> > +
> > +			/*
> > +			 * Memory allocation/reclaim might be called from a WQ
> > +			 * context and the current implementation of the WQ
> > +			 * concurrency control doesn't recognize that
> > +			 * a particular WQ is congested if the worker thread is
> > +			 * looping without ever sleeping. Therefore we have to
> > +			 * do a short sleep here rather than calling
> > +			 * cond_resched().
> > +			 */
> > +			if (current->flags & PF_WQ_WORKER)
> > +				schedule_timeout(1);
> 
> This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
> Please update.

Yeah, I have that updated in my local patch already.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
@ 2016-03-17 12:01       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-17 12:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 17-03-16 20:35:23, Tetsuo Handa wrote:
[...]
> But what I felt strange is what should_reclaim_retry() is doing.
> 
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f77e283fb8c6..b2de8c8761ad 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		 */
> >  		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> >  				ac->high_zoneidx, alloc_flags, available)) {
> > -			/* Wait for some write requests to complete then retry */
> > -			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +			unsigned long writeback;
> > +			unsigned long dirty;
> > +
> > +			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> > +			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> > +
> > +			/*
> > +			 * If we didn't make any progress and have a lot of
> > +			 * dirty + writeback pages then we should wait for
> > +			 * an IO to complete to slow down the reclaim and
> > +			 * prevent from pre mature OOM
> > +			 */
> > +			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +				return true;
> > +			}
> 
> writeback and dirty are used only when did_some_progress == 0. Thus, we don't
> need to calculate writeback and dirty using zone_page_state_snapshot() unless
> did_some_progress == 0.

OK, I will move this into if !did_some_progress.

> But, does it make sense to take writeback and dirty into account when
> disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
> wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
> Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

I have considered skiping the throttling part for GFP_NOFS/GFP_NOIO
previously but I couldn't have convinced myself it would make any
difference. We know there was no progress in the reclaim and even if the
current context is doing FS/IO allocation potentially then it obviously
cannot get its memory so it cannot proceed. So now we are in the state
where we either busy loop or sleep for a while. So I ended up not
complicating the code even more. If you have a use case where busy
waiting makes a difference then I would vote for a separate patch with a
clear description.

> > +
> > +			/*
> > +			 * Memory allocation/reclaim might be called from a WQ
> > +			 * context and the current implementation of the WQ
> > +			 * concurrency control doesn't recognize that
> > +			 * a particular WQ is congested if the worker thread is
> > +			 * looping without ever sleeping. Therefore we have to
> > +			 * do a short sleep here rather than calling
> > +			 * cond_resched().
> > +			 */
> > +			if (current->flags & PF_WQ_WORKER)
> > +				schedule_timeout(1);
> 
> This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
> Please update.

Yeah, I have that updated in my local patch already.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2015-12-15 18:19   ` Michal Hocko
@ 2016-04-04  8:23     ` Vladimir Davydov
  -1 siblings, 0 replies; 299+ messages in thread
From: Vladimir Davydov @ 2016-04-04  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Michal Hocko

On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
...
> @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						&nr_soft_scanned);
>  			sc->nr_reclaimed += nr_soft_reclaimed;
>  			sc->nr_scanned += nr_soft_scanned;
> -			if (nr_soft_reclaimed)
> -				reclaimable = true;
>  			/* need some check for avoid more shrink_zone() */
>  		}
>  
> -		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> -			reclaimable = true;
> -
> -		if (global_reclaim(sc) &&
> -		    !reclaimable && zone_reclaimable(zone))
> -			reclaimable = true;
> +		shrink_zone(zone, sc, zone_idx(zone));

Shouldn't it be

		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

?

>  	}
>  
>  	/*

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-04-04  8:23     ` Vladimir Davydov
  0 siblings, 0 replies; 299+ messages in thread
From: Vladimir Davydov @ 2016-04-04  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Michal Hocko

On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
...
> @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						&nr_soft_scanned);
>  			sc->nr_reclaimed += nr_soft_reclaimed;
>  			sc->nr_scanned += nr_soft_scanned;
> -			if (nr_soft_reclaimed)
> -				reclaimable = true;
>  			/* need some check for avoid more shrink_zone() */
>  		}
>  
> -		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> -			reclaimable = true;
> -
> -		if (global_reclaim(sc) &&
> -		    !reclaimable && zone_reclaimable(zone))
> -			reclaimable = true;
> +		shrink_zone(zone, sc, zone_idx(zone));

Shouldn't it be

		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

?

>  	}
>  
>  	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-04-04  8:23     ` Vladimir Davydov
@ 2016-04-04  9:42       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-04-04  9:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Mon 04-04-16 11:23:43, Vladimir Davydov wrote:
> On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
> ...
> > @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  						&nr_soft_scanned);
> >  			sc->nr_reclaimed += nr_soft_reclaimed;
> >  			sc->nr_scanned += nr_soft_scanned;
> > -			if (nr_soft_reclaimed)
> > -				reclaimable = true;
> >  			/* need some check for avoid more shrink_zone() */
> >  		}
> >  
> > -		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> > -			reclaimable = true;
> > -
> > -		if (global_reclaim(sc) &&
> > -		    !reclaimable && zone_reclaimable(zone))
> > -			reclaimable = true;
> > +		shrink_zone(zone, sc, zone_idx(zone));
> 
> Shouldn't it be
> 
> 		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
> 
> ?

I cannot remember the reason why I have removed it so it is more likely
this was unintentional. Thanks for catching this. I will fold it into
the original patch before I repost the full series (this week
hopefully).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-04-04  9:42       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-04-04  9:42 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Mon 04-04-16 11:23:43, Vladimir Davydov wrote:
> On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
> ...
> > @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  						&nr_soft_scanned);
> >  			sc->nr_reclaimed += nr_soft_reclaimed;
> >  			sc->nr_scanned += nr_soft_scanned;
> > -			if (nr_soft_reclaimed)
> > -				reclaimable = true;
> >  			/* need some check for avoid more shrink_zone() */
> >  		}
> >  
> > -		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> > -			reclaimable = true;
> > -
> > -		if (global_reclaim(sc) &&
> > -		    !reclaimable && zone_reclaimable(zone))
> > -			reclaimable = true;
> > +		shrink_zone(zone, sc, zone_idx(zone));
> 
> Shouldn't it be
> 
> 		shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
> 
> ?

I cannot remember the reason why I have removed it so it is more likely
this was unintentional. Thanks for catching this. I will fold it into
the original patch before I repost the full series (this week
hopefully).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

end of thread, other threads:[~2016-04-04  9:42 UTC | newest]

Thread overview: 299+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-15 18:19 [PATCH 0/3] OOM detection rework v4 Michal Hocko
2015-12-15 18:19 ` Michal Hocko
2015-12-15 18:19 ` [PATCH 1/3] mm, oom: rework oom detection Michal Hocko
2015-12-15 18:19   ` Michal Hocko
2016-01-14 22:58   ` David Rientjes
2016-01-14 22:58     ` David Rientjes
2016-01-16  1:07     ` Tetsuo Handa
2016-01-16  1:07       ` Tetsuo Handa
2016-01-19 22:48       ` David Rientjes
2016-01-19 22:48         ` David Rientjes
2016-01-20 11:13         ` Tetsuo Handa
2016-01-20 11:13           ` Tetsuo Handa
2016-01-20 13:13           ` Michal Hocko
2016-01-20 13:13             ` Michal Hocko
2016-04-04  8:23   ` Vladimir Davydov
2016-04-04  8:23     ` Vladimir Davydov
2016-04-04  9:42     ` Michal Hocko
2016-04-04  9:42       ` Michal Hocko
2015-12-15 18:19 ` [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages Michal Hocko
2015-12-15 18:19   ` Michal Hocko
2016-03-17 11:35   ` Tetsuo Handa
2016-03-17 11:35     ` Tetsuo Handa
2016-03-17 12:01     ` Michal Hocko
2016-03-17 12:01       ` Michal Hocko
2015-12-15 18:19 ` [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations Michal Hocko
2015-12-15 18:19   ` Michal Hocko
2015-12-16 23:35 ` [PATCH 0/3] OOM detection rework v4 Andrew Morton
2015-12-16 23:35   ` Andrew Morton
2015-12-18 12:12   ` Michal Hocko
2015-12-18 12:12     ` Michal Hocko
2015-12-16 23:58 ` Andrew Morton
2015-12-16 23:58   ` Andrew Morton
2015-12-18 13:15   ` Michal Hocko
2015-12-18 13:15     ` Michal Hocko
2015-12-18 16:35     ` Johannes Weiner
2015-12-18 16:35       ` Johannes Weiner
2015-12-24 12:41 ` Tetsuo Handa
2015-12-24 12:41   ` Tetsuo Handa
2015-12-28 12:08   ` Tetsuo Handa
2015-12-28 12:08     ` Tetsuo Handa
2015-12-28 14:13     ` Tetsuo Handa
2015-12-28 14:13       ` Tetsuo Handa
2016-01-06 12:44       ` Vlastimil Babka
2016-01-06 12:44         ` Vlastimil Babka
2016-01-08 12:37       ` Michal Hocko
2016-01-08 12:37         ` Michal Hocko
2015-12-29 16:32     ` Michal Hocko
2015-12-29 16:32       ` Michal Hocko
2015-12-30 15:05       ` Tetsuo Handa
2015-12-30 15:05         ` Tetsuo Handa
2016-01-02 15:47         ` Tetsuo Handa
2016-01-02 15:47           ` Tetsuo Handa
2016-01-20 12:24           ` Michal Hocko
2016-01-20 12:24             ` Michal Hocko
2016-01-27 23:18             ` David Rientjes
2016-01-27 23:18               ` David Rientjes
2016-01-28 21:19               ` Michal Hocko
2016-01-28 21:19                 ` Michal Hocko
2015-12-29 16:27   ` Michal Hocko
2015-12-29 16:27     ` Michal Hocko
2016-01-28 20:40 ` [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory Michal Hocko
2016-01-28 20:40   ` Michal Hocko
2016-01-28 21:36   ` Johannes Weiner
2016-01-28 21:36     ` Johannes Weiner
2016-01-28 23:19     ` David Rientjes
2016-01-28 23:19       ` David Rientjes
2016-01-28 23:51       ` Johannes Weiner
2016-01-28 23:51         ` Johannes Weiner
2016-01-29 10:39         ` Tetsuo Handa
2016-01-29 10:39           ` Tetsuo Handa
2016-01-29 15:32         ` Michal Hocko
2016-01-29 15:32           ` Michal Hocko
2016-01-30 12:18           ` Tetsuo Handa
2016-01-30 12:18             ` Tetsuo Handa
2016-01-29 15:23       ` Michal Hocko
2016-01-29 15:23         ` Michal Hocko
2016-01-29 15:24     ` Michal Hocko
2016-01-29 15:24       ` Michal Hocko
2016-01-28 21:19 ` [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise Michal Hocko
2016-01-28 21:19   ` Michal Hocko
2016-01-28 23:20   ` David Rientjes
2016-01-28 23:20     ` David Rientjes
2016-01-29  3:41   ` Hillf Danton
2016-01-29  3:41     ` Hillf Danton
2016-01-29 10:35   ` Tetsuo Handa
2016-01-29 10:35     ` Tetsuo Handa
2016-01-29 15:17     ` Michal Hocko
2016-01-29 15:17       ` Michal Hocko
2016-01-29 21:30       ` Tetsuo Handa
2016-01-29 21:30         ` Tetsuo Handa
2016-02-03 13:27 ` [PATCH 0/3] OOM detection rework v4 Michal Hocko
2016-02-03 13:27   ` Michal Hocko
2016-02-03 22:58   ` David Rientjes
2016-02-03 22:58     ` David Rientjes
2016-02-04 12:57     ` Michal Hocko
2016-02-04 12:57       ` Michal Hocko
2016-02-04 13:10       ` Tetsuo Handa
2016-02-04 13:10         ` Tetsuo Handa
2016-02-04 13:39         ` Michal Hocko
2016-02-04 13:39           ` Michal Hocko
2016-02-04 14:24           ` Michal Hocko
2016-02-04 14:24             ` Michal Hocko
2016-02-07  4:09           ` Tetsuo Handa
2016-02-07  4:09             ` Tetsuo Handa
2016-02-15 20:06             ` Michal Hocko
2016-02-15 20:06               ` Michal Hocko
2016-02-16 13:10               ` Tetsuo Handa
2016-02-16 13:10                 ` Tetsuo Handa
2016-02-16 15:19                 ` Michal Hocko
2016-02-16 15:19                   ` Michal Hocko
2016-02-25  3:47   ` Hugh Dickins
2016-02-25  3:47     ` Hugh Dickins
2016-02-25  6:48     ` Sergey Senozhatsky
2016-02-25  6:48       ` Sergey Senozhatsky
2016-02-25  9:17       ` Hillf Danton
2016-02-25  9:17         ` Hillf Danton
2016-02-25  9:27         ` Michal Hocko
2016-02-25  9:27           ` Michal Hocko
2016-02-25  9:48           ` Hillf Danton
2016-02-25  9:48             ` Hillf Danton
2016-02-25 11:02             ` Sergey Senozhatsky
2016-02-25 11:02               ` Sergey Senozhatsky
2016-02-25  9:23     ` Michal Hocko
2016-02-25  9:23       ` Michal Hocko
2016-02-26  6:32       ` Hugh Dickins
2016-02-26  6:32         ` Hugh Dickins
2016-02-26  7:54         ` Hillf Danton
2016-02-26  7:54           ` Hillf Danton
2016-02-26  9:24           ` Michal Hocko
2016-02-26  9:24             ` Michal Hocko
2016-02-26 10:27             ` Hillf Danton
2016-02-26 10:27               ` Hillf Danton
2016-02-26 13:49               ` Michal Hocko
2016-02-26 13:49                 ` Michal Hocko
2016-02-26  9:33         ` Michal Hocko
2016-02-26  9:33           ` Michal Hocko
2016-02-29 21:02       ` Michal Hocko
2016-02-29 21:02         ` Michal Hocko
2016-03-02  2:19         ` Joonsoo Kim
2016-03-02  2:19           ` Joonsoo Kim
2016-03-02  9:50           ` Michal Hocko
2016-03-02  9:50             ` Michal Hocko
2016-03-02 13:32             ` Joonsoo Kim
2016-03-02 13:32               ` Joonsoo Kim
2016-03-02 14:06               ` Michal Hocko
2016-03-02 14:06                 ` Michal Hocko
2016-03-02 14:34                 ` Joonsoo Kim
2016-03-02 14:34                   ` Joonsoo Kim
2016-03-03  9:26                   ` Michal Hocko
2016-03-03  9:26                     ` Michal Hocko
2016-03-03 10:29                     ` Tetsuo Handa
2016-03-03 10:29                       ` Tetsuo Handa
2016-03-03 14:10                     ` Joonsoo Kim
2016-03-03 14:10                       ` Joonsoo Kim
2016-03-03 15:25                       ` Michal Hocko
2016-03-03 15:25                         ` Michal Hocko
2016-03-04  5:23                         ` Joonsoo Kim
2016-03-04  5:23                           ` Joonsoo Kim
2016-03-04 15:15                           ` Michal Hocko
2016-03-04 15:15                             ` Michal Hocko
2016-03-04 17:39                             ` Michal Hocko
2016-03-04 17:39                               ` Michal Hocko
2016-03-07  5:23                             ` Joonsoo Kim
2016-03-07  5:23                               ` Joonsoo Kim
2016-03-03 15:50                       ` Vlastimil Babka
2016-03-03 15:50                         ` Vlastimil Babka
2016-03-03 16:26                         ` Michal Hocko
2016-03-03 16:26                           ` Michal Hocko
2016-03-04  7:10                         ` Joonsoo Kim
2016-03-04  7:10                           ` Joonsoo Kim
2016-03-02 15:01             ` Minchan Kim
2016-03-02 15:01               ` Minchan Kim
2016-03-07 16:08         ` [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4) Michal Hocko
2016-03-07 16:08           ` Michal Hocko
2016-03-08  3:51           ` Sergey Senozhatsky
2016-03-08  3:51             ` Sergey Senozhatsky
2016-03-08  9:08             ` Michal Hocko
2016-03-08  9:08               ` Michal Hocko
2016-03-08  9:24               ` Sergey Senozhatsky
2016-03-08  9:24                 ` Sergey Senozhatsky
2016-03-08  9:24           ` [PATCH] mm, oom: protect !costly allocations some more Vlastimil Babka
2016-03-08  9:24             ` Vlastimil Babka
2016-03-08  9:32             ` Sergey Senozhatsky
2016-03-08  9:32               ` Sergey Senozhatsky
2016-03-08  9:46             ` Michal Hocko
2016-03-08  9:46               ` Michal Hocko
2016-03-08  9:52               ` Vlastimil Babka
2016-03-08  9:52                 ` Vlastimil Babka
2016-03-08 10:10                 ` Michal Hocko
2016-03-08 10:10                   ` Michal Hocko
2016-03-08 11:12                   ` Vlastimil Babka
2016-03-08 11:12                     ` Vlastimil Babka
2016-03-08 12:22                     ` Michal Hocko
2016-03-08 12:22                       ` Michal Hocko
2016-03-08 12:29                       ` Vlastimil Babka
2016-03-08 12:29                         ` Vlastimil Babka
2016-03-08  9:58           ` [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4) Sergey Senozhatsky
2016-03-08  9:58             ` Sergey Senozhatsky
2016-03-08 13:57             ` Michal Hocko
2016-03-08 13:57               ` Michal Hocko
2016-03-08 10:36           ` Hugh Dickins
2016-03-08 13:42           ` [PATCH 0/2] oom rework: high order enahncements Michal Hocko
2016-03-08 13:42             ` Michal Hocko
2016-03-08 13:42             ` [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum Michal Hocko
2016-03-08 13:42               ` Michal Hocko
2016-03-08 14:19               ` Vlastimil Babka
2016-03-08 14:19                 ` Vlastimil Babka
2016-03-09  3:55               ` Hillf Danton
2016-03-09  3:55                 ` Hillf Danton
2016-03-08 13:42             ` [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone Michal Hocko
2016-03-08 13:42               ` Michal Hocko
2016-03-08 14:22               ` Vlastimil Babka
2016-03-08 14:22                 ` Vlastimil Babka
2016-03-09  3:57               ` Hillf Danton
2016-03-09  3:57                 ` Hillf Danton
2016-03-08 13:42             ` [PATCH 3/3] mm, oom: protect !costly allocations some more Michal Hocko
2016-03-08 13:42               ` Michal Hocko
2016-03-08 14:34               ` Vlastimil Babka
2016-03-08 14:34                 ` Vlastimil Babka
2016-03-08 14:48                 ` Michal Hocko
2016-03-08 14:48                   ` Michal Hocko
2016-03-08 15:03                   ` Vlastimil Babka
2016-03-08 15:03                     ` Vlastimil Babka
2016-03-09 11:11               ` Michal Hocko
2016-03-09 11:11                 ` Michal Hocko
2016-03-09 14:07                 ` Vlastimil Babka
2016-03-09 14:07                   ` Vlastimil Babka
2016-03-11 12:17                 ` Hugh Dickins
2016-03-11 12:17                   ` Hugh Dickins
2016-03-11 13:06                   ` Michal Hocko
2016-03-11 13:06                     ` Michal Hocko
2016-03-11 19:08                     ` Hugh Dickins
2016-03-11 19:08                       ` Hugh Dickins
2016-03-14 16:21                       ` Michal Hocko
2016-03-14 16:21                         ` Michal Hocko
2016-03-08 15:19           ` [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4) Joonsoo Kim
2016-03-08 15:19             ` Joonsoo Kim
2016-03-08 16:05             ` Michal Hocko
2016-03-08 16:05               ` Michal Hocko
2016-03-08 17:03               ` Joonsoo Kim
2016-03-08 17:03                 ` Joonsoo Kim
2016-03-09 10:41                 ` Michal Hocko
2016-03-09 10:41                   ` Michal Hocko
2016-03-11 14:53                   ` Joonsoo Kim
2016-03-11 14:53                     ` Joonsoo Kim
2016-03-11 15:20                     ` Michal Hocko
2016-03-11 15:20                       ` Michal Hocko
2016-02-29 20:35     ` [PATCH 0/3] OOM detection rework v4 Michal Hocko
2016-03-01  7:29       ` Hugh Dickins
2016-03-01  7:29         ` Hugh Dickins
2016-03-01 13:38         ` Michal Hocko
2016-03-01 13:38           ` Michal Hocko
2016-03-01 14:40           ` Michal Hocko
2016-03-01 14:40             ` Michal Hocko
2016-03-01 18:14           ` Vlastimil Babka
2016-03-01 18:14             ` Vlastimil Babka
2016-03-02  2:55             ` Joonsoo Kim
2016-03-02  2:55               ` Joonsoo Kim
2016-03-02 12:37               ` Michal Hocko
2016-03-02 12:37                 ` Michal Hocko
2016-03-02 14:06                 ` Joonsoo Kim
2016-03-02 14:06                   ` Joonsoo Kim
2016-03-02 12:24             ` Michal Hocko
2016-03-02 13:00               ` Michal Hocko
2016-03-02 13:22               ` Vlastimil Babka
2016-03-02 13:22                 ` Vlastimil Babka
2016-03-02  2:28           ` Joonsoo Kim
2016-03-02  2:28             ` Joonsoo Kim
2016-03-02 12:39             ` Michal Hocko
2016-03-02 12:39               ` Michal Hocko
2016-03-03  9:54           ` Hugh Dickins
2016-03-03 12:32             ` Michal Hocko
2016-03-03 12:32               ` Michal Hocko
2016-03-03 20:57               ` Hugh Dickins
2016-03-03 20:57                 ` Hugh Dickins
2016-03-04  7:41                 ` Vlastimil Babka
2016-03-04  7:41                   ` Vlastimil Babka
2016-03-04  7:53             ` Joonsoo Kim
2016-03-04  7:53               ` Joonsoo Kim
2016-03-04 12:28             ` Michal Hocko
2016-03-04 12:28               ` Michal Hocko
2016-03-11 10:45 ` Tetsuo Handa
2016-03-11 10:45   ` Tetsuo Handa
2016-03-11 13:08   ` Michal Hocko
2016-03-11 13:08     ` Michal Hocko
2016-03-11 13:32     ` Tetsuo Handa
2016-03-11 13:32       ` Tetsuo Handa
2016-03-11 15:28       ` Michal Hocko
2016-03-11 15:28         ` Michal Hocko
2016-03-11 16:49         ` Tetsuo Handa
2016-03-11 16:49           ` Tetsuo Handa
2016-03-11 17:00           ` Michal Hocko
2016-03-11 17:00             ` Michal Hocko
2016-03-11 17:20             ` Tetsuo Handa
2016-03-11 17:20               ` Tetsuo Handa
2016-03-12  4:08               ` Tetsuo Handa
2016-03-12  4:08                 ` Tetsuo Handa
2016-03-13 14:41                 ` Tetsuo Handa
2016-03-13 14:41                   ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.