All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] OOM detection rework v4
@ 2015-12-15 18:19 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,

This is v4 of the series. The previous version was posted [1].  I have
dropped the RFC because this has been sitting and waiting for the
fundamental objections for quite some time and there were none. I still
do not think we should rush this and merge it no sooner than 4.6. Having
this in the mmotm and thus linux-next would open it to a much larger
testing coverage. I will iron out issues as they come but hopefully
there will no serious ones.

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

as pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
   removes the files and starts over again) running in parallel for 10s
   to build up a lot of dirty pages when 100 parallel mem_eaters (anon
   private populated mmap which waits until it gets signal) with 80M
   each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements
but the runtime is way too different (it took 800+s). One thing that
could have skewed results was that I was tail -f the serial log on the
host system to see the progress. I have stopped doing that. The results
are more consistent now but still too different from the last time.
This is really weird so I've retested with the last 4.2 mmotm again and
I am getting consistent ~220s which is really close to the above. If I
apply the WQ vmstat patch on top I am getting close to 160s so the stale
vmstat counters made a difference which is to be expected. I have a new
SSD in my laptop which migh have made a difference but I wouldn't expect
it to be that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall
runtime of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.


2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
   Start 10 memeaters in parallel each with
   size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
   This will cause an OOM killer which will kill one of them which will free up
   200M and then try to use all the remaining space for hugetlb pages. See how
   many of them will pass kill everything, wait 2s and try again.
   This tests whether we do not fail __GFP_REPEAT costly allocations too early
   now.
* base kernel
$ sort base-hugepages.log | uniq -c
      1 64
     13 65
      6 66
     20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
     17 65
      3 66
     20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/1448974607-10208-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com


^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 0/3] OOM detection rework v4
@ 2015-12-15 18:19 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,

This is v4 of the series. The previous version was posted [1].  I have
dropped the RFC because this has been sitting and waiting for the
fundamental objections for quite some time and there were none. I still
do not think we should rush this and merge it no sooner than 4.6. Having
this in the mmotm and thus linux-next would open it to a much larger
testing coverage. I will iron out issues as they come but hopefully
there will no serious ones.

* Changes since v3
- factor out the new heuristic into its own function as suggested by
  Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
  wait_iff_congested related changes which needed refresh in
  patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
  the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

as pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
   removes the files and starts over again) running in parallel for 10s
   to build up a lot of dirty pages when 100 parallel mem_eaters (anon
   private populated mmap which waits until it gets signal) with 80M
   each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements
but the runtime is way too different (it took 800+s). One thing that
could have skewed results was that I was tail -f the serial log on the
host system to see the progress. I have stopped doing that. The results
are more consistent now but still too different from the last time.
This is really weird so I've retested with the last 4.2 mmotm again and
I am getting consistent ~220s which is really close to the above. If I
apply the WQ vmstat patch on top I am getting close to 160s so the stale
vmstat counters made a difference which is to be expected. I have a new
SSD in my laptop which migh have made a difference but I wouldn't expect
it to be that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall
runtime of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.


2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
   Start 10 memeaters in parallel each with
   size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
   This will cause an OOM killer which will kill one of them which will free up
   200M and then try to use all the remaining space for hugetlb pages. See how
   many of them will pass kill everything, wait 2s and try again.
   This tests whether we do not fail __GFP_REPEAT costly allocations too early
   now.
* base kernel
$ sort base-hugepages.log | uniq -c
      1 64
     13 65
      6 66
     20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
     17 65
      3 66
     20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/1448974607-10208-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, oom: rework oom detection
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
feedback mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

factor out the retry logic into separate function - per Johannes
---
 include/linux/swap.h |  1 +
 mm/page_alloc.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          | 25 +++------------
 3 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..738ae2206635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e267faad4649..f77e283fb8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+		     struct alloc_context *ac, int alloc_flags,
+		     bool did_some_progress, unsigned long pages_reclaimed,
+		     int no_progress_loops)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure we converge to OOM if we cannot make any progress
+	 * several times in the row.
+	 */
+	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+		return false;
+
+	/* Do not retry high order allocations unless they are __GFP_REPEAT */
+	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+			return false;
+
+		if (did_some_progress)
+			return true;
+	}
+
+	/*
+	 * Keep reclaiming pages while there is a chance this will lead somewhere.
+	 * If none of the target zones can satisfy our allocation request even
+	 * if all reclaimable pages are considered then we are screwed and have
+	 * to go OOM.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
+		unsigned long available;
+
+		available = zone_reclaimable_pages(zone);
+		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/*
+		 * Would the allocation succeed if we reclaimed the whole available?
+		 */
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac->high_zoneidx, alloc_flags, available)) {
+			/* Wait for some write requests to complete then retry */
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -2996,6 +3065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int no_progress_loops = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3155,23 +3225,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	/* Keep reclaiming pages as long as there is reasonable progress */
-	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto retry;
+	if (did_some_progress) {
+		no_progress_loops = 0;
+		pages_reclaimed += did_some_progress;
+	} else {
+		no_progress_loops++;
 	}
 
+	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, pages_reclaimed,
+				 no_progress_loops))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		no_progress_loops = 0;
 		goto retry;
+	}
 
 noretry:
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..489212252cd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,7 +192,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
@@ -2516,10 +2516,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2527,7 +2525,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
-	bool reclaimable = false;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			if (nr_soft_reclaimed)
-				reclaimable = true;
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
-			reclaimable = true;
-
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		shrink_zone(zone, sc, zone_idx(zone));
 	}
 
 	/*
@@ -2610,8 +2600,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * promoted it to __GFP_HIGHMEM.
 	 */
 	sc->gfp_mask = orig_mask;
-
-	return reclaimable;
 }
 
 /*
@@ -2636,7 +2624,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
-	bool zones_reclaimable;
 retry:
 	delayacct_freepages_start();
 
@@ -2647,7 +2634,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		zones_reclaimable = shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2694,10 +2681,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		goto retry;
 	}
 
-	/* Any of the zones still reclaimable?  Don't OOM. */
-	if (zones_reclaimable)
-		return 1;
-
 	return 0;
 }
 
-- 
2.6.2


^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 1/3] mm, oom: rework oom detection
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
feedback mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>

factor out the retry logic into separate function - per Johannes
---
 include/linux/swap.h |  1 +
 mm/page_alloc.c      | 91 +++++++++++++++++++++++++++++++++++++++++++++++-----
 mm/vmscan.c          | 25 +++------------
 3 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..738ae2206635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e267faad4649..f77e283fb8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+		     struct alloc_context *ac, int alloc_flags,
+		     bool did_some_progress, unsigned long pages_reclaimed,
+		     int no_progress_loops)
+{
+	struct zone *zone;
+	struct zoneref *z;
+
+	/*
+	 * Make sure we converge to OOM if we cannot make any progress
+	 * several times in the row.
+	 */
+	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+		return false;
+
+	/* Do not retry high order allocations unless they are __GFP_REPEAT */
+	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+			return false;
+
+		if (did_some_progress)
+			return true;
+	}
+
+	/*
+	 * Keep reclaiming pages while there is a chance this will lead somewhere.
+	 * If none of the target zones can satisfy our allocation request even
+	 * if all reclaimable pages are considered then we are screwed and have
+	 * to go OOM.
+	 */
+	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
+		unsigned long available;
+
+		available = zone_reclaimable_pages(zone);
+		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
+		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/*
+		 * Would the allocation succeed if we reclaimed the whole available?
+		 */
+		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+				ac->high_zoneidx, alloc_flags, available)) {
+			/* Wait for some write requests to complete then retry */
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			return true;
+		}
+	}
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -2996,6 +3065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	int no_progress_loops = 0;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3155,23 +3225,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	/* Keep reclaiming pages as long as there is reasonable progress */
-	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
-		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
-		goto retry;
+	if (did_some_progress) {
+		no_progress_loops = 0;
+		pages_reclaimed += did_some_progress;
+	} else {
+		no_progress_loops++;
 	}
 
+	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, pages_reclaimed,
+				 no_progress_loops))
+		goto retry;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		no_progress_loops = 0;
 		goto retry;
+	}
 
 noretry:
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..489212252cd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,7 +192,7 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
@@ -2516,10 +2516,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2527,7 +2525,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
-	bool reclaimable = false;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
-			if (nr_soft_reclaimed)
-				reclaimable = true;
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
-			reclaimable = true;
-
-		if (global_reclaim(sc) &&
-		    !reclaimable && zone_reclaimable(zone))
-			reclaimable = true;
+		shrink_zone(zone, sc, zone_idx(zone));
 	}
 
 	/*
@@ -2610,8 +2600,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	 * promoted it to __GFP_HIGHMEM.
 	 */
 	sc->gfp_mask = orig_mask;
-
-	return reclaimable;
 }
 
 /*
@@ -2636,7 +2624,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
-	bool zones_reclaimable;
 retry:
 	delayacct_freepages_start();
 
@@ -2647,7 +2634,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		zones_reclaimable = shrink_zones(zonelist, sc);
+		shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2694,10 +2681,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		goto retry;
 	}
 
-	/* Any of the zones still reclaimable?  Don't OOM. */
-	if (zones_reclaimable)
-		return 1;
-
 	return 0;
 }
 
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested
because the sleep is needed to be done only once in the allocation retry
cycle.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 19 +++----------------
 mm/page_alloc.c  | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..d2473ce9cc57 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -980,19 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
 
-		/*
-		 * Memory allocation/reclaim might be called from a WQ
-		 * context and the current implementation of the WQ
-		 * concurrency control doesn't recognize that a particular
-		 * WQ is congested if the worker thread is looping without
-		 * ever sleeping. Therefore we have to do a short sleep
-		 * here rather than calling cond_resched().
-		 */
-		if (current->flags & PF_WQ_WORKER)
-			schedule_timeout(1);
-		else
-			cond_resched();
-
+		cond_resched();
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f77e283fb8c6..b2de8c8761ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3034,8 +3034,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
+		unsigned long reclaimable;
 
-		available = zone_reclaimable_pages(zone);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
@@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, available)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback;
+			unsigned long dirty;
+
+			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout(1);
+			else
+				cond_resched();
+
 			return true;
 		}
 	}
-- 
2.6.2


^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested
because the sleep is needed to be done only once in the allocation retry
cycle.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 19 +++----------------
 mm/page_alloc.c  | 36 +++++++++++++++++++++++++++++++++---
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..d2473ce9cc57 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -980,19 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
 
-		/*
-		 * Memory allocation/reclaim might be called from a WQ
-		 * context and the current implementation of the WQ
-		 * concurrency control doesn't recognize that a particular
-		 * WQ is congested if the worker thread is looping without
-		 * ever sleeping. Therefore we have to do a short sleep
-		 * here rather than calling cond_resched().
-		 */
-		if (current->flags & PF_WQ_WORKER)
-			schedule_timeout(1);
-		else
-			cond_resched();
-
+		cond_resched();
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
 		if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f77e283fb8c6..b2de8c8761ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3034,8 +3034,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
+		unsigned long reclaimable;
 
-		available = zone_reclaimable_pages(zone);
+		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
@@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, available)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback;
+			unsigned long dirty;
+
+			writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+			dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
+			}
+
+			/*
+			 * Memory allocation/reclaim might be called from a WQ
+			 * context and the current implementation of the WQ
+			 * concurrency control doesn't recognize that
+			 * a particular WQ is congested if the worker thread is
+			 * looping without ever sleeping. Therefore we have to
+			 * do a short sleep here rather than calling
+			 * cond_resched().
+			 */
+			if (current->flags & PF_WQ_WORKER)
+				schedule_timeout(1);
+			else
+				cond_resched();
+
 			return true;
 		}
 	}
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-15 18:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath retries costly allocations until at least
order worth of pages were reclaimed or the watermark check for at least
one zone would succeed after all reclaiming all pages if the reclaim
hasn't made any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. It would be more appropriate to check the compaction progress
instead but this patch simply removes the check and relies solely
on the watermark check.

To prevent from too many retries the no_progress_loops is not reseted after
a reclaim which made progress because we cannot assume it helped high
order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2de8c8761ad..268de1654128 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2994,17 +2994,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, unsigned long pages_reclaimed,
+		     bool did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3018,13 +3018,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
-			return false;
-
-		if (did_some_progress)
-			return true;
-	}
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+		return false;
 
 	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
@@ -3090,7 +3085,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
@@ -3255,16 +3249,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	if (did_some_progress) {
+	/*
+	 * Costly allocations might have made a progress but this doesn't mean
+	 * their order will become available due to high fragmentation so do
+	 * not reset the no progress counter for them
+	 */
+	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
 		no_progress_loops = 0;
-		pages_reclaimed += did_some_progress;
-	} else {
+	else
 		no_progress_loops++;
-	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, pages_reclaimed,
-				 no_progress_loops))
+				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.6.2


^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations
@ 2015-12-15 18:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-15 18:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_slowpath retries costly allocations until at least
order worth of pages were reclaimed or the watermark check for at least
one zone would succeed after all reclaiming all pages if the reclaim
hasn't made any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. It would be more appropriate to check the compaction progress
instead but this patch simply removes the check and relies solely
on the watermark check.

To prevent from too many retries the no_progress_loops is not reseted after
a reclaim which made progress because we cannot assume it helped high
order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2de8c8761ad..268de1654128 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2994,17 +2994,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress, unsigned long pages_reclaimed,
+		     bool did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3018,13 +3018,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
-		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
-			return false;
-
-		if (did_some_progress)
-			return true;
-	}
+	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+		return false;
 
 	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
@@ -3090,7 +3085,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
-	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
@@ -3255,16 +3249,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (gfp_mask & __GFP_NORETRY)
 		goto noretry;
 
-	if (did_some_progress) {
+	/*
+	 * Costly allocations might have made a progress but this doesn't mean
+	 * their order will become available due to high fragmentation so do
+	 * not reset the no progress counter for them
+	 */
+	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
 		no_progress_loops = 0;
-		pages_reclaimed += did_some_progress;
-	} else {
+	else
 		no_progress_loops++;
-	}
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, pages_reclaimed,
-				 no_progress_loops))
+				 did_some_progress > 0, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.6.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-16 23:35   ` Andrew Morton
  -1 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well.  This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything).  It's not so easy in the case of
oom-too-late-or-never.  The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom.  But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
  start going wrong and turns on diagnostics (this would need an enable
  knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
  the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
  instructions/scripts so that people who know nothing about kernel
  internals or tracing can easily gather the info we need to understand
  issues.

- add a sysrq key to turn on diagnostics.  Pretty essential when the
  machine is comatose and doesn't respond to keystrokes.

- something else

So...  please have a think about it?  What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well? 
At this time, too much developer support code will be better than too
little.  We can take it out later on.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-16 23:35   ` Andrew Morton
  0 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well.  This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything).  It's not so easy in the case of
oom-too-late-or-never.  The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom.  But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
  start going wrong and turns on diagnostics (this would need an enable
  knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
  the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
  instructions/scripts so that people who know nothing about kernel
  internals or tracing can easily gather the info we need to understand
  issues.

- add a sysrq key to turn on diagnostics.  Pretty essential when the
  machine is comatose and doesn't respond to keystrokes.

- something else

So...  please have a think about it?  What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well? 
At this time, too much developer support code will be better than too
little.  We can take it out later on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-16 23:58   ` Andrew Morton
  -1 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> 
> ...
>
> * base kernel
> $ grep "Killed process" base-oom-run1.log | tail -n1
> [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> $ grep "Killed process" base-oom-run2.log | tail -n1
> [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> 
> $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> 76
> 
> The number of OOM invocations is consistent with my last measurements
> but the runtime is way too different (it took 800+s).

I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
assume the 800+s is sum-across-multiple-CPUs?  Given that all the CPUs
are pounding away at the same data and the same disk, that doesn't
sound like very interesting info - the overall elapsed time is the
thing to look at in this case.

> One thing that
> could have skewed results was that I was tail -f the serial log on the
> host system to see the progress. I have stopped doing that. The results
> are more consistent now but still too different from the last time.
> This is really weird so I've retested with the last 4.2 mmotm again and
> I am getting consistent ~220s which is really close to the above. If I
> apply the WQ vmstat patch on top I am getting close to 160s so the stale
> vmstat counters made a difference which is to be expected. I have a new
> SSD in my laptop which migh have made a difference but I wouldn't expect
> it to be that large.
> 
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 4
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 1
> 
> * patched kernel
> $ grep "Killed process" patched-oom-run1.log | tail -n1
> [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> $ grep "Killed process" patched-oom-run2.log | tail -n1
> [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

Even better.

> $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> 77
> 
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 0
> 
> So the number of OOM killer invocation is the same but the overall
> runtime of the test was much longer with the patched kernel. This can be
> attributed to more retries in general. The results from the base kernel
> are quite inconsitent and I think that consistency is better here.

It's hard to say how long declaration of oom should take.  Correctness
comes first.  But what is "correct"?  oom isn't a binary condition -
there's a chance that if we keep churning away for another 5 minutes
we'll be able to satisfy this allocation (but probably not the next
one).  There are tradeoffs between promptness-of-declaring-oom and
exhaustiveness-in-avoiding-it.

> 
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
>    memory as possible without triggering the OOM killer. This required a lot
>    of tuning but I've considered 3 consecutive runs without OOM as a success.

"a lot of tuning" sounds bad.  It means that the tuning settings you
have now for a particular workload on a particular machine will be
wrong for other workloads and machines.  uh-oh.

> ...

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-16 23:58   ` Andrew Morton
  0 siblings, 0 replies; 299+ messages in thread
From: Andrew Morton @ 2015-12-16 23:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> 
> ...
>
> * base kernel
> $ grep "Killed process" base-oom-run1.log | tail -n1
> [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> $ grep "Killed process" base-oom-run2.log | tail -n1
> [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> 
> $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> 76
> 
> The number of OOM invocations is consistent with my last measurements
> but the runtime is way too different (it took 800+s).

I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
assume the 800+s is sum-across-multiple-CPUs?  Given that all the CPUs
are pounding away at the same data and the same disk, that doesn't
sound like very interesting info - the overall elapsed time is the
thing to look at in this case.

> One thing that
> could have skewed results was that I was tail -f the serial log on the
> host system to see the progress. I have stopped doing that. The results
> are more consistent now but still too different from the last time.
> This is really weird so I've retested with the last 4.2 mmotm again and
> I am getting consistent ~220s which is really close to the above. If I
> apply the WQ vmstat patch on top I am getting close to 160s so the stale
> vmstat counters made a difference which is to be expected. I have a new
> SSD in my laptop which migh have made a difference but I wouldn't expect
> it to be that large.
> 
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 4
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 1
> 
> * patched kernel
> $ grep "Killed process" patched-oom-run1.log | tail -n1
> [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> $ grep "Killed process" patched-oom-run2.log | tail -n1
> [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

Even better.

> $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> 77
> 
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 0
> 
> So the number of OOM killer invocation is the same but the overall
> runtime of the test was much longer with the patched kernel. This can be
> attributed to more retries in general. The results from the base kernel
> are quite inconsitent and I think that consistency is better here.

It's hard to say how long declaration of oom should take.  Correctness
comes first.  But what is "correct"?  oom isn't a binary condition -
there's a chance that if we keep churning away for another 5 minutes
we'll be able to satisfy this allocation (but probably not the next
one).  There are tradeoffs between promptness-of-declaring-oom and
exhaustiveness-in-avoiding-it.

> 
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
>    memory as possible without triggering the OOM killer. This required a lot
>    of tuning but I've considered 3 consecutive runs without OOM as a success.

"a lot of tuning" sounds bad.  It means that the tuning settings you
have now for a particular workload on a particular machine will be
wrong for other workloads and machines.  uh-oh.

> ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-16 23:35   ` Andrew Morton
@ 2015-12-18 12:12     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 12:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:35:13, Andrew Morton wrote:
[...]
> So...  please have a think about it?  What can we add in here to make it
> as easy as possible for us (ie: you ;)) to get this code working well? 
> At this time, too much developer support code will be better than too
> little.  We can take it out later on.

Sure. I will think about this and get back to it early next year. I will
be mostly offline starting next week.

Thanks for looking into this!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 12:12     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 12:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:35:13, Andrew Morton wrote:
[...]
> So...  please have a think about it?  What can we add in here to make it
> as easy as possible for us (ie: you ;)) to get this code working well? 
> At this time, too much developer support code will be better than too
> little.  We can take it out later on.

Sure. I will think about this and get back to it early next year. I will
be mostly offline starting next week.

Thanks for looking into this!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-16 23:58   ` Andrew Morton
@ 2015-12-18 13:15     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > 
> > ...
> >
> > * base kernel
> > $ grep "Killed process" base-oom-run1.log | tail -n1
> > [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> > $ grep "Killed process" base-oom-run2.log | tail -n1
> > [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> > 
> > $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> > 76
> > 
> > The number of OOM invocations is consistent with my last measurements
> > but the runtime is way too different (it took 800+s).
> 
> I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
> assume the 800+s is sum-across-multiple-CPUs?

This is the time until the oom situation settled down. And I really
suspect that the new SSD made a difference here.

> Given that all the CPUs
> are pounding away at the same data and the same disk, that doesn't
> sound like very interesting info - the overall elapsed time is the
> thing to look at in this case.

Which is what I was looking at when checking the timestamp in the log.

[...]
> > * patched kernel
> > $ grep "Killed process" patched-oom-run1.log | tail -n1
> > [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> > $ grep "Killed process" patched-oom-run2.log | tail -n1
> > [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB
> 
> Even better.
> 
> > $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> > 77
> > 
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> > 1
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> > 0
> > 
> > So the number of OOM killer invocation is the same but the overall
> > runtime of the test was much longer with the patched kernel. This can be
> > attributed to more retries in general. The results from the base kernel
> > are quite inconsitent and I think that consistency is better here.
> 
> It's hard to say how long declaration of oom should take.  Correctness
> comes first.  But what is "correct"?  oom isn't a binary condition -
> there's a chance that if we keep churning away for another 5 minutes
> we'll be able to satisfy this allocation (but probably not the next
> one).  There are tradeoffs between promptness-of-declaring-oom and
> exhaustiveness-in-avoiding-it.

Yes, this is really hard to tell. What I wanted to achieve here is a
determinism - the same load should give comparable results. It seems
that there is an improvement in this regards. The time to settle is 
much more consistent than with the original implementation.
 
> > 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> >    memory as possible without triggering the OOM killer. This required a lot
> >    of tuning but I've considered 3 consecutive runs without OOM as a success.
> 
> "a lot of tuning" sounds bad.  It means that the tuning settings you
> have now for a particular workload on a particular machine will be
> wrong for other workloads and machines.  uh-oh.

Well, I had to tune the test to see how close to the edge I can get. I
haven't done any decisions based on this test.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 13:15     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-18 13:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > 
> > ...
> >
> > * base kernel
> > $ grep "Killed process" base-oom-run1.log | tail -n1
> > [  211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> > $ grep "Killed process" base-oom-run2.log | tail -n1
> > [  157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> > 
> > $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> > 76
> > 
> > The number of OOM invocations is consistent with my last measurements
> > but the runtime is way too different (it took 800+s).
> 
> I'm seeing 211 seconds vs 157 seconds?  If so, that's not toooo bad.  I
> assume the 800+s is sum-across-multiple-CPUs?

This is the time until the oom situation settled down. And I really
suspect that the new SSD made a difference here.

> Given that all the CPUs
> are pounding away at the same data and the same disk, that doesn't
> sound like very interesting info - the overall elapsed time is the
> thing to look at in this case.

Which is what I was looking at when checking the timestamp in the log.

[...]
> > * patched kernel
> > $ grep "Killed process" patched-oom-run1.log | tail -n1
> > [  341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> > $ grep "Killed process" patched-oom-run2.log | tail -n1
> > [  349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB
> 
> Even better.
> 
> > $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> > 77
> > 
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> > 1
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> > 0
> > 
> > So the number of OOM killer invocation is the same but the overall
> > runtime of the test was much longer with the patched kernel. This can be
> > attributed to more retries in general. The results from the base kernel
> > are quite inconsitent and I think that consistency is better here.
> 
> It's hard to say how long declaration of oom should take.  Correctness
> comes first.  But what is "correct"?  oom isn't a binary condition -
> there's a chance that if we keep churning away for another 5 minutes
> we'll be able to satisfy this allocation (but probably not the next
> one).  There are tradeoffs between promptness-of-declaring-oom and
> exhaustiveness-in-avoiding-it.

Yes, this is really hard to tell. What I wanted to achieve here is a
determinism - the same load should give comparable results. It seems
that there is an improvement in this regards. The time to settle is 
much more consistent than with the original implementation.
 
> > 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> >    memory as possible without triggering the OOM killer. This required a lot
> >    of tuning but I've considered 3 consecutive runs without OOM as a success.
> 
> "a lot of tuning" sounds bad.  It means that the tuning settings you
> have now for a particular workload on a particular machine will be
> wrong for other workloads and machines.  uh-oh.

Well, I had to tune the test to see how close to the edge I can get. I
haven't done any decisions based on this test.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-18 13:15     ` Michal Hocko
@ 2015-12-18 16:35       ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2015-12-18 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Fri, Dec 18, 2015 at 02:15:09PM +0100, Michal Hocko wrote:
> On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> > It's hard to say how long declaration of oom should take.  Correctness
> > comes first.  But what is "correct"?  oom isn't a binary condition -
> > there's a chance that if we keep churning away for another 5 minutes
> > we'll be able to satisfy this allocation (but probably not the next
> > one).  There are tradeoffs between promptness-of-declaring-oom and
> > exhaustiveness-in-avoiding-it.
> 
> Yes, this is really hard to tell. What I wanted to achieve here is a
> determinism - the same load should give comparable results. It seems
> that there is an improvement in this regards. The time to settle is 
> much more consistent than with the original implementation.

+1

Before that we couldn't even really make a meaningful statement about
how long we are going to try - "as long as reclaim thinks it can maybe
do some more, depending on heuristics". I think the best thing we can
strive for with OOM is to make the rules simple and predictable.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-18 16:35       ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2015-12-18 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Fri, Dec 18, 2015 at 02:15:09PM +0100, Michal Hocko wrote:
> On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> > It's hard to say how long declaration of oom should take.  Correctness
> > comes first.  But what is "correct"?  oom isn't a binary condition -
> > there's a chance that if we keep churning away for another 5 minutes
> > we'll be able to satisfy this allocation (but probably not the next
> > one).  There are tradeoffs between promptness-of-declaring-oom and
> > exhaustiveness-in-avoiding-it.
> 
> Yes, this is really hard to tell. What I wanted to achieve here is a
> determinism - the same load should give comparable results. It seems
> that there is an improvement in this regards. The time to settle is 
> much more consistent than with the original implementation.

+1

Before that we couldn't even really make a meaningful statement about
how long we are going to try - "as long as reclaim thinks it can maybe
do some more, depending on heuristics". I think the best thing we can
strive for with OOM is to make the rules simple and predictable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2015-12-24 12:41   ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-24 12:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

I got OOM killers while running heavy disk I/O (extracting kernel source,
running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
Do you think these OOM killers reasonable? Too weak against fragmentation?

[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.432780] kthreadd cpuset=/ mems_allowed=0
[ 3902.433904] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 3902.435463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3902.437541]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 3902.439129]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 3902.440779]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 3902.442436] Call Trace:
[ 3902.443094]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 3902.444188]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 3902.445301]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3902.446656]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 3902.447881]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 3902.449093]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 3902.450266]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 3902.451430]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 3902.452757]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 3902.454468]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 3902.455756]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 3902.457076]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 3902.458396]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 3902.459480]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 3902.460775]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 3902.461894]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 3902.463035]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 3902.464230]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.465502]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 3902.466648]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.467953] Mem-Info:
[ 3902.468537] active_anon:20817 inactive_anon:2098 isolated_anon:0
[ 3902.468537]  active_file:145434 inactive_file:145453 isolated_file:0
[ 3902.468537]  unevictable:0 dirty:20613 writeback:7248 unstable:0
[ 3902.468537]  slab_reclaimable:86363 slab_unreclaimable:14905
[ 3902.468537]  mapped:6670 shmem:2167 pagetables:1497 bounce:0
[ 3902.468537]  free:5422 free_pcp:75 free_cma:0
[ 3902.476541] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3268kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:36kB shmem:216kB slab_reclaimable:3708kB slab_unreclaimable:456kB kernel_stack:48kB pagetables:160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3902.486494] lowmem_reserve[]: 0 1714 1714 1714
[ 3902.487659] Node 0 DMA32 free:13760kB min:5172kB low:6464kB high:7756kB active_anon:80000kB inactive_anon:8192kB active_file:581780kB inactive_file:581848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:82312kB writeback:29588kB mapped:26648kB shmem:8452kB slab_reclaimable:341744kB slab_unreclaimable:59496kB kernel_stack:3456kB pagetables:5828kB unstable:0kB bounce:0kB free_pcp:732kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:560 all_unreclaimable? no
[ 3902.500438] lowmem_reserve[]: 0 0 0 0
[ 3902.502373] Node 0 DMA: 42*4kB (UME) 84*8kB (UM) 57*16kB (UM) 15*32kB (UM) 11*64kB (M) 9*128kB (UME) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 0*2048kB 0*4096kB = 6904kB
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 3902.511236] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3902.513938] 292144 total pagecache pages
[ 3902.515609] 0 pages in swap cache
[ 3902.517139] Swap cache stats: add 0, delete 0, find 0/0
[ 3902.519153] Free swap  = 0kB
[ 3902.520587] Total swap = 0kB
[ 3902.522095] 524157 pages RAM
[ 3902.523511] 0 pages HighMem/MovableOnly
[ 3902.525091] 80441 pages reserved
[ 3902.526580] 0 pages hwpoisoned
[ 3902.528169] Out of memory: Kill process 687 (firewalld) score 11 or sacrifice child
[ 3902.531017] Killed process 687 (firewalld) total-vm:323600kB, anon-rss:17032kB, file-rss:4896kB, shmem-rss:0kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.903629] smbd cpuset=/ mems_allowed=0
[ 5262.904725] CPU: 2 PID: 3935 Comm: smbd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5262.906401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5262.908679]  0000000000000000 00000000eaa24b41 ffff88007c37faf8 ffffffff81395bc3
[ 5262.910459]  0000000000000000 ffff88007c37fb98 ffffffff811babac 0000000000000206
[ 5262.912224]  ffffffff81810470 ffff88007c37fb38 ffffffff810bce29 0000000000000206
[ 5262.914019] Call Trace:
[ 5262.914839]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5262.916118]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5262.917493]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5262.919131]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5262.920690]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5262.922204]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5262.923863]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5262.925386]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5262.927121]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5262.928738]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5262.930438]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.932110]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5262.933410]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.935016]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 5262.936632]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 5262.938383]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 5262.940024]  [<ffffffff8106e5a4>] SyS_clone+0x14/0x20
[ 5262.941465]  [<ffffffff816fc532>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 5262.943137] Mem-Info:
[ 5262.944068] active_anon:37901 inactive_anon:2095 isolated_anon:0
[ 5262.944068]  active_file:134812 inactive_file:135474 isolated_file:0
[ 5262.944068]  unevictable:0 dirty:257 writeback:0 unstable:0
[ 5262.944068]  slab_reclaimable:90770 slab_unreclaimable:12759
[ 5262.944068]  mapped:4223 shmem:2166 pagetables:1428 bounce:0
[ 5262.944068]  free:3738 free_pcp:49 free_cma:0
[ 5262.953176] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:900kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5262.963749] lowmem_reserve[]: 0 1714 1714 1714
[ 5262.965434] Node 0 DMA32 free:8048kB min:5172kB low:6464kB high:7756kB active_anon:150704kB inactive_anon:8180kB active_file:539244kB inactive_file:541892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:1028kB writeback:0kB mapped:16860kB shmem:8448kB slab_reclaimable:357524kB slab_unreclaimable:50324kB kernel_stack:3232kB pagetables:5560kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? no
[ 5262.976879] lowmem_reserve[]: 0 0 0 0
[ 5262.978586] Node 0 DMA: 58*4kB (UME) 60*8kB (UME) 73*16kB (UME) 23*32kB (UME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6904kB
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5262.987124] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5262.989532] 272459 total pagecache pages
[ 5262.991203] 0 pages in swap cache
[ 5262.992583] Swap cache stats: add 0, delete 0, find 0/0
[ 5262.994334] Free swap  = 0kB
[ 5262.995787] Total swap = 0kB
[ 5262.997038] 524157 pages RAM
[ 5262.998270] 0 pages HighMem/MovableOnly
[ 5262.999683] 80441 pages reserved
[ 5263.001153] 0 pages hwpoisoned
[ 5263.002612] Out of memory: Kill process 26226 (genxref) score 54 or sacrifice child
[ 5263.004648] Killed process 26226 (genxref) total-vm:130348kB, anon-rss:94680kB, file-rss:4756kB, shmem-rss:0kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.767289] kthreadd cpuset=/ mems_allowed=0
[ 5269.768904] CPU: 2 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5269.770956] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5269.773754]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 5269.776088]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 5269.778213]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 5269.780497] Call Trace:
[ 5269.781796]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5269.783634]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5269.786116]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5269.788495]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5269.790538]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5269.792755]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5269.794784]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5269.796848]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5269.799038]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 5269.801073]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5269.803186]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5269.805249]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 5269.807374]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5269.809089]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 5269.811146]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 5269.812944]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 5269.814698]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 5269.816330]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.818088]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 5269.819685]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.821399] Mem-Info:
[ 5269.822430] active_anon:14280 inactive_anon:2095 isolated_anon:0
[ 5269.822430]  active_file:134344 inactive_file:134515 isolated_file:0
[ 5269.822430]  unevictable:0 dirty:2 writeback:0 unstable:0
[ 5269.822430]  slab_reclaimable:96214 slab_unreclaimable:22185
[ 5269.822430]  mapped:3512 shmem:2166 pagetables:1368 bounce:0
[ 5269.822430]  free:12388 free_pcp:51 free_cma:0
[ 5269.831310] Node 0 DMA free:6892kB min:44kB low:52kB high:64kB active_anon:856kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:768kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.840580] lowmem_reserve[]: 0 1714 1714 1714
[ 5269.842107] Node 0 DMA32 free:42660kB min:5172kB low:6464kB high:7756kB active_anon:56264kB inactive_anon:8180kB active_file:537372kB inactive_file:538056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:8kB writeback:0kB mapped:14020kB shmem:8448kB slab_reclaimable:379300kB slab_unreclaimable:87972kB kernel_stack:3232kB pagetables:5320kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.852375] lowmem_reserve[]: 0 0 0 0
[ 5269.853784] Node 0 DMA: 67*4kB (ME) 60*8kB (UME) 72*16kB (ME) 22*32kB (ME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6892kB
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB
[ 5269.861551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5269.863676] 271012 total pagecache pages
[ 5269.865100] 0 pages in swap cache
[ 5269.866366] Swap cache stats: add 0, delete 0, find 0/0
[ 5269.867996] Free swap  = 0kB
[ 5269.869363] Total swap = 0kB
[ 5269.870593] 524157 pages RAM
[ 5269.871857] 0 pages HighMem/MovableOnly
[ 5269.873604] 80441 pages reserved
[ 5269.874937] 0 pages hwpoisoned
[ 5269.876207] Out of memory: Kill process 2710 (tuned) score 7 or sacrifice child
[ 5269.878265] Killed process 2710 (tuned) total-vm:553052kB, anon-rss:10596kB, file-rss:2776kB, shmem-rss:0kB

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-24 12:41   ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-24 12:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

I got OOM killers while running heavy disk I/O (extracting kernel source,
running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
Do you think these OOM killers reasonable? Too weak against fragmentation?

[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.432780] kthreadd cpuset=/ mems_allowed=0
[ 3902.433904] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 3902.435463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3902.437541]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 3902.439129]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 3902.440779]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 3902.442436] Call Trace:
[ 3902.443094]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 3902.444188]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 3902.445301]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3902.446656]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 3902.447881]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 3902.449093]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 3902.450266]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 3902.451430]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 3902.452757]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 3902.454468]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 3902.455756]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 3902.457076]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 3902.458396]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 3902.459480]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 3902.460775]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 3902.461894]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 3902.463035]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 3902.464230]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.465502]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 3902.466648]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.467953] Mem-Info:
[ 3902.468537] active_anon:20817 inactive_anon:2098 isolated_anon:0
[ 3902.468537]  active_file:145434 inactive_file:145453 isolated_file:0
[ 3902.468537]  unevictable:0 dirty:20613 writeback:7248 unstable:0
[ 3902.468537]  slab_reclaimable:86363 slab_unreclaimable:14905
[ 3902.468537]  mapped:6670 shmem:2167 pagetables:1497 bounce:0
[ 3902.468537]  free:5422 free_pcp:75 free_cma:0
[ 3902.476541] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3268kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:36kB shmem:216kB slab_reclaimable:3708kB slab_unreclaimable:456kB kernel_stack:48kB pagetables:160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3902.486494] lowmem_reserve[]: 0 1714 1714 1714
[ 3902.487659] Node 0 DMA32 free:13760kB min:5172kB low:6464kB high:7756kB active_anon:80000kB inactive_anon:8192kB active_file:581780kB inactive_file:581848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:82312kB writeback:29588kB mapped:26648kB shmem:8452kB slab_reclaimable:341744kB slab_unreclaimable:59496kB kernel_stack:3456kB pagetables:5828kB unstable:0kB bounce:0kB free_pcp:732kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:560 all_unreclaimable? no
[ 3902.500438] lowmem_reserve[]: 0 0 0 0
[ 3902.502373] Node 0 DMA: 42*4kB (UME) 84*8kB (UM) 57*16kB (UM) 15*32kB (UM) 11*64kB (M) 9*128kB (UME) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 0*2048kB 0*4096kB = 6904kB
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 3902.511236] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3902.513938] 292144 total pagecache pages
[ 3902.515609] 0 pages in swap cache
[ 3902.517139] Swap cache stats: add 0, delete 0, find 0/0
[ 3902.519153] Free swap  = 0kB
[ 3902.520587] Total swap = 0kB
[ 3902.522095] 524157 pages RAM
[ 3902.523511] 0 pages HighMem/MovableOnly
[ 3902.525091] 80441 pages reserved
[ 3902.526580] 0 pages hwpoisoned
[ 3902.528169] Out of memory: Kill process 687 (firewalld) score 11 or sacrifice child
[ 3902.531017] Killed process 687 (firewalld) total-vm:323600kB, anon-rss:17032kB, file-rss:4896kB, shmem-rss:0kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.903629] smbd cpuset=/ mems_allowed=0
[ 5262.904725] CPU: 2 PID: 3935 Comm: smbd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5262.906401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5262.908679]  0000000000000000 00000000eaa24b41 ffff88007c37faf8 ffffffff81395bc3
[ 5262.910459]  0000000000000000 ffff88007c37fb98 ffffffff811babac 0000000000000206
[ 5262.912224]  ffffffff81810470 ffff88007c37fb38 ffffffff810bce29 0000000000000206
[ 5262.914019] Call Trace:
[ 5262.914839]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5262.916118]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5262.917493]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5262.919131]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5262.920690]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5262.922204]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5262.923863]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5262.925386]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5262.927121]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5262.928738]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5262.930438]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.932110]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5262.933410]  [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.935016]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 5262.936632]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 5262.938383]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 5262.940024]  [<ffffffff8106e5a4>] SyS_clone+0x14/0x20
[ 5262.941465]  [<ffffffff816fc532>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 5262.943137] Mem-Info:
[ 5262.944068] active_anon:37901 inactive_anon:2095 isolated_anon:0
[ 5262.944068]  active_file:134812 inactive_file:135474 isolated_file:0
[ 5262.944068]  unevictable:0 dirty:257 writeback:0 unstable:0
[ 5262.944068]  slab_reclaimable:90770 slab_unreclaimable:12759
[ 5262.944068]  mapped:4223 shmem:2166 pagetables:1428 bounce:0
[ 5262.944068]  free:3738 free_pcp:49 free_cma:0
[ 5262.953176] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:900kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5262.963749] lowmem_reserve[]: 0 1714 1714 1714
[ 5262.965434] Node 0 DMA32 free:8048kB min:5172kB low:6464kB high:7756kB active_anon:150704kB inactive_anon:8180kB active_file:539244kB inactive_file:541892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:1028kB writeback:0kB mapped:16860kB shmem:8448kB slab_reclaimable:357524kB slab_unreclaimable:50324kB kernel_stack:3232kB pagetables:5560kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? no
[ 5262.976879] lowmem_reserve[]: 0 0 0 0
[ 5262.978586] Node 0 DMA: 58*4kB (UME) 60*8kB (UME) 73*16kB (UME) 23*32kB (UME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6904kB
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5262.987124] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5262.989532] 272459 total pagecache pages
[ 5262.991203] 0 pages in swap cache
[ 5262.992583] Swap cache stats: add 0, delete 0, find 0/0
[ 5262.994334] Free swap  = 0kB
[ 5262.995787] Total swap = 0kB
[ 5262.997038] 524157 pages RAM
[ 5262.998270] 0 pages HighMem/MovableOnly
[ 5262.999683] 80441 pages reserved
[ 5263.001153] 0 pages hwpoisoned
[ 5263.002612] Out of memory: Kill process 26226 (genxref) score 54 or sacrifice child
[ 5263.004648] Killed process 26226 (genxref) total-vm:130348kB, anon-rss:94680kB, file-rss:4756kB, shmem-rss:0kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.767289] kthreadd cpuset=/ mems_allowed=0
[ 5269.768904] CPU: 2 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5269.770956] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5269.773754]  0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 5269.776088]  0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 5269.778213]  ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 5269.780497] Call Trace:
[ 5269.781796]  [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5269.783634]  [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5269.786116]  [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5269.788495]  [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5269.790538]  [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5269.792755]  [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5269.794784]  [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5269.796848]  [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5269.799038]  [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 5269.801073]  [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5269.803186]  [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5269.805249]  [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 5269.807374]  [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5269.809089]  [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 5269.811146]  [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 5269.812944]  [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 5269.814698]  [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 5269.816330]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.818088]  [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 5269.819685]  [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.821399] Mem-Info:
[ 5269.822430] active_anon:14280 inactive_anon:2095 isolated_anon:0
[ 5269.822430]  active_file:134344 inactive_file:134515 isolated_file:0
[ 5269.822430]  unevictable:0 dirty:2 writeback:0 unstable:0
[ 5269.822430]  slab_reclaimable:96214 slab_unreclaimable:22185
[ 5269.822430]  mapped:3512 shmem:2166 pagetables:1368 bounce:0
[ 5269.822430]  free:12388 free_pcp:51 free_cma:0
[ 5269.831310] Node 0 DMA free:6892kB min:44kB low:52kB high:64kB active_anon:856kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:768kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.840580] lowmem_reserve[]: 0 1714 1714 1714
[ 5269.842107] Node 0 DMA32 free:42660kB min:5172kB low:6464kB high:7756kB active_anon:56264kB inactive_anon:8180kB active_file:537372kB inactive_file:538056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:8kB writeback:0kB mapped:14020kB shmem:8448kB slab_reclaimable:379300kB slab_unreclaimable:87972kB kernel_stack:3232kB pagetables:5320kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.852375] lowmem_reserve[]: 0 0 0 0
[ 5269.853784] Node 0 DMA: 67*4kB (ME) 60*8kB (UME) 72*16kB (ME) 22*32kB (ME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6892kB
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB
[ 5269.861551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5269.863676] 271012 total pagecache pages
[ 5269.865100] 0 pages in swap cache
[ 5269.866366] Swap cache stats: add 0, delete 0, find 0/0
[ 5269.867996] Free swap  = 0kB
[ 5269.869363] Total swap = 0kB
[ 5269.870593] 524157 pages RAM
[ 5269.871857] 0 pages HighMem/MovableOnly
[ 5269.873604] 80441 pages reserved
[ 5269.874937] 0 pages hwpoisoned
[ 5269.876207] Out of memory: Kill process 2710 (tuned) score 7 or sacrifice child
[ 5269.878265] Killed process 2710 (tuned) total-vm:553052kB, anon-rss:10596kB, file-rss:2776kB, shmem-rss:0kB

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-24 12:41   ` Tetsuo Handa
@ 2015-12-28 12:08     ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 12:08 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

Well, current patch invokes OOM killers when more than 75% of memory is used
for file cache (active_file: + inactive_file:). I think this is a surprising
thing for administrators and we want to retry more harder (but not forever,
please).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151228.txt.xz .
----------
[  277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
----------

Since I cannot establish workload that caused December 24's natural OOM
killers, I used the following stressor for generating similar situation.

The fileio.c fills up all memory with file cache and tries to keep them
on memory. The fork.c is flood of order-2 allocation generator because
December 24's OOM killers were triggered by copy_process() which involves
order-2 allocation request.

---------- fileio.c start ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	static char buffer[4096];
	signal(SIGCHLD, SIG_IGN);
	for (i = 0; i < 2; i++) {
		int fd;
		int j;
		snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
		fd = open(buffer, O_RDWR | O_CREAT, 0600);
		memset(buffer, 0, sizeof(buffer));
		for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
			write(fd, buffer, sizeof(buffer));
		close(fd);
	}
	for (i = 0; i < 2; i++) {
		if (fork() == 0) {
			int fd;
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
			fd = open(buffer, O_RDWR);
			memset(buffer, 0, sizeof(buffer));
			while (fd != EOF) {
				lseek(fd, 0, SEEK_SET);
				while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
			}
			_exit(0);
		}
	}
	if (fork() == 0) {
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	if (fork() == 0) {
		sleep(1);
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	while (1)
		system("pidof fork | wc");
	return 0;
}
---------- fileio.c end ----------

---------- fork.c start ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	signal(SIGCHLD, SIG_IGN);
	while (1) {
		sleep(5);
		for (i = 0; i < 2000; i++) {
			if (fork() == 0) {
				sleep(3);
				_exit(0);
			}
		}
	}
}
---------- fork.c end ----------

This reproducer also showed that once the OOM killer is invoked,
subsequent OOM killers tend to occur shortly because file cache
do not decrease.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-28 12:08     ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 12:08 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

Well, current patch invokes OOM killers when more than 75% of memory is used
for file cache (active_file: + inactive_file:). I think this is a surprising
thing for administrators and we want to retry more harder (but not forever,
please).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151228.txt.xz .
----------
[  277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
----------

Since I cannot establish workload that caused December 24's natural OOM
killers, I used the following stressor for generating similar situation.

The fileio.c fills up all memory with file cache and tries to keep them
on memory. The fork.c is flood of order-2 allocation generator because
December 24's OOM killers were triggered by copy_process() which involves
order-2 allocation request.

---------- fileio.c start ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	static char buffer[4096];
	signal(SIGCHLD, SIG_IGN);
	for (i = 0; i < 2; i++) {
		int fd;
		int j;
		snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
		fd = open(buffer, O_RDWR | O_CREAT, 0600);
		memset(buffer, 0, sizeof(buffer));
		for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
			write(fd, buffer, sizeof(buffer));
		close(fd);
	}
	for (i = 0; i < 2; i++) {
		if (fork() == 0) {
			int fd;
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
			fd = open(buffer, O_RDWR);
			memset(buffer, 0, sizeof(buffer));
			while (fd != EOF) {
				lseek(fd, 0, SEEK_SET);
				while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
			}
			_exit(0);
		}
	}
	if (fork() == 0) {
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	if (fork() == 0) {
		sleep(1);
		execl("./fork", "./fork", NULL);
		_exit(1);
	}
	while (1)
		system("pidof fork | wc");
	return 0;
}
---------- fileio.c end ----------

---------- fork.c start ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
	int i;
	signal(SIGCHLD, SIG_IGN);
	while (1) {
		sleep(5);
		for (i = 0; i < 2000; i++) {
			if (fork() == 0) {
				sleep(3);
				_exit(0);
			}
		}
	}
}
---------- fork.c end ----------

This reproducer also showed that once the OOM killer is invoked,
subsequent OOM killers tend to occur shortly because file cache
do not decrease.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 12:08     ` Tetsuo Handa
@ 2015-12-28 14:13       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 14:13 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Since I cannot establish workload that caused December 24's natural OOM
> killers, I used the following stressor for generating similar situation.
>

I came to feel that I am observing a different problem which is currently
hidden behind the "too small to fail" memory-allocation rule. That is, tasks
requesting order > 0 pages are continuously losing the competition when
tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
by tasks requesting order = 0 pages before reclaimed pages are combined to
order > 0 pages (or maybe order > 0 pages are immediately split into
order = 0 pages due to tasks requesting order = 0 pages).

Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
unless chosen by the OOM killer. Therefore, even if tasks requesting
order = 2 pages lost the competition when there are tasks requesting
order = 0 pages, the order = 2 allocation request is implicitly retried
and therefore the OOM killer is not invoked (though there is a problem that
tasks requesting order > 0 allocation will stall as long as tasks requesting
order = 0 pages dominate).

But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
order = 2 pages lost the competition for 16 times due to tasks requesting
order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
To avoid the OOM killer, we need to make sure that pages reclaimed for
order > 0 allocations will not be stolen by tasks requesting order = 0
allocations.

Is my feeling plausible?

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-28 14:13       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-28 14:13 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Since I cannot establish workload that caused December 24's natural OOM
> killers, I used the following stressor for generating similar situation.
>

I came to feel that I am observing a different problem which is currently
hidden behind the "too small to fail" memory-allocation rule. That is, tasks
requesting order > 0 pages are continuously losing the competition when
tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
by tasks requesting order = 0 pages before reclaimed pages are combined to
order > 0 pages (or maybe order > 0 pages are immediately split into
order = 0 pages due to tasks requesting order = 0 pages).

Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
unless chosen by the OOM killer. Therefore, even if tasks requesting
order = 2 pages lost the competition when there are tasks requesting
order = 0 pages, the order = 2 allocation request is implicitly retried
and therefore the OOM killer is not invoked (though there is a problem that
tasks requesting order > 0 allocation will stall as long as tasks requesting
order = 0 pages dominate).

But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
order = 2 pages lost the competition for 16 times due to tasks requesting
order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
To avoid the OOM killer, we need to make sure that pages reclaimed for
order > 0 allocations will not be stolen by tasks requesting order = 0
allocations.

Is my feeling plausible?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-24 12:41   ` Tetsuo Handa
@ 2015-12-29 16:27     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 24-12-15 21:41:19, Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

I will have a look at the oom report more closely early next week (I am
still in holiday mode) but it would be good to compare how the same load
behaves with the original implementation. It would be also interesting
to see how stable are the results (is there any variability in multiple
runs?).

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-29 16:27     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 24-12-15 21:41:19, Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

I will have a look at the oom report more closely early next week (I am
still in holiday mode) but it would be good to compare how the same load
behaves with the original implementation. It would be also interesting
to see how stable are the results (is there any variability in multiple
runs?).

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 12:08     ` Tetsuo Handa
@ 2015-12-29 16:32       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
> 
> Well, current patch invokes OOM killers when more than 75% of memory is used
> for file cache (active_file: + inactive_file:). I think this is a surprising
> thing for administrators and we want to retry more harder (but not forever,
> please).

Here again, it would be good to see what is the comparision between
the original and the new behavior. 75% of a page cache is certainly
unexpected but those pages might be pinned for other reasons and so
unreclaimable and basically IO bound. This is hard to optimize for
without causing any undesirable side effects for other loads. I will
have a look at the oom reports later but having a comparision would be
a great start.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-29 16:32       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2015-12-29 16:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
> 
> Well, current patch invokes OOM killers when more than 75% of memory is used
> for file cache (active_file: + inactive_file:). I think this is a surprising
> thing for administrators and we want to retry more harder (but not forever,
> please).

Here again, it would be good to see what is the comparision between
the original and the new behavior. 75% of a page cache is certainly
unexpected but those pages might be pinned for other reasons and so
unreclaimable and basically IO bound. This is hard to optimize for
without causing any undesirable side effects for other loads. I will
have a look at the oom reports later but having a comparision would be
a great start.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-29 16:32       ` Michal Hocko
@ 2015-12-30 15:05         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-30 15:05 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > 
> > Well, current patch invokes OOM killers when more than 75% of memory is used
> > for file cache (active_file: + inactive_file:). I think this is a surprising
> > thing for administrators and we want to retry more harder (but not forever,
> > please).
> 
> Here again, it would be good to see what is the comparision between
> the original and the new behavior. 75% of a page cache is certainly
> unexpected but those pages might be pinned for other reasons and so
> unreclaimable and basically IO bound. This is hard to optimize for
> without causing any undesirable side effects for other loads. I will
> have a look at the oom reports later but having a comparision would be
> a great start.

Prior to "mm, oom: rework oom detection" patch (the original), this stressor
never invoked the OOM killer. After this patch (the new), this stressor easily
invokes the OOM killer. Both the original and the new case, active_file: +
inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
order > 0 allocation requests.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2015-12-30 15:05         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2015-12-30 15:05 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > 
> > Well, current patch invokes OOM killers when more than 75% of memory is used
> > for file cache (active_file: + inactive_file:). I think this is a surprising
> > thing for administrators and we want to retry more harder (but not forever,
> > please).
> 
> Here again, it would be good to see what is the comparision between
> the original and the new behavior. 75% of a page cache is certainly
> unexpected but those pages might be pinned for other reasons and so
> unreclaimable and basically IO bound. This is hard to optimize for
> without causing any undesirable side effects for other loads. I will
> have a look at the oom reports later but having a comparision would be
> a great start.

Prior to "mm, oom: rework oom detection" patch (the original), this stressor
never invoked the OOM killer. After this patch (the new), this stressor easily
invokes the OOM killer. Both the original and the new case, active_file: +
inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
order > 0 allocation requests.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-30 15:05         ` Tetsuo Handa
@ 2016-01-02 15:47           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-02 15:47 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > > 
> > > Well, current patch invokes OOM killers when more than 75% of memory is used
> > > for file cache (active_file: + inactive_file:). I think this is a surprising
> > > thing for administrators and we want to retry more harder (but not forever,
> > > please).
> > 
> > Here again, it would be good to see what is the comparision between
> > the original and the new behavior. 75% of a page cache is certainly
> > unexpected but those pages might be pinned for other reasons and so
> > unreclaimable and basically IO bound. This is hard to optimize for
> > without causing any undesirable side effects for other loads. I will
> > have a look at the oom reports later but having a comparision would be
> > a great start.
> 
> Prior to "mm, oom: rework oom detection" patch (the original), this stressor
> never invoked the OOM killer. After this patch (the new), this stressor easily
> invokes the OOM killer. Both the original and the new case, active_file: +
> inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
> order > 0 allocation requests.
> 

I retested with below debug printk() patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..e433504 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3014,7 +3014,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3024,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3086,6 +3088,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress);
 	}
 
 	return false;
@@ -3273,7 +3277,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.
That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
patch hits the trigger.

----------
[  154.547143] zone=DMA32 reclaimable=323478 available=325894 no_progress_loops=0 did_some_progress=58
[  154.551119] zone=DMA32 reclaimable=323153 available=325770 no_progress_loops=0 did_some_progress=58
[  154.571983] zone=DMA32 reclaimable=319582 available=322161 no_progress_loops=0 did_some_progress=56
[  154.576121] zone=DMA32 reclaimable=319647 available=322016 no_progress_loops=0 did_some_progress=56
[  154.583523] zone=DMA32 reclaimable=319467 available=321801 no_progress_loops=0 did_some_progress=55
[  154.593948] zone=DMA32 reclaimable=317400 available=320988 no_progress_loops=0 did_some_progress=56
[  154.730880] zone=DMA32 reclaimable=312385 available=313952 no_progress_loops=0 did_some_progress=48
[  154.733226] zone=DMA32 reclaimable=312337 available=313919 no_progress_loops=0 did_some_progress=48
[  154.737270] zone=DMA32 reclaimable=312417 available=313871 no_progress_loops=0 did_some_progress=48
[  154.739569] zone=DMA32 reclaimable=312369 available=313844 no_progress_loops=0 did_some_progress=48
[  154.743195] zone=DMA32 reclaimable=312385 available=313790 no_progress_loops=0 did_some_progress=48
[  154.745534] zone=DMA32 reclaimable=312365 available=313813 no_progress_loops=0 did_some_progress=48
[  154.748431] zone=DMA32 reclaimable=312272 available=313728 no_progress_loops=0 did_some_progress=48
[  154.750973] zone=DMA32 reclaimable=312273 available=313760 no_progress_loops=0 did_some_progress=48
[  154.753503] zone=DMA32 reclaimable=312289 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753584] zone=DMA32 reclaimable=312241 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753660] zone=DMA32 reclaimable=312193 available=313958 no_progress_loops=0 did_some_progress=48
[  154.781574] zone=DMA32 reclaimable=312147 available=314095 no_progress_loops=0 did_some_progress=48
[  154.784281] zone=DMA32 reclaimable=311539 available=314015 no_progress_loops=0 did_some_progress=49
[  154.786639] zone=DMA32 reclaimable=311498 available=314040 no_progress_loops=0 did_some_progress=49
[  154.788761] zone=DMA32 reclaimable=311432 available=314040 no_progress_loops=0 did_some_progress=49
[  154.791047] zone=DMA32 reclaimable=311366 available=314040 no_progress_loops=0 did_some_progress=49
[  154.793388] zone=DMA32 reclaimable=311300 available=314040 no_progress_loops=0 did_some_progress=49
[  154.795802] zone=DMA32 reclaimable=311153 available=314006 no_progress_loops=0 did_some_progress=49
[  154.804685] zone=DMA32 reclaimable=309950 available=313140 no_progress_loops=0 did_some_progress=49
[  154.807039] zone=DMA32 reclaimable=309867 available=313138 no_progress_loops=0 did_some_progress=49
[  154.809440] zone=DMA32 reclaimable=309761 available=313080 no_progress_loops=0 did_some_progress=49
[  154.811583] zone=DMA32 reclaimable=309735 available=313120 no_progress_loops=0 did_some_progress=49
[  154.814090] zone=DMA32 reclaimable=309561 available=313068 no_progress_loops=0 did_some_progress=49
[  154.817381] zone=DMA32 reclaimable=309463 available=313030 no_progress_loops=0 did_some_progress=49
[  154.824387] zone=DMA32 reclaimable=309414 available=313030 no_progress_loops=0 did_some_progress=49
[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
[  154.844308] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  154.846654]  0000000000000000 0000000045061c6b ffff88007a5dbb00 ffffffff81398b83
[  154.848559]  0000000000000000 ffff88007a5dbba0 ffffffff811bc81c 0000000000000206
[  154.850488]  ffffffff818104b0 ffff88007a5dbb40 ffffffff810bdd79 0000000000000206
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
[  154.875357]  [<ffffffff8106d441>] copy_process.part.31+0x131/0x1b40
[  154.877845]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.880397]  [<ffffffff8106f01b>] _do_fork+0xdb/0x5d0
[  154.882259]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.884722]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  154.887201]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  154.889666]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[  154.891519]  [<ffffffff8106f594>] SyS_clone+0x14/0x20
[  154.893059]  [<ffffffff816feeb2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  154.894859] Mem-Info:
[  154.895851] active_anon:31807 inactive_anon:2093 isolated_anon:0
[  154.895851]  active_file:242656 inactive_file:67266 isolated_file:0
[  154.895851]  unevictable:0 dirty:8 writeback:0 unstable:0
[  154.895851]  slab_reclaimable:15100 slab_unreclaimable:20839
[  154.895851]  mapped:1681 shmem:2162 pagetables:18491 bounce:0
[  154.895851]  free:4243 free_pcp:343 free_cma:0
[  154.905459] Node 0 DMA free:6908kB min:44kB low:52kB high:64kB active_anon:3408kB inactive_anon:120kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:124kB slab_reclaimable:872kB slab_unreclaimable:3032kB kernel_stack:176kB pagetables:328kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.916097] lowmem_reserve[]: 0 1714 1714 1714
[  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB active_anon:121688kB inactive_anon:8252kB active_file:970620kB inactive_file:269060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758944kB mlocked:0kB dirty:32kB writeback:0kB mapped:6660kB shmem:8524kB slab_reclaimable:59528kB slab_unreclaimable:80460kB kernel_stack:47312kB pagetables:70972kB unstable:0kB bounce:0kB free_pcp:1356kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.929908] lowmem_reserve[]: 0 0 0 0
[  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
[  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
[  154.941617] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  154.944167] 312171 total pagecache pages
[  154.945926] 0 pages in swap cache
[  154.947521] Swap cache stats: add 0, delete 0, find 0/0
[  154.949436] Free swap  = 0kB
[  154.950920] Total swap = 0kB
[  154.952531] 524157 pages RAM
[  154.954063] 0 pages HighMem/MovableOnly
[  154.955785] 80445 pages reserved
[  154.957362] 0 pages hwpoisoned
----------

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-02 15:47           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-02 15:47 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > > Tetsuo Handa wrote:
> > > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> > > 
> > > Well, current patch invokes OOM killers when more than 75% of memory is used
> > > for file cache (active_file: + inactive_file:). I think this is a surprising
> > > thing for administrators and we want to retry more harder (but not forever,
> > > please).
> > 
> > Here again, it would be good to see what is the comparision between
> > the original and the new behavior. 75% of a page cache is certainly
> > unexpected but those pages might be pinned for other reasons and so
> > unreclaimable and basically IO bound. This is hard to optimize for
> > without causing any undesirable side effects for other loads. I will
> > have a look at the oom reports later but having a comparision would be
> > a great start.
> 
> Prior to "mm, oom: rework oom detection" patch (the original), this stressor
> never invoked the OOM killer. After this patch (the new), this stressor easily
> invokes the OOM killer. Both the original and the new case, active_file: +
> inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
> order > 0 allocation requests.
> 

I retested with below debug printk() patch.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..e433504 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3014,7 +3014,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3024,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3086,6 +3088,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress);
 	}
 
 	return false;
@@ -3273,7 +3277,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.
That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
patch hits the trigger.

----------
[  154.547143] zone=DMA32 reclaimable=323478 available=325894 no_progress_loops=0 did_some_progress=58
[  154.551119] zone=DMA32 reclaimable=323153 available=325770 no_progress_loops=0 did_some_progress=58
[  154.571983] zone=DMA32 reclaimable=319582 available=322161 no_progress_loops=0 did_some_progress=56
[  154.576121] zone=DMA32 reclaimable=319647 available=322016 no_progress_loops=0 did_some_progress=56
[  154.583523] zone=DMA32 reclaimable=319467 available=321801 no_progress_loops=0 did_some_progress=55
[  154.593948] zone=DMA32 reclaimable=317400 available=320988 no_progress_loops=0 did_some_progress=56
[  154.730880] zone=DMA32 reclaimable=312385 available=313952 no_progress_loops=0 did_some_progress=48
[  154.733226] zone=DMA32 reclaimable=312337 available=313919 no_progress_loops=0 did_some_progress=48
[  154.737270] zone=DMA32 reclaimable=312417 available=313871 no_progress_loops=0 did_some_progress=48
[  154.739569] zone=DMA32 reclaimable=312369 available=313844 no_progress_loops=0 did_some_progress=48
[  154.743195] zone=DMA32 reclaimable=312385 available=313790 no_progress_loops=0 did_some_progress=48
[  154.745534] zone=DMA32 reclaimable=312365 available=313813 no_progress_loops=0 did_some_progress=48
[  154.748431] zone=DMA32 reclaimable=312272 available=313728 no_progress_loops=0 did_some_progress=48
[  154.750973] zone=DMA32 reclaimable=312273 available=313760 no_progress_loops=0 did_some_progress=48
[  154.753503] zone=DMA32 reclaimable=312289 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753584] zone=DMA32 reclaimable=312241 available=313958 no_progress_loops=0 did_some_progress=48
[  154.753660] zone=DMA32 reclaimable=312193 available=313958 no_progress_loops=0 did_some_progress=48
[  154.781574] zone=DMA32 reclaimable=312147 available=314095 no_progress_loops=0 did_some_progress=48
[  154.784281] zone=DMA32 reclaimable=311539 available=314015 no_progress_loops=0 did_some_progress=49
[  154.786639] zone=DMA32 reclaimable=311498 available=314040 no_progress_loops=0 did_some_progress=49
[  154.788761] zone=DMA32 reclaimable=311432 available=314040 no_progress_loops=0 did_some_progress=49
[  154.791047] zone=DMA32 reclaimable=311366 available=314040 no_progress_loops=0 did_some_progress=49
[  154.793388] zone=DMA32 reclaimable=311300 available=314040 no_progress_loops=0 did_some_progress=49
[  154.795802] zone=DMA32 reclaimable=311153 available=314006 no_progress_loops=0 did_some_progress=49
[  154.804685] zone=DMA32 reclaimable=309950 available=313140 no_progress_loops=0 did_some_progress=49
[  154.807039] zone=DMA32 reclaimable=309867 available=313138 no_progress_loops=0 did_some_progress=49
[  154.809440] zone=DMA32 reclaimable=309761 available=313080 no_progress_loops=0 did_some_progress=49
[  154.811583] zone=DMA32 reclaimable=309735 available=313120 no_progress_loops=0 did_some_progress=49
[  154.814090] zone=DMA32 reclaimable=309561 available=313068 no_progress_loops=0 did_some_progress=49
[  154.817381] zone=DMA32 reclaimable=309463 available=313030 no_progress_loops=0 did_some_progress=49
[  154.824387] zone=DMA32 reclaimable=309414 available=313030 no_progress_loops=0 did_some_progress=49
[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
[  154.844308] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  154.846654]  0000000000000000 0000000045061c6b ffff88007a5dbb00 ffffffff81398b83
[  154.848559]  0000000000000000 ffff88007a5dbba0 ffffffff811bc81c 0000000000000206
[  154.850488]  ffffffff818104b0 ffff88007a5dbb40 ffffffff810bdd79 0000000000000206
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
[  154.875357]  [<ffffffff8106d441>] copy_process.part.31+0x131/0x1b40
[  154.877845]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.880397]  [<ffffffff8106f01b>] _do_fork+0xdb/0x5d0
[  154.882259]  [<ffffffff8111d8da>] ? __audit_syscall_entry+0xaa/0xf0
[  154.884722]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  154.887201]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  154.889666]  [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[  154.891519]  [<ffffffff8106f594>] SyS_clone+0x14/0x20
[  154.893059]  [<ffffffff816feeb2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  154.894859] Mem-Info:
[  154.895851] active_anon:31807 inactive_anon:2093 isolated_anon:0
[  154.895851]  active_file:242656 inactive_file:67266 isolated_file:0
[  154.895851]  unevictable:0 dirty:8 writeback:0 unstable:0
[  154.895851]  slab_reclaimable:15100 slab_unreclaimable:20839
[  154.895851]  mapped:1681 shmem:2162 pagetables:18491 bounce:0
[  154.895851]  free:4243 free_pcp:343 free_cma:0
[  154.905459] Node 0 DMA free:6908kB min:44kB low:52kB high:64kB active_anon:3408kB inactive_anon:120kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:64kB shmem:124kB slab_reclaimable:872kB slab_unreclaimable:3032kB kernel_stack:176kB pagetables:328kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.916097] lowmem_reserve[]: 0 1714 1714 1714
[  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB active_anon:121688kB inactive_anon:8252kB active_file:970620kB inactive_file:269060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758944kB mlocked:0kB dirty:32kB writeback:0kB mapped:6660kB shmem:8524kB slab_reclaimable:59528kB slab_unreclaimable:80460kB kernel_stack:47312kB pagetables:70972kB unstable:0kB bounce:0kB free_pcp:1356kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  154.929908] lowmem_reserve[]: 0 0 0 0
[  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
[  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
[  154.941617] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  154.944167] 312171 total pagecache pages
[  154.945926] 0 pages in swap cache
[  154.947521] Swap cache stats: add 0, delete 0, find 0/0
[  154.949436] Free swap  = 0kB
[  154.950920] Total swap = 0kB
[  154.952531] 524157 pages RAM
[  154.954063] 0 pages HighMem/MovableOnly
[  154.955785] 80445 pages reserved
[  154.957362] 0 pages hwpoisoned
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 14:13       ` Tetsuo Handa
@ 2016-01-06 12:44         ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-01-06 12:44 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

On 12/28/2015 03:13 PM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Tetsuo Handa wrote:
>> > I got OOM killers while running heavy disk I/O (extracting kernel source,
>> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
>> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>>
>> Since I cannot establish workload that caused December 24's natural OOM
>> killers, I used the following stressor for generating similar situation.
>>
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).

Hm I would expect that as long as there are some reserves left that your
reproducer cannot grab, there are some free pages left and the allocator should
thus preserve the order-2 pages that combine, since order-0 allocations will get
existing order-0 pages before splitting higher orders. Compaction should also be
able to successfully combine order-2 without racing allocators thanks to per-cpu
caching (but I'd have to check).

So I think the problem is not higher-order page itself, but that order-2 needs 4
pages and thus needs to pass a bit higher watermark, thus being at disadvantage
to order-0 allocations. Thus I would expect the order-2 pages to be there, but
not available for allocation due to watermarks.

> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).
> 
> But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
> order = 2 pages lost the competition for 16 times due to tasks requesting
> order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
> To avoid the OOM killer, we need to make sure that pages reclaimed for
> order > 0 allocations will not be stolen by tasks requesting order = 0
> allocations.
> 
> Is my feeling plausible?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-06 12:44         ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-01-06 12:44 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

On 12/28/2015 03:13 PM, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
>> Tetsuo Handa wrote:
>> > I got OOM killers while running heavy disk I/O (extracting kernel source,
>> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
>> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>>
>> Since I cannot establish workload that caused December 24's natural OOM
>> killers, I used the following stressor for generating similar situation.
>>
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).

Hm I would expect that as long as there are some reserves left that your
reproducer cannot grab, there are some free pages left and the allocator should
thus preserve the order-2 pages that combine, since order-0 allocations will get
existing order-0 pages before splitting higher orders. Compaction should also be
able to successfully combine order-2 without racing allocators thanks to per-cpu
caching (but I'd have to check).

So I think the problem is not higher-order page itself, but that order-2 needs 4
pages and thus needs to pass a bit higher watermark, thus being at disadvantage
to order-0 allocations. Thus I would expect the order-2 pages to be there, but
not available for allocation due to watermarks.

> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).
> 
> But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
> order = 2 pages lost the competition for 16 times due to tasks requesting
> order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
> To avoid the OOM killer, we need to make sure that pages reclaimed for
> order > 0 allocations will not be stolen by tasks requesting order = 0
> allocations.
> 
> Is my feeling plausible?
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-28 14:13       ` Tetsuo Handa
@ 2016-01-08 12:37         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 23:13:31, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> >
> > Since I cannot establish workload that caused December 24's natural OOM
> > killers, I used the following stressor for generating similar situation.
> >
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).
> 
> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).

Yes this is possible and nothing new. High order allocations (even small
orders) are never for free and more expensive than order-0. I have seen
an OOM killer striking while there were megs of free memory on a larger
machine just because of the high fragmentation.

> But this patchset introduced a limit of 16 retries.

We retry 16 times _only_ if the reclaim hasn't made _any_ progress
which means it hasn't reclaimed a single page. We can still fail due to
watermarks check for the required order but I think this is a correct
and desirable behavior because there is no guarantee that lower order
pages will get coalesced after more retries. The primary point of this
rework is to make the whole thing more deterministic.

So we can see some OOM reports for high orders (<COSTLY) which would
survive before just because we have retried so many times that we
end up allocating that single high order page but this was a pure luck
and indeterministic behavior. That being said I agree we might end up
doing some more tuning for non-costly high order allocation but it
should be bounded as well and based on failures on some reasonable
workloads. I haven't got to OOM reports you have posted yet but I
definitely plan to check them soon.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-08 12:37         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-08 12:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Mon 28-12-15 23:13:31, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> >
> > Since I cannot establish workload that caused December 24's natural OOM
> > killers, I used the following stressor for generating similar situation.
> >
> 
> I came to feel that I am observing a different problem which is currently
> hidden behind the "too small to fail" memory-allocation rule. That is, tasks
> requesting order > 0 pages are continuously losing the competition when
> tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
> by tasks requesting order = 0 pages before reclaimed pages are combined to
> order > 0 pages (or maybe order > 0 pages are immediately split into
> order = 0 pages due to tasks requesting order = 0 pages).
> 
> Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
> unless chosen by the OOM killer. Therefore, even if tasks requesting
> order = 2 pages lost the competition when there are tasks requesting
> order = 0 pages, the order = 2 allocation request is implicitly retried
> and therefore the OOM killer is not invoked (though there is a problem that
> tasks requesting order > 0 allocation will stall as long as tasks requesting
> order = 0 pages dominate).

Yes this is possible and nothing new. High order allocations (even small
orders) are never for free and more expensive than order-0. I have seen
an OOM killer striking while there were megs of free memory on a larger
machine just because of the high fragmentation.

> But this patchset introduced a limit of 16 retries.

We retry 16 times _only_ if the reclaim hasn't made _any_ progress
which means it hasn't reclaimed a single page. We can still fail due to
watermarks check for the required order but I think this is a correct
and desirable behavior because there is no guarantee that lower order
pages will get coalesced after more retries. The primary point of this
rework is to make the whole thing more deterministic.

So we can see some OOM reports for high orders (<COSTLY) which would
survive before just because we have retried so many times that we
end up allocating that single high order page but this was a pure luck
and indeterministic behavior. That being said I agree we might end up
doing some more tuning for non-costly high order allocation but it
should be bounded as well and based on failures on some reasonable
workloads. I haven't got to OOM reports you have posted yet but I
definitely plan to check them soon.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2015-12-15 18:19   ` Michal Hocko
@ 2016-01-14 22:58     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-14 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Tue, 15 Dec 2015, Michal Hocko wrote:

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..738ae2206635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e267faad4649..f77e283fb8c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>  	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
>  }
>  
> +/*
> + * Maximum number of reclaim retries without any progress before OOM killer
> + * is consider as the only way to move forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
> + * Checks whether it makes sense to retry the reclaim to make a forward progress
> + * for the given allocation request.
> + * The reclaim feedback represented by did_some_progress (any progress during
> + * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> + * pages) and no_progress_loops (number of reclaim rounds without any progress
> + * in a row) is considered as well as the reclaimable pages on the applicable
> + * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + *
> + * Returns true if a retry is viable or false to enter the oom path.
> + */
> +static inline bool
> +should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> +		     struct alloc_context *ac, int alloc_flags,
> +		     bool did_some_progress, unsigned long pages_reclaimed,
> +		     int no_progress_loops)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure we converge to OOM if we cannot make any progress
> +	 * several times in the row.
> +	 */
> +	if (no_progress_loops > MAX_RECLAIM_RETRIES)
> +		return false;
> +
> +	/* Do not retry high order allocations unless they are __GFP_REPEAT */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER) {
> +		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
> +			return false;
> +
> +		if (did_some_progress)
> +			return true;
> +	}
> +
> +	/*
> +	 * Keep reclaiming pages while there is a chance this will lead somewhere.
> +	 * If none of the target zones can satisfy our allocation request even
> +	 * if all reclaimable pages are considered then we are screwed and have
> +	 * to go OOM.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
> +		unsigned long available;
> +
> +		available = zone_reclaimable_pages(zone);
> +		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole available?
> +		 */
> +		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +				ac->high_zoneidx, alloc_flags, available)) {
> +			/* Wait for some write requests to complete then retry */
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			return true;
> +		}
> +	}

Tetsuo's log of an early oom in this thread shows that this check is 
wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
with only ZONE_DMA and ZONE_DMA32:

	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50

and the watermarks:

	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
	lowmem_reserve[]: 0 1714 1714 1714
	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
	lowmem_reserve[]: 0 0 0 0

and the scary thing is that this triggers when no_progress_loops == 0, so 
this is the first time trying the allocation after progress has been made.

Watermarks clearly indicate that memory is available, the problem is 
fragmentation for the order-2 allocation.  This is not a situation where 
we want to immediately call the oom killer to solve since we have no 
guarantee it is going to free contiguous memory (in fact it wouldn't be 
used at all for PAGE_ALLOC_COSTLY_ORDER).

There is order-2 memory available however:

	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
oom for this allocation.  ZONE_DMA32 is not, however.

I'm wondering if this has to do with the z->nr_reserved_highatomic 
estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
to 1%, or 20806kB.  That failure would make sense if free is 17996kB.

Tetsuo, would it be possible to try your workload with just this match and 
also show z->nr_reserved_highatomic?

This patch would need to at least have knowledge of the heuristics used by 
__zone_watermark_ok() since it's making an inference on reclaimability 
based on numbers that include pageblocks that are reserved from usage.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-14 22:58     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-14 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Tue, 15 Dec 2015, Michal Hocko wrote:

> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 457181844b6e..738ae2206635 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> +extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
>  extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e267faad4649..f77e283fb8c6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
>  	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
>  }
>  
> +/*
> + * Maximum number of reclaim retries without any progress before OOM killer
> + * is consider as the only way to move forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
> + * Checks whether it makes sense to retry the reclaim to make a forward progress
> + * for the given allocation request.
> + * The reclaim feedback represented by did_some_progress (any progress during
> + * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> + * pages) and no_progress_loops (number of reclaim rounds without any progress
> + * in a row) is considered as well as the reclaimable pages on the applicable
> + * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + *
> + * Returns true if a retry is viable or false to enter the oom path.
> + */
> +static inline bool
> +should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> +		     struct alloc_context *ac, int alloc_flags,
> +		     bool did_some_progress, unsigned long pages_reclaimed,
> +		     int no_progress_loops)
> +{
> +	struct zone *zone;
> +	struct zoneref *z;
> +
> +	/*
> +	 * Make sure we converge to OOM if we cannot make any progress
> +	 * several times in the row.
> +	 */
> +	if (no_progress_loops > MAX_RECLAIM_RETRIES)
> +		return false;
> +
> +	/* Do not retry high order allocations unless they are __GFP_REPEAT */
> +	if (order > PAGE_ALLOC_COSTLY_ORDER) {
> +		if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
> +			return false;
> +
> +		if (did_some_progress)
> +			return true;
> +	}
> +
> +	/*
> +	 * Keep reclaiming pages while there is a chance this will lead somewhere.
> +	 * If none of the target zones can satisfy our allocation request even
> +	 * if all reclaimable pages are considered then we are screwed and have
> +	 * to go OOM.
> +	 */
> +	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
> +		unsigned long available;
> +
> +		available = zone_reclaimable_pages(zone);
> +		available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
> +		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole available?
> +		 */
> +		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +				ac->high_zoneidx, alloc_flags, available)) {
> +			/* Wait for some write requests to complete then retry */
> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +			return true;
> +		}
> +	}

Tetsuo's log of an early oom in this thread shows that this check is 
wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
with only ZONE_DMA and ZONE_DMA32:

	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50

and the watermarks:

	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
	lowmem_reserve[]: 0 1714 1714 1714
	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
	lowmem_reserve[]: 0 0 0 0

and the scary thing is that this triggers when no_progress_loops == 0, so 
this is the first time trying the allocation after progress has been made.

Watermarks clearly indicate that memory is available, the problem is 
fragmentation for the order-2 allocation.  This is not a situation where 
we want to immediately call the oom killer to solve since we have no 
guarantee it is going to free contiguous memory (in fact it wouldn't be 
used at all for PAGE_ALLOC_COSTLY_ORDER).

There is order-2 memory available however:

	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
oom for this allocation.  ZONE_DMA32 is not, however.

I'm wondering if this has to do with the z->nr_reserved_highatomic 
estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
to 1%, or 20806kB.  That failure would make sense if free is 17996kB.

Tetsuo, would it be possible to try your workload with just this match and 
also show z->nr_reserved_highatomic?

This patch would need to at least have knowledge of the heuristics used by 
__zone_watermark_ok() since it's making an inference on reclaimability 
based on numbers that include pageblocks that are reserved from usage.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-14 22:58     ` David Rientjes
@ 2016-01-16  1:07       ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-16  1:07 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Tetsuo's log of an early oom in this thread shows that this check is 
> wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> with only ZONE_DMA and ZONE_DMA32:
> 
> 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> 
> and the watermarks:
> 
> 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> 	lowmem_reserve[]: 0 1714 1714 1714
> 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> 	lowmem_reserve[]: 0 0 0 0
> 
> and the scary thing is that this triggers when no_progress_loops == 0, so 
> this is the first time trying the allocation after progress has been made.
> 
> Watermarks clearly indicate that memory is available, the problem is 
> fragmentation for the order-2 allocation.  This is not a situation where 
> we want to immediately call the oom killer to solve since we have no 
> guarantee it is going to free contiguous memory (in fact it wouldn't be 
> used at all for PAGE_ALLOC_COSTLY_ORDER).
> 
> There is order-2 memory available however:
> 
> 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> oom for this allocation.  ZONE_DMA32 is not, however.
> 
> I'm wondering if this has to do with the z->nr_reserved_highatomic 
> estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> 
> Tetsuo, would it be possible to try your workload with just this match and 
> also show z->nr_reserved_highatomic?

I don't know what "try your workload with just this match" expects, but
zone->nr_reserved_highatomic is always 0.

----------
[  178.058803] zone=DMA32 reclaimable=367474 available=369923 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.061350] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.132174] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3256kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:56kB shmem:180kB slab_reclaimable:2056kB slab_unreclaimable:1096kB kernel_stack:192kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  178.145589] Node 0 DMA32 free:11532kB min:5564kB low:6952kB high:8344kB active_anon:133896kB inactive_anon:8204kB active_file:1001828kB inactive_file:462944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:8kB writeback:0kB mapped:8572kB shmem:8468kB slab_reclaimable:57136kB slab_unreclaimable:86380kB kernel_stack:50080kB pagetables:83600kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  198.457718] zone=DMA32 reclaimable=381991 available=386237 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.460111] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.507204] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3088kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:180kB slab_reclaimable:976kB slab_unreclaimable:1468kB kernel_stack:672kB pagetables:336kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  198.507209] Node 0 DMA32 free:19992kB min:5564kB low:6952kB high:8344kB active_anon:104176kB inactive_anon:8204kB active_file:905320kB inactive_file:617264kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12772kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:77856kB kernel_stack:44144kB pagetables:69180kB unstable:0kB bounce:0kB free_pcp:1104kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.647075] zone=DMA32 reclaimable=374429 available=378945 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.647076] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.652177] Node 0 DMA free:7928kB min:40kB low:48kB high:60kB active_anon:588kB inactive_anon:172kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:88kB shmem:180kB slab_reclaimable:1008kB slab_unreclaimable:2576kB kernel_stack:1840kB pagetables:408kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.652182] Node 0 DMA32 free:17608kB min:5564kB low:6952kB high:8344kB active_anon:89528kB inactive_anon:8204kB active_file:1025084kB inactive_file:472512kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12848kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:86628kB kernel_stack:50880kB pagetables:82336kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  207.045450] zone=DMA32 reclaimable=386923 available=392299 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.045451] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.050241] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:732kB inactive_anon:336kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:140kB shmem:436kB slab_reclaimable:456kB slab_unreclaimable:3536kB kernel_stack:1584kB pagetables:188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  207.050246] Node 0 DMA32 free:20092kB min:5564kB low:6952kB high:8344kB active_anon:91600kB inactive_anon:18620kB active_file:921896kB inactive_file:626544kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:964kB writeback:0kB mapped:17016kB shmem:24584kB slab_reclaimable:51908kB slab_unreclaimable:72792kB kernel_stack:40832kB pagetables:67396kB unstable:0kB bounce:0kB free_pcp:472kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  221.034713] zone=DMA32 reclaimable=389283 available=393245 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.037103] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.105952] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:416kB inactive_anon:304kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:132kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3156kB kernel_stack:2352kB pagetables:212kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  221.119016] Node 0 DMA32 free:7220kB min:5564kB low:6952kB high:8344kB active_anon:74480kB inactive_anon:23544kB active_file:946560kB inactive_file:618900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:1056kB writeback:0kB mapped:14760kB shmem:32768kB slab_reclaimable:51328kB slab_unreclaimable:75692kB kernel_stack:42960kB pagetables:66732kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:248 all_unreclaimable? no
[  224.072875] zone=DMA32 reclaimable=397667 available=401058 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.075212] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.133813] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:664kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3760kB kernel_stack:1136kB pagetables:376kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  224.145691] Node 0 DMA32 free:12160kB min:5564kB low:6952kB high:8344kB active_anon:69352kB inactive_anon:23140kB active_file:1191992kB inactive_file:399408kB unevictable:0kB isolated(anon):0kB isolated(file):104kB present:2080640kB managed:2021064kB mlocked:0kB dirty:844kB writeback:0kB mapped:4916kB shmem:32768kB slab_reclaimable:51288kB slab_unreclaimable:68392kB kernel_stack:38560kB pagetables:61820kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  234.291285] zone=DMA32 reclaimable=403563 available=407626 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.293557] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.357091] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:312kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2596kB kernel_stack:2992kB pagetables:204kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  234.370106] Node 0 DMA32 free:6804kB min:5564kB low:6952kB high:8344kB active_anon:77364kB inactive_anon:23140kB active_file:1168356kB inactive_file:454384kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:11884kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:61492kB kernel_stack:32016kB pagetables:49248kB unstable:0kB bounce:0kB free_pcp:760kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:696 all_unreclaimable? no
[  246.183836] zone=DMA32 reclaimable=405496 available=410200 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.186069] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.246157] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:1144kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2404kB kernel_stack:1392kB pagetables:660kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  246.260159] Node 0 DMA32 free:11564kB min:5564kB low:6952kB high:8344kB active_anon:74360kB inactive_anon:23036kB active_file:1173248kB inactive_file:456000kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:732kB writeback:0kB mapped:14812kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:59884kB kernel_stack:31824kB pagetables:47960kB unstable:0kB bounce:0kB free_pcp:136kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  258.994846] zone=DMA32 reclaimable=403441 available=407544 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  258.997488] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  259.055818] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:848kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2692kB kernel_stack:1872kB pagetables:476kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  259.067950] Node 0 DMA32 free:29136kB min:5564kB low:6952kB high:8344kB active_anon:71476kB inactive_anon:23032kB active_file:1129276kB inactive_file:485324kB unevictable:0kB isolated(anon):0kB isolated(file):112kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:14340kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61680kB kernel_stack:34704kB pagetables:44856kB unstable:0kB bounce:0kB free_pcp:1996kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.392099] zone=DMA32 reclaimable=399774 available=406049 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.394646] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.459049] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:832kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2824kB kernel_stack:2320kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.472413] Node 0 DMA32 free:21848kB min:5564kB low:6952kB high:8344kB active_anon:77144kB inactive_anon:23032kB active_file:1148420kB inactive_file:462308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:14700kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61672kB kernel_stack:32064kB pagetables:50888kB unstable:0kB bounce:0kB free_pcp:848kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.428858] zone=DMA32 reclaimable=404186 available=408756 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.431146] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.487864] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:600kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:3504kB kernel_stack:1120kB pagetables:532kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.499779] Node 0 DMA32 free:17040kB min:5564kB low:6952kB high:8344kB active_anon:60480kB inactive_anon:23032kB active_file:1277956kB inactive_file:339528kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:5912kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:64216kB kernel_stack:37520kB pagetables:52096kB unstable:0kB bounce:0kB free_pcp:308kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160116.txt.xz .

> 
> This patch would need to at least have knowledge of the heuristics used by 
> __zone_watermark_ok() since it's making an inference on reclaimability 
> based on numbers that include pageblocks that are reserved from usage.
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-16  1:07       ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-16  1:07 UTC (permalink / raw)
  To: rientjes, mhocko
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Tetsuo's log of an early oom in this thread shows that this check is 
> wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> with only ZONE_DMA and ZONE_DMA32:
> 
> 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> 
> and the watermarks:
> 
> 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> 	lowmem_reserve[]: 0 1714 1714 1714
> 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> 	lowmem_reserve[]: 0 0 0 0
> 
> and the scary thing is that this triggers when no_progress_loops == 0, so 
> this is the first time trying the allocation after progress has been made.
> 
> Watermarks clearly indicate that memory is available, the problem is 
> fragmentation for the order-2 allocation.  This is not a situation where 
> we want to immediately call the oom killer to solve since we have no 
> guarantee it is going to free contiguous memory (in fact it wouldn't be 
> used at all for PAGE_ALLOC_COSTLY_ORDER).
> 
> There is order-2 memory available however:
> 
> 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> oom for this allocation.  ZONE_DMA32 is not, however.
> 
> I'm wondering if this has to do with the z->nr_reserved_highatomic 
> estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> 
> Tetsuo, would it be possible to try your workload with just this match and 
> also show z->nr_reserved_highatomic?

I don't know what "try your workload with just this match" expects, but
zone->nr_reserved_highatomic is always 0.

----------
[  178.058803] zone=DMA32 reclaimable=367474 available=369923 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.061350] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  178.132174] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3256kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:56kB shmem:180kB slab_reclaimable:2056kB slab_unreclaimable:1096kB kernel_stack:192kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  178.145589] Node 0 DMA32 free:11532kB min:5564kB low:6952kB high:8344kB active_anon:133896kB inactive_anon:8204kB active_file:1001828kB inactive_file:462944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:8kB writeback:0kB mapped:8572kB shmem:8468kB slab_reclaimable:57136kB slab_unreclaimable:86380kB kernel_stack:50080kB pagetables:83600kB unstable:0kB bounce:0kB free_pcp:1268kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  198.457718] zone=DMA32 reclaimable=381991 available=386237 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.460111] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0
[  198.507204] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:3088kB inactive_anon:172kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:180kB slab_reclaimable:976kB slab_unreclaimable:1468kB kernel_stack:672kB pagetables:336kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  198.507209] Node 0 DMA32 free:19992kB min:5564kB low:6952kB high:8344kB active_anon:104176kB inactive_anon:8204kB active_file:905320kB inactive_file:617264kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12772kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:77856kB kernel_stack:44144kB pagetables:69180kB unstable:0kB bounce:0kB free_pcp:1104kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.647075] zone=DMA32 reclaimable=374429 available=378945 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.647076] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  198.652177] Node 0 DMA free:7928kB min:40kB low:48kB high:60kB active_anon:588kB inactive_anon:172kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:88kB shmem:180kB slab_reclaimable:1008kB slab_unreclaimable:2576kB kernel_stack:1840kB pagetables:408kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  198.652182] Node 0 DMA32 free:17608kB min:5564kB low:6952kB high:8344kB active_anon:89528kB inactive_anon:8204kB active_file:1025084kB inactive_file:472512kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021064kB mlocked:0kB dirty:176kB writeback:0kB mapped:12848kB shmem:8468kB slab_reclaimable:60372kB slab_unreclaimable:86628kB kernel_stack:50880kB pagetables:82336kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  207.045450] zone=DMA32 reclaimable=386923 available=392299 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.045451] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=38 nr_reserved_highatomic=0
[  207.050241] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:732kB inactive_anon:336kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:140kB shmem:436kB slab_reclaimable:456kB slab_unreclaimable:3536kB kernel_stack:1584kB pagetables:188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  207.050246] Node 0 DMA32 free:20092kB min:5564kB low:6952kB high:8344kB active_anon:91600kB inactive_anon:18620kB active_file:921896kB inactive_file:626544kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:964kB writeback:0kB mapped:17016kB shmem:24584kB slab_reclaimable:51908kB slab_unreclaimable:72792kB kernel_stack:40832kB pagetables:67396kB unstable:0kB bounce:0kB free_pcp:472kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  221.034713] zone=DMA32 reclaimable=389283 available=393245 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.037103] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0
[  221.105952] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:416kB inactive_anon:304kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:132kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3156kB kernel_stack:2352kB pagetables:212kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  221.119016] Node 0 DMA32 free:7220kB min:5564kB low:6952kB high:8344kB active_anon:74480kB inactive_anon:23544kB active_file:946560kB inactive_file:618900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:1056kB writeback:0kB mapped:14760kB shmem:32768kB slab_reclaimable:51328kB slab_unreclaimable:75692kB kernel_stack:42960kB pagetables:66732kB unstable:0kB bounce:0kB free_pcp:196kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:248 all_unreclaimable? no
[  224.072875] zone=DMA32 reclaimable=397667 available=401058 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.075212] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=56 nr_reserved_highatomic=0
[  224.133813] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:664kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:3760kB kernel_stack:1136kB pagetables:376kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  224.145691] Node 0 DMA32 free:12160kB min:5564kB low:6952kB high:8344kB active_anon:69352kB inactive_anon:23140kB active_file:1191992kB inactive_file:399408kB unevictable:0kB isolated(anon):0kB isolated(file):104kB present:2080640kB managed:2021064kB mlocked:0kB dirty:844kB writeback:0kB mapped:4916kB shmem:32768kB slab_reclaimable:51288kB slab_unreclaimable:68392kB kernel_stack:38560kB pagetables:61820kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  234.291285] zone=DMA32 reclaimable=403563 available=407626 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.293557] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0
[  234.357091] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:312kB inactive_anon:296kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2596kB kernel_stack:2992kB pagetables:204kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  234.370106] Node 0 DMA32 free:6804kB min:5564kB low:6952kB high:8344kB active_anon:77364kB inactive_anon:23140kB active_file:1168356kB inactive_file:454384kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:11884kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:61492kB kernel_stack:32016kB pagetables:49248kB unstable:0kB bounce:0kB free_pcp:760kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:696 all_unreclaimable? no
[  246.183836] zone=DMA32 reclaimable=405496 available=410200 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.186069] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  246.246157] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:1144kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:424kB slab_unreclaimable:2404kB kernel_stack:1392kB pagetables:660kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  246.260159] Node 0 DMA32 free:11564kB min:5564kB low:6952kB high:8344kB active_anon:74360kB inactive_anon:23036kB active_file:1173248kB inactive_file:456000kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:732kB writeback:0kB mapped:14812kB shmem:32768kB slab_reclaimable:51292kB slab_unreclaimable:59884kB kernel_stack:31824kB pagetables:47960kB unstable:0kB bounce:0kB free_pcp:136kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  258.994846] zone=DMA32 reclaimable=403441 available=407544 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  258.997488] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0
[  259.055818] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:848kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2692kB kernel_stack:1872kB pagetables:476kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  259.067950] Node 0 DMA32 free:29136kB min:5564kB low:6952kB high:8344kB active_anon:71476kB inactive_anon:23032kB active_file:1129276kB inactive_file:485324kB unevictable:0kB isolated(anon):0kB isolated(file):112kB present:2080640kB managed:2021064kB mlocked:0kB dirty:0kB writeback:0kB mapped:14340kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61680kB kernel_stack:34704kB pagetables:44856kB unstable:0kB bounce:0kB free_pcp:1996kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.392099] zone=DMA32 reclaimable=399774 available=406049 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.394646] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0
[  271.459049] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:832kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:2824kB kernel_stack:2320kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  271.472413] Node 0 DMA32 free:21848kB min:5564kB low:6952kB high:8344kB active_anon:77144kB inactive_anon:23032kB active_file:1148420kB inactive_file:462308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:14700kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:61672kB kernel_stack:32064kB pagetables:50888kB unstable:0kB bounce:0kB free_pcp:848kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.428858] zone=DMA32 reclaimable=404186 available=408756 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.431146] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=52 nr_reserved_highatomic=0
[  274.487864] Node 0 DMA free:7924kB min:40kB low:48kB high:60kB active_anon:600kB inactive_anon:284kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:436kB slab_reclaimable:428kB slab_unreclaimable:3504kB kernel_stack:1120kB pagetables:532kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  274.499779] Node 0 DMA32 free:17040kB min:5564kB low:6952kB high:8344kB active_anon:60480kB inactive_anon:23032kB active_file:1277956kB inactive_file:339528kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021064kB mlocked:0kB dirty:664kB writeback:0kB mapped:5912kB shmem:32768kB slab_reclaimable:51312kB slab_unreclaimable:64216kB kernel_stack:37520kB pagetables:52096kB unstable:0kB bounce:0kB free_pcp:308kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160116.txt.xz .

> 
> This patch would need to at least have knowledge of the heuristics used by 
> __zone_watermark_ok() since it's making an inference on reclaimability 
> based on numbers that include pageblocks that are reserved from usage.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-16  1:07       ` Tetsuo Handa
@ 2016-01-19 22:48         ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-19 22:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

On Sat, 16 Jan 2016, Tetsuo Handa wrote:

> > Tetsuo's log of an early oom in this thread shows that this check is 
> > wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> > with only ZONE_DMA and ZONE_DMA32:
> > 
> > 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > 
> > and the watermarks:
> > 
> > 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> > 	lowmem_reserve[]: 0 1714 1714 1714
> > 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> > 	lowmem_reserve[]: 0 0 0 0
> > 
> > and the scary thing is that this triggers when no_progress_loops == 0, so 
> > this is the first time trying the allocation after progress has been made.
> > 
> > Watermarks clearly indicate that memory is available, the problem is 
> > fragmentation for the order-2 allocation.  This is not a situation where 
> > we want to immediately call the oom killer to solve since we have no 
> > guarantee it is going to free contiguous memory (in fact it wouldn't be 
> > used at all for PAGE_ALLOC_COSTLY_ORDER).
> > 
> > There is order-2 memory available however:
> > 
> > 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> > oom for this allocation.  ZONE_DMA32 is not, however.
> > 
> > I'm wondering if this has to do with the z->nr_reserved_highatomic 
> > estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> > to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> > 
> > Tetsuo, would it be possible to try your workload with just this match and 
> > also show z->nr_reserved_highatomic?
> 
> I don't know what "try your workload with just this match" expects, but
> zone->nr_reserved_highatomic is always 0.
> 

My point about z->nr_reserved_highatomic still stands, specifically that 
pageblocks may be reserved from allocation and __zone_watermark_ok() may 
fail, which would cause a premature oom condition, for this patch's 
calculation of "available".  It may not have caused a problem on your 
specific workload, however.

Are you able to precisely identify why __zone_watermark_ok() is failing 
and triggering the oom in the log you posted January 3?

[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
// here //
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
...
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
// and also here, if we didn't serialize the oom killer //

I think that would help in fixing the issue you reported.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-19 22:48         ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-19 22:48 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

On Sat, 16 Jan 2016, Tetsuo Handa wrote:

> > Tetsuo's log of an early oom in this thread shows that this check is 
> > wrong.  The allocation in question is an order-2 GFP_KERNEL on a system 
> > with only ZONE_DMA and ZONE_DMA32:
> > 
> > 	zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > 	zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > 
> > and the watermarks:
> > 
> > 	Node 0 DMA free:6908kB min:44kB low:52kB high:64kB ...
> > 	lowmem_reserve[]: 0 1714 1714 1714
> > 	Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB  ...
> > 	lowmem_reserve[]: 0 0 0 0
> > 
> > and the scary thing is that this triggers when no_progress_loops == 0, so 
> > this is the first time trying the allocation after progress has been made.
> > 
> > Watermarks clearly indicate that memory is available, the problem is 
> > fragmentation for the order-2 allocation.  This is not a situation where 
> > we want to immediately call the oom killer to solve since we have no 
> > guarantee it is going to free contiguous memory (in fact it wouldn't be 
> > used at all for PAGE_ALLOC_COSTLY_ORDER).
> > 
> > There is order-2 memory available however:
> > 
> > 	Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > The failure for ZONE_DMA makes sense for the lowmem_reserve ratio, it's 
> > oom for this allocation.  ZONE_DMA32 is not, however.
> > 
> > I'm wondering if this has to do with the z->nr_reserved_highatomic 
> > estimate.  ZONE_DMA32 present pages is 2080640kB, so this would be limited 
> > to 1%, or 20806kB.  That failure would make sense if free is 17996kB.
> > 
> > Tetsuo, would it be possible to try your workload with just this match and 
> > also show z->nr_reserved_highatomic?
> 
> I don't know what "try your workload with just this match" expects, but
> zone->nr_reserved_highatomic is always 0.
> 

My point about z->nr_reserved_highatomic still stands, specifically that 
pageblocks may be reserved from allocation and __zone_watermark_ok() may 
fail, which would cause a premature oom condition, for this patch's 
calculation of "available".  It may not have caused a problem on your 
specific workload, however.

Are you able to precisely identify why __zone_watermark_ok() is failing 
and triggering the oom in the log you posted January 3?

[  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
[  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
// here //
[  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[  154.841167] fork cpuset=/ mems_allowed=0
[  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
...
[  154.852386] Call Trace:
[  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
[  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
[  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
[  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
[  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
[  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
[  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
[  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
[  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
[  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
[  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
// and also here, if we didn't serialize the oom killer //

I think that would help in fixing the issue you reported.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-19 22:48         ` David Rientjes
@ 2016-01-20 11:13           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-20 11:13 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Are you able to precisely identify why __zone_watermark_ok() is failing 
> and triggering the oom in the log you posted January 3?
> 
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> // here //
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
> [  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
> ...
> [  154.852386] Call Trace:
> [  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
> [  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
> [  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
> [  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
> [  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
> [  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
> [  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
> [  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
> [  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
> [  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
> [  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
> // and also here, if we didn't serialize the oom killer //
> 
> I think that would help in fixing the issue you reported.
> 
Does "why __zone_watermark_ok() is failing" mean "which 'return false;' statement
in __zone_watermark_ok() I'm hitting on my specific workload"? Then, answer is
the former for DMA zone and the latter for DMA32 zone.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..dd36f01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2390,7 +2390,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
-			long free_pages)
+				long free_pages, bool *no_free)
 {
 	long min = mark;
 	int o;
@@ -2423,6 +2423,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * are not met, then a high-order request also cannot go ahead
 	 * even if a suitable page happened to be free.
 	 */
+	*no_free = false;
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 
@@ -2453,26 +2454,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		}
 #endif
 	}
+	*no_free = true;
 	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
+	bool unused;
+
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
-					zone_page_state(z, NR_FREE_PAGES));
+				   zone_page_state(z, NR_FREE_PAGES), &unused);
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx)
 {
+	bool unused;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
-								free_pages);
+				   free_pages, &unused);
 }
 
 #ifdef CONFIG_NUMA
@@ -3014,7 +3019,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3029,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3039,6 +3046,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 			ac->high_zoneidx, ac->nodemask) {
+		bool no_free;
 		unsigned long available;
 		unsigned long reclaimable;
 
@@ -3052,7 +3060,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 * available?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac->high_zoneidx, alloc_flags, available)) {
+					ac->high_zoneidx, alloc_flags, available, &no_free)) {
 			unsigned long writeback;
 			unsigned long dirty;
 
@@ -3086,6 +3094,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu nr_reserved_highatomic=%lu no_free=%u\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress, zone->nr_reserved_highatomic, no_free);
 	}
 
 	return false;
@@ -3273,7 +3283,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .
----------
[  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  142.055908] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:3208kB inactive_anon:188kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB\
 dirty:0kB writeback:0kB mapped:60kB shmem:188kB slab_reclaimable:2792kB slab_unreclaimable:360kB kernel_stack:224kB pagetables:260kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_\
unreclaimable? no
[  142.066690] lowmem_reserve[]: 0 1970 1970 1970

[  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
[  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
[  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
[  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
[  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
[  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
[  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
[  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
[  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.006952] zone=DMA32 reclaimable=405821 available=410137 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.006954] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.010690] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.030121] zone=DMA32 reclaimable=405016 available=409801 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.030123] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.033530] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  190.499877] zone=DMA32 reclaimable=382152 available=384996 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  190.499878] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  190.504901] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  196.146728] zone=DMA32 reclaimable=371941 available=374605 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=1
[  196.146730] zone=DMA reclaimable=1 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=0
[  196.152546] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  201.837825] zone=DMA32 reclaimable=364569 available=370359 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=1
[  201.837826] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=0
[  201.844879] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.862325] zone=DMA32 reclaimable=381542 available=387785 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=1
[  212.862327] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=0
[  212.866857] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.866914] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:440kB inactive_anon:196kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB \
dirty:8kB writeback:0kB mapped:0kB shmem:280kB slab_reclaimable:480kB slab_unreclaimable:3856kB kernel_stack:1776kB pagetables:240kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_u\
nreclaimable? no
[  212.866915] lowmem_reserve[]: 0 1970 1970 1970
----------

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-20 11:13           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-20 11:13 UTC (permalink / raw)
  To: rientjes
  Cc: mhocko, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel, mhocko

David Rientjes wrote:
> Are you able to precisely identify why __zone_watermark_ok() is failing 
> and triggering the oom in the log you posted January 3?
> 
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> // here //
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
> [  154.842348] CPU: 1 PID: 9599 Comm: fork Tainted: G        W       4.4.0-rc7-next-20151231+ #273
> ...
> [  154.852386] Call Trace:
> [  154.853350]  [<ffffffff81398b83>] dump_stack+0x4b/0x68
> [  154.854731]  [<ffffffff811bc81c>] dump_header+0x5b/0x3b0
> [  154.856309]  [<ffffffff810bdd79>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  154.858046]  [<ffffffff810bde4d>] ? trace_hardirqs_on+0xd/0x10
> [  154.859593]  [<ffffffff81143d36>] oom_kill_process+0x366/0x540
> [  154.861142]  [<ffffffff8114414f>] out_of_memory+0x1ef/0x5a0
> [  154.862655]  [<ffffffff8114420d>] ? out_of_memory+0x2ad/0x5a0
> [  154.864194]  [<ffffffff81149c72>] __alloc_pages_nodemask+0xda2/0xde0
> [  154.865852]  [<ffffffff810bdd00>] ? trace_hardirqs_on_caller+0x80/0x1c0
> [  154.867844]  [<ffffffff81149e6c>] alloc_kmem_pages_node+0x4c/0xc0
> [  154.868726] zone=DMA32 reclaimable=309003 available=312677 no_progress_loops=0 did_some_progress=48
> [  154.868727] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=48
> // and also here, if we didn't serialize the oom killer //
> 
> I think that would help in fixing the issue you reported.
> 
Does "why __zone_watermark_ok() is failing" mean "which 'return false;' statement
in __zone_watermark_ok() I'm hitting on my specific workload"? Then, answer is
the former for DMA zone and the latter for DMA32 zone.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d70a80..dd36f01 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2390,7 +2390,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
-			long free_pages)
+				long free_pages, bool *no_free)
 {
 	long min = mark;
 	int o;
@@ -2423,6 +2423,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * are not met, then a high-order request also cannot go ahead
 	 * even if a suitable page happened to be free.
 	 */
+	*no_free = false;
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 
@@ -2453,26 +2454,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		}
 #endif
 	}
+	*no_free = true;
 	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
+	bool unused;
+
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
-					zone_page_state(z, NR_FREE_PAGES));
+				   zone_page_state(z, NR_FREE_PAGES), &unused);
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx)
 {
+	bool unused;
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
 	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
-								free_pages);
+				   free_pages, &unused);
 }
 
 #ifdef CONFIG_NUMA
@@ -3014,7 +3019,7 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 static inline bool
 should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
-		     bool did_some_progress,
+		     unsigned long did_some_progress,
 		     int no_progress_loops)
 {
 	struct zone *zone;
@@ -3024,8 +3029,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
 	 */
-	if (no_progress_loops > MAX_RECLAIM_RETRIES)
+	if (no_progress_loops > MAX_RECLAIM_RETRIES) {
+		printk(KERN_INFO "Reached MAX_RECLAIM_RETRIES.\n");
 		return false;
+	}
 
 	/* Do not retry high order allocations unless they are __GFP_REPEAT */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
@@ -3039,6 +3046,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
 			ac->high_zoneidx, ac->nodemask) {
+		bool no_free;
 		unsigned long available;
 		unsigned long reclaimable;
 
@@ -3052,7 +3060,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 * available?
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac->high_zoneidx, alloc_flags, available)) {
+					ac->high_zoneidx, alloc_flags, available, &no_free)) {
 			unsigned long writeback;
 			unsigned long dirty;
 
@@ -3086,6 +3094,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 
 			return true;
 		}
+		printk(KERN_INFO "zone=%s reclaimable=%lu available=%lu no_progress_loops=%u did_some_progress=%lu nr_reserved_highatomic=%lu no_free=%u\n",
+		       zone->name, reclaimable, available, no_progress_loops, did_some_progress, zone->nr_reserved_highatomic, no_free);
 	}
 
 	return false;
@@ -3273,7 +3283,7 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress, no_progress_loops))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .
----------
[  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  142.055908] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:3208kB inactive_anon:188kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB\
 dirty:0kB writeback:0kB mapped:60kB shmem:188kB slab_reclaimable:2792kB slab_unreclaimable:360kB kernel_stack:224kB pagetables:260kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_\
unreclaimable? no
[  142.066690] lowmem_reserve[]: 0 1970 1970 1970

[  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
[  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
[  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
[  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
[  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
[  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
[  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
[  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
[  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.006952] zone=DMA32 reclaimable=405821 available=410137 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.006954] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.010690] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  171.030121] zone=DMA32 reclaimable=405016 available=409801 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=1
[  171.030123] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=34 nr_reserved_highatomic=0 no_free=0
[  171.033530] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
[  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
[  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  190.499877] zone=DMA32 reclaimable=382152 available=384996 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
[  190.499878] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
[  190.504901] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  196.146728] zone=DMA32 reclaimable=371941 available=374605 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=1
[  196.146730] zone=DMA reclaimable=1 available=1982 no_progress_loops=0 did_some_progress=61 nr_reserved_highatomic=0 no_free=0
[  196.152546] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  201.837825] zone=DMA32 reclaimable=364569 available=370359 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=1
[  201.837826] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=59 nr_reserved_highatomic=0 no_free=0
[  201.844879] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.862325] zone=DMA32 reclaimable=381542 available=387785 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=1
[  212.862327] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=39 nr_reserved_highatomic=0 no_free=0
[  212.866857] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)

[  212.866914] Node 0 DMA free:7920kB min:40kB low:48kB high:60kB active_anon:440kB inactive_anon:196kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB \
dirty:8kB writeback:0kB mapped:0kB shmem:280kB slab_reclaimable:480kB slab_unreclaimable:3856kB kernel_stack:1776kB pagetables:240kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_u\
nreclaimable? no
[  212.866915] lowmem_reserve[]: 0 1970 1970 1970
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-02 15:47           ` Tetsuo Handa
@ 2016-01-20 12:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 12:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 03-01-16 00:47:30, Tetsuo Handa wrote:
[...]
> The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
> zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
> reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.

Yes and this is to be expected. Direct reclaim doesn't guarantee any
progress for high order allocations. We might be reclaiming pages which
cannot be coalesced.

> That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> patch hits the trigger.
[....]
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
[...]
> [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
[...]
> [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

It is really strange that __zone_watermark_ok claimed DMA32 unusable
here. With the target of 312734 which should easilly pass the wmark
check for the particular order and there are 116*16kB 15*32kB 1*64kB
blocks "usable" for our request because GFP_KERNEL can use both
Unmovable and Movable blocks. So it makes sense to wait for more order-0
allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
with this particular allocation request.

The nr_reserved_highatomic might be too high to matter but then you see
[1] the reserce being 0. So this doesn't make much sense to me. I will
dig into it some more.

[1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-20 12:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 12:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 03-01-16 00:47:30, Tetsuo Handa wrote:
[...]
> The output showed that __zone_watermark_ok() returning false on both DMA32 and DMA
> zones is the trigger of the OOM killer invocation. Direct reclaim is constantly
> reclaiming some pages, but I guess freelist for 2 <= order < MAX_ORDER are empty.

Yes and this is to be expected. Direct reclaim doesn't guarantee any
progress for high order allocations. We might be reclaiming pages which
cannot be coalesced.

> That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> patch hits the trigger.
[....]
> [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> [  154.841167] fork cpuset=/ mems_allowed=0
[...]
> [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
[...]
> [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB

It is really strange that __zone_watermark_ok claimed DMA32 unusable
here. With the target of 312734 which should easilly pass the wmark
check for the particular order and there are 116*16kB 15*32kB 1*64kB
blocks "usable" for our request because GFP_KERNEL can use both
Unmovable and Movable blocks. So it makes sense to wait for more order-0
allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
with this particular allocation request.

The nr_reserved_highatomic might be too high to matter but then you see
[1] the reserce being 0. So this doesn't make much sense to me. I will
dig into it some more.

[1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
  2016-01-20 11:13           ` Tetsuo Handa
@ 2016-01-20 13:13             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 13:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Wed 20-01-16 20:13:32, Tetsuo Handa wrote:
[...]
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .

> [  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1

Ok, so we really do not have _any_ pages on the order 2+ free lists and
that is why __zone_watermark_ok failed.

> [  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0

DMA zone is not even interesting because it is fully protected by the
lowmem reserves.

> [  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.086897] Node 0 DMA32: 1796*4kB (M) 763*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13288kB

And indeed we still do not have any order-2+ available. OOM seems
reasonable.

> [  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
> [  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
> [  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.921192] Node 0 DMA32: 1794*4kB (UME) 464*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

Ditto

> [  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
> [  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
> [  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.620582] Node 0 DMA32: 1241*4kB (UME) 1280*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15204kB

Ditto

> [  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
> [  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
> [  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.663480] Node 0 DMA32: 554*4kB (UME) 2148*8kB (UM) 3*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19448kB

Now we have __zone_watermark_ok claiming no order 2+ blocks available
but oom report little bit later sees 3 blocks. This would suggest that
this is just a matter of timing when the children exit and free their
stacks which are order-2.

> [  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
> [  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
> [  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  159.622451] Node 0 DMA32: 2141*4kB (UM) 1435*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20044kB

Again no high order pages.

> [  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
> [  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
> [  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  164.786643] Node 0 DMA32: 2961*4kB (UME) 432*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15300kB

Ditto

> [  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
> [  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
> [  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  184.728695] Node 0 DMA32: 3144*4kB (UME) 971*8kB (UME) 43*16kB (UM) 3*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21128kB

Again we have order >=2 pages available here after the allocator has
seen none earlier. And the pattern repeats later on. So I would say
that in this particular load it is a timing which plays the role. I
am not sure we can tune for such a load beause any difference in the
timing would result in a different behavior and basically breaking such
a tuning.

The current heuristic is based on an assumption that retrying for high
order allocations only makes sense if they are hidden behind the min
watermark and the currently reclaimable pages would get us above the
watermark. We cannot assume that the order-0 reclaimable pages will form
the required high order blocks because there is no such guarantee.  I
think such a heuristic makes sense because we have passed the direct
reclaim and also compaction at the time when we check for the retry
so chances to get the required block from the reclaim are not that high.

So I am not really sure what to do here now. On one hand the previous
heuristic would happen to work here probably better because we would be
looping in the allocator, exiting processes would rest the counter and
keep the retries and sooner or later the fork would be lucky and see its
order-2 block and continue. We could starve in this state for basically
unbounded amount of time though which is excatly what I would like to
get rid of. I guess we might want to give few attempts to retry for
all order>0. Let me think about it some more.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 1/3] mm, oom: rework oom detection
@ 2016-01-20 13:13             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-20 13:13 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Wed 20-01-16 20:13:32, Tetsuo Handa wrote:
[...]
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160120.txt.xz .

> [  141.987548] zone=DMA32 reclaimable=367085 available=371232 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1

Ok, so we really do not have _any_ pages on the order 2+ free lists and
that is why __zone_watermark_ok failed.

> [  141.990091] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0

DMA zone is not even interesting because it is fully protected by the
lowmem reserves.

> [  141.997360] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.086897] Node 0 DMA32: 1796*4kB (M) 763*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13288kB

And indeed we still do not have any order-2+ available. OOM seems
reasonable.

> [  142.914557] zone=DMA32 reclaimable=345975 available=348821 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=1
> [  142.914558] zone=DMA reclaimable=2 available=1980 no_progress_loops=0 did_some_progress=58 nr_reserved_highatomic=0 no_free=0
> [  142.921113] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  142.921192] Node 0 DMA32: 1794*4kB (UME) 464*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

Ditto

> [  153.615466] zone=DMA32 reclaimable=385567 available=389678 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=1
> [  153.615467] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=36 nr_reserved_highatomic=0 no_free=0
> [  153.620507] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.620582] Node 0 DMA32: 1241*4kB (UME) 1280*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15204kB

Ditto

> [  153.658621] zone=DMA32 reclaimable=384064 available=388833 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=1
> [  153.658623] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=37 nr_reserved_highatomic=0 no_free=0
> [  153.663401] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  153.663480] Node 0 DMA32: 554*4kB (UME) 2148*8kB (UM) 3*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19448kB

Now we have __zone_watermark_ok claiming no order 2+ blocks available
but oom report little bit later sees 3 blocks. This would suggest that
this is just a matter of timing when the children exit and free their
stacks which are order-2.

> [  159.614894] zone=DMA32 reclaimable=356635 available=361925 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=1
> [  159.614895] zone=DMA reclaimable=2 available=1983 no_progress_loops=0 did_some_progress=32 nr_reserved_highatomic=0 no_free=0
> [  159.622374] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  159.622451] Node 0 DMA32: 2141*4kB (UM) 1435*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20044kB

Again no high order pages.

> [  164.781516] zone=DMA32 reclaimable=393457 available=397561 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=1
> [  164.781518] zone=DMA reclaimable=1 available=1983 no_progress_loops=0 did_some_progress=40 nr_reserved_highatomic=0 no_free=0
> [  164.786560] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  164.786643] Node 0 DMA32: 2961*4kB (UME) 432*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15300kB

Ditto

> [  184.631660] zone=DMA32 reclaimable=356652 available=359338 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=1
> [  184.634207] zone=DMA reclaimable=2 available=1982 no_progress_loops=0 did_some_progress=60 nr_reserved_highatomic=0 no_free=0
> [  184.642800] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[...]
> [  184.728695] Node 0 DMA32: 3144*4kB (UME) 971*8kB (UME) 43*16kB (UM) 3*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21128kB

Again we have order >=2 pages available here after the allocator has
seen none earlier. And the pattern repeats later on. So I would say
that in this particular load it is a timing which plays the role. I
am not sure we can tune for such a load beause any difference in the
timing would result in a different behavior and basically breaking such
a tuning.

The current heuristic is based on an assumption that retrying for high
order allocations only makes sense if they are hidden behind the min
watermark and the currently reclaimable pages would get us above the
watermark. We cannot assume that the order-0 reclaimable pages will form
the required high order blocks because there is no such guarantee.  I
think such a heuristic makes sense because we have passed the direct
reclaim and also compaction at the time when we check for the retry
so chances to get the required block from the reclaim are not that high.

So I am not really sure what to do here now. On one hand the previous
heuristic would happen to work here probably better because we would be
looping in the allocator, exiting processes would rest the counter and
keep the retries and sooner or later the fork would be lucky and see its
order-2 block and continue. We could starve in this state for basically
unbounded amount of time though which is excatly what I would like to
get rid of. I guess we might want to give few attempts to retry for
all order>0. Let me think about it some more.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-20 12:24             ` Michal Hocko
@ 2016-01-27 23:18               ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-27 23:18 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Tetsuo Handa, Andrew Morton, torvalds, hannes, mgorman, hillf.zj,
	Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Michal Hocko wrote:

> > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > patch hits the trigger.
> [....]
> > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > [  154.841167] fork cpuset=/ mems_allowed=0
> [...]
> > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> [...]
> > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> It is really strange that __zone_watermark_ok claimed DMA32 unusable
> here. With the target of 312734 which should easilly pass the wmark
> check for the particular order and there are 116*16kB 15*32kB 1*64kB
> blocks "usable" for our request because GFP_KERNEL can use both
> Unmovable and Movable blocks. So it makes sense to wait for more order-0
> allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> with this particular allocation request.
> 
> The nr_reserved_highatomic might be too high to matter but then you see
> [1] the reserce being 0. So this doesn't make much sense to me. I will
> dig into it some more.
> 
> [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp

There's another issue in the use of zone_reclaimable_pages().  I think 
should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
as I indicated before, but notice that zone_reclaimable_pages() only uses 
zone_page_state().  It means that the heuristic is based on some 
up-to-date members and some stale members.  If we are relying on 
NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
then it may take up to 1s for that to actually occur and may quickly 
exhaust the retry counter in should_reclaim_retry() before that happens.

This is the same issue that Joonsoo reported with the use of 
zone_page_state(NR_ISOLATED_*) in the too_many_isolated() loops of reclaim 
and compaction.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-27 23:18               ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-27 23:18 UTC (permalink / raw)
  To: Michal Hocko, Joonsoo Kim
  Cc: Tetsuo Handa, Andrew Morton, torvalds, hannes, mgorman, hillf.zj,
	Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed, 20 Jan 2016, Michal Hocko wrote:

> > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > patch hits the trigger.
> [....]
> > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > [  154.841167] fork cpuset=/ mems_allowed=0
> [...]
> > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> [...]
> > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> 
> It is really strange that __zone_watermark_ok claimed DMA32 unusable
> here. With the target of 312734 which should easilly pass the wmark
> check for the particular order and there are 116*16kB 15*32kB 1*64kB
> blocks "usable" for our request because GFP_KERNEL can use both
> Unmovable and Movable blocks. So it makes sense to wait for more order-0
> allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> with this particular allocation request.
> 
> The nr_reserved_highatomic might be too high to matter but then you see
> [1] the reserce being 0. So this doesn't make much sense to me. I will
> dig into it some more.
> 
> [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp

There's another issue in the use of zone_reclaimable_pages().  I think 
should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
as I indicated before, but notice that zone_reclaimable_pages() only uses 
zone_page_state().  It means that the heuristic is based on some 
up-to-date members and some stale members.  If we are relying on 
NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
then it may take up to 1s for that to actually occur and may quickly 
exhaust the retry counter in should_reclaim_retry() before that happens.

This is the same issue that Joonsoo reported with the use of 
zone_page_state(NR_ISOLATED_*) in the too_many_isolated() loops of reclaim 
and compaction.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2015-12-15 18:19 ` Michal Hocko
@ 2016-01-28 20:40   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom has been doing get_page_from_freelist with
ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
killer. This has two reasons as explained by Andrea:
"
: the reason for the high wmark is to reduce the likelihood of livelocks
: and be sure to invoke the OOM killer, if we're still under pressure
: and reclaim just failed. The high wmark is used to be sure the failure
: of reclaim isn't going to be ignored. If using the min wmark like
: you propose there's risk of livelock or anyway of delayed OOM killer
: invocation.
:
: The reason for doing one last wmark check (regardless of the wmark
: used) before invoking the oom killer, was just to be sure another OOM
: killer invocation hasn't already freed a ton of memory while we were
: stuck in reclaim. A lot of free memory generated by the OOM killer,
: won't make a parallel reclaim more likely to succeed, it just creates
: free memory, but reclaim only succeeds when it finds "freeable" memory
: and it makes progress in converting it to free memory. So for the
: purpose of this last check, the high wmark would work fine as lots of
: free memory would have been generated in such case.
"

This is no longer a concern after "mm, oom: rework oom detection"
because should_reclaim_retry performs the water mark check right before
__alloc_pages_may_oom is invoked. Remove the last moment allocation
request as it just makes the code more confusing and doesn't really
serve any purpose because a success is basically impossible otherwise
should_reclaim_retry would force the reclaim to retry. So this is
merely a code cleanup rather than a functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 268de1654128..f82941c0ac4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
-	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	if (!(gfp_mask & __GFP_NOFAIL)) {
 		/* Coredumps can quickly deplete all memory reserves */
 		if (current->flags & PF_DUMPCORE)
-- 
2.7.0.rc3

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 20:40   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 20:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__alloc_pages_may_oom has been doing get_page_from_freelist with
ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
killer. This has two reasons as explained by Andrea:
"
: the reason for the high wmark is to reduce the likelihood of livelocks
: and be sure to invoke the OOM killer, if we're still under pressure
: and reclaim just failed. The high wmark is used to be sure the failure
: of reclaim isn't going to be ignored. If using the min wmark like
: you propose there's risk of livelock or anyway of delayed OOM killer
: invocation.
:
: The reason for doing one last wmark check (regardless of the wmark
: used) before invoking the oom killer, was just to be sure another OOM
: killer invocation hasn't already freed a ton of memory while we were
: stuck in reclaim. A lot of free memory generated by the OOM killer,
: won't make a parallel reclaim more likely to succeed, it just creates
: free memory, but reclaim only succeeds when it finds "freeable" memory
: and it makes progress in converting it to free memory. So for the
: purpose of this last check, the high wmark would work fine as lots of
: free memory would have been generated in such case.
"

This is no longer a concern after "mm, oom: rework oom detection"
because should_reclaim_retry performs the water mark check right before
__alloc_pages_may_oom is invoked. Remove the last moment allocation
request as it just makes the code more confusing and doesn't really
serve any purpose because a success is basically impossible otherwise
should_reclaim_retry would force the reclaim to retry. So this is
merely a code cleanup rather than a functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 268de1654128..f82941c0ac4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,16 +2743,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
-	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
-	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
-	if (page)
-		goto out;
-
 	if (!(gfp_mask & __GFP_NOFAIL)) {
 		/* Coredumps can quickly deplete all memory reserves */
 		if (current->flags & PF_DUMPCORE)
-- 
2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-01-27 23:18               ` David Rientjes
@ 2016-01-28 21:19                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Tetsuo Handa, Andrew Morton, torvalds, hannes,
	mgorman, hillf.zj, Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed 27-01-16 15:18:11, David Rientjes wrote:
> On Wed, 20 Jan 2016, Michal Hocko wrote:
> 
> > > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > > patch hits the trigger.
> > [....]
> > > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > > [  154.841167] fork cpuset=/ mems_allowed=0
> > [...]
> > > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> > [...]
> > > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > It is really strange that __zone_watermark_ok claimed DMA32 unusable
> > here. With the target of 312734 which should easilly pass the wmark
> > check for the particular order and there are 116*16kB 15*32kB 1*64kB
> > blocks "usable" for our request because GFP_KERNEL can use both
> > Unmovable and Movable blocks. So it makes sense to wait for more order-0
> > allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> > with this particular allocation request.
> > 
> > The nr_reserved_highatomic might be too high to matter but then you see
> > [1] the reserce being 0. So this doesn't make much sense to me. I will
> > dig into it some more.
> > 
> > [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
> 
> There's another issue in the use of zone_reclaimable_pages().  I think 
> should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
> as I indicated before, but notice that zone_reclaimable_pages() only uses 
> zone_page_state().  It means that the heuristic is based on some 
> up-to-date members and some stale members.  If we are relying on 
> NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
> then it may take up to 1s for that to actually occur and may quickly 
> exhaust the retry counter in should_reclaim_retry() before that happens.

You are right. I will post a patch to fix that.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-01-28 21:19                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Tetsuo Handa, Andrew Morton, torvalds, hannes,
	mgorman, hillf.zj, Kamezawa Hiroyuki, linux-mm, linux-kernel

On Wed 27-01-16 15:18:11, David Rientjes wrote:
> On Wed, 20 Jan 2016, Michal Hocko wrote:
> 
> > > That trigger was introduced by commit 97a16fc82a7c5b0c ("mm, page_alloc: only
> > > enforce watermarks for order-0 allocations"), and "mm, oom: rework oom detection"
> > > patch hits the trigger.
> > [....]
> > > [  154.829582] zone=DMA32 reclaimable=308907 available=312734 no_progress_loops=0 did_some_progress=50
> > > [  154.831562] zone=DMA reclaimable=2 available=1728 no_progress_loops=0 did_some_progress=50
> > > [  154.838499] fork invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
> > > [  154.841167] fork cpuset=/ mems_allowed=0
> > [...]
> > > [  154.917857] Node 0 DMA32 free:17996kB min:5172kB low:6464kB high:7756kB ....
> > [...]
> > > [  154.931918] Node 0 DMA: 107*4kB (UME) 72*8kB (ME) 47*16kB (UME) 19*32kB (UME) 9*64kB (ME) 1*128kB (M) 3*256kB (M) 2*512kB (E) 2*1024kB (UM) 0*2048kB 0*4096kB = 6908kB
> > > [  154.937453] Node 0 DMA32: 1113*4kB (UME) 1400*8kB (UME) 116*16kB (UM) 15*32kB (UM) 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18052kB
> > 
> > It is really strange that __zone_watermark_ok claimed DMA32 unusable
> > here. With the target of 312734 which should easilly pass the wmark
> > check for the particular order and there are 116*16kB 15*32kB 1*64kB
> > blocks "usable" for our request because GFP_KERNEL can use both
> > Unmovable and Movable blocks. So it makes sense to wait for more order-0
> > allocations to pass the basic (NR_FREE_MEMORY) watermark and continue
> > with this particular allocation request.
> > 
> > The nr_reserved_highatomic might be too high to matter but then you see
> > [1] the reserce being 0. So this doesn't make much sense to me. I will
> > dig into it some more.
> > 
> > [1] http://lkml.kernel.org/r/201601161007.DDG56185.QOHMOFOLtSFJVF@I-love.SAKURA.ne.jp
> 
> There's another issue in the use of zone_reclaimable_pages().  I think 
> should_reclaim_retry() using zone_page_state_snapshot() is approrpriate, 
> as I indicated before, but notice that zone_reclaimable_pages() only uses 
> zone_page_state().  It means that the heuristic is based on some 
> up-to-date members and some stale members.  If we are relying on 
> NR_ISOLATED_* to be accurate, for example, in zone_reclaimable_pages(), 
> then it may take up to 1s for that to actually occur and may quickly 
> exhaust the retry counter in should_reclaim_retry() before that happens.

You are right. I will post a patch to fix that.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2015-12-15 18:19 ` Michal Hocko
@ 2016-01-28 21:19   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zone_reclaimable_pages is used in should_reclaim_retry which uses it to
calculate the target for the watermark check. This means that precise
numbers are important for the correct decision. zone_reclaimable_pages
uses zone_page_state which can contain stale data with per-cpu diffs
not synced yet (the last vmstat_update might have run 1s in the past).

Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
of the current callers is in a hot path where getting the precise value
(which involves per-cpu iteration) would cause an unreasonable overhead.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 489212252cd6..9145e3f89eab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE) +
-	     zone_page_state(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON) +
-		      zone_page_state(zone, NR_ISOLATED_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
2.7.0.rc3

^ permalink raw reply	[flat|nested] 299+ messages in thread

* [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-28 21:19   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-28 21:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zone_reclaimable_pages is used in should_reclaim_retry which uses it to
calculate the target for the watermark check. This means that precise
numbers are important for the correct decision. zone_reclaimable_pages
uses zone_page_state which can contain stale data with per-cpu diffs
not synced yet (the last vmstat_update might have run 1s in the past).

Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
of the current callers is in a hot path where getting the precise value
(which involves per-cpu iteration) would cause an unreasonable overhead.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 489212252cd6..9145e3f89eab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE) +
-	     zone_page_state(zone, NR_ISOLATED_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON) +
-		      zone_page_state(zone, NR_ISOLATED_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 20:40   ` Michal Hocko
@ 2016-01-28 21:36     ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom has been doing get_page_from_freelist with
> ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> killer. This has two reasons as explained by Andrea:
> "
> : the reason for the high wmark is to reduce the likelihood of livelocks
> : and be sure to invoke the OOM killer, if we're still under pressure
> : and reclaim just failed. The high wmark is used to be sure the failure
> : of reclaim isn't going to be ignored. If using the min wmark like
> : you propose there's risk of livelock or anyway of delayed OOM killer
> : invocation.
> :
> : The reason for doing one last wmark check (regardless of the wmark
> : used) before invoking the oom killer, was just to be sure another OOM
> : killer invocation hasn't already freed a ton of memory while we were
> : stuck in reclaim. A lot of free memory generated by the OOM killer,
> : won't make a parallel reclaim more likely to succeed, it just creates
> : free memory, but reclaim only succeeds when it finds "freeable" memory
> : and it makes progress in converting it to free memory. So for the
> : purpose of this last check, the high wmark would work fine as lots of
> : free memory would have been generated in such case.
> "
> 
> This is no longer a concern after "mm, oom: rework oom detection"
> because should_reclaim_retry performs the water mark check right before
> __alloc_pages_may_oom is invoked. Remove the last moment allocation
> request as it just makes the code more confusing and doesn't really
> serve any purpose because a success is basically impossible otherwise
> should_reclaim_retry would force the reclaim to retry. So this is
> merely a code cleanup rather than a functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

The check has to happen while holding the OOM lock, otherwise we'll
end up killing much more than necessary when there are many racing
allocations.

Please drop this patch.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 21:36     ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __alloc_pages_may_oom has been doing get_page_from_freelist with
> ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> killer. This has two reasons as explained by Andrea:
> "
> : the reason for the high wmark is to reduce the likelihood of livelocks
> : and be sure to invoke the OOM killer, if we're still under pressure
> : and reclaim just failed. The high wmark is used to be sure the failure
> : of reclaim isn't going to be ignored. If using the min wmark like
> : you propose there's risk of livelock or anyway of delayed OOM killer
> : invocation.
> :
> : The reason for doing one last wmark check (regardless of the wmark
> : used) before invoking the oom killer, was just to be sure another OOM
> : killer invocation hasn't already freed a ton of memory while we were
> : stuck in reclaim. A lot of free memory generated by the OOM killer,
> : won't make a parallel reclaim more likely to succeed, it just creates
> : free memory, but reclaim only succeeds when it finds "freeable" memory
> : and it makes progress in converting it to free memory. So for the
> : purpose of this last check, the high wmark would work fine as lots of
> : free memory would have been generated in such case.
> "
> 
> This is no longer a concern after "mm, oom: rework oom detection"
> because should_reclaim_retry performs the water mark check right before
> __alloc_pages_may_oom is invoked. Remove the last moment allocation
> request as it just makes the code more confusing and doesn't really
> serve any purpose because a success is basically impossible otherwise
> should_reclaim_retry would force the reclaim to retry. So this is
> merely a code cleanup rather than a functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

The check has to happen while holding the OOM lock, otherwise we'll
end up killing much more than necessary when there are many racing
allocations.

Please drop this patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 21:36     ` Johannes Weiner
@ 2016-01-28 23:19       ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Johannes Weiner wrote:

> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.
> 

Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
acquired.

The situation is still somewhat fragile, however, but I think it's 
tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
thread isn't visible during the oom killer's tasklist scan because it has 
exited, we still end up killing more than we should.  The likelihood of 
this happening grows with the length of the tasklist.

Perhaps we should try testing watermarks after a victim has been selected 
and immediately before killing?  (Aside: we actually carry an internal 
patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
victim because we have been hit with this before in the memcg path.)

I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
to deem that we aren't going to immediately reenter an oom condition so 
the deferred killing is a waste of time.

The downside is how sloppy this would be because it's blurring the line 
between oom killer and page allocator.  We'd need the oom killer to return 
the selected victim to the page allocator, try the allocation, and then 
call oom_kill_process() if necessary.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 23:19       ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Johannes Weiner wrote:

> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.
> 

Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
acquired.

The situation is still somewhat fragile, however, but I think it's 
tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
thread isn't visible during the oom killer's tasklist scan because it has 
exited, we still end up killing more than we should.  The likelihood of 
this happening grows with the length of the tasklist.

Perhaps we should try testing watermarks after a victim has been selected 
and immediately before killing?  (Aside: we actually carry an internal 
patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
victim because we have been hit with this before in the memcg path.)

I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
to deem that we aren't going to immediately reenter an oom condition so 
the deferred killing is a waste of time.

The downside is how sloppy this would be because it's blurring the line 
between oom killer and page allocator.  We'd need the oom killer to return 
the selected victim to the page allocator, try the allocation, and then 
call oom_kill_process() if necessary.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-28 23:20     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-28 23:20     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-01-28 23:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, 28 Jan 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:19       ` David Rientjes
@ 2016-01-28 23:51         ` Johannes Weiner
  -1 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 23:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.
> 
> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

https://lkml.org/lkml/2015/3/25/40

We could have out_of_memory() wait until the number of outstanding OOM
victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
the lock until its kill has been finalized:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 914451a..4dc5b9d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
 		 */
-		schedule_timeout_killable(1);
+		if (!test_thread_flag(TIF_MEMDIE))
+			wait_event_timeout(oom_victims_wait,
+					   !atomic_read(&oom_victims), HZ);
 	}
 	return true;
 }

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-28 23:51         ` Johannes Weiner
  0 siblings, 0 replies; 299+ messages in thread
From: Johannes Weiner @ 2016-01-28 23:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Michal Hocko

On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.
> 
> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

https://lkml.org/lkml/2015/3/25/40

We could have out_of_memory() wait until the number of outstanding OOM
victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
the lock until its kill has been finalized:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 914451a..4dc5b9d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
 		 */
-		schedule_timeout_killable(1);
+		if (!test_thread_flag(TIF_MEMDIE))
+			wait_event_timeout(oom_victims_wait,
+					   !atomic_read(&oom_victims), HZ);
 	}
 	return true;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-29  3:41     ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-01-29  3:41 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 489212252cd6..9145e3f89eab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
> 
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE) +
> -	     zone_page_state(zone, NR_ISOLATED_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> 
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON) +
> -		      zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> 
>  	return nr;
>  }
> 
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
> 
> --
> 2.7.0.rc3

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29  3:41     ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-01-29  3:41 UTC (permalink / raw)
  To: 'Michal Hocko', 'Andrew Morton'
  Cc: 'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Michal Hocko'

> 
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 489212252cd6..9145e3f89eab 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,21 +196,21 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
> 
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE) +
> -	     zone_page_state(zone, NR_ISOLATED_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_ISOLATED_FILE);
> 
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON) +
> -		      zone_page_state(zone, NR_ISOLATED_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_ISOLATED_ANON);
> 
>  	return nr;
>  }
> 
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
> 
> --
> 2.7.0.rc3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-28 21:19   ` Michal Hocko
@ 2016-01-29 10:35     ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 

I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
was forgotten. Anyway,

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 10:35     ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:35 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: torvalds, hannes, mgorman, rientjes, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> calculate the target for the watermark check. This means that precise
> numbers are important for the correct decision. zone_reclaimable_pages
> uses zone_page_state which can contain stale data with per-cpu diffs
> not synced yet (the last vmstat_update might have run 1s in the past).
> 
> Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> of the current callers is in a hot path where getting the precise value
> (which involves per-cpu iteration) would cause an unreasonable overhead.
> 
> Suggested-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 

I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
was forgotten. Anyway,

Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:51         ` Johannes Weiner
@ 2016-01-29 10:39           ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:39 UTC (permalink / raw)
  To: mhocko, hannes, rientjes
  Cc: akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu, linux-mm,
	linux-kernel, mhocko

Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> >
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > >
> >
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been
> > acquired.
> >
> > The situation is still somewhat fragile, however, but I think it's
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE
> > thread isn't visible during the oom killer's tasklist scan because it has
> > exited, we still end up killing more than we should.  The likelihood of
> > this happening grows with the length of the tasklist.
> >
> > Perhaps we should try testing watermarks after a victim has been selected
> > and immediately before killing?  (Aside: we actually carry an internal
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a
> > victim because we have been hit with this before in the memcg path.)

Yes. Moving final testing to after selecting an OOM victim can reduce the
possibility of killing more OOM victims than we need. But unfortunately, it is
likely that memory becomes available (i.e. get_page_from_freelist() succeeds)
during dump_header() is printing OOM messages using printk(), for printk() is
a slow operation compared to selecting a victim. This happens very much later
counted from the moment the victim cleared TIF_MEMDIE.

We can avoid killing more OOM victims than we need if we move final testing to
after printing OOM messages, but we can't avoid printing OOM messages when we
don't kill a victim. Maybe this is not a problem if we do

  pr_err("But did not kill any process ...")

instead of

  do_send_sig_info(SIGKILL);
  mark_oom_victim();
  pr_err("Killed process %d (%s) ...")

when final testing succeeded.

> >
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory
> > to deem that we aren't going to immediately reenter an oom condition so
> > the deferred killing is a waste of time.
> >
> > The downside is how sloppy this would be because it's blurring the line
> > between oom killer and page allocator.  We'd need the oom killer to return
> > the selected victim to the page allocator, try the allocation, and then
> > call oom_kill_process() if necessary.

I assumed that Michal wants to preserve the boundary between the OOM killer
and the page allocator. Therefore, I proposed a patch
( http://lkml.kernel.org/r/201512291559.HGA46749.VFOFSOHLMtFJQO@I-love.SAKURA.ne.jp )
which tries to manage it without returning a victim and without depending on
TIF_MEMDIE or oom_victims.

>
> https://lkml.org/lkml/2015/3/25/40
>
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }
>

oom_victims became 0 does not mean that memory became available (i.e.
get_page_from_freelist() will succeed). I think this patch wants some
effort for trying to reduce possibility of killing more OOM victims
than we need.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 10:39           ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 10:39 UTC (permalink / raw)
  To: mhocko, hannes, rientjes
  Cc: akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu, linux-mm,
	linux-kernel, mhocko

Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> >
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > >
> >
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been
> > acquired.
> >
> > The situation is still somewhat fragile, however, but I think it's
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE
> > thread isn't visible during the oom killer's tasklist scan because it has
> > exited, we still end up killing more than we should.  The likelihood of
> > this happening grows with the length of the tasklist.
> >
> > Perhaps we should try testing watermarks after a victim has been selected
> > and immediately before killing?  (Aside: we actually carry an internal
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a
> > victim because we have been hit with this before in the memcg path.)

Yes. Moving final testing to after selecting an OOM victim can reduce the
possibility of killing more OOM victims than we need. But unfortunately, it is
likely that memory becomes available (i.e. get_page_from_freelist() succeeds)
during dump_header() is printing OOM messages using printk(), for printk() is
a slow operation compared to selecting a victim. This happens very much later
counted from the moment the victim cleared TIF_MEMDIE.

We can avoid killing more OOM victims than we need if we move final testing to
after printing OOM messages, but we can't avoid printing OOM messages when we
don't kill a victim. Maybe this is not a problem if we do

  pr_err("But did not kill any process ...")

instead of

  do_send_sig_info(SIGKILL);
  mark_oom_victim();
  pr_err("Killed process %d (%s) ...")

when final testing succeeded.

> >
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory
> > to deem that we aren't going to immediately reenter an oom condition so
> > the deferred killing is a waste of time.
> >
> > The downside is how sloppy this would be because it's blurring the line
> > between oom killer and page allocator.  We'd need the oom killer to return
> > the selected victim to the page allocator, try the allocation, and then
> > call oom_kill_process() if necessary.

I assumed that Michal wants to preserve the boundary between the OOM killer
and the page allocator. Therefore, I proposed a patch
( http://lkml.kernel.org/r/201512291559.HGA46749.VFOFSOHLMtFJQO@I-love.SAKURA.ne.jp )
which tries to manage it without returning a victim and without depending on
TIF_MEMDIE or oom_victims.

>
> https://lkml.org/lkml/2015/3/25/40
>
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }
>

oom_victims became 0 does not mean that memory became available (i.e.
get_page_from_freelist() will succeed). I think this patch wants some
effort for trying to reduce possibility of killing more OOM victims
than we need.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-29 10:35     ` Tetsuo Handa
@ 2016-01-29 15:17       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > calculate the target for the watermark check. This means that precise
> > numbers are important for the correct decision. zone_reclaimable_pages
> > uses zone_page_state which can contain stale data with per-cpu diffs
> > not synced yet (the last vmstat_update might have run 1s in the past).
> > 
> > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > of the current callers is in a hot path where getting the precise value
> > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > 
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmscan.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> 
> I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> was forgotten. Anyway,

OK, that explains why this sounded so familiar. Sorry I comepletely
forgot about it.

> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Can I change it to your Signed-off-by?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 15:17       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:17 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > calculate the target for the watermark check. This means that precise
> > numbers are important for the correct decision. zone_reclaimable_pages
> > uses zone_page_state which can contain stale data with per-cpu diffs
> > not synced yet (the last vmstat_update might have run 1s in the past).
> > 
> > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > of the current callers is in a hot path where getting the precise value
> > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > 
> > Suggested-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmscan.c | 14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> 
> I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> was forgotten. Anyway,

OK, that explains why this sounded so familiar. Sorry I comepletely
forgot about it.

> Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Can I change it to your Signed-off-by?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:19       ` David Rientjes
@ 2016-01-29 15:23         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 15:19:08, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.

Yes exactly the point I made in the original thread which brought the
question about ALLOC_WMARK_HIGH originally. The race window after the
last attempt is much larger than between the last wmark check and the
attempt.

> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

Yes the layer violation is definitely not nice.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:23         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 15:19:08, David Rientjes wrote:
> On Thu, 28 Jan 2016, Johannes Weiner wrote:
> 
> > The check has to happen while holding the OOM lock, otherwise we'll
> > end up killing much more than necessary when there are many racing
> > allocations.
> > 
> 
> Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> acquired.
> 
> The situation is still somewhat fragile, however, but I think it's 
> tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> thread isn't visible during the oom killer's tasklist scan because it has 
> exited, we still end up killing more than we should.  The likelihood of 
> this happening grows with the length of the tasklist.

Yes exactly the point I made in the original thread which brought the
question about ALLOC_WMARK_HIGH originally. The race window after the
last attempt is much larger than between the last wmark check and the
attempt.

> Perhaps we should try testing watermarks after a victim has been selected 
> and immediately before killing?  (Aside: we actually carry an internal 
> patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> victim because we have been hit with this before in the memcg path.)
> 
> I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> to deem that we aren't going to immediately reenter an oom condition so 
> the deferred killing is a waste of time.
> 
> The downside is how sloppy this would be because it's blurring the line 
> between oom killer and page allocator.  We'd need the oom killer to return 
> the selected victim to the page allocator, try the allocation, and then 
> call oom_kill_process() if necessary.

Yes the layer violation is definitely not nice.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 21:36     ` Johannes Weiner
@ 2016-01-29 15:24       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 16:36:34, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom has been doing get_page_from_freelist with
> > ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> > killer. This has two reasons as explained by Andrea:
> > "
> > : the reason for the high wmark is to reduce the likelihood of livelocks
> > : and be sure to invoke the OOM killer, if we're still under pressure
> > : and reclaim just failed. The high wmark is used to be sure the failure
> > : of reclaim isn't going to be ignored. If using the min wmark like
> > : you propose there's risk of livelock or anyway of delayed OOM killer
> > : invocation.
> > :
> > : The reason for doing one last wmark check (regardless of the wmark
> > : used) before invoking the oom killer, was just to be sure another OOM
> > : killer invocation hasn't already freed a ton of memory while we were
> > : stuck in reclaim. A lot of free memory generated by the OOM killer,
> > : won't make a parallel reclaim more likely to succeed, it just creates
> > : free memory, but reclaim only succeeds when it finds "freeable" memory
> > : and it makes progress in converting it to free memory. So for the
> > : purpose of this last check, the high wmark would work fine as lots of
> > : free memory would have been generated in such case.
> > "
> > 
> > This is no longer a concern after "mm, oom: rework oom detection"
> > because should_reclaim_retry performs the water mark check right before
> > __alloc_pages_may_oom is invoked. Remove the last moment allocation
> > request as it just makes the code more confusing and doesn't really
> > serve any purpose because a success is basically impossible otherwise
> > should_reclaim_retry would force the reclaim to retry. So this is
> > merely a code cleanup rather than a functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.

My testing shows that this doesn't trigger even during oom flood
testing. So I am not really convinced it does anything useful.

> Please drop this patch.

Sure I do not insist...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:24       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 16:36:34, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 09:40:03PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __alloc_pages_may_oom has been doing get_page_from_freelist with
> > ALLOC_WMARK_HIGH target before going out_of_memory and invoking the oom
> > killer. This has two reasons as explained by Andrea:
> > "
> > : the reason for the high wmark is to reduce the likelihood of livelocks
> > : and be sure to invoke the OOM killer, if we're still under pressure
> > : and reclaim just failed. The high wmark is used to be sure the failure
> > : of reclaim isn't going to be ignored. If using the min wmark like
> > : you propose there's risk of livelock or anyway of delayed OOM killer
> > : invocation.
> > :
> > : The reason for doing one last wmark check (regardless of the wmark
> > : used) before invoking the oom killer, was just to be sure another OOM
> > : killer invocation hasn't already freed a ton of memory while we were
> > : stuck in reclaim. A lot of free memory generated by the OOM killer,
> > : won't make a parallel reclaim more likely to succeed, it just creates
> > : free memory, but reclaim only succeeds when it finds "freeable" memory
> > : and it makes progress in converting it to free memory. So for the
> > : purpose of this last check, the high wmark would work fine as lots of
> > : free memory would have been generated in such case.
> > "
> > 
> > This is no longer a concern after "mm, oom: rework oom detection"
> > because should_reclaim_retry performs the water mark check right before
> > __alloc_pages_may_oom is invoked. Remove the last moment allocation
> > request as it just makes the code more confusing and doesn't really
> > serve any purpose because a success is basically impossible otherwise
> > should_reclaim_retry would force the reclaim to retry. So this is
> > merely a code cleanup rather than a functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> The check has to happen while holding the OOM lock, otherwise we'll
> end up killing much more than necessary when there are many racing
> allocations.

My testing shows that this doesn't trigger even during oom flood
testing. So I am not really convinced it does anything useful.

> Please drop this patch.

Sure I do not insist...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-28 23:51         ` Johannes Weiner
@ 2016-01-29 15:32           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 18:51:10, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> > 
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > > 
> > 
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> > acquired.
> > 
> > The situation is still somewhat fragile, however, but I think it's 
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> > thread isn't visible during the oom killer's tasklist scan because it has 
> > exited, we still end up killing more than we should.  The likelihood of 
> > this happening grows with the length of the tasklist.
> > 
> > Perhaps we should try testing watermarks after a victim has been selected 
> > and immediately before killing?  (Aside: we actually carry an internal 
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> > victim because we have been hit with this before in the memcg path.)
> > 
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> > to deem that we aren't going to immediately reenter an oom condition so 
> > the deferred killing is a waste of time.
> > 
> > The downside is how sloppy this would be because it's blurring the line 
> > between oom killer and page allocator.  We'd need the oom killer to return 
> > the selected victim to the page allocator, try the allocation, and then 
> > call oom_kill_process() if necessary.
> 
> https://lkml.org/lkml/2015/3/25/40
> 
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }

Yes this makes sense to me
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-29 15:32           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-01-29 15:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, Andrew Morton, Linus Torvalds, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Thu 28-01-16 18:51:10, Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> > 
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > > 
> > 
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been 
> > acquired.
> > 
> > The situation is still somewhat fragile, however, but I think it's 
> > tangential to this patch series.  If the ALLOC_WMARK_HIGH allocation fails 
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE 
> > thread isn't visible during the oom killer's tasklist scan because it has 
> > exited, we still end up killing more than we should.  The likelihood of 
> > this happening grows with the length of the tasklist.
> > 
> > Perhaps we should try testing watermarks after a victim has been selected 
> > and immediately before killing?  (Aside: we actually carry an internal 
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a 
> > victim because we have been hit with this before in the memcg path.)
> > 
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory 
> > to deem that we aren't going to immediately reenter an oom condition so 
> > the deferred killing is a waste of time.
> > 
> > The downside is how sloppy this would be because it's blurring the line 
> > between oom killer and page allocator.  We'd need the oom killer to return 
> > the selected victim to the page allocator, try the allocation, and then 
> > call oom_kill_process() if necessary.
> 
> https://lkml.org/lkml/2015/3/25/40
> 
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
>  		 * Give the killed process a good chance to exit before trying
>  		 * to allocate memory again.
>  		 */
> -		schedule_timeout_killable(1);
> +		if (!test_thread_flag(TIF_MEMDIE))
> +			wait_event_timeout(oom_victims_wait,
> +					   !atomic_read(&oom_victims), HZ);
>  	}
>  	return true;
>  }

Yes this makes sense to me
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
  2016-01-29 15:17       ` Michal Hocko
@ 2016-01-29 21:30         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 21:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > > calculate the target for the watermark check. This means that precise
> > > numbers are important for the correct decision. zone_reclaimable_pages
> > > uses zone_page_state which can contain stale data with per-cpu diffs
> > > not synced yet (the last vmstat_update might have run 1s in the past).
> > > 
> > > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > > of the current callers is in a hot path where getting the precise value
> > > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > > 
> > > Suggested-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmscan.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > 
> > 
> > I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> > was forgotten. Anyway,
> 
> OK, that explains why this sounded so familiar. Sorry I comepletely
> forgot about it.
> 
> > Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Can I change it to your Signed-off-by?

No problem.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 5/3] mm, vmscan: make zone_reclaimable_pages more precise
@ 2016-01-29 21:30         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-29 21:30 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, torvalds, hannes, mgorman, rientjes, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 29-01-16 19:35:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > zone_reclaimable_pages is used in should_reclaim_retry which uses it to
> > > calculate the target for the watermark check. This means that precise
> > > numbers are important for the correct decision. zone_reclaimable_pages
> > > uses zone_page_state which can contain stale data with per-cpu diffs
> > > not synced yet (the last vmstat_update might have run 1s in the past).
> > > 
> > > Use zone_page_state_snapshot in zone_reclaimable_pages instead. None
> > > of the current callers is in a hot path where getting the precise value
> > > (which involves per-cpu iteration) would cause an unreasonable overhead.
> > > 
> > > Suggested-by: David Rientjes <rientjes@google.com>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmscan.c | 14 +++++++-------
> > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > 
> > 
> > I didn't know http://lkml.kernel.org/r/20151021130323.GC8805@dhcp22.suse.cz
> > was forgotten. Anyway,
> 
> OK, that explains why this sounded so familiar. Sorry I comepletely
> forgot about it.
> 
> > Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> Can I change it to your Signed-off-by?

No problem.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
  2016-01-29 15:32           ` Michal Hocko
@ 2016-01-30 12:18             ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-30 12:18 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > https://lkml.org/lkml/2015/3/25/40
> > 
> > We could have out_of_memory() wait until the number of outstanding OOM
> > victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> > the lock until its kill has been finalized:
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 914451a..4dc5b9d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> >  		 * Give the killed process a good chance to exit before trying
> >  		 * to allocate memory again.
> >  		 */
> > -		schedule_timeout_killable(1);
> > +		if (!test_thread_flag(TIF_MEMDIE))
> > +			wait_event_timeout(oom_victims_wait,
> > +					   !atomic_read(&oom_victims), HZ);
> >  	}
> >  	return true;
> >  }
> 
> Yes this makes sense to me

I think schedule_timeout_killable(1) was used for handling cases
where current thread did not get TIF_MEMDIE but got SIGKILL due to
sharing the victim's memory. If current thread is blocking TIF_MEMDIE
thread, this can become a needless delay.

Also, I don't know whether using wait_event_*() helps handling a
problem that schedule_timeout_killable(1) can sleep for many minutes
with oom_lock held when there are a lot of tasks. Detail is explained
in my proposed patch.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory
@ 2016-01-30 12:18             ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-01-30 12:18 UTC (permalink / raw)
  To: mhocko, hannes
  Cc: rientjes, akpm, torvalds, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > https://lkml.org/lkml/2015/3/25/40
> > 
> > We could have out_of_memory() wait until the number of outstanding OOM
> > victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> > the lock until its kill has been finalized:
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 914451a..4dc5b9d 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> >  		 * Give the killed process a good chance to exit before trying
> >  		 * to allocate memory again.
> >  		 */
> > -		schedule_timeout_killable(1);
> > +		if (!test_thread_flag(TIF_MEMDIE))
> > +			wait_event_timeout(oom_victims_wait,
> > +					   !atomic_read(&oom_victims), HZ);
> >  	}
> >  	return true;
> >  }
> 
> Yes this makes sense to me

I think schedule_timeout_killable(1) was used for handling cases
where current thread did not get TIF_MEMDIE but got SIGKILL due to
sharing the victim's memory. If current thread is blocking TIF_MEMDIE
thread, this can become a needless delay.

Also, I don't know whether using wait_event_*() helps handling a
problem that schedule_timeout_killable(1) can sleep for many minutes
with oom_lock held when there are a lot of tasks. Detail is explained
in my proposed patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2015-12-15 18:19 ` Michal Hocko
@ 2016-02-03 13:27   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-03 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,
this thread went mostly quite. Are all the main concerns clarified?
Are there any new concerns? Are there any objections to targeting
this for the next merge window?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-03 13:27   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-03 13:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

Hi,
this thread went mostly quite. Are all the main concerns clarified?
Are there any new concerns? Are there any objections to targeting
this for the next merge window?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 13:27   ` Michal Hocko
@ 2016-02-03 22:58     ` David Rientjes
  -1 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-02-03 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:

> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Did we ever figure out what was causing the oom killer to be called much 
earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and
http://marc.info/?l=linux-kernel&m=145130454913757 ?  I'd like to take a 
look at the patch(es) that fixed it.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-03 22:58     ` David Rientjes
  0 siblings, 0 replies; 299+ messages in thread
From: David Rientjes @ 2016-02-03 22:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:

> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Did we ever figure out what was causing the oom killer to be called much 
earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and
http://marc.info/?l=linux-kernel&m=145130454913757 ?  I'd like to take a 
look at the patch(es) that fixed it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 22:58     ` David Rientjes
@ 2016-02-04 12:57       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 12:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 03-02-16 14:58:06, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Did we ever figure out what was causing the oom killer to be called much 
> earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and

>From the OOM report:
[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB

> http://marc.info/?l=linux-kernel&m=145130454913757 ?

[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

There are cases where order-2 has some pages but I have commented on
that here [1]

> I'd like to take a look at the patch(es) that fixed it.

I am not sure we can fix these pathological loads where we hit the
higher order depletion and there is a chance that one of the thousands
tasks terminates in an unpredictable way which happens to race with the
OOM killer. As I've pointed out in [1] once the watermark check for the
higher order allocation fails for the given order then we cannot rely
on the reclaimable pages ever construct the required order. The current
zone_reclaimable approach just happens to work for this particular load
because the NR_PAGES_SCANNED gets reseted too often with a side effect
of an undeterministic behavior.

[1] http://lkml.kernel.org/r/20160120131355.GE14187@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 12:57       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 12:57 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 03-02-16 14:58:06, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Did we ever figure out what was causing the oom killer to be called much 
> earlier in Tetsuo's http://marc.info/?l=linux-kernel&m=145096089726481 and

>From the OOM report:
[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB

> http://marc.info/?l=linux-kernel&m=145130454913757 ?

[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB

There are cases where order-2 has some pages but I have commented on
that here [1]

> I'd like to take a look at the patch(es) that fixed it.

I am not sure we can fix these pathological loads where we hit the
higher order depletion and there is a chance that one of the thousands
tasks terminates in an unpredictable way which happens to race with the
OOM killer. As I've pointed out in [1] once the watermark check for the
higher order allocation fails for the given order then we cannot rely
on the reclaimable pages ever construct the required order. The current
zone_reclaimable approach just happens to work for this particular load
because the NR_PAGES_SCANNED gets reseted too often with a side effect
of an undeterministic behavior.

[1] http://lkml.kernel.org/r/20160120131355.GE14187@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 12:57       ` Michal Hocko
@ 2016-02-04 13:10         ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-04 13:10 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not sure we can fix these pathological loads where we hit the
> higher order depletion and there is a chance that one of the thousands
> tasks terminates in an unpredictable way which happens to race with the
> OOM killer.

When I hit this problem on Dec 24th, I didn't run thousands of tasks.
I think there were less than one hundred tasks in the system and only
a few tasks were running. Not a pathological load at all.

I'm running thousands of tasks only for increasing the possibility
in the reproducer.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 13:10         ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-04 13:10 UTC (permalink / raw)
  To: mhocko, rientjes
  Cc: akpm, torvalds, hannes, mgorman, hillf.zj, kamezawa.hiroyu,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not sure we can fix these pathological loads where we hit the
> higher order depletion and there is a chance that one of the thousands
> tasks terminates in an unpredictable way which happens to race with the
> OOM killer.

When I hit this problem on Dec 24th, I didn't run thousands of tasks.
I think there were less than one hundred tasks in the system and only
a few tasks were running. Not a pathological load at all.

I'm running thousands of tasks only for increasing the possibility
in the reproducer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:10         ` Tetsuo Handa
@ 2016-02-04 13:39           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 13:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I am not sure we can fix these pathological loads where we hit the
> > higher order depletion and there is a chance that one of the thousands
> > tasks terminates in an unpredictable way which happens to race with the
> > OOM killer.
> 
> When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> I think there were less than one hundred tasks in the system and only
> a few tasks were running. Not a pathological load at all.

But as the OOM report clearly stated there were no > order-1 pages
available in that particular case. And that happened after the direct
reclaim and compaction were already invoked.

As I've mentioned in the referenced email, we can try to do multiple
retries e.g. do not give up on the higher order requests until we hit
the maximum number of retries but I consider it quite ugly to be honest.
I think that a proper communication with compaction is a more
appropriate way to go long term. E.g. I find it interesting that
try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
and treat is as any other high order request.

Something like the following:
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..1ae5b7da821b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 13:39           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 13:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I am not sure we can fix these pathological loads where we hit the
> > higher order depletion and there is a chance that one of the thousands
> > tasks terminates in an unpredictable way which happens to race with the
> > OOM killer.
> 
> When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> I think there were less than one hundred tasks in the system and only
> a few tasks were running. Not a pathological load at all.

But as the OOM report clearly stated there were no > order-1 pages
available in that particular case. And that happened after the direct
reclaim and compaction were already invoked.

As I've mentioned in the referenced email, we can try to do multiple
retries e.g. do not give up on the higher order requests until we hit
the maximum number of retries but I consider it quite ugly to be honest.
I think that a proper communication with compaction is a more
appropriate way to go long term. E.g. I find it interesting that
try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
and treat is as any other high order request.

Something like the following:
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..1ae5b7da821b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:39           ` Michal Hocko
@ 2016-02-04 14:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 14:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 14:39:05, Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 
> Something like the following:

With the patch description. Please note I haven't tested this yet so
this is more a RFC than something I am really convinced about. I can
live with it because the number of retries is nicely bounded but it
sounds too hackish because it makes the decision rather blindly. I will
talk to Vlastimil and Mel whether they see some way how to communicate
the compaction state in a reasonable way. But I guess this is something
that can come up later. What do you think?
---
>From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 4 Feb 2016 14:56:59 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and the OOM killer is just a
matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
order and not costly requests to make sure we do not fail prematurely.

This also means that we do not reset no_progress_loops at the
__alloc_pages_slowpath for high order allocations to guarantee a bounded
number of retries.

Longterm it would be much better to communicate with the compaction
and retry only if the compaction considers it meaningfull.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f05aca36469b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-04 14:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-04 14:24 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Thu 04-02-16 14:39:05, Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 
> Something like the following:

With the patch description. Please note I haven't tested this yet so
this is more a RFC than something I am really convinced about. I can
live with it because the number of retries is nicely bounded but it
sounds too hackish because it makes the decision rather blindly. I will
talk to Vlastimil and Mel whether they see some way how to communicate
the compaction state in a reasonable way. But I guess this is something
that can come up later. What do you think?
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-04 13:39           ` Michal Hocko
@ 2016-02-07  4:09             ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-07  4:09 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 

FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
I think current patchset is too fragile to merge.
----------------------------------------
[ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 3101.629148] smbd cpuset=/ mems_allowed=0
[ 3101.630332] CPU: 1 PID: 3941 Comm: smbd Not tainted 4.5.0-rc2-next-20160205 #293
[ 3101.632335] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3101.634567]  0000000000000286 000000005784a8f9 ffff88007c47bad0 ffffffff8139abbd
[ 3101.636533]  0000000000000000 ffff88007c47bd00 ffff88007c47bb70 ffffffff811bdc6c
[ 3101.638381]  0000000000000206 ffffffff81810b30 ffff88007c47bb10 ffffffff810be079
[ 3101.640215] Call Trace:
[ 3101.641169]  [<ffffffff8139abbd>] dump_stack+0x85/0xc8
[ 3101.642560]  [<ffffffff811bdc6c>] dump_header+0x5b/0x3b0
[ 3101.643983]  [<ffffffff810be079>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3101.645616]  [<ffffffff810be14d>] ? trace_hardirqs_on+0xd/0x10
[ 3101.647081]  [<ffffffff81143fb6>] oom_kill_process+0x366/0x550
[ 3101.648631]  [<ffffffff811443df>] out_of_memory+0x1ef/0x5a0
[ 3101.650081]  [<ffffffff8114449d>] ? out_of_memory+0x2ad/0x5a0
[ 3101.651624]  [<ffffffff81149d0d>] __alloc_pages_nodemask+0xbad/0xd90
[ 3101.653207]  [<ffffffff8114a0ac>] alloc_kmem_pages_node+0x4c/0xc0
[ 3101.654767]  [<ffffffff8106d5c1>] copy_process.part.31+0x131/0x1b40
[ 3101.656381]  [<ffffffff8111d9ea>] ? __audit_syscall_entry+0xaa/0xf0
[ 3101.657952]  [<ffffffff810e8119>] ? current_kernel_time64+0xa9/0xc0
[ 3101.659492]  [<ffffffff8106f19b>] _do_fork+0xdb/0x5d0
[ 3101.660814]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 3101.662305]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 3101.663988]  [<ffffffff81703d2c>] ? return_from_SYSCALL_64+0x2d/0x7a
[ 3101.665572]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[ 3101.667067]  [<ffffffff8106f714>] SyS_clone+0x14/0x20
[ 3101.668510]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[ 3101.669931]  [<ffffffff81703cff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3101.671642] Mem-Info:
[ 3101.672612] active_anon:46842 inactive_anon:2094 isolated_anon:0
 active_file:108974 inactive_file:131350 isolated_file:0
 unevictable:0 dirty:1174 writeback:0 unstable:0
 slab_reclaimable:107536 slab_unreclaimable:14287
 mapped:4199 shmem:2166 pagetables:1524 bounce:0
 free:6260 free_pcp:31 free_cma:0
[ 3101.681294] Node 0 DMA free:6884kB min:44kB low:52kB high:64kB active_anon:3488kB inactive_anon:100kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:100kB slab_reclaimable:3852kB slab_unreclaimable:444kB kernel_stack:80kB pagetables:112kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.691319] lowmem_reserve[]: 0 1714 1714 1714
[ 3101.692847] Node 0 DMA32 free:18156kB min:5172kB low:6464kB high:7756kB active_anon:183880kB inactive_anon:8276kB active_file:435896kB inactive_file:525396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759480kB mlocked:0kB dirty:4696kB writeback:0kB mapped:16792kB shmem:8564kB slab_reclaimable:426292kB slab_unreclaimable:56704kB kernel_stack:3328kB pagetables:5984kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.704239] lowmem_reserve[]: 0 0 0 0
[ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
[ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
[ 3101.713857] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3101.716332] 242517 total pagecache pages
[ 3101.717878] 0 pages in swap cache
[ 3101.719332] Swap cache stats: add 0, delete 0, find 0/0
[ 3101.721577] Free swap  = 0kB
[ 3101.722980] Total swap = 0kB
[ 3101.724364] 524157 pages RAM
[ 3101.725697] 0 pages HighMem/MovableOnly
[ 3101.727165] 80311 pages reserved
[ 3101.728482] 0 pages hwpoisoned
[ 3101.729754] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 3101.732071] [  492]     0   492     9206      975      20       4        0             0 systemd-journal
[ 3101.734357] [  520]     0   520    10479      631      22       3        0         -1000 systemd-udevd
[ 3101.737036] [  527]     0   527    12805      682      24       3        0         -1000 auditd
[ 3101.739505] [ 1174]     0  1174     4830      556      14       3        0             0 irqbalance
[ 3101.741876] [ 1180]    81  1180     6672      604      20       3        0          -900 dbus-daemon
[ 3101.744728] [ 1817]     0  1817    56009      880      40       4        0             0 rsyslogd
[ 3101.747164] [ 1818]     0  1818     1096      349       8       3        0             0 rngd
[ 3101.749788] [ 1820]     0  1820    52575     1074      56       3        0             0 abrtd
[ 3101.752135] [ 1821]     0  1821    80901     5160      80       4        0             0 firewalld
[ 3101.754532] [ 1823]     0  1823     6602      681      20       3        0             0 systemd-logind
[ 3101.757342] [ 1825]    70  1825     6999      458      20       3        0             0 avahi-daemon
[ 3101.759784] [ 1827]     0  1827    51995      986      55       3        0             0 abrt-watch-log
[ 3101.762465] [ 1838]     0  1838    31586      647      21       3        0             0 crond
[ 3101.764797] [ 1946]    70  1946     6999       58      19       3        0             0 avahi-daemon
[ 3101.767262] [ 2043]     0  2043    65187      858      43       3        0             0 vmtoolsd
[ 3101.769665] [ 2618]     0  2618    27631     3112      53       3        0             0 dhclient
[ 3101.772203] [ 2622]   999  2622   130827     2570      56       3        0             0 polkitd
[ 3101.774645] [ 2704]     0  2704   138263     3351      91       4        0             0 tuned
[ 3101.777114] [ 2709]     0  2709    20640      773      45       3        0         -1000 sshd
[ 3101.779428] [ 2711]     0  2711     7328      551      19       3        0             0 xinetd
[ 3101.782016] [ 3883]     0  3883    22785      827      45       3        0             0 master
[ 3101.784576] [ 3884]    89  3884    22811      924      46       4        0             0 pickup
[ 3101.786898] [ 3885]    89  3885    22828      886      44       3        0             0 qmgr
[ 3101.789287] [ 3916]     0  3916    23203      736      50       3        0             0 login
[ 3101.791666] [ 3927]     0  3927    27511      381      13       3        0             0 agetty
[ 3101.794116] [ 3930]     0  3930    79392     1063     105       3        0             0 nmbd
[ 3101.796387] [ 3941]     0  3941    96485     1544     138       4        0             0 smbd
[ 3101.798602] [ 3944]     0  3944    96485     1290     131       4        0             0 smbd
[ 3101.800783] [ 7471]     0  7471    28886      732      15       3        0             0 bash
[ 3101.803013] [ 7580]     0  7580     2380      613      10       3        0             0 makelxr.sh
[ 3101.805147] [ 7786]     0  7786    27511      395      10       3        0             0 agetty
[ 3101.807198] [ 8139]     0  8139    35888      974      72       3        0             0 sshd
[ 3101.809255] [ 8144]     0  8144    28896      761      15       4        0             0 bash
[ 3101.811335] [15286]     0 15286    38294    30474      81       3        0             0 genxref
[ 3101.813512] Out of memory: Kill process 15286 (genxref) score 66 or sacrifice child
[ 3101.815659] Killed process 15286 (genxref) total-vm:153176kB, anon-rss:117092kB, file-rss:4804kB, shmem-rss:0kB
----------------------------------------

> Something like the following:
Yes, I do think we need something like it.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-07  4:09             ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-07  4:09 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 22:10:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I am not sure we can fix these pathological loads where we hit the
> > > higher order depletion and there is a chance that one of the thousands
> > > tasks terminates in an unpredictable way which happens to race with the
> > > OOM killer.
> > 
> > When I hit this problem on Dec 24th, I didn't run thousands of tasks.
> > I think there were less than one hundred tasks in the system and only
> > a few tasks were running. Not a pathological load at all.
> 
> But as the OOM report clearly stated there were no > order-1 pages
> available in that particular case. And that happened after the direct
> reclaim and compaction were already invoked.
> 
> As I've mentioned in the referenced email, we can try to do multiple
> retries e.g. do not give up on the higher order requests until we hit
> the maximum number of retries but I consider it quite ugly to be honest.
> I think that a proper communication with compaction is a more
> appropriate way to go long term. E.g. I find it interesting that
> try_to_compact_pages doesn't even care about PAGE_ALLOC_COSTLY_ORDER
> and treat is as any other high order request.
> 

FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
I think current patchset is too fragile to merge.
----------------------------------------
[ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 3101.629148] smbd cpuset=/ mems_allowed=0
[ 3101.630332] CPU: 1 PID: 3941 Comm: smbd Not tainted 4.5.0-rc2-next-20160205 #293
[ 3101.632335] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3101.634567]  0000000000000286 000000005784a8f9 ffff88007c47bad0 ffffffff8139abbd
[ 3101.636533]  0000000000000000 ffff88007c47bd00 ffff88007c47bb70 ffffffff811bdc6c
[ 3101.638381]  0000000000000206 ffffffff81810b30 ffff88007c47bb10 ffffffff810be079
[ 3101.640215] Call Trace:
[ 3101.641169]  [<ffffffff8139abbd>] dump_stack+0x85/0xc8
[ 3101.642560]  [<ffffffff811bdc6c>] dump_header+0x5b/0x3b0
[ 3101.643983]  [<ffffffff810be079>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3101.645616]  [<ffffffff810be14d>] ? trace_hardirqs_on+0xd/0x10
[ 3101.647081]  [<ffffffff81143fb6>] oom_kill_process+0x366/0x550
[ 3101.648631]  [<ffffffff811443df>] out_of_memory+0x1ef/0x5a0
[ 3101.650081]  [<ffffffff8114449d>] ? out_of_memory+0x2ad/0x5a0
[ 3101.651624]  [<ffffffff81149d0d>] __alloc_pages_nodemask+0xbad/0xd90
[ 3101.653207]  [<ffffffff8114a0ac>] alloc_kmem_pages_node+0x4c/0xc0
[ 3101.654767]  [<ffffffff8106d5c1>] copy_process.part.31+0x131/0x1b40
[ 3101.656381]  [<ffffffff8111d9ea>] ? __audit_syscall_entry+0xaa/0xf0
[ 3101.657952]  [<ffffffff810e8119>] ? current_kernel_time64+0xa9/0xc0
[ 3101.659492]  [<ffffffff8106f19b>] _do_fork+0xdb/0x5d0
[ 3101.660814]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 3101.662305]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 3101.663988]  [<ffffffff81703d2c>] ? return_from_SYSCALL_64+0x2d/0x7a
[ 3101.665572]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[ 3101.667067]  [<ffffffff8106f714>] SyS_clone+0x14/0x20
[ 3101.668510]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[ 3101.669931]  [<ffffffff81703cff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 3101.671642] Mem-Info:
[ 3101.672612] active_anon:46842 inactive_anon:2094 isolated_anon:0
 active_file:108974 inactive_file:131350 isolated_file:0
 unevictable:0 dirty:1174 writeback:0 unstable:0
 slab_reclaimable:107536 slab_unreclaimable:14287
 mapped:4199 shmem:2166 pagetables:1524 bounce:0
 free:6260 free_pcp:31 free_cma:0
[ 3101.681294] Node 0 DMA free:6884kB min:44kB low:52kB high:64kB active_anon:3488kB inactive_anon:100kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:100kB slab_reclaimable:3852kB slab_unreclaimable:444kB kernel_stack:80kB pagetables:112kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.691319] lowmem_reserve[]: 0 1714 1714 1714
[ 3101.692847] Node 0 DMA32 free:18156kB min:5172kB low:6464kB high:7756kB active_anon:183880kB inactive_anon:8276kB active_file:435896kB inactive_file:525396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759480kB mlocked:0kB dirty:4696kB writeback:0kB mapped:16792kB shmem:8564kB slab_reclaimable:426292kB slab_unreclaimable:56704kB kernel_stack:3328kB pagetables:5984kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3101.704239] lowmem_reserve[]: 0 0 0 0
[ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
[ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
[ 3101.713857] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3101.716332] 242517 total pagecache pages
[ 3101.717878] 0 pages in swap cache
[ 3101.719332] Swap cache stats: add 0, delete 0, find 0/0
[ 3101.721577] Free swap  = 0kB
[ 3101.722980] Total swap = 0kB
[ 3101.724364] 524157 pages RAM
[ 3101.725697] 0 pages HighMem/MovableOnly
[ 3101.727165] 80311 pages reserved
[ 3101.728482] 0 pages hwpoisoned
[ 3101.729754] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 3101.732071] [  492]     0   492     9206      975      20       4        0             0 systemd-journal
[ 3101.734357] [  520]     0   520    10479      631      22       3        0         -1000 systemd-udevd
[ 3101.737036] [  527]     0   527    12805      682      24       3        0         -1000 auditd
[ 3101.739505] [ 1174]     0  1174     4830      556      14       3        0             0 irqbalance
[ 3101.741876] [ 1180]    81  1180     6672      604      20       3        0          -900 dbus-daemon
[ 3101.744728] [ 1817]     0  1817    56009      880      40       4        0             0 rsyslogd
[ 3101.747164] [ 1818]     0  1818     1096      349       8       3        0             0 rngd
[ 3101.749788] [ 1820]     0  1820    52575     1074      56       3        0             0 abrtd
[ 3101.752135] [ 1821]     0  1821    80901     5160      80       4        0             0 firewalld
[ 3101.754532] [ 1823]     0  1823     6602      681      20       3        0             0 systemd-logind
[ 3101.757342] [ 1825]    70  1825     6999      458      20       3        0             0 avahi-daemon
[ 3101.759784] [ 1827]     0  1827    51995      986      55       3        0             0 abrt-watch-log
[ 3101.762465] [ 1838]     0  1838    31586      647      21       3        0             0 crond
[ 3101.764797] [ 1946]    70  1946     6999       58      19       3        0             0 avahi-daemon
[ 3101.767262] [ 2043]     0  2043    65187      858      43       3        0             0 vmtoolsd
[ 3101.769665] [ 2618]     0  2618    27631     3112      53       3        0             0 dhclient
[ 3101.772203] [ 2622]   999  2622   130827     2570      56       3        0             0 polkitd
[ 3101.774645] [ 2704]     0  2704   138263     3351      91       4        0             0 tuned
[ 3101.777114] [ 2709]     0  2709    20640      773      45       3        0         -1000 sshd
[ 3101.779428] [ 2711]     0  2711     7328      551      19       3        0             0 xinetd
[ 3101.782016] [ 3883]     0  3883    22785      827      45       3        0             0 master
[ 3101.784576] [ 3884]    89  3884    22811      924      46       4        0             0 pickup
[ 3101.786898] [ 3885]    89  3885    22828      886      44       3        0             0 qmgr
[ 3101.789287] [ 3916]     0  3916    23203      736      50       3        0             0 login
[ 3101.791666] [ 3927]     0  3927    27511      381      13       3        0             0 agetty
[ 3101.794116] [ 3930]     0  3930    79392     1063     105       3        0             0 nmbd
[ 3101.796387] [ 3941]     0  3941    96485     1544     138       4        0             0 smbd
[ 3101.798602] [ 3944]     0  3944    96485     1290     131       4        0             0 smbd
[ 3101.800783] [ 7471]     0  7471    28886      732      15       3        0             0 bash
[ 3101.803013] [ 7580]     0  7580     2380      613      10       3        0             0 makelxr.sh
[ 3101.805147] [ 7786]     0  7786    27511      395      10       3        0             0 agetty
[ 3101.807198] [ 8139]     0  8139    35888      974      72       3        0             0 sshd
[ 3101.809255] [ 8144]     0  8144    28896      761      15       4        0             0 bash
[ 3101.811335] [15286]     0 15286    38294    30474      81       3        0             0 genxref
[ 3101.813512] Out of memory: Kill process 15286 (genxref) score 66 or sacrifice child
[ 3101.815659] Killed process 15286 (genxref) total-vm:153176kB, anon-rss:117092kB, file-rss:4804kB, shmem-rss:0kB
----------------------------------------

> Something like the following:
Yes, I do think we need something like it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-07  4:09             ` Tetsuo Handa
@ 2016-02-15 20:06               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-15 20:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
[...]
> FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> I think current patchset is too fragile to merge.
> ----------------------------------------
> [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> [ 3101.629148] smbd cpuset=/ mems_allowed=0
[...]
> [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB

How come this is an unexpected OOM? There is clearly no order-2+ page
available for the allocation request.

> > Something like the following:
> Yes, I do think we need something like it.

Was the patch applied?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-15 20:06               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-15 20:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
[...]
> FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> I think current patchset is too fragile to merge.
> ----------------------------------------
> [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> [ 3101.629148] smbd cpuset=/ mems_allowed=0
[...]
> [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB

How come this is an unexpected OOM? There is clearly no order-2+ page
available for the allocation request.

> > Something like the following:
> Yes, I do think we need something like it.

Was the patch applied?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-15 20:06               ` Michal Hocko
@ 2016-02-16 13:10                 ` Tetsuo Handa
  -1 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-16 13:10 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> [...]
> > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > I think current patchset is too fragile to merge.
> > ----------------------------------------
> > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> [...]
> > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> 
> How come this is an unexpected OOM? There is clearly no order-2+ page
> available for the allocation request.

I used "unexpected" because there were only 35 userspace processes and
genxref was the only process which did a lot of memory allocation
(modulo kernel threads woken by file I/O) and most memory is reclaimable.

> 
> > > Something like the following:
> > Yes, I do think we need something like it.
> 
> Was the patch applied?

No for above result.

A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
today's linux-next is shown below. It seems that protection is not enough.

----------
[  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  118.586684] fork cpuset=/ mems_allowed=0
[  118.588254] CPU: 2 PID: 9565 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  118.589795] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  118.591941]  0000000000000286 0000000085a9ed62 ffff88007b3d3ad0 ffffffff8139e82d
[  118.593616]  0000000000000000 ffff88007b3d3d00 ffff88007b3d3b70 ffffffff811bedec
[  118.595273]  0000000000000206 ffffffff81810b70 ffff88007b3d3b10 ffffffff810be8f9
[  118.596970] Call Trace:
[  118.597634]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  118.598787]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  118.599979]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  118.601421]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  118.602713]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  118.604882]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  118.606940]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  118.608275]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  118.609698]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  118.611166]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  118.612589]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  118.614203]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  118.615689]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  118.617151]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  118.618391]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  118.619875]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  118.621642]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  118.622920]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  118.624262]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  118.625661]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  118.626959]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  118.628340]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  118.630002] Mem-Info:
[  118.630853] active_anon:27270 inactive_anon:2094 isolated_anon:0
[  118.630853]  active_file:253575 inactive_file:89021 isolated_file:22
[  118.630853]  unevictable:0 dirty:0 writeback:0 unstable:0
[  118.630853]  slab_reclaimable:14202 slab_unreclaimable:13906
[  118.630853]  mapped:1622 shmem:2162 pagetables:10587 bounce:0
[  118.630853]  free:5328 free_pcp:356 free_cma:0
[  118.639774] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3280kB inactive_anon:156kB active_file:684kB inactive_file:2292kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:800kB kernel_stack:256kB pagetables:200kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.650132] lowmem_reserve[]: 0 1714 1714 1714
[  118.651763] Node 0 DMA32 free:14256kB min:5172kB low:6464kB high:7756kB active_anon:105924kB inactive_anon:8220kB active_file:1026268kB inactive_file:340844kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:1759460kB mlocked:0kB dirty:0kB writeback:0kB mapped:6436kB shmem:8484kB slab_reclaimable:56740kB slab_unreclaimable:54824kB kernel_stack:28112kB pagetables:42148kB unstable:0kB bounce:0kB free_pcp:1440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.663101] lowmem_reserve[]: 0 0 0 0
[  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
[  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[  118.673742] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  118.676297] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  118.678610] 344508 total pagecache pages
[  118.680163] 0 pages in swap cache
[  118.681567] Swap cache stats: add 0, delete 0, find 0/0
[  118.681567] Free swap  = 0kB
[  118.681568] Total swap = 0kB
[  118.681625] 524157 pages RAM
[  118.681625] 0 pages HighMem/MovableOnly
[  118.681625] 80316 pages reserved
[  118.681626] 0 pages hwpoisoned

[  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  120.117097] fork cpuset=/ mems_allowed=0
[  120.117099] CPU: 0 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  120.117100] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  120.117102]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  120.117103]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  120.117104]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  120.117104] Call Trace:
[  120.117111]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  120.117113]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  120.117116]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  120.117117]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  120.117119]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  120.117121]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  120.117122]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  120.117123]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  120.117124]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  120.117125]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  120.117128]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  120.117130]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  120.117132]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  120.117133]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  120.117136]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  120.117137]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  120.117139]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  120.117142]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  120.117143]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  120.117144]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  120.117145]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  120.117147]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  120.117147] Mem-Info:
[  120.117150] active_anon:30895 inactive_anon:2094 isolated_anon:0
[  120.117150]  active_file:183306 inactive_file:118692 isolated_file:18
[  120.117150]  unevictable:0 dirty:47 writeback:0 unstable:0
[  120.117150]  slab_reclaimable:14405 slab_unreclaimable:22372
[  120.117150]  mapped:3101 shmem:2162 pagetables:20154 bounce:0
[  120.117150]  free:7231 free_pcp:108 free_cma:0
[  120.117154] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:1172kB inactive_anon:156kB active_file:684kB inactive_file:1356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:2244kB kernel_stack:1376kB pagetables:436kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  120.117156] lowmem_reserve[]: 0 1714 1714 1714
[  120.117172] Node 0 DMA32 free:22020kB min:5172kB low:6464kB high:7756kB active_anon:122408kB inactive_anon:8220kB active_file:732540kB inactive_file:473412kB unevictable:0kB isolated(anon):0kB isolated(file):72kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:11984kB shmem:8484kB slab_reclaimable:57056kB slab_unreclaimable:87244kB kernel_stack:52048kB pagetables:80180kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  120.117230] lowmem_reserve[]: 0 0 0 0
[  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
[  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  120.117245] 304244 total pagecache pages
[  120.117246] 0 pages in swap cache
[  120.117246] Swap cache stats: add 0, delete 0, find 0/0
[  120.117247] Free swap  = 0kB
[  120.117247] Total swap = 0kB
[  120.117248] 524157 pages RAM
[  120.117248] 0 pages HighMem/MovableOnly
[  120.117248] 80316 pages reserved
[  120.117249] 0 pages hwpoisoned

[  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  126.034918] fork cpuset=/ mems_allowed=0
[  126.034920] CPU: 2 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  126.034921] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  126.034923]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  126.034925]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  126.034926]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  126.034926] Call Trace:
[  126.034932]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  126.034935]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  126.034938]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  126.034939]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  126.034941]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  126.034943]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  126.034944]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  126.034945]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  126.034947]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  126.034948]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  126.034950]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  126.034952]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  126.034954]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  126.034956]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  126.034958]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  126.034959]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  126.034961]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  126.034965]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  126.034965]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  126.034967]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  126.034968]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  126.034969]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  126.034970] Mem-Info:
[  126.034973] active_anon:27060 inactive_anon:2093 isolated_anon:0
[  126.034973]  active_file:206123 inactive_file:85224 isolated_file:32
[  126.034973]  unevictable:0 dirty:47 writeback:0 unstable:0
[  126.034973]  slab_reclaimable:13214 slab_unreclaimable:26604
[  126.034973]  mapped:2421 shmem:2161 pagetables:24889 bounce:0
[  126.034973]  free:4649 free_pcp:30 free_cma:0
[  126.034986] Node 0 DMA free:6924kB min:44kB low:52kB high:64kB active_anon:1156kB inactive_anon:156kB active_file:728kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:368kB shmem:164kB slab_reclaimable:468kB slab_unreclaimable:2496kB kernel_stack:832kB pagetables:704kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  126.034988] lowmem_reserve[]: 0 1714 1714 1714
[  126.034992] Node 0 DMA32 free:11672kB min:5172kB low:6464kB high:7756kB active_anon:107084kB inactive_anon:8216kB active_file:823764kB inactive_file:339836kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:9316kB shmem:8480kB slab_reclaimable:52388kB slab_unreclaimable:103920kB kernel_stack:66016kB pagetables:98852kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  126.034993] lowmem_reserve[]: 0 0 0 0
[  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
[  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  126.035007] 293674 total pagecache pages
[  126.035008] 0 pages in swap cache
[  126.035008] Swap cache stats: add 0, delete 0, find 0/0
[  126.035009] Free swap  = 0kB
[  126.035009] Total swap = 0kB
[  126.035010] 524157 pages RAM
[  126.035010] 0 pages HighMem/MovableOnly
[  126.035010] 80316 pages reserved
[  126.035011] 0 pages hwpoisoned
----------

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-16 13:10                 ` Tetsuo Handa
  0 siblings, 0 replies; 299+ messages in thread
From: Tetsuo Handa @ 2016-02-16 13:10 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

Michal Hocko wrote:
> On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> [...]
> > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > I think current patchset is too fragile to merge.
> > ----------------------------------------
> > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> [...]
> > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> 
> How come this is an unexpected OOM? There is clearly no order-2+ page
> available for the allocation request.

I used "unexpected" because there were only 35 userspace processes and
genxref was the only process which did a lot of memory allocation
(modulo kernel threads woken by file I/O) and most memory is reclaimable.

> 
> > > Something like the following:
> > Yes, I do think we need something like it.
> 
> Was the patch applied?

No for above result.

A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
today's linux-next is shown below. It seems that protection is not enough.

----------
[  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  118.586684] fork cpuset=/ mems_allowed=0
[  118.588254] CPU: 2 PID: 9565 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  118.589795] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  118.591941]  0000000000000286 0000000085a9ed62 ffff88007b3d3ad0 ffffffff8139e82d
[  118.593616]  0000000000000000 ffff88007b3d3d00 ffff88007b3d3b70 ffffffff811bedec
[  118.595273]  0000000000000206 ffffffff81810b70 ffff88007b3d3b10 ffffffff810be8f9
[  118.596970] Call Trace:
[  118.597634]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  118.598787]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  118.599979]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  118.601421]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  118.602713]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  118.604882]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  118.606940]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  118.608275]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  118.609698]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  118.611166]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  118.612589]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  118.614203]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  118.615689]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  118.617151]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  118.618391]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  118.619875]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  118.621642]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  118.622920]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  118.624262]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  118.625661]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  118.626959]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  118.628340]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  118.630002] Mem-Info:
[  118.630853] active_anon:27270 inactive_anon:2094 isolated_anon:0
[  118.630853]  active_file:253575 inactive_file:89021 isolated_file:22
[  118.630853]  unevictable:0 dirty:0 writeback:0 unstable:0
[  118.630853]  slab_reclaimable:14202 slab_unreclaimable:13906
[  118.630853]  mapped:1622 shmem:2162 pagetables:10587 bounce:0
[  118.630853]  free:5328 free_pcp:356 free_cma:0
[  118.639774] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3280kB inactive_anon:156kB active_file:684kB inactive_file:2292kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:800kB kernel_stack:256kB pagetables:200kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.650132] lowmem_reserve[]: 0 1714 1714 1714
[  118.651763] Node 0 DMA32 free:14256kB min:5172kB low:6464kB high:7756kB active_anon:105924kB inactive_anon:8220kB active_file:1026268kB inactive_file:340844kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:1759460kB mlocked:0kB dirty:0kB writeback:0kB mapped:6436kB shmem:8484kB slab_reclaimable:56740kB slab_unreclaimable:54824kB kernel_stack:28112kB pagetables:42148kB unstable:0kB bounce:0kB free_pcp:1440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  118.663101] lowmem_reserve[]: 0 0 0 0
[  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
[  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[  118.673742] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  118.676297] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  118.678610] 344508 total pagecache pages
[  118.680163] 0 pages in swap cache
[  118.681567] Swap cache stats: add 0, delete 0, find 0/0
[  118.681567] Free swap  = 0kB
[  118.681568] Total swap = 0kB
[  118.681625] 524157 pages RAM
[  118.681625] 0 pages HighMem/MovableOnly
[  118.681625] 80316 pages reserved
[  118.681626] 0 pages hwpoisoned

[  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  120.117097] fork cpuset=/ mems_allowed=0
[  120.117099] CPU: 0 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  120.117100] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  120.117102]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  120.117103]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  120.117104]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  120.117104] Call Trace:
[  120.117111]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  120.117113]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  120.117116]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  120.117117]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  120.117119]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  120.117121]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  120.117122]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  120.117123]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  120.117124]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  120.117125]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  120.117128]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  120.117130]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  120.117132]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  120.117133]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  120.117136]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  120.117137]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  120.117139]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  120.117142]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  120.117143]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  120.117144]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  120.117145]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  120.117147]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  120.117147] Mem-Info:
[  120.117150] active_anon:30895 inactive_anon:2094 isolated_anon:0
[  120.117150]  active_file:183306 inactive_file:118692 isolated_file:18
[  120.117150]  unevictable:0 dirty:47 writeback:0 unstable:0
[  120.117150]  slab_reclaimable:14405 slab_unreclaimable:22372
[  120.117150]  mapped:3101 shmem:2162 pagetables:20154 bounce:0
[  120.117150]  free:7231 free_pcp:108 free_cma:0
[  120.117154] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:1172kB inactive_anon:156kB active_file:684kB inactive_file:1356kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:420kB shmem:164kB slab_reclaimable:564kB slab_unreclaimable:2244kB kernel_stack:1376kB pagetables:436kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  120.117156] lowmem_reserve[]: 0 1714 1714 1714
[  120.117172] Node 0 DMA32 free:22020kB min:5172kB low:6464kB high:7756kB active_anon:122408kB inactive_anon:8220kB active_file:732540kB inactive_file:473412kB unevictable:0kB isolated(anon):0kB isolated(file):72kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:11984kB shmem:8484kB slab_reclaimable:57056kB slab_unreclaimable:87244kB kernel_stack:52048kB pagetables:80180kB unstable:0kB bounce:0kB free_pcp:432kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  120.117230] lowmem_reserve[]: 0 0 0 0
[  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
[  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  120.117244] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  120.117245] 304244 total pagecache pages
[  120.117246] 0 pages in swap cache
[  120.117246] Swap cache stats: add 0, delete 0, find 0/0
[  120.117247] Free swap  = 0kB
[  120.117247] Total swap = 0kB
[  120.117248] 524157 pages RAM
[  120.117248] 0 pages HighMem/MovableOnly
[  120.117248] 80316 pages reserved
[  120.117249] 0 pages hwpoisoned

[  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[  126.034918] fork cpuset=/ mems_allowed=0
[  126.034920] CPU: 2 PID: 9566 Comm: fork Not tainted 4.5.0-rc4-next-20160216+ #306
[  126.034921] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  126.034923]  0000000000000286 00000000be6c9129 ffff880035dabad0 ffffffff8139e82d
[  126.034925]  0000000000000000 ffff880035dabd00 ffff880035dabb70 ffffffff811bedec
[  126.034926]  0000000000000206 ffffffff81810b70 ffff880035dabb10 ffffffff810be8f9
[  126.034926] Call Trace:
[  126.034932]  [<ffffffff8139e82d>] dump_stack+0x85/0xc8
[  126.034935]  [<ffffffff811bedec>] dump_header+0x5b/0x3b0
[  126.034938]  [<ffffffff810be8f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  126.034939]  [<ffffffff810be9cd>] ? trace_hardirqs_on+0xd/0x10
[  126.034941]  [<ffffffff811447f6>] oom_kill_process+0x366/0x550
[  126.034943]  [<ffffffff81144c1f>] out_of_memory+0x1ef/0x5a0
[  126.034944]  [<ffffffff81144cdd>] ? out_of_memory+0x2ad/0x5a0
[  126.034945]  [<ffffffff8114a63b>] __alloc_pages_nodemask+0xb3b/0xd80
[  126.034947]  [<ffffffff810be800>] ? mark_held_locks+0x90/0x90
[  126.034948]  [<ffffffff8114aa3c>] alloc_kmem_pages_node+0x4c/0xc0
[  126.034950]  [<ffffffff8106d661>] copy_process.part.33+0x131/0x1be0
[  126.034952]  [<ffffffff8111e20a>] ? __audit_syscall_entry+0xaa/0xf0
[  126.034954]  [<ffffffff810e8939>] ? current_kernel_time64+0xa9/0xc0
[  126.034956]  [<ffffffff8106f2db>] _do_fork+0xdb/0x5d0
[  126.034958]  [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[  126.034959]  [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[  126.034961]  [<ffffffff810bae1a>] ? up_read+0x1a/0x40
[  126.034965]  [<ffffffff817093ce>] ? retint_user+0x18/0x23
[  126.034965]  [<ffffffff810035ec>] ? do_syscall_64+0x1c/0x180
[  126.034967]  [<ffffffff8106f854>] SyS_clone+0x14/0x20
[  126.034968]  [<ffffffff8100362d>] do_syscall_64+0x5d/0x180
[  126.034969]  [<ffffffff81708abf>] entry_SYSCALL64_slow_path+0x25/0x25
[  126.034970] Mem-Info:
[  126.034973] active_anon:27060 inactive_anon:2093 isolated_anon:0
[  126.034973]  active_file:206123 inactive_file:85224 isolated_file:32
[  126.034973]  unevictable:0 dirty:47 writeback:0 unstable:0
[  126.034973]  slab_reclaimable:13214 slab_unreclaimable:26604
[  126.034973]  mapped:2421 shmem:2161 pagetables:24889 bounce:0
[  126.034973]  free:4649 free_pcp:30 free_cma:0
[  126.034986] Node 0 DMA free:6924kB min:44kB low:52kB high:64kB active_anon:1156kB inactive_anon:156kB active_file:728kB inactive_file:1060kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:368kB shmem:164kB slab_reclaimable:468kB slab_unreclaimable:2496kB kernel_stack:832kB pagetables:704kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[  126.034988] lowmem_reserve[]: 0 1714 1714 1714
[  126.034992] Node 0 DMA32 free:11672kB min:5172kB low:6464kB high:7756kB active_anon:107084kB inactive_anon:8216kB active_file:823764kB inactive_file:339836kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1759460kB mlocked:0kB dirty:188kB writeback:0kB mapped:9316kB shmem:8480kB slab_reclaimable:52388kB slab_unreclaimable:103920kB kernel_stack:66016kB pagetables:98852kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  126.034993] lowmem_reserve[]: 0 0 0 0
[  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
[  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  126.035006] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  126.035007] 293674 total pagecache pages
[  126.035008] 0 pages in swap cache
[  126.035008] Swap cache stats: add 0, delete 0, find 0/0
[  126.035009] Free swap  = 0kB
[  126.035009] Total swap = 0kB
[  126.035010] 524157 pages RAM
[  126.035010] 0 pages HighMem/MovableOnly
[  126.035010] 80316 pages reserved
[  126.035011] 0 pages hwpoisoned
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-16 13:10                 ` Tetsuo Handa
@ 2016-02-16 15:19                   ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-16 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Tue 16-02-16 22:10:01, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> > [...]
> > > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > > I think current patchset is too fragile to merge.
> > > ----------------------------------------
> > > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> > [...]
> > > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> > 
> > How come this is an unexpected OOM? There is clearly no order-2+ page
> > available for the allocation request.
> 
> I used "unexpected" because there were only 35 userspace processes and
> genxref was the only process which did a lot of memory allocation
> (modulo kernel threads woken by file I/O) and most memory is reclaimable.

The memory is reclaimable but that doesn't mean that order-2 page block
will get formed even if all of it gets reclaimed. The memory is simply
too fragmented. That is why I think the OOM makes sense.

> > > > Something like the following:
> > > Yes, I do think we need something like it.
> > 
> > Was the patch applied?
> 
> No for above result.
> 
> A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
> today's linux-next is shown below. It seems that protection is not enough.
> 
> ----------
> [  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
> [  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[...]
> [  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
> [  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[...]
> [  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
> [  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB

As you can see, in all cases we had order-2 requests and no order-2+
free blocks even after all the retries. I think the OOM is appropriate
at that time. We could have tried N+1 times but we have to draw a line
at some point of time. The reason why we do not have any high order
block available is a completely different question IMO. Maybe the
compaction just gets deferred and doesn't do anything. This would be
interesting to investigate further of course. Anyway my point is
that going OOM with the current fragmentation is simply the only choice.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-16 15:19                   ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-16 15:19 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, akpm, torvalds, hannes, mgorman, hillf.zj,
	kamezawa.hiroyu, linux-mm, linux-kernel

On Tue 16-02-16 22:10:01, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 07-02-16 13:09:33, Tetsuo Handa wrote:
> > [...]
> > > FYI, I again hit unexpected OOM-killer during genxref on linux-4.5-rc2 source.
> > > I think current patchset is too fragile to merge.
> > > ----------------------------------------
> > > [ 3101.626995] smbd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
> > > [ 3101.629148] smbd cpuset=/ mems_allowed=0
> > [...]
> > > [ 3101.705887] Node 0 DMA: 75*4kB (UME) 69*8kB (UME) 43*16kB (UM) 23*32kB (UME) 8*64kB (UM) 4*128kB (UME) 2*256kB (UM) 0*512kB 1*1024kB (U) 1*2048kB (M) 0*4096kB = 6884kB
> > > [ 3101.710581] Node 0 DMA32: 4513*4kB (UME) 15*8kB (U) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18172kB
> > 
> > How come this is an unexpected OOM? There is clearly no order-2+ page
> > available for the allocation request.
> 
> I used "unexpected" because there were only 35 userspace processes and
> genxref was the only process which did a lot of memory allocation
> (modulo kernel threads woken by file I/O) and most memory is reclaimable.

The memory is reclaimable but that doesn't mean that order-2 page block
will get formed even if all of it gets reclaimed. The memory is simply
too fragmented. That is why I think the OOM makes sense.

> > > > Something like the following:
> > > Yes, I do think we need something like it.
> > 
> > Was the patch applied?
> 
> No for above result.
> 
> A result with the patch (20160204142400.GC14425@dhcp22.suse.cz) applied on
> today's linux-next is shown below. It seems that protection is not enough.
> 
> ----------
> [  118.584571] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  118.664704] Node 0 DMA: 83*4kB (ME) 51*8kB (UME) 9*16kB (UME) 2*32kB (UM) 1*64kB (M) 4*128kB (UME) 5*256kB (UME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6900kB
> [  118.670166] Node 0 DMA32: 2327*4kB (ME) 621*8kB (M) 1*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14292kB
[...]
> [  120.117093] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  120.117238] Node 0 DMA: 46*4kB (UME) 82*8kB (ME) 37*16kB (UME) 13*32kB (M) 3*64kB (UM) 2*128kB (ME) 2*256kB (ME) 2*512kB (UM) 1*1024kB (E) 1*2048kB (M) 0*4096kB = 6904kB
> [  120.117242] Node 0 DMA32: 709*4kB (UME) 2374*8kB (UME) 0*16kB 10*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22148kB
[...]
> [  126.034913] fork invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [  126.035000] Node 0 DMA: 70*4kB (UME) 16*8kB (UME) 59*16kB (UME) 34*32kB (ME) 14*64kB (UME) 2*128kB (UE) 1*256kB (E) 2*512kB (M) 2*1024kB (ME) 0*2048kB 0*4096kB = 6920kB
> [  126.035005] Node 0 DMA32: 2372*4kB (UME) 290*8kB (UM) 3*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11856kB

As you can see, in all cases we had order-2 requests and no order-2+
free blocks even after all the retries. I think the OOM is appropriate
at that time. We could have tried N+1 times but we have to draw a line
at some point of time. The reason why we do not have any high order
block available is a completely different question IMO. Maybe the
compaction just gets deferred and doesn't do anything. This would be
interesting to investigate further of course. Anyway my point is
that going OOM with the current fragmentation is simply the only choice.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-03 13:27   ` Michal Hocko
@ 2016-02-25  3:47     ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-25  3:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:
> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Sorry to say at this late date, but I do have one concern: hopefully
you can tweak something somewhere, or point me to some tunable that
I can adjust (I've not studied the patches, sorry).

This rework makes it impossible to run my tmpfs swapping loads:
they're soon OOM-killed when they ran forever before, so swapping
does not get the exercise on mmotm that it used to.  (But I'm not
so arrogant as to expect you to optimize for my load!)

Maybe it's just that I'm using tmpfs, and there's code that's conscious
of file and anon, but doesn't cope properly with the awkward shmem case.

(Of course, tmpfs is and always has been a problem for OOM-killing,
given that it takes up memory, but none is freed by killing processes:
but although that is a tiresome problem, it's not what either of us is
attacking here.)

Taking many of the irrelevancies out of my load, here's something you
could try, first on v4.5-rc5 and then on mmotm.

Boot with mem=1G (or boot your usual way, and do something to occupy
most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
way to gobble up most of the memory, though it's not how I've done it).

Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
kernel source tree into a tmpfs: size=2G is more than enough.
make defconfig there, then make -j20.

On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.

Except that you'll probably need to fiddle around with that j20,
it's true for my laptop but not for my workstation.  j20 just happens
to be what I've had there for years, that I now see breaking down
(I can lower to j6 to proceed, perhaps could go a bit higher,
but it still doesn't exercise swap very much).

This OOM detection rework significantly lowers the number of jobs
which can be run in parallel without being OOM-killed.  Which would
be welcome if it were choosing to abort in place of thrashing, but
the system was far from thrashing: j20 took a few seconds more than
j6, and even j30 didn't take 50% longer.

(I have /proc/sys/vm/swappiness 100, if that matters.)

I hope there's an easy answer to this: thanks!
Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  3:47     ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-25  3:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Wed, 3 Feb 2016, Michal Hocko wrote:
> Hi,
> this thread went mostly quite. Are all the main concerns clarified?
> Are there any new concerns? Are there any objections to targeting
> this for the next merge window?

Sorry to say at this late date, but I do have one concern: hopefully
you can tweak something somewhere, or point me to some tunable that
I can adjust (I've not studied the patches, sorry).

This rework makes it impossible to run my tmpfs swapping loads:
they're soon OOM-killed when they ran forever before, so swapping
does not get the exercise on mmotm that it used to.  (But I'm not
so arrogant as to expect you to optimize for my load!)

Maybe it's just that I'm using tmpfs, and there's code that's conscious
of file and anon, but doesn't cope properly with the awkward shmem case.

(Of course, tmpfs is and always has been a problem for OOM-killing,
given that it takes up memory, but none is freed by killing processes:
but although that is a tiresome problem, it's not what either of us is
attacking here.)

Taking many of the irrelevancies out of my load, here's something you
could try, first on v4.5-rc5 and then on mmotm.

Boot with mem=1G (or boot your usual way, and do something to occupy
most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
way to gobble up most of the memory, though it's not how I've done it).

Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
kernel source tree into a tmpfs: size=2G is more than enough.
make defconfig there, then make -j20.

On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.

Except that you'll probably need to fiddle around with that j20,
it's true for my laptop but not for my workstation.  j20 just happens
to be what I've had there for years, that I now see breaking down
(I can lower to j6 to proceed, perhaps could go a bit higher,
but it still doesn't exercise swap very much).

This OOM detection rework significantly lowers the number of jobs
which can be run in parallel without being OOM-killed.  Which would
be welcome if it were choosing to abort in place of thrashing, but
the system was far from thrashing: j20 took a few seconds more than
j6, and even j30 didn't take 50% longer.

(I have /proc/sys/vm/swappiness 100, if that matters.)

I hope there's an easy answer to this: thanks!
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
@ 2016-02-25  6:48       ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25  6:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky,
	Sergey Senozhatsky

Hello,

On (02/24/16 19:47), Hugh Dickins wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Sorry to say at this late date, but I do have one concern: hopefully
> you can tweak something somewhere, or point me to some tunable that
> I can adjust (I've not studied the patches, sorry).
> 
> This rework makes it impossible to run my tmpfs swapping loads:
> they're soon OOM-killed when they ran forever before, so swapping
> does not get the exercise on mmotm that it used to.  (But I'm not
> so arrogant as to expect you to optimize for my load!)
> 
> Maybe it's just that I'm using tmpfs, and there's code that's conscious
> of file and anon, but doesn't cope properly with the awkward shmem case.
> 
> (Of course, tmpfs is and always has been a problem for OOM-killing,
> given that it takes up memory, but none is freed by killing processes:
> but although that is a tiresome problem, it's not what either of us is
> attacking here.)
> 
> Taking many of the irrelevancies out of my load, here's something you
> could try, first on v4.5-rc5 and then on mmotm.
> 

FWIW,

I have recently noticed the same change while testing zram-zsmalloc. next/mmots
are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
oom-kills Xorg, etc.

the test script just creates a zram device (ext4 fs, lzo compression) and fills
it with some data, nothing special.


OOM example:

[ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2392.663187] Call Trace:
[ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2392.663230] Mem-Info:
[ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
                active_file:161111 inactive_file:320022 isolated_file:0
                unevictable:0 dirty:51 writeback:0 unstable:0
                slab_reclaimable:80335 slab_unreclaimable:5920
                mapped:30115 shmem:29235 pagetables:2589 bounce:0
                free:10949 free_pcp:189 free_cma:0
[ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
[ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
[ 2392.663249] lowmem_reserve[]: 0 0 824 824
[ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 2392.663257] lowmem_reserve[]: 0 0 0 0
[ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
[ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
[ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2392.663303] 510384 total pagecache pages
[ 2392.663305] 31 pages in swap cache
[ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
[ 2392.663307] Free swap  = 8388268kB
[ 2392.663308] Total swap = 8388604kB
[ 2392.663308] 1032092 pages RAM
[ 2392.663309] 0 pages HighMem/MovableOnly
[ 2392.663310] 40110 pages reserved
[ 2392.663311] 0 pages hwpoisoned
[ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
[ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
[ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
[ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
[ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
[ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
[ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
[ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
[ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB


[ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2400.152481] Call Trace:
[ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
[ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2400.152527] Mem-Info:
[ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
                active_file:160072 inactive_file:275086 isolated_file:0
                unevictable:0 dirty:49 writeback:0 unstable:0
                slab_reclaimable:54096 slab_unreclaimable:5978
                mapped:13650 shmem:29234 pagetables:2058 bounce:0
                free:13017 free_pcp:134 free_cma:0
[ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
[ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
[ 2400.152546] lowmem_reserve[]: 0 0 824 824
[ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
[ 2400.152555] lowmem_reserve[]: 0 0 0 0
[ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
[ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
[ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2400.152593] 464295 total pagecache pages
[ 2400.152594] 31 pages in swap cache
[ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
[ 2400.152596] Free swap  = 8388268kB
[ 2400.152597] Total swap = 8388604kB
[ 2400.152598] 1032092 pages RAM
[ 2400.152599] 0 pages HighMem/MovableOnly
[ 2400.152600] 40110 pages reserved
[ 2400.152600] 0 pages hwpoisoned
[ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
[ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
[ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
[ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
[ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
[ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
[ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
[ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
[ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB




$ free
              total        used        free      shared  buff/cache   available
Mem:        3967928     1563132      310548      116936     2094248     2207584
Swap:       8388604         332     8388272


	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  6:48       ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25  6:48 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky,
	Sergey Senozhatsky

Hello,

On (02/24/16 19:47), Hugh Dickins wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> > Hi,
> > this thread went mostly quite. Are all the main concerns clarified?
> > Are there any new concerns? Are there any objections to targeting
> > this for the next merge window?
> 
> Sorry to say at this late date, but I do have one concern: hopefully
> you can tweak something somewhere, or point me to some tunable that
> I can adjust (I've not studied the patches, sorry).
> 
> This rework makes it impossible to run my tmpfs swapping loads:
> they're soon OOM-killed when they ran forever before, so swapping
> does not get the exercise on mmotm that it used to.  (But I'm not
> so arrogant as to expect you to optimize for my load!)
> 
> Maybe it's just that I'm using tmpfs, and there's code that's conscious
> of file and anon, but doesn't cope properly with the awkward shmem case.
> 
> (Of course, tmpfs is and always has been a problem for OOM-killing,
> given that it takes up memory, but none is freed by killing processes:
> but although that is a tiresome problem, it's not what either of us is
> attacking here.)
> 
> Taking many of the irrelevancies out of my load, here's something you
> could try, first on v4.5-rc5 and then on mmotm.
> 

FWIW,

I have recently noticed the same change while testing zram-zsmalloc. next/mmots
are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
oom-kills Xorg, etc.

the test script just creates a zram device (ext4 fs, lzo compression) and fills
it with some data, nothing special.


OOM example:

[ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2392.663187] Call Trace:
[ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2392.663230] Mem-Info:
[ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
                active_file:161111 inactive_file:320022 isolated_file:0
                unevictable:0 dirty:51 writeback:0 unstable:0
                slab_reclaimable:80335 slab_unreclaimable:5920
                mapped:30115 shmem:29235 pagetables:2589 bounce:0
                free:10949 free_pcp:189 free_cma:0
[ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
[ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
[ 2392.663249] lowmem_reserve[]: 0 0 824 824
[ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 2392.663257] lowmem_reserve[]: 0 0 0 0
[ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
[ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
[ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2392.663303] 510384 total pagecache pages
[ 2392.663305] 31 pages in swap cache
[ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
[ 2392.663307] Free swap  = 8388268kB
[ 2392.663308] Total swap = 8388604kB
[ 2392.663308] 1032092 pages RAM
[ 2392.663309] 0 pages HighMem/MovableOnly
[ 2392.663310] 40110 pages reserved
[ 2392.663311] 0 pages hwpoisoned
[ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
[ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
[ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
[ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
[ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
[ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
[ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
[ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
[ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB


[ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
[ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
[ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
[ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
[ 2400.152481] Call Trace:
[ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
[ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
[ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
[ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
[ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
[ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
[ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
[ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
[ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
[ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
[ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
[ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
[ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
[ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
[ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
[ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
[ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 2400.152527] Mem-Info:
[ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
                active_file:160072 inactive_file:275086 isolated_file:0
                unevictable:0 dirty:49 writeback:0 unstable:0
                slab_reclaimable:54096 slab_unreclaimable:5978
                mapped:13650 shmem:29234 pagetables:2058 bounce:0
                free:13017 free_pcp:134 free_cma:0
[ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
[ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
[ 2400.152546] lowmem_reserve[]: 0 0 824 824
[ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
[ 2400.152555] lowmem_reserve[]: 0 0 0 0
[ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
[ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
[ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5708kB
[ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 2400.152593] 464295 total pagecache pages
[ 2400.152594] 31 pages in swap cache
[ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
[ 2400.152596] Free swap  = 8388268kB
[ 2400.152597] Total swap = 8388604kB
[ 2400.152598] 1032092 pages RAM
[ 2400.152599] 0 pages HighMem/MovableOnly
[ 2400.152600] 40110 pages reserved
[ 2400.152600] 0 pages hwpoisoned
[ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
[ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
[ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
[ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
[ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
[ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
[ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
[ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
[ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
[ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
[ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
[ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
[ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
[ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
[ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
[ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
[ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
[ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
[ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
[ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
[ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
[ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
[ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
[ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
[ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
[ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
[ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
[ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
[ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
[ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
[ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
[ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
[ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
[ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
[ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
[ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
[ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
[ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
[ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
[ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
[ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
[ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
[ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
[ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
[ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
[ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
[ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
[ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
[ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
[ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
[ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
[ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
[ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
[ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
[ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
[ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
[ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
[ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
[ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
[ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
[ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
[ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
[ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
[ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
[ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
[ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
[ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
[ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
[ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
[ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
[ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
[ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
[ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
[ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
[ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
[ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB




$ free
              total        used        free      shared  buff/cache   available
Mem:        3967928     1563132      310548      116936     2094248     2207584
Swap:       8388604         332     8388272


	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  6:48       ` Sergey Senozhatsky
@ 2016-02-25  9:17         ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:17 UTC (permalink / raw)
  To: 'Sergey Senozhatsky', 'Hugh Dickins'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

> 
> On (02/24/16 19:47), Hugh Dickins wrote:
> > On Wed, 3 Feb 2016, Michal Hocko wrote:
> > > Hi,
> > > this thread went mostly quite. Are all the main concerns clarified?
> > > Are there any new concerns? Are there any objections to targeting
> > > this for the next merge window?
> >
> > Sorry to say at this late date, but I do have one concern: hopefully
> > you can tweak something somewhere, or point me to some tunable that
> > I can adjust (I've not studied the patches, sorry).
> >
> > This rework makes it impossible to run my tmpfs swapping loads:
> > they're soon OOM-killed when they ran forever before, so swapping
> > does not get the exercise on mmotm that it used to.  (But I'm not
> > so arrogant as to expect you to optimize for my load!)
> >
> > Maybe it's just that I'm using tmpfs, and there's code that's conscious
> > of file and anon, but doesn't cope properly with the awkward shmem case.
> >
> > (Of course, tmpfs is and always has been a problem for OOM-killing,
> > given that it takes up memory, but none is freed by killing processes:
> > but although that is a tiresome problem, it's not what either of us is
> > attacking here.)
> >
> > Taking many of the irrelevancies out of my load, here's something you
> > could try, first on v4.5-rc5 and then on mmotm.
> >
> 
> FWIW,
> 
> I have recently noticed the same change while testing zram-zsmalloc. next/mmots
> are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
> of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
> oom-kills Xorg, etc.
> 
> the test script just creates a zram device (ext4 fs, lzo compression) and fills
> it with some data, nothing special.
> 
> 
> OOM example:
> 
> [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2392.663187] Call Trace:
> [ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2392.663230] Mem-Info:
> [ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
>                 active_file:161111 inactive_file:320022 isolated_file:0
>                 unevictable:0 dirty:51 writeback:0 unstable:0
>                 slab_reclaimable:80335 slab_unreclaimable:5920
>                 mapped:30115 shmem:29235 pagetables:2589 bounce:0
>                 free:10949 free_pcp:189 free_cma:0
> [ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
> [ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB
> active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB
> slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
> [ 2392.663249] lowmem_reserve[]: 0 0 824 824
> [ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB
> active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB
> slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [ 2392.663257] lowmem_reserve[]: 0 0 0 0
> [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 23260kB
> [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 6060kB
> [ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2392.663303] 510384 total pagecache pages
> [ 2392.663305] 31 pages in swap cache
> [ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
> [ 2392.663307] Free swap  = 8388268kB
> [ 2392.663308] Total swap = 8388604kB
> [ 2392.663308] 1032092 pages RAM
> [ 2392.663309] 0 pages HighMem/MovableOnly
> [ 2392.663310] 40110 pages reserved
> [ 2392.663311] 0 pages hwpoisoned
> [ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
> [ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
> [ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
> [ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
> [ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
> [ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
> [ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
> [ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB
> 
> 
> [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2400.152481] Call Trace:
> [ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
> [ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2400.152527] Mem-Info:
> [ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
>                 active_file:160072 inactive_file:275086 isolated_file:0
>                 unevictable:0 dirty:49 writeback:0 unstable:0
>                 slab_reclaimable:54096 slab_unreclaimable:5978
>                 mapped:13650 shmem:29234 pagetables:2058 bounce:0
>                 free:13017 free_pcp:134 free_cma:0
> [ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
> [ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB
> active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB
> slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
> [ 2400.152546] lowmem_reserve[]: 0 0 824 824
> [ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB
> active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB
> slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
> [ 2400.152555] lowmem_reserve[]: 0 0 0 0
> [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 31780kB
> [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 5708kB
> [ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2400.152593] 464295 total pagecache pages
> [ 2400.152594] 31 pages in swap cache
> [ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
> [ 2400.152596] Free swap  = 8388268kB
> [ 2400.152597] Total swap = 8388604kB
> [ 2400.152598] 1032092 pages RAM
> [ 2400.152599] 0 pages HighMem/MovableOnly
> [ 2400.152600] 40110 pages reserved
> [ 2400.152600] 0 pages hwpoisoned
> [ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
> [ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
> [ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
> [ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
> [ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
> [ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
> [ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
> [ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
> 
> 
> 
> 
> $ free
>               total        used        free      shared  buff/cache   available
> Mem:        3967928     1563132      310548      116936     2094248     2207584
> Swap:       8388604         332     8388272
> 
Hi Sergey

Thanks for your info.

Can you please schedule a run for the diff attached, in which 
non-expensive allocators are allowed to burn more CPU cycles.

thanks
Hillf

--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		no_progress_loops /= 2;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:17         ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:17 UTC (permalink / raw)
  To: 'Sergey Senozhatsky', 'Hugh Dickins'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

> 
> On (02/24/16 19:47), Hugh Dickins wrote:
> > On Wed, 3 Feb 2016, Michal Hocko wrote:
> > > Hi,
> > > this thread went mostly quite. Are all the main concerns clarified?
> > > Are there any new concerns? Are there any objections to targeting
> > > this for the next merge window?
> >
> > Sorry to say at this late date, but I do have one concern: hopefully
> > you can tweak something somewhere, or point me to some tunable that
> > I can adjust (I've not studied the patches, sorry).
> >
> > This rework makes it impossible to run my tmpfs swapping loads:
> > they're soon OOM-killed when they ran forever before, so swapping
> > does not get the exercise on mmotm that it used to.  (But I'm not
> > so arrogant as to expect you to optimize for my load!)
> >
> > Maybe it's just that I'm using tmpfs, and there's code that's conscious
> > of file and anon, but doesn't cope properly with the awkward shmem case.
> >
> > (Of course, tmpfs is and always has been a problem for OOM-killing,
> > given that it takes up memory, but none is freed by killing processes:
> > but although that is a tiresome problem, it's not what either of us is
> > attacking here.)
> >
> > Taking many of the irrelevancies out of my load, here's something you
> > could try, first on v4.5-rc5 and then on mmotm.
> >
> 
> FWIW,
> 
> I have recently noticed the same change while testing zram-zsmalloc. next/mmots
> are much more likely to OOM-kill apps now. and, unlike before, I don't see a lot
> of shrinker->zsmalloc->zs_shrinker_scan() calls or swapouts, the kernel just
> oom-kills Xorg, etc.
> 
> the test script just creates a zram device (ext4 fs, lzo compression) and fills
> it with some data, nothing special.
> 
> 
> OOM example:
> 
> [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2392.663175] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2392.663178]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2392.663181]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2392.663184]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2392.663187] Call Trace:
> [ 2392.663191]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2392.663195]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2392.663197]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2392.663201]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2392.663204]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2392.663206]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2392.663208]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2392.663211]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2392.663213]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2392.663216]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2392.663218]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2392.663220]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2392.663223]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2392.663224]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2392.663226]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2392.663228]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2392.663230] Mem-Info:
> [ 2392.663233] active_anon:87788 inactive_anon:69289 isolated_anon:0
>                 active_file:161111 inactive_file:320022 isolated_file:0
>                 unevictable:0 dirty:51 writeback:0 unstable:0
>                 slab_reclaimable:80335 slab_unreclaimable:5920
>                 mapped:30115 shmem:29235 pagetables:2589 bounce:0
>                 free:10949 free_pcp:189 free_cma:0
> [ 2392.663239] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2392.663240] lowmem_reserve[]: 0 3031 3855 3855
> [ 2392.663247] DMA32 free:22876kB min:6232kB low:9332kB high:12432kB active_anon:316384kB inactive_anon:172076kB
> active_file:512592kB inactive_file:1011992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:148kB writeback:0kB mapped:93284kB shmem:90904kB slab_reclaimable:248836kB
> slab_unreclaimable:14620kB kernel_stack:2208kB pagetables:7796kB unstable:0kB bounce:0kB free_pcp:628kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:256 all_unreclaimable? no
> [ 2392.663249] lowmem_reserve[]: 0 0 824 824
> [ 2392.663256] Normal free:5824kB min:1696kB low:2540kB high:3384kB active_anon:34768kB inactive_anon:105080kB
> active_file:131820kB inactive_file:267720kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:56kB writeback:0kB mapped:27040kB shmem:26036kB slab_reclaimable:72456kB
> slab_unreclaimable:8968kB kernel_stack:1296kB pagetables:2560kB unstable:0kB bounce:0kB free_pcp:128kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [ 2392.663257] lowmem_reserve[]: 0 0 0 0
> [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 23260kB
> [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 6060kB
> [ 2392.663302] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2392.663303] 510384 total pagecache pages
> [ 2392.663305] 31 pages in swap cache
> [ 2392.663306] Swap cache stats: add 113, delete 82, find 47/62
> [ 2392.663307] Free swap  = 8388268kB
> [ 2392.663308] Total swap = 8388604kB
> [ 2392.663308] 1032092 pages RAM
> [ 2392.663309] 0 pages HighMem/MovableOnly
> [ 2392.663310] 40110 pages reserved
> [ 2392.663311] 0 pages hwpoisoned
> [ 2392.663312] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2392.663316] [  149]     0   149     9683     1612      20       3        4             0 systemd-journal
> [ 2392.663319] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2392.663321] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2392.663323] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2392.663326] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2392.663328] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2392.663330] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2392.663332] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2392.663334] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2392.663336] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2392.663339] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2392.663341] [  336]  1000   336    47841    19104      94       3        0             0 Xorg
> [ 2392.663343] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2392.663345] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2392.663347] [  355]  1000   355    34168     5710      57       3        0             0 urxvt
> [ 2392.663349] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2392.663351] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2392.663353] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2392.663355] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2392.663357] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2392.663359] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2392.663362] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2392.663364] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2392.663366] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2392.663368] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2392.663370] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2392.663372] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2392.663374] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2392.663376] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2392.663378] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2392.663380] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2392.663382] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2392.663385] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2392.663387] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2392.663389] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2392.663391] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2392.663393] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2392.663395] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2392.663397] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2392.663399] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2392.663401] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2392.663403] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2392.663405] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2392.663408] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2392.663410] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2392.663412] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2392.663414] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2392.663416] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2392.663418] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2392.663420] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2392.663422] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2392.663424] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2392.663426] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2392.663429] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2392.663431] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2392.663433] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2392.663435] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2392.663437] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2392.663439] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2392.663441] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2392.663443] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2392.663445] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2392.663447] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2392.663449] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2392.663451] [  808]  1000   808   341606    79367     532       5        0             0 firefox
> [ 2392.663454] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2392.663456] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2392.663458] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2392.663460] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2392.663462] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2392.663464] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2392.663482] [ 9517]     0  9517     3750      830      13       3        0             0 zram-test.sh
> [ 2392.663485] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2392.663487] [13623]  1000 13623     1764      186       9       3        0             0 sleep
> [ 2392.663489] Out of memory: Kill process 808 (firefox) score 25 or sacrifice child
> [ 2392.663769] Killed process 808 (firefox) total-vm:1366424kB, anon-rss:235572kB, file-rss:82320kB, shmem-rss:8kB
> 
> 
> [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,
> oom_score_adj=0
> [ 2400.152470] CPU: 1 PID: 9517 Comm: zram-test.sh Not tainted 4.5.0-rc5-next-20160225-dbg-00009-g334f687-dirty #190
> [ 2400.152473]  0000000000000000 ffff88000b4efb88 ffffffff81237bac 0000000000000000
> [ 2400.152476]  ffff88000b4efd28 ffff88000b4efbf8 ffffffff8113a077 ffff88000b4efba8
> [ 2400.152479]  ffffffff81080e24 ffff88000b4efbc8 ffffffff8151584e ffffffff81a48460
> [ 2400.152481] Call Trace:
> [ 2400.152487]  [<ffffffff81237bac>] dump_stack+0x67/0x90
> [ 2400.152490]  [<ffffffff8113a077>] dump_header.isra.5+0x54/0x351
> [ 2400.152493]  [<ffffffff81080e24>] ? trace_hardirqs_on+0xd/0xf
> [ 2400.152496]  [<ffffffff8151584e>] ? _raw_spin_unlock_irqrestore+0x4b/0x60
> [ 2400.152500]  [<ffffffff810f7ae7>] oom_kill_process+0x89/0x4ff
> [ 2400.152502]  [<ffffffff810f8319>] out_of_memory+0x36c/0x387
> [ 2400.152504]  [<ffffffff810fc9c2>] __alloc_pages_nodemask+0x9ba/0xaa8
> [ 2400.152506]  [<ffffffff810fcca8>] alloc_kmem_pages_node+0x1b/0x1d
> [ 2400.152509]  [<ffffffff81040216>] copy_process.part.9+0xfe/0x183f
> [ 2400.152511]  [<ffffffff81083178>] ? lock_acquire+0x11f/0x1c7
> [ 2400.152513]  [<ffffffff81041aea>] _do_fork+0xbd/0x5f1
> [ 2400.152515]  [<ffffffff81117402>] ? __might_fault+0x40/0x8d
> [ 2400.152517]  [<ffffffff81515f52>] ? entry_SYSCALL_64_fastpath+0x5/0xa8
> [ 2400.152520]  [<ffffffff81001844>] ? do_syscall_64+0x18/0xe6
> [ 2400.152522]  [<ffffffff810420a4>] SyS_clone+0x19/0x1b
> [ 2400.152524]  [<ffffffff81001886>] do_syscall_64+0x5a/0xe6
> [ 2400.152526]  [<ffffffff8151601a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 2400.152527] Mem-Info:
> [ 2400.152531] active_anon:37648 inactive_anon:59709 isolated_anon:0
>                 active_file:160072 inactive_file:275086 isolated_file:0
>                 unevictable:0 dirty:49 writeback:0 unstable:0
>                 slab_reclaimable:54096 slab_unreclaimable:5978
>                 mapped:13650 shmem:29234 pagetables:2058 bounce:0
>                 free:13017 free_pcp:134 free_cma:0
> [ 2400.152536] DMA free:15096kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:32kB
> inactive_file:120kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB
> writeback:0kB mapped:136kB shmem:0kB slab_reclaimable:48kB slab_unreclaimable:92kB kernel_stack:0kB pagetables:0kB
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 2400.152537] lowmem_reserve[]: 0 3031 3855 3855
> [ 2400.152545] DMA32 free:31504kB min:6232kB low:9332kB high:12432kB active_anon:129548kB inactive_anon:172076kB
> active_file:508480kB inactive_file:872492kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB
> managed:3107516kB mlocked:0kB dirty:132kB writeback:0kB mapped:42296kB shmem:90900kB slab_reclaimable:165548kB
> slab_unreclaimable:14964kB kernel_stack:1712kB pagetables:6176kB unstable:0kB bounce:0kB free_pcp:428kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:424 all_unreclaimable? no
> [ 2400.152546] lowmem_reserve[]: 0 0 824 824
> [ 2400.152553] Normal free:5468kB min:1696kB low:2540kB high:3384kB active_anon:21044kB inactive_anon:66760kB
> active_file:131776kB inactive_file:227732kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB
> managed:844512kB mlocked:0kB dirty:64kB writeback:0kB mapped:12168kB shmem:26036kB slab_reclaimable:50788kB
> slab_unreclaimable:8856kB kernel_stack:912kB pagetables:2056kB unstable:0kB bounce:0kB free_pcp:108kB local_pcp:0kB
> free_cma:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
> [ 2400.152555] lowmem_reserve[]: 0 0 0 0
> [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)
> 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 31780kB
> [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 5708kB
> [ 2400.152592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [ 2400.152593] 464295 total pagecache pages
> [ 2400.152594] 31 pages in swap cache
> [ 2400.152595] Swap cache stats: add 113, delete 82, find 47/62
> [ 2400.152596] Free swap  = 8388268kB
> [ 2400.152597] Total swap = 8388604kB
> [ 2400.152598] 1032092 pages RAM
> [ 2400.152599] 0 pages HighMem/MovableOnly
> [ 2400.152600] 40110 pages reserved
> [ 2400.152600] 0 pages hwpoisoned
> [ 2400.152601] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 2400.152605] [  149]     0   149     9683     1990      20       3        4             0 systemd-journal
> [ 2400.152608] [  183]     0   183     8598     1103      19       3       18         -1000 systemd-udevd
> [ 2400.152610] [  285]    81   285     8183      911      20       3        0          -900 dbus-daemon
> [ 2400.152613] [  288]     0   288     3569      653      13       3        0             0 crond
> [ 2400.152615] [  289]     0   289     3855      649      12       3        0             0 systemd-logind
> [ 2400.152617] [  291]     0   291    22469      967      48       3        0             0 login
> [ 2400.152619] [  299]  1000   299     8493     1140      21       3        0             0 systemd
> [ 2400.152621] [  301]  1000   301    24226      416      47       3       20             0 (sd-pam)
> [ 2400.152623] [  306]  1000   306     4471     1126      14       3        0             0 bash
> [ 2400.152626] [  313]  1000   313     3717      739      13       3        0             0 startx
> [ 2400.152628] [  335]  1000   335     3981      236      14       3        0             0 xinit
> [ 2400.152630] [  336]  1000   336    47713    19103      93       3        0             0 Xorg
> [ 2400.152632] [  338]  1000   338    39714     4302      80       3        0             0 openbox
> [ 2400.152634] [  349]  1000   349    43472     3280      88       3        0             0 tint2
> [ 2400.152636] [  355]  1000   355    34168     5754      58       3        0             0 urxvt
> [ 2400.152638] [  356]  1000   356     4533     1248      15       3        0             0 bash
> [ 2400.152640] [  435]     0   435     3691     2168      10       3        0             0 dhclient
> [ 2400.152642] [  451]  1000   451     4445     1111      14       4        0             0 bash
> [ 2400.152644] [  459]  1000   459    45577     6121      59       3        0             0 urxvt
> [ 2400.152646] [  460]  1000   460     4445     1070      15       3        0             0 bash
> [ 2400.152648] [  463]  1000   463     5207      728      16       3        0             0 tmux
> [ 2400.152650] [  465]  1000   465     6276     1299      18       3        0             0 tmux
> [ 2400.152653] [  466]  1000   466     4445     1113      14       3        0             0 bash
> [ 2400.152655] [  473]  1000   473     4445     1087      15       3        0             0 bash
> [ 2400.152657] [  476]  1000   476     5207      760      15       3        0             0 tmux
> [ 2400.152659] [  477]  1000   477     4445     1080      14       3        0             0 bash
> [ 2400.152661] [  484]  1000   484     4445     1076      14       3        0             0 bash
> [ 2400.152663] [  487]  1000   487     4445     1129      14       3        0             0 bash
> [ 2400.152665] [  490]  1000   490     4445     1115      14       3        0             0 bash
> [ 2400.152667] [  493]  1000   493    10206     1135      24       3        0             0 top
> [ 2400.152669] [  495]  1000   495     4445     1146      15       3        0             0 bash
> [ 2400.152671] [  502]  1000   502     3745      814      13       3        0             0 coretemp-sensor
> [ 2400.152673] [  536]  1000   536    27937     4429      53       3        0             0 urxvt
> [ 2400.152675] [  537]  1000   537     4445     1092      14       3        0             0 bash
> [ 2400.152677] [  543]  1000   543    29981     4138      53       3        0             0 urxvt
> [ 2400.152680] [  544]  1000   544     4445     1095      14       3        0             0 bash
> [ 2400.152682] [  549]  1000   549    29981     4132      53       3        0             0 urxvt
> [ 2400.152684] [  550]  1000   550     4445     1121      13       3        0             0 bash
> [ 2400.152686] [  555]  1000   555    45194     5728      62       3        0             0 urxvt
> [ 2400.152688] [  556]  1000   556     4445     1116      14       3        0             0 bash
> [ 2400.152690] [  561]  1000   561    30173     4317      51       3        0             0 urxvt
> [ 2400.152692] [  562]  1000   562     4445     1075      14       3        0             0 bash
> [ 2400.152694] [  586]  1000   586    57178     7499      65       4        0             0 urxvt
> [ 2400.152696] [  587]  1000   587     4478     1156      14       3        0             0 bash
> [ 2400.152698] [  593]     0   593    17836     1213      39       3        0             0 sudo
> [ 2400.152700] [  594]     0   594   136671     1794     188       4        0             0 journalctl
> [ 2400.152702] [  616]  1000   616    29981     4140      54       3        0             0 urxvt
> [ 2400.152705] [  617]  1000   617     4445     1122      14       3        0             0 bash
> [ 2400.152707] [  622]  1000   622    34169     8473      60       3        0             0 urxvt
> [ 2400.152709] [  623]  1000   623     4445     1116      14       3        0             0 bash
> [ 2400.152711] [  646]  1000   646     4445     1124      15       3        0             0 bash
> [ 2400.152713] [  668]  1000   668     4445     1090      15       3        0             0 bash
> [ 2400.152715] [  671]  1000   671     4445     1090      13       3        0             0 bash
> [ 2400.152717] [  674]  1000   674     4445     1083      13       3        0             0 bash
> [ 2400.152719] [  677]  1000   677     4445     1124      15       3        0             0 bash
> [ 2400.152721] [  720]  1000   720     3717      707      12       3        0             0 build99
> [ 2400.152723] [  721]  1000   721     9107     1244      21       3        0             0 ssh
> [ 2400.152725] [  768]     0   768    17827     1292      40       3        0             0 sudo
> [ 2400.152727] [  771]     0   771     4640      622      14       3        0             0 screen
> [ 2400.152729] [  772]     0   772     4673      505      11       3        0             0 screen
> [ 2400.152731] [  775]  1000   775     4445     1120      14       3        0             0 bash
> [ 2400.152733] [  778]  1000   778     4445     1097      14       3        0             0 bash
> [ 2400.152735] [  781]  1000   781     4445     1088      13       3        0             0 bash
> [ 2400.152737] [  784]  1000   784     4445     1109      13       3        0             0 bash
> [ 2400.152740] [  845]  1000   845     8144      799      20       3        0             0 dbus-daemon
> [ 2400.152742] [  852]  1000   852    83828     1216      31       4        0             0 at-spi-bus-laun
> [ 2400.152744] [ 9064]  1000  9064     4478     1154      13       3        0             0 bash
> [ 2400.152746] [ 9068]  1000  9068     4478     1135      15       3        0             0 bash
> [ 2400.152748] [ 9460]  1000  9460    11128      767      26       3        0             0 su
> [ 2400.152750] [ 9463]     0  9463     4474     1188      14       4        0             0 bash
> [ 2400.152752] [ 9517]     0  9517     3783      832      13       3        0             0 zram-test.sh
> [ 2400.152754] [ 9917]  1000  9917     4444     1124      14       3        0             0 bash
> [ 2400.152757] [14052]  1000 14052     1764      162       9       3        0             0 sleep
> [ 2400.152758] Out of memory: Kill process 336 (Xorg) score 6 or sacrifice child
> [ 2400.152767] Killed process 336 (Xorg) total-vm:190852kB, anon-rss:58728kB, file-rss:17684kB, shmem-rss:0kB
> [ 2400.161723] oom_reaper: reaped process 336 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
> 
> 
> 
> 
> $ free
>               total        used        free      shared  buff/cache   available
> Mem:        3967928     1563132      310548      116936     2094248     2207584
> Swap:       8388604         332     8388272
> 
Hi Sergey

Thanks for your info.

Can you please schedule a run for the diff attached, in which 
non-expensive allocators are allowed to burn more CPU cycles.

thanks
Hillf

--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		no_progress_loops /= 2;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
@ 2016-02-25  9:23       ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).
> 
> This OOM detection rework significantly lowers the number of jobs
> which can be run in parallel without being OOM-killed. 

This all smells like pre mature OOM because of a high order allocation
(order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is
reporting order-2 OOMs as well. It is true that what we have in the
mmomt right now is quite fragile if all order-N+ are completely
depleted. That was the case for both Tetsuo and Sergey. I have tried to
mitigate this at least to some degree by
http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
with the full changelog) but I haven't heard back whether it helped
so I haven't posted the official patch yet.

I also suspect that something is not quite right with the compaction and
it gives up too early even though we have quite a lot reclaimable pages.
I do not have any numbers for that because I didn't have a load to
reproduce this problem yet. I will try your setup and see what I can do
about that. It would be great if you could give the patch below a try
and see if it helps.
---
>From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 4 Feb 2016 14:56:59 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and the OOM killer is just a
matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
order and not costly requests to make sure we do not fail prematurely.

This also means that we do not reset no_progress_loops at the
__alloc_pages_slowpath for high order allocations to guarantee a bounded
number of retries.

Longterm it would be much better to communicate with the compaction
and retry only if the compaction considers it meaningfull.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f05aca36469b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		}
 	}
 
+	/*
+	 * OK, so the watermak check has failed. Make sure we do all the
+	 * retries for !costly high order requests and hope that multiple
+	 * runs of compaction will generate some high order ones for us.
+	 *
+	 * XXX: ideally we should teach the compaction to try _really_ hard
+	 * if we are in the retry path - something like priority 0 for the
+	 * reclaim
+	 */
+	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
+
 	return false;
 }
 
@@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto noretry;
 
 	/*
-	 * Costly allocations might have made a progress but this doesn't mean
-	 * their order will become available due to high fragmentation so do
-	 * not reset the no progress counter for them
+	 * High order allocations might have made a progress but this doesn't
+	 * mean their order will become available due to high fragmentation so
+	 * do not reset the no progress counter for them
 	 */
-	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
+	if (did_some_progress && !order)
 		no_progress_loops = 0;
 	else
 		no_progress_loops++;
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:23       ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:23 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).
> 
> This OOM detection rework significantly lowers the number of jobs
> which can be run in parallel without being OOM-killed. 

This all smells like pre mature OOM because of a high order allocation
(order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is
reporting order-2 OOMs as well. It is true that what we have in the
mmomt right now is quite fragile if all order-N+ are completely
depleted. That was the case for both Tetsuo and Sergey. I have tried to
mitigate this at least to some degree by
http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
with the full changelog) but I haven't heard back whether it helped
so I haven't posted the official patch yet.

I also suspect that something is not quite right with the compaction and
it gives up too early even though we have quite a lot reclaimable pages.
I do not have any numbers for that because I didn't have a load to
reproduce this problem yet. I will try your setup and see what I can do
about that. It would be great if you could give the patch below a try
and see if it helps.
---

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:17         ` Hillf Danton
@ 2016-02-25  9:27           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

On Thu 25-02-16 17:17:45, Hillf Danton wrote:
[...]
> > OOM example:
> > 
> > [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,  oom_score_adj=0
[...]
> > [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
> > [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB

[...]
> > [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> > [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)  2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
> > [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =  5708kB
[...]
> Thanks for your info.
> 
> Can you please schedule a run for the diff attached, in which 
> non-expensive allocators are allowed to burn more CPU cycles.

I do not think your patch will help. As you can see, both OOMs were for
order-2 and there simply are no order-2+ free blocks usable for the
allocation request so the watermark check will fail for all eligible
zones and no_progress_loops is simply ignored. This is what I've tried
to address by patch I have just posted as a reply to Hugh's email
http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz

> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		no_progress_loops /= 2;
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
>  	 * several times in the row.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:27           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-25  9:27 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

On Thu 25-02-16 17:17:45, Hillf Danton wrote:
[...]
> > OOM example:
> > 
> > [ 2392.663170] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2,  oom_score_adj=0
[...]
> > [ 2392.663260] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2392.663284] DMA32: 5809*4kB (UME) 3*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23260kB
> > [ 2392.663293] Normal: 1515*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB

[...]
> > [ 2400.152464] zram-test.sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> > [ 2400.152558] DMA: 4*4kB (M) 1*8kB (M) 4*16kB (ME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 3*256kB (UME) 3*512kB (UME)  2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 15096kB
> > [ 2400.152573] DMA32: 7835*4kB (UME) 55*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 31780kB
> > [ 2400.152582] Normal: 1383*4kB (UM) 22*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =  5708kB
[...]
> Thanks for your info.
> 
> Can you please schedule a run for the diff attached, in which 
> non-expensive allocators are allowed to burn more CPU cycles.

I do not think your patch will help. As you can see, both OOMs were for
order-2 and there simply are no order-2+ free blocks usable for the
allocation request so the watermark check will fail for all eligible
zones and no_progress_loops is simply ignored. This is what I've tried
to address by patch I have just posted as a reply to Hugh's email
http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz

> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Thu Feb 25 16:46:05 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		no_progress_loops /= 2;
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
>  	 * several times in the row.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:27           ` Michal Hocko
@ 2016-02-25  9:48             ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:48 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> >
> > Can you please schedule a run for the diff attached, in which
> > non-expensive allocators are allowed to burn more CPU cycles.
> 
> I do not think your patch will help. As you can see, both OOMs were for
> order-2 and there simply are no order-2+ free blocks usable for the
> allocation request so the watermark check will fail for all eligible
> zones and no_progress_loops is simply ignored. This is what I've tried
> to address by patch I have just posted as a reply to Hugh's email
> http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> 
Hm, Mr. Swap can tell us more.

Hillf

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25  9:48             ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-25  9:48 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Sergey Senozhatsky', 'Hugh Dickins',
	'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> >
> > Can you please schedule a run for the diff attached, in which
> > non-expensive allocators are allowed to burn more CPU cycles.
> 
> I do not think your patch will help. As you can see, both OOMs were for
> order-2 and there simply are no order-2+ free blocks usable for the
> allocation request so the watermark check will fail for all eligible
> zones and no_progress_loops is simply ignored. This is what I've tried
> to address by patch I have just posted as a reply to Hugh's email
> http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> 
Hm, Mr. Swap can tell us more.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:48             ` Hillf Danton
@ 2016-02-25 11:02               ` Sergey Senozhatsky
  -1 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25 11:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', 'Sergey Senozhatsky',
	'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On (02/25/16 17:48), Hillf Danton wrote:
> > > Can you please schedule a run for the diff attached, in which
> > > non-expensive allocators are allowed to burn more CPU cycles.
> > 
> > I do not think your patch will help. As you can see, both OOMs were for
> > order-2 and there simply are no order-2+ free blocks usable for the
> > allocation request so the watermark check will fail for all eligible
> > zones and no_progress_loops is simply ignored. This is what I've tried
> > to address by patch I have just posted as a reply to Hugh's email
> > http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> > 
> Hm, Mr. Swap can tell us more.


Hi,

after *preliminary testing* both patches seem to work. at least I don't
see oom-kills and there are some swapouts.

Michal Hocko's
              total        used        free      shared  buff/cache   available
Mem:        3836880     2458020       35992      115984     1342868     1181484
Swap:       8388604        2008     8386596

              total        used        free      shared  buff/cache   available
Mem:        3836880     2459516       39616      115880     1337748     1180156
Swap:       8388604        2052     8386552

              total        used        free      shared  buff/cache   available
Mem:        3836880     2460584       33944      115880     1342352     1179004
Swap:       8388604        2132     8386472
...




Hillf Danton's
              total        used        free      shared  buff/cache   available
Mem:        3836880     1661000      554236      116448     1621644     1978872
Swap:       8388604        1548     8387056

              total        used        free      shared  buff/cache   available
Mem:        3836880     1660500      554740      116448     1621640     1979376
Swap:       8388604        1548     8387056

...


I'll do more tests tomorrow.


	-ss

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-25 11:02               ` Sergey Senozhatsky
  0 siblings, 0 replies; 299+ messages in thread
From: Sergey Senozhatsky @ 2016-02-25 11:02 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', 'Sergey Senozhatsky',
	'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On (02/25/16 17:48), Hillf Danton wrote:
> > > Can you please schedule a run for the diff attached, in which
> > > non-expensive allocators are allowed to burn more CPU cycles.
> > 
> > I do not think your patch will help. As you can see, both OOMs were for
> > order-2 and there simply are no order-2+ free blocks usable for the
> > allocation request so the watermark check will fail for all eligible
> > zones and no_progress_loops is simply ignored. This is what I've tried
> > to address by patch I have just posted as a reply to Hugh's email
> > http://lkml.kernel.org/r/20160225092315.GD17573@dhcp22.suse.cz
> > 
> Hm, Mr. Swap can tell us more.


Hi,

after *preliminary testing* both patches seem to work. at least I don't
see oom-kills and there are some swapouts.

Michal Hocko's
              total        used        free      shared  buff/cache   available
Mem:        3836880     2458020       35992      115984     1342868     1181484
Swap:       8388604        2008     8386596

              total        used        free      shared  buff/cache   available
Mem:        3836880     2459516       39616      115880     1337748     1180156
Swap:       8388604        2052     8386552

              total        used        free      shared  buff/cache   available
Mem:        3836880     2460584       33944      115880     1342352     1179004
Swap:       8388604        2132     8386472
...




Hillf Danton's
              total        used        free      shared  buff/cache   available
Mem:        3836880     1661000      554236      116448     1621644     1978872
Swap:       8388604        1548     8387056

              total        used        free      shared  buff/cache   available
Mem:        3836880     1660500      554740      116448     1621640     1979376
Swap:       8388604        1548     8387056

...


I'll do more tests tomorrow.


	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:23       ` Michal Hocko
@ 2016-02-26  6:32         ` Hugh Dickins
  -1 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Thu, 25 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> > 
> > This OOM detection rework significantly lowers the number of jobs
> > which can be run in parallel without being OOM-killed. 
> 
> This all smells like pre mature OOM because of a high order allocation
> (order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is

You're absolutely right, and I'm ashamed not to have noticed that, nor
your comments and patch earlier in this thread, before bothering you.
Order 2 they are.

> reporting order-2 OOMs as well. It is true that what we have in the
> mmomt right now is quite fragile if all order-N+ are completely
> depleted. That was the case for both Tetsuo and Sergey. I have tried to
> mitigate this at least to some degree by
> http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
> with the full changelog) but I haven't heard back whether it helped
> so I haven't posted the official patch yet.
> 
> I also suspect that something is not quite right with the compaction and
> it gives up too early even though we have quite a lot reclaimable pages.
> I do not have any numbers for that because I didn't have a load to
> reproduce this problem yet. I will try your setup and see what I can do

Thanks a lot.

> about that. It would be great if you could give the patch below a try
> and see if it helps.
> ---
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

It didn't really help, I'm afraid: it reduces the actual number of OOM
kills which occur before the job is terminated, but doesn't stop the
job from being terminated very soon.

I also tried Hillf's patch (separately) too, but as you expected,
it didn't seem to make any difference.

(I haven't tried on the PowerMac G5 yet, since that's busy with
other testing; but expect that to tell the same story.)

Hugh

> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  6:32         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-02-26  6:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Thu, 25 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> > 
> > This OOM detection rework significantly lowers the number of jobs
> > which can be run in parallel without being OOM-killed. 
> 
> This all smells like pre mature OOM because of a high order allocation
> (order-2 for fork) which Tetuo has seen already. Sergey Senozhatsky is

You're absolutely right, and I'm ashamed not to have noticed that, nor
your comments and patch earlier in this thread, before bothering you.
Order 2 they are.

> reporting order-2 OOMs as well. It is true that what we have in the
> mmomt right now is quite fragile if all order-N+ are completely
> depleted. That was the case for both Tetsuo and Sergey. I have tried to
> mitigate this at least to some degree by
> http://lkml.kernel.org/r/20160204133905.GB14425@dhcp22.suse.cz (below
> with the full changelog) but I haven't heard back whether it helped
> so I haven't posted the official patch yet.
> 
> I also suspect that something is not quite right with the compaction and
> it gives up too early even though we have quite a lot reclaimable pages.
> I do not have any numbers for that because I didn't have a load to
> reproduce this problem yet. I will try your setup and see what I can do

Thanks a lot.

> about that. It would be great if you could give the patch below a try
> and see if it helps.
> ---
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

It didn't really help, I'm afraid: it reduces the actual number of OOM
kills which occur before the job is terminated, but doesn't stop the
job from being terminated very soon.

I also tried Hillf's patch (separately) too, but as you expected,
it didn't seem to make any difference.

(I haven't tried on the PowerMac G5 yet, since that's busy with
other testing; but expect that to tell the same story.)

Hugh

> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  6:32         ` Hugh Dickins
@ 2016-02-26  7:54           ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26  7:54 UTC (permalink / raw)
  To: 'Hugh Dickins', 'Michal Hocko'
  Cc: 'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.
> 
> I also tried Hillf's patch (separately) too, but as you expected,
> it didn't seem to make any difference.
> 
Perhaps non-costly means NOFAIL as shown by folding the two
patches into one. Can it make any sense?

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  7:54           ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26  7:54 UTC (permalink / raw)
  To: 'Hugh Dickins', 'Michal Hocko'
  Cc: 'Andrew Morton', 'Linus Torvalds',
	'Johannes Weiner', 'Mel Gorman',
	'David Rientjes', 'Tetsuo Handa',
	'KAMEZAWA Hiroyuki', linux-mm, 'LKML',
	'Sergey Senozhatsky'

> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.
> 
> I also tried Hillf's patch (separately) too, but as you expected,
> it didn't seem to make any difference.
> 
Perhaps non-costly means NOFAIL as shown by folding the two
patches into one. Can it make any sense?

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
@@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 	struct zone *zone;
 	struct zoneref *z;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return true;
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
 	 * several times in the row.
--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  7:54           ` Hillf Danton
@ 2016-02-26  9:24             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 15:54:19, Hillf Danton wrote:
> > 
> > It didn't really help, I'm afraid: it reduces the actual number of OOM
> > kills which occur before the job is terminated, but doesn't stop the
> > job from being terminated very soon.
> > 
> > I also tried Hillf's patch (separately) too, but as you expected,
> > it didn't seem to make any difference.
> > 
> Perhaps non-costly means NOFAIL as shown by folding the two

nofail only means that the page allocator doesn't return with NULL.
OOM killer is still not put aside...

> patches into one. Can it make any sense?
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;

This is defeating the whole purpose of the rework - to behave
deterministically. You have just disabled the oom killer completely.
This is not the way to go

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  9:24             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:24 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 15:54:19, Hillf Danton wrote:
> > 
> > It didn't really help, I'm afraid: it reduces the actual number of OOM
> > kills which occur before the job is terminated, but doesn't stop the
> > job from being terminated very soon.
> > 
> > I also tried Hillf's patch (separately) too, but as you expected,
> > it didn't seem to make any difference.
> > 
> Perhaps non-costly means NOFAIL as shown by folding the two

nofail only means that the page allocator doesn't return with NULL.
OOM killer is still not put aside...

> patches into one. Can it make any sense?
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
>  	struct zone *zone;
>  	struct zoneref *z;
>  
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;

This is defeating the whole purpose of the rework - to behave
deterministically. You have just disabled the oom killer completely.
This is not the way to go

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  6:32         ` Hugh Dickins
@ 2016-02-26  9:33           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Thu 25-02-16 22:32:54, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Michal Hocko wrote:
[...]
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.

Yeah this is not a magic bullet. I am happy to hear that the patch
actually helped to reduce the number of OOM kills, though, because that is
what it aims to do. I also believe that supports (at least partially) my
suspicious that it is compaction which doesn't try enough.
order-0 reclaim, even when done repeatedly, doesn't have a great
chances to form higher order pages. Especially when there is a lot of
migrateable memory. I have already talked about this with Vlastimil and
he said that compaction can indeed back off too early because it doesn't
care about !costly request much at all. We will have a look into this
more next week.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26  9:33           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26  9:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

On Thu 25-02-16 22:32:54, Hugh Dickins wrote:
> On Thu, 25 Feb 2016, Michal Hocko wrote:
[...]
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> It didn't really help, I'm afraid: it reduces the actual number of OOM
> kills which occur before the job is terminated, but doesn't stop the
> job from being terminated very soon.

Yeah this is not a magic bullet. I am happy to hear that the patch
actually helped to reduce the number of OOM kills, though, because that is
what it aims to do. I also believe that supports (at least partially) my
suspicious that it is compaction which doesn't try enough.
order-0 reclaim, even when done repeatedly, doesn't have a great
chances to form higher order pages. Especially when there is a lot of
migrateable memory. I have already talked about this with Vlastimil and
he said that compaction can indeed back off too early because it doesn't
care about !costly request much at all. We will have a look into this
more next week.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26  9:24             ` Michal Hocko
@ 2016-02-26 10:27               ` Hillf Danton
  -1 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26 10:27 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

>> 
> > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> 
> This is defeating the whole purpose of the rework - to behave
> deterministically. You have just disabled the oom killer completely.
> This is not the way to go
> 
Then in another direction, below is what I can do.

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
@@ -3366,8 +3366,11 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress > 0, no_progress_loops)) {
+		/* Burn more cycles if any zone seems to satisfy our request */
+		no_progress_loops /= 2;
 		goto retry;
+	}
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
--

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26 10:27               ` Hillf Danton
  0 siblings, 0 replies; 299+ messages in thread
From: Hillf Danton @ 2016-02-26 10:27 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

>> 
> > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> >  	struct zone *zone;
> >  	struct zoneref *z;
> >
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> 
> This is defeating the whole purpose of the rework - to behave
> deterministically. You have just disabled the oom killer completely.
> This is not the way to go
> 
Then in another direction, below is what I can do.

thanks
Hillf
--- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
+++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
@@ -3366,8 +3366,11 @@ retry:
 		no_progress_loops++;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
+				 did_some_progress > 0, no_progress_loops)) {
+		/* Burn more cycles if any zone seems to satisfy our request */
+		no_progress_loops /= 2;
 		goto retry;
+	}
 
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
--


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-26 10:27               ` Hillf Danton
@ 2016-02-26 13:49                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26 13:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 18:27:16, Hillf Danton wrote:
> >> 
> > > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > 
> > This is defeating the whole purpose of the rework - to behave
> > deterministically. You have just disabled the oom killer completely.
> > This is not the way to go
> > 
> Then in another direction, below is what I can do.
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
> @@ -3366,8 +3366,11 @@ retry:
>  		no_progress_loops++;
>  
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, no_progress_loops))
> +				 did_some_progress > 0, no_progress_loops)) {
> +		/* Burn more cycles if any zone seems to satisfy our request */
> +		no_progress_loops /= 2;

No, I do not think this makes any sense. If we need more retry loops
then we can do it by increasing MAX_RECLAIM_RETRIES.

>  		goto retry;
> +	}
>  
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-26 13:49                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-26 13:49 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Hugh Dickins', 'Andrew Morton',
	'Linus Torvalds', 'Johannes Weiner',
	'Mel Gorman', 'David Rientjes',
	'Tetsuo Handa', 'KAMEZAWA Hiroyuki',
	linux-mm, 'LKML', 'Sergey Senozhatsky'

On Fri 26-02-16 18:27:16, Hillf Danton wrote:
> >> 
> > > --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> > > +++ b/mm/page_alloc.c	Fri Feb 26 15:18:55 2016
> > > @@ -3113,6 +3113,8 @@ should_reclaim_retry(gfp_t gfp_mask, uns
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > >
> > > +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > 
> > This is defeating the whole purpose of the rework - to behave
> > deterministically. You have just disabled the oom killer completely.
> > This is not the way to go
> > 
> Then in another direction, below is what I can do.
> 
> thanks
> Hillf
> --- a/mm/page_alloc.c	Thu Feb 25 15:43:18 2016
> +++ b/mm/page_alloc.c	Fri Feb 26 18:14:59 2016
> @@ -3366,8 +3366,11 @@ retry:
>  		no_progress_loops++;
>  
>  	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, no_progress_loops))
> +				 did_some_progress > 0, no_progress_loops)) {
> +		/* Burn more cycles if any zone seems to satisfy our request */
> +		no_progress_loops /= 2;

No, I do not think this makes any sense. If we need more retry loops
then we can do it by increasing MAX_RECLAIM_RETRIES.

>  		goto retry;
> +	}
>  
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  3:47     ` Hugh Dickins
                       ` (2 preceding siblings ...)
  (?)
@ 2016-02-29 20:35     ` Michal Hocko
  2016-03-01  7:29         ` Hugh Dickins
  -1 siblings, 1 reply; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 20:35 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1845 bytes --]

On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
[...]
> Boot with mem=1G (or boot your usual way, and do something to occupy
> most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> way to gobble up most of the memory, though it's not how I've done it).
> 
> Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> kernel source tree into a tmpfs: size=2G is more than enough.
> make defconfig there, then make -j20.
> 
> On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> 
> Except that you'll probably need to fiddle around with that j20,
> it's true for my laptop but not for my workstation.  j20 just happens
> to be what I've had there for years, that I now see breaking down
> (I can lower to j6 to proceed, perhaps could go a bit higher,
> but it still doesn't exercise swap very much).

I have tried to reproduce and failed in a virtual on my laptop. I
will try with another host with more CPUs (because my laptop has only
two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
(16, 10 no difference really). I was also collecting vmstat in the
background. The compilation takes ages but the behavior seems consistent
and stable.

If I try 900M for huge pages then I get OOMs but this happens with the
mmotm without my oom rework patch set as well.

It would be great if you could retry and collect /proc/vmstat data
around the OOM time to see what compaction did? (I was using the
attached little program to reduce interference during OOM (no forks, the
code locked in and the resulting file preallocated - e.g.
read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
hits).

Thanks!
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: read_vmstat.c --]
[-- Type: text/x-csrc, Size: 5025 bytes --]

#define _GNU_SOURCE
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <unistd.h>
#include <time.h>

/*
 * A simple /proc/vmstat collector into a file. It tries hard to guarantee
 * that the content will get into the output file even under a strong memory
 * pressure.
 *
 * Usage
 * ./read_vmstat output_file timeout output_size
 *
 * output_file can be either a non-existing file or - for stdout
 * timeout - time period between two snapshots. s - seconds, ms - miliseconds
 * 	     and m - minutes suffix is allowed
 * output_file - size of the output file. The file is preallocated and pre-filled.
 *
 * If the output reaches the end of the file it will start over overwriting the oldest
 * data. Each snapshot is enclosed by header and footer.
 * =S timestamp
 * [...]
 * E=
 *
 * Please note that your ulimit has to be sufficient to allow to mlock the code+
 * all the buffers.
 *
 * This comes under GPL v2
 * Copyright: Michal Hocko <mhocko@suse.cz> 2015 
 */

#define NS_PER_MS (1000*1000)
#define NS_PER_SEC (1000*NS_PER_MS)

int open_file(const char *str)
{
	int fd;

	fd = open(str, O_CREAT|O_EXCL|O_RDWR, 0755);
	if (fd == -1) {
		perror("open input");
		return 1;
	}

	return fd;
}

int read_timeout(const char *str, struct timespec *timeout)
{
	char *end;
	unsigned long val;

	val = strtoul(str, &end, 10);
	if (val == ULONG_MAX) {
		perror("Invalid number");
		return 1;
	}
	switch(*end) {
		case 's':
			timeout->tv_sec = val;
			break;
		case 'm':
			/* ms vs minute*/
			if (*(end+1) == 's') {
				timeout->tv_sec = (val * NS_PER_MS) / NS_PER_SEC;
				timeout->tv_nsec = (val * NS_PER_MS) % NS_PER_SEC;
			} else {
				timeout->tv_sec = val*60;
			}
			break;
		default:
			fprintf(stderr, "Uknown number %s\n", str);
			return 1;
	}

	return 0;
}

size_t read_size(const char *str)
{
	char *end;
	size_t val = strtoul(str, &end, 10);

	switch (*end) {
		case 'K':
			val <<= 10;
			break;
		case 'M':
			val <<= 20;
			break;
		case 'G':
			val <<= 30;
			break;
	}

	return val;
}

size_t dump_str(char *buffer, size_t buffer_size, size_t pos, const char *in, size_t size)
{
	size_t i;
	for (i = 0; i < size; i++) {
		buffer[pos] = in[i];
		pos = (pos + 1) % buffer_size;
	}

	return pos;
}

/* buffer == NULL -> stdout */
int __collect_logs(const struct timespec *timeout, char *buffer, size_t buffer_size)
{
	char buff[4096]; /* dump to the file automatically */
	time_t before, after;
	int in_fd = open("/proc/vmstat", O_RDONLY);
	size_t out_pos = 0;
	size_t in_pos = 0;
	size_t size = 0;
	int estimate = 0;

	if (in_fd == -1) {
		perror("open vmstat:");
		return 1;
	}

	/* lock everything in */
	if (mlockall(MCL_CURRENT) == -1) {
		perror("mlockall. Continuing anyway");
	}

	while (1) {
		before = time(NULL);

		size = snprintf(buff, sizeof(buff), "=S %lu\n", before);
		lseek(in_fd, 0, SEEK_SET);
		size += read(in_fd, buff + size, sizeof(buff) - size);
		size += snprintf(buff + size, sizeof(buff) - size, "E=\n");
		if (buffer && !estimate) {
			printf("Estimated %d entries fit to the buffer\n", buffer_size/size);
			estimate = 1;
		}

		/* Dump to stdout */
		if (!buffer) {
			printf("%s", buff);
		} else {
			size_t pos;
			pos = dump_str(buffer, buffer_size, out_pos, buff, size);
			if (pos < out_pos)
				fprintf(stderr, "%lu: Buffer wrapped\n", before);
			out_pos = pos;
		}

		after = time(NULL);

		if (after - before > 2) {
			fprintf(stderr, "%d: Snapshoting took %d!!!\n", before, after-before);
		}
		if (nanosleep(timeout, NULL) == -1)
			if (errno == EINTR)
				return 0;
		/* kick in the flushing */
		if (buffer)
			msync(buffer, buffer_size, MS_ASYNC);
	}
}

int collect_logs(int fd, const struct timespec *timeout, size_t buffer_size)
{
	unsigned char *buffer = NULL;

	if (fd != -1) {
		if (ftruncate(fd, buffer_size) == -1) {
			perror("ftruncate");
			return 1;
		}

		if (fallocate(fd, 0, 0, buffer_size) && errno != EOPNOTSUPP) {
			perror("fallocate");
			return 1;
		}

		/* commit it to the disk */
		sync();

		buffer = mmap(NULL, buffer_size, PROT_READ | PROT_WRITE,
				MAP_SHARED | MAP_POPULATE, fd, 0);
		if (buffer == MAP_FAILED) {
			perror("mmap");
			return 1;
		}
	}

	return __collect_logs(timeout, buffer, buffer_size);
}

int main(int argc, char **argv)
{
	struct timespec timeout = {.tv_sec = 1};
	int fd = -1;
	size_t buffer_size = 10UL<<20;

	if (argc > 1) {
		/* output file */
		if (strcmp(argv[1], "-")) {
			fd = open_file(argv[1]);
			if (fd == -1)
				return 1;
		}

		/* timeout */
		if (argc > 2) {
			if (read_timeout(argv[2], &timeout))
				return 1;

			/* buffer size */
			if (argc > 3) {
				buffer_size = read_size(argv[3]);
				if (buffer_size == -1UL)
					return 1;
			}
		}
	}
	printf("file:%s timeout:%lu.%lus buffer_size:%llu\n",
			(fd == -1)? "stdout" : argv[1],
			timeout.tv_sec, timeout.tv_nsec / NS_PER_MS,
			buffer_size);

	return collect_logs(fd, &timeout, buffer_size);
}

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-25  9:23       ` Michal Hocko
@ 2016-02-29 21:02         ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

Andrew,
could you queue this one as well, please? This is more a band aid than a
real solution which I will be working on as soon as I am able to
reproduce the issue but the patch should help to some degree at least.

On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-02-29 21:02         ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-02-29 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML, Sergey Senozhatsky

Andrew,
could you queue this one as well, please? This is more a band aid than a
real solution which I will be working on as soon as I am able to
reproduce the issue but the patch should help to some degree at least.

On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 4 Feb 2016 14:56:59 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
> 
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
> 
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and the OOM killer is just a
> matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> order and not costly requests to make sure we do not fail prematurely.
> 
> This also means that we do not reset no_progress_loops at the
> __alloc_pages_slowpath for high order allocations to guarantee a bounded
> number of retries.
> 
> Longterm it would be much better to communicate with the compaction
> and retry only if the compaction considers it meaningfull.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f05aca36469b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		}
>  	}
>  
> +	/*
> +	 * OK, so the watermak check has failed. Make sure we do all the
> +	 * retries for !costly high order requests and hope that multiple
> +	 * runs of compaction will generate some high order ones for us.
> +	 *
> +	 * XXX: ideally we should teach the compaction to try _really_ hard
> +	 * if we are in the retry path - something like priority 0 for the
> +	 * reclaim
> +	 */
> +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto noretry;
>  
>  	/*
> -	 * Costly allocations might have made a progress but this doesn't mean
> -	 * their order will become available due to high fragmentation so do
> -	 * not reset the no progress counter for them
> +	 * High order allocations might have made a progress but this doesn't
> +	 * mean their order will become available due to high fragmentation so
> +	 * do not reset the no progress counter for them
>  	 */
> -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> +	if (did_some_progress && !order)
>  		no_progress_loops = 0;
>  	else
>  		no_progress_loops++;
> -- 
> 2.7.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-29 20:35     ` [PATCH 0/3] OOM detection rework v4 Michal Hocko
@ 2016-03-01  7:29         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-01  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On Mon, 29 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> 
> I have tried to reproduce and failed in a virtual on my laptop. I
> will try with another host with more CPUs (because my laptop has only
> two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> (16, 10 no difference really). I was also collecting vmstat in the
> background. The compilation takes ages but the behavior seems consistent
> and stable.

Thanks a lot for giving it a go.

I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
are less than 2M big??  But probably that's just a misunderstanding
or typo somewhere.

Ignoring that, you're successfully doing a make -20 defconfig build
in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
at all surprised that it takes ages, but I am very surprised that it
does not OOM.  I suppose by rights it ought not to OOM, the built
tree occupies only a little more than 1G, so you do have enough swap;
but I wouldn't get anywhere near that myself without OOMing - I give
myself 1G of RAM (well, minus whatever the booted system takes up)
to do that build in, four times your RAM, yet in my case it OOMs.

That source tree alone occupies more than 700M, so just copying it
into your tmpfs would take a long time.  I'd expect a build in 224M
RAM plus 2G of swap to take so long, that I'd be very grateful to be
OOM killed, even if there is technically enough space.  Unless
perhaps it's some superfast swap that you have?

I was only suggesting to allocate hugetlb pages, if you preferred
not to reboot with artificially reduced RAM.  Not an issue if you're
booting VMs.

It's true that my testing has been done on the physical machines,
no virtualization involved: I expect that accounts for some difference
between us, but as much difference as we're seeing?  That's strange.

> 
> If I try 900M for huge pages then I get OOMs but this happens with the
> mmotm without my oom rework patch set as well.

Right, not at all surprising.

> 
> It would be great if you could retry and collect /proc/vmstat data
> around the OOM time to see what compaction did? (I was using the
> attached little program to reduce interference during OOM (no forks, the
> code locked in and the resulting file preallocated - e.g.
> read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
> hits).
> 
> Thanks!

I'll give it a try, thanks, but not tonight.

Hugh

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01  7:29         ` Hugh Dickins
  0 siblings, 0 replies; 299+ messages in thread
From: Hugh Dickins @ 2016-03-01  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Andrew Morton, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML

On Mon, 29 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> > 
> > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> > 
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > 
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation.  j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
> 
> I have tried to reproduce and failed in a virtual on my laptop. I
> will try with another host with more CPUs (because my laptop has only
> two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> (16, 10 no difference really). I was also collecting vmstat in the
> background. The compilation takes ages but the behavior seems consistent
> and stable.

Thanks a lot for giving it a go.

I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
are less than 2M big??  But probably that's just a misunderstanding
or typo somewhere.

Ignoring that, you're successfully doing a make -20 defconfig build
in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
at all surprised that it takes ages, but I am very surprised that it
does not OOM.  I suppose by rights it ought not to OOM, the built
tree occupies only a little more than 1G, so you do have enough swap;
but I wouldn't get anywhere near that myself without OOMing - I give
myself 1G of RAM (well, minus whatever the booted system takes up)
to do that build in, four times your RAM, yet in my case it OOMs.

That source tree alone occupies more than 700M, so just copying it
into your tmpfs would take a long time.  I'd expect a build in 224M
RAM plus 2G of swap to take so long, that I'd be very grateful to be
OOM killed, even if there is technically enough space.  Unless
perhaps it's some superfast swap that you have?

I was only suggesting to allocate hugetlb pages, if you preferred
not to reboot with artificially reduced RAM.  Not an issue if you're
booting VMs.

It's true that my testing has been done on the physical machines,
no virtualization involved: I expect that accounts for some difference
between us, but as much difference as we're seeing?  That's strange.

> 
> If I try 900M for huge pages then I get OOMs but this happens with the
> mmotm without my oom rework patch set as well.

Right, not at all surprising.

> 
> It would be great if you could retry and collect /proc/vmstat data
> around the OOM time to see what compaction did? (I was using the
> attached little program to reduce interference during OOM (no forks, the
> code locked in and the resulting file preallocated - e.g.
> read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
> hits).
> 
> Thanks!

I'll give it a try, thanks, but not tonight.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01  7:29         ` Hugh Dickins
@ 2016-03-01 13:38           ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 13:38 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > > 
> > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > > 
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > 
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation.  j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> > 
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
> 
> Thanks a lot for giving it a go.
> 
> I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big??  But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM.  I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time. 

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M    /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     941904      66556          0       5092     806760
-/+ buffers/cache:     130052     878408
Swap:      2097148      42648    2054500
$ du -sh linux-3.7-rc5/
799M    linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M    /mnt/tmpfs/linux-4.5-rc6/
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     962976      45484          0       7236     820064
-/+ buffers/cache:     135676     872784
Swap:      2097148         16    2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]

real    8m36.621s
user    14m1.642s
sys     2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space.  Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat 
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
 	switch (compact_result) {
 	case COMPACT_DEFERRED:
 		*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
             gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
             gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat 
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
    190 2 1
    122 2 3
     59 2 4

#define COMPACT_SKIPPED         1               
#define COMPACT_PARTIAL         3
#define COMPACT_COMPLETE        4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real    8m46.626s
user    14m15.823s
sys     2m45.471s

the time increased but I haven't checked how stable the result is. 

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM.  Not an issue if you're
> booting VMs.

Ohh, I see.
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 13:38           ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 13:38 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/alpine.LSU.2.11.1602241832160.15564@eggly.anvils]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > > 
> > > Make sure you have swap: 2G is more than enough.  Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > > 
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > 
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation.  j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> > 
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
> 
> Thanks a lot for giving it a go.
> 
> I'm puzzled.  445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big??  But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap?  I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM.  I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time. 

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M    /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     941904      66556          0       5092     806760
-/+ buffers/cache:     130052     878408
Swap:      2097148      42648    2054500
$ du -sh linux-3.7-rc5/
799M    linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M    /mnt/tmpfs/linux-4.5-rc6/
$ free
             total       used       free     shared    buffers     cached
Mem:       1008460     962976      45484          0       7236     820064
-/+ buffers/cache:     135676     872784
Swap:      2097148         16    2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function a??acpi_data_prop_reada??:
drivers/acpi/property.c:745:8: warning: a??obja?? may be used uninitialized in this function [-Wmaybe-uninitialized]

real    8m36.621s
user    14m1.642s
sys     2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space.  Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat 
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 						mode, contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
 	switch (compact_result) {
 	case COMPACT_DEFERRED:
 		*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
             gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
             gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat 
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c 
    190 2 1
    122 2 3
     59 2 4

#define COMPACT_SKIPPED         1               
#define COMPACT_PARTIAL         3
#define COMPACT_COMPLETE        4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real    8m46.626s
user    14m15.823s
sys     2m45.471s

the time increased but I haven't checked how stable the result is. 

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM.  Not an issue if you're
> booting VMs.

Ohh, I see.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-01 14:40             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 14:40 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Tue 01-03-16 14:38:46, Michal Hocko wrote:
[...]
> the time increased but I haven't checked how stable the result is. 

And those results vary a lot (even when executed from the fresh boot)
as per my further testing. Sure it might be related to the virtual
environment but I do not think this particular test should be used for
the performance regression comparision.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 14:40             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-01 14:40 UTC (permalink / raw)
  To: Hugh Dickins, Vlastimil Babka, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On Tue 01-03-16 14:38:46, Michal Hocko wrote:
[...]
> the time increased but I haven't checked how stable the result is. 

And those results vary a lot (even when executed from the fresh boot)
as per my further testing. Sure it might be related to the virtual
environment but I do not think this particular test should be used for
the performance regression comparision.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-01 18:14             ` Vlastimil Babka
  -1 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-01 18:14 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On 03/01/2016 02:38 PM, Michal Hocko wrote:
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   						mode, contended_compaction);
>   	current->flags &= ~PF_MEMALLOC;
>
> +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
>   	switch (compact_result) {
>   	case COMPACT_DEFERRED:
>   		*deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
>               gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>               gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
>      190 2 1
>      122 2 3
>       59 2 4
>
> #define COMPACT_SKIPPED         1
> #define COMPACT_PARTIAL         3
> #define COMPACT_COMPLETE        4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.

Yes, compaction is historically quite careful to avoid making low memory 
conditions worse, and to prevent work if it doesn't look like it can ultimately 
succeed the allocation (so having not enough base pages means that compacting 
them is considered pointless). This aspect of preventing non-zero-order OOMs is 
somewhat unexpected :)

> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
>   								alloc_flags))
>   		return COMPACT_PARTIAL;
>
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return COMPACT_CONTINUE;
> +
>   	/*
>   	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
>   	 * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real    8m46.626s
> user    14m15.823s
> sys     2m45.471s
>
> the time increased but I haven't checked how stable the result is.
>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1
>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.

I was under impression that similar checks to compaction_suitable() were done 
also in compact_finished(), to stop compacting if memory got low due to parallel 
activity. But I guess it was a patch from Joonsoo that didn't get merged.

My only other theory so far is that watermark checks fail in 
__isolate_free_page() when we want to grab page(s) as migration targets. I would 
suggest enabling all compaction tracepoint and the migration tracepoint. Looking 
at the trace could hopefully help faster than going one trace_printk() per attempt.

Once we learn all the relevant places/checks, we can think about how to 
communicate to them that this compaction attempt is "important" and should 
continue as long as possible even in low-memory conditions. Maybe not just a 
costly order check, but we also have alloc_flags or could add something to 
compact_control, etc.

> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?
>
>> I was only suggesting to allocate hugetlb pages, if you preferred
>> not to reboot with artificially reduced RAM.  Not an issue if you're
>> booting VMs.
>
> Ohh, I see.
>
>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-01 18:14             ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-01 18:14 UTC (permalink / raw)
  To: Michal Hocko, Hugh Dickins, Joonsoo Kim
  Cc: Andrew Morton, Linus Torvalds, Johannes Weiner, Mel Gorman,
	David Rientjes, Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	linux-mm, LKML

On 03/01/2016 02:38 PM, Michal Hocko wrote:
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   						mode, contended_compaction);
>   	current->flags &= ~PF_MEMALLOC;
>
> +	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> +		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
>   	switch (compact_result) {
>   	case COMPACT_DEFERRED:
>   		*deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
>               gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>               gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
>      190 2 1
>      122 2 3
>       59 2 4
>
> #define COMPACT_SKIPPED         1
> #define COMPACT_PARTIAL         3
> #define COMPACT_COMPLETE        4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.

Yes, compaction is historically quite careful to avoid making low memory 
conditions worse, and to prevent work if it doesn't look like it can ultimately 
succeed the allocation (so having not enough base pages means that compacting 
them is considered pointless). This aspect of preventing non-zero-order OOMs is 
somewhat unexpected :)

> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
>   								alloc_flags))
>   		return COMPACT_PARTIAL;
>
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> +		return COMPACT_CONTINUE;
> +
>   	/*
>   	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
>   	 * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real    8m46.626s
> user    14m15.823s
> sys     2m45.471s
>
> the time increased but I haven't checked how stable the result is.
>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1
>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.

I was under impression that similar checks to compaction_suitable() were done 
also in compact_finished(), to stop compacting if memory got low due to parallel 
activity. But I guess it was a patch from Joonsoo that didn't get merged.

My only other theory so far is that watermark checks fail in 
__isolate_free_page() when we want to grab page(s) as migration targets. I would 
suggest enabling all compaction tracepoint and the migration tracepoint. Looking 
at the trace could hopefully help faster than going one trace_printk() per attempt.

Once we learn all the relevant places/checks, we can think about how to 
communicate to them that this compaction attempt is "important" and should 
continue as long as possible even in low-memory conditions. Maybe not just a 
costly order check, but we also have alloc_flags or could add something to 
compact_control, etc.

> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?
>
>> I was only suggesting to allocate hugetlb pages, if you preferred
>> not to reboot with artificially reduced RAM.  Not an issue if you're
>> booting VMs.
>
> Ohh, I see.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-02-29 21:02         ` Michal Hocko
@ 2016-03-02  2:19           ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

I'm not sure that this is a way to go. See below.

> 
> On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 269a04f20927..f05aca36469b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * OK, so the watermak check has failed. Make sure we do all the
> > +	 * retries for !costly high order requests and hope that multiple
> > +	 * runs of compaction will generate some high order ones for us.
> > +	 *
> > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > +	 * if we are in the retry path - something like priority 0 for the
> > +	 * reclaim
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> > +
> >  	return false;

This seems not a proper fix. Checking watermark with high order has
another meaning that there is high order page or not. This isn't
what we want here. So, following fix is needed.

'if (order)' check isn't needed. It is used to clarify the meaning of
this fix. You can remove it.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..8c80375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                return false;
 
+       /* To check whether compaction is available or not */
+       if (order)
+               order = 0;
+
        /*
         * Keep reclaiming pages while there is a chance this will lead
         * somewhere.  If none of the target zones can satisfy our allocation

> >  }
> >  
> > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto noretry;
> >  
> >  	/*
> > -	 * Costly allocations might have made a progress but this doesn't mean
> > -	 * their order will become available due to high fragmentation so do
> > -	 * not reset the no progress counter for them
> > +	 * High order allocations might have made a progress but this doesn't
> > +	 * mean their order will become available due to high fragmentation so
> > +	 * do not reset the no progress counter for them
> >  	 */
> > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +	if (did_some_progress && !order)
> >  		no_progress_loops = 0;
> >  	else
> >  		no_progress_loops++;

This unconditionally increases no_progress_loops for high order
allocation, so, after 16 iterations, it will fail. If compaction isn't
enabled in Kconfig, 16 times reclaim attempt would not be sufficient
to make high order page. Should we consider this case also?

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:19           ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

I'm not sure that this is a way to go. See below.

> 
> On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> > 
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> > 
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> > 
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> > 
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 20 ++++++++++++++++----
> >  1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 269a04f20927..f05aca36469b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * OK, so the watermak check has failed. Make sure we do all the
> > +	 * retries for !costly high order requests and hope that multiple
> > +	 * runs of compaction will generate some high order ones for us.
> > +	 *
> > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > +	 * if we are in the retry path - something like priority 0 for the
> > +	 * reclaim
> > +	 */
> > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +		return true;
> > +
> >  	return false;

This seems not a proper fix. Checking watermark with high order has
another meaning that there is high order page or not. This isn't
what we want here. So, following fix is needed.

'if (order)' check isn't needed. It is used to clarify the meaning of
this fix. You can remove it.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..8c80375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                return false;
 
+       /* To check whether compaction is available or not */
+       if (order)
+               order = 0;
+
        /*
         * Keep reclaiming pages while there is a chance this will lead
         * somewhere.  If none of the target zones can satisfy our allocation

> >  }
> >  
> > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		goto noretry;
> >  
> >  	/*
> > -	 * Costly allocations might have made a progress but this doesn't mean
> > -	 * their order will become available due to high fragmentation so do
> > -	 * not reset the no progress counter for them
> > +	 * High order allocations might have made a progress but this doesn't
> > +	 * mean their order will become available due to high fragmentation so
> > +	 * do not reset the no progress counter for them
> >  	 */
> > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > +	if (did_some_progress && !order)
> >  		no_progress_loops = 0;
> >  	else
> >  		no_progress_loops++;

This unconditionally increases no_progress_loops for high order
allocation, so, after 16 iterations, it will fail. If compaction isn't
enabled in Kconfig, 16 times reclaim attempt would not be sufficient
to make high order page. Should we consider this case also?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 13:38           ` Michal Hocko
@ 2016-03-02  2:28             ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space.  Unless
> > perhaps it's some superfast swap that you have?
> 
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).

Hmm... This looks like one of potential culprit. If page is in
writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
In this case, this page works as pinned page and prevent compaction.
It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
'no_progress_loops > XXX' will help in this situation.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:28             ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space.  Unless
> > perhaps it's some superfast swap that you have?
> 
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).

Hmm... This looks like one of potential culprit. If page is in
writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
In this case, this page works as pinned page and prevent compaction.
It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
'no_progress_loops > XXX' will help in this situation.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 18:14             ` Vlastimil Babka
@ 2016-03-02  2:55               ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 113983
> >compact_free_scanned 1433503
> >compact_isolated 134307
> >compact_stall 128
> >compact_fail 26
> >compact_success 102
> >compact_kcompatd_wake 0
> >
> >So the whole load has done the direct compaction only 128 times during
> >that test. This doesn't sound much to me
> >$ grep allocstall /proc/vmstat
> >allocstall 1061
> >
> >we entered the direct reclaim much more but most of the load will be
> >order-0 so this might be still ok. So I've tried the following:
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 1993894b4219..107d444afdb1 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >
> >+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> >+
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> >
> >And the result was:
> >$ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> >this shows that order-2 memory pressure is not overly high in my
> >setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> >So I went back to 800M of hugetlb pages and tried again. It took ages
> >so I have interrupted that after one hour (there was still no OOM). The
> >trace log is quite interesting regardless:
> >$ wc -l ~/trace.log
> >371 /root/trace.log
> >
> >$ grep compact_stall /proc/vmstat
> >compact_stall 190
> >
> >so the compaction was still ignored more than actually invoked for
> >!costly allocations:
> >sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> >     190 2 1
> >     122 2 3
> >      59 2 4
> >
> >#define COMPACT_SKIPPED         1
> >#define COMPACT_PARTIAL         3
> >#define COMPACT_COMPLETE        4
> >
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low
> memory conditions worse, and to prevent work if it doesn't look like
> it can ultimately succeed the allocation (so having not enough base
> pages means that compacting them is considered pointless). This
> aspect of preventing non-zero-order OOMs is somewhat unexpected :)

It's better not to assume that compaction would succeed all the times.
Compaction has some limitations so it sometimes fails.
For example, in lowmem situation, it only scans small parts of memory
and if that part is fragmented by non-movable page, compaction would fail.
And, compaction would defer requests 64 times at maximum if successive
compaction failure happens before.

Depending on compaction heavily is right direction to go but I think
that it's not ready for now. More reclaim would relieve problem.

I tried to fix this situation but not yet finished.

http://thread.gmane.org/gmane.linux.kernel.mm/142364
https://lkml.org/lkml/2015/8/23/182


> >So I have tried the following:
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 4d99e1f5055c..7364e48cf69a 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >
> >+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		return COMPACT_CONTINUE;
> >+
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> >
> >and retried the same test (without huge pages):
> >$ time make -j20 > /dev/null
> >
> >real    8m46.626s
> >user    14m15.823s
> >sys     2m45.471s
> >
> >the time increased but I haven't checked how stable the result is.
> >
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 139822
> >compact_free_scanned 1661642
> >compact_isolated 139407
> >compact_stall 129
> >compact_fail 58
> >compact_success 71
> >compact_kcompatd_wake 1
> >
> >$ grep allocstall /proc/vmstat
> >allocstall 1665
> >
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable()
> were done also in compact_finished(), to stop compacting if memory
> got low due to parallel activity. But I guess it was a patch from
> Joonsoo that didn't get merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration
> targets. I would suggest enabling all compaction tracepoint and the
> migration tracepoint. Looking at the trace could hopefully help
> faster than going one trace_printk() per attempt.

Agreed. It's best thing to do now.

Thanks.

> 
> Once we learn all the relevant places/checks, we can think about how
> to communicate to them that this compaction attempt is "important"
> and should continue as long as possible even in low-memory
> conditions. Maybe not just a costly order check, but we also have
> alloc_flags or could add something to compact_control, etc.
> 
> >I will play with this some more but I would be really interested to hear
> >whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> >even make sense to you?
> >
> >>I was only suggesting to allocate hugetlb pages, if you preferred
> >>not to reboot with artificially reduced RAM.  Not an issue if you're
> >>booting VMs.
> >
> >Ohh, I see.
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  2:55               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02  2:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 113983
> >compact_free_scanned 1433503
> >compact_isolated 134307
> >compact_stall 128
> >compact_fail 26
> >compact_success 102
> >compact_kcompatd_wake 0
> >
> >So the whole load has done the direct compaction only 128 times during
> >that test. This doesn't sound much to me
> >$ grep allocstall /proc/vmstat
> >allocstall 1061
> >
> >we entered the direct reclaim much more but most of the load will be
> >order-0 so this might be still ok. So I've tried the following:
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 1993894b4219..107d444afdb1 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> >  						mode, contended_compaction);
> >  	current->flags &= ~PF_MEMALLOC;
> >
> >+	if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> >+
> >  	switch (compact_result) {
> >  	case COMPACT_DEFERRED:
> >  		*deferred_compaction = true;
> >
> >And the result was:
> >$ cat /debug/tracing/trace_pipe | tee ~/trace.log
> >              gcc-8707  [001] ....   137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >              gcc-8726  [000] ....   138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> >this shows that order-2 memory pressure is not overly high in my
> >setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> >So I went back to 800M of hugetlb pages and tried again. It took ages
> >so I have interrupted that after one hour (there was still no OOM). The
> >trace log is quite interesting regardless:
> >$ wc -l ~/trace.log
> >371 /root/trace.log
> >
> >$ grep compact_stall /proc/vmstat
> >compact_stall 190
> >
> >so the compaction was still ignored more than actually invoked for
> >!costly allocations:
> >sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> >     190 2 1
> >     122 2 3
> >      59 2 4
> >
> >#define COMPACT_SKIPPED         1
> >#define COMPACT_PARTIAL         3
> >#define COMPACT_COMPLETE        4
> >
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low
> memory conditions worse, and to prevent work if it doesn't look like
> it can ultimately succeed the allocation (so having not enough base
> pages means that compacting them is considered pointless). This
> aspect of preventing non-zero-order OOMs is somewhat unexpected :)

It's better not to assume that compaction would succeed all the times.
Compaction has some limitations so it sometimes fails.
For example, in lowmem situation, it only scans small parts of memory
and if that part is fragmented by non-movable page, compaction would fail.
And, compaction would defer requests 64 times at maximum if successive
compaction failure happens before.

Depending on compaction heavily is right direction to go but I think
that it's not ready for now. More reclaim would relieve problem.

I tried to fix this situation but not yet finished.

http://thread.gmane.org/gmane.linux.kernel.mm/142364
https://lkml.org/lkml/2015/8/23/182


> >So I have tried the following:
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 4d99e1f5055c..7364e48cf69a 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> >  								alloc_flags))
> >  		return COMPACT_PARTIAL;
> >
> >+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >+		return COMPACT_CONTINUE;
> >+
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
> >  	 * This is because during migration, copies of pages need to be
> >
> >and retried the same test (without huge pages):
> >$ time make -j20 > /dev/null
> >
> >real    8m46.626s
> >user    14m15.823s
> >sys     2m45.471s
> >
> >the time increased but I haven't checked how stable the result is.
> >
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 139822
> >compact_free_scanned 1661642
> >compact_isolated 139407
> >compact_stall 129
> >compact_fail 58
> >compact_success 71
> >compact_kcompatd_wake 1
> >
> >$ grep allocstall /proc/vmstat
> >allocstall 1665
> >
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable()
> were done also in compact_finished(), to stop compacting if memory
> got low due to parallel activity. But I guess it was a patch from
> Joonsoo that didn't get merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration
> targets. I would suggest enabling all compaction tracepoint and the
> migration tracepoint. Looking at the trace could hopefully help
> faster than going one trace_printk() per attempt.

Agreed. It's best thing to do now.

Thanks.

> 
> Once we learn all the relevant places/checks, we can think about how
> to communicate to them that this compaction attempt is "important"
> and should continue as long as possible even in low-memory
> conditions. Maybe not just a costly order check, but we also have
> alloc_flags or could add something to compact_control, etc.
> 
> >I will play with this some more but I would be really interested to hear
> >whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> >even make sense to you?
> >
> >>I was only suggesting to allocate hugetlb pages, if you preferred
> >>not to reboot with artificially reduced RAM.  Not an issue if you're
> >>booting VMs.
> >
> >Ohh, I see.
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:19           ` Joonsoo Kim
@ 2016-03-02  9:50             ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02  9:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
[...]
> > > +	/*
> > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > +	 * retries for !costly high order requests and hope that multiple
> > > +	 * runs of compaction will generate some high order ones for us.
> > > +	 *
> > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > +	 * if we are in the retry path - something like priority 0 for the
> > > +	 * reclaim
> > > +	 */
> > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > > +
> > >  	return false;
> 
> This seems not a proper fix. Checking watermark with high order has
> another meaning that there is high order page or not. This isn't
> what we want here.

Why not? Why should we retry the reclaim if we do not have >=order page
available? Reclaim itself doesn't guarantee any of the freed pages will
form the requested order. The ordering on the LRU lists is pretty much
random wrt. pfn ordering. On the other hand if we have a page available
which is just hidden by watermarks then it makes perfect sense to retry
and free even order-0 pages.

> So, following fix is needed.

> 'if (order)' check isn't needed. It is used to clarify the meaning of
> this fix. You can remove it.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..8c80375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 return false;
>  
> +       /* To check whether compaction is available or not */
> +       if (order)
> +               order = 0;
> +

This would enforce the order 0 wmark check which is IMHO not correct as
per above.

>         /*
>          * Keep reclaiming pages while there is a chance this will lead
>          * somewhere.  If none of the target zones can satisfy our allocation
> 
> > >  }
> > >  
> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  		goto noretry;
> > >  
> > >  	/*
> > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > -	 * their order will become available due to high fragmentation so do
> > > -	 * not reset the no progress counter for them
> > > +	 * High order allocations might have made a progress but this doesn't
> > > +	 * mean their order will become available due to high fragmentation so
> > > +	 * do not reset the no progress counter for them
> > >  	 */
> > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +	if (did_some_progress && !order)
> > >  		no_progress_loops = 0;
> > >  	else
> > >  		no_progress_loops++;
> 
> This unconditionally increases no_progress_loops for high order
> allocation, so, after 16 iterations, it will fail. If compaction isn't
> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> to make high order page. Should we consider this case also?

How many retries would help? I do not think any number will work
reliably. Configurations without compaction enabled are asking for
problems by definition IMHO. Relying on order-0 reclaim for high order
allocations simply cannot work.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02  9:50             ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02  9:50 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Hugh Dickins, Linus Torvalds, Johannes Weiner,
	Mel Gorman, David Rientjes, Tetsuo Handa, Hillf Danton,
	KAMEZAWA Hiroyuki, linux-mm, LKML, Sergey Senozhatsky

On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
[...]
> > > +	/*
> > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > +	 * retries for !costly high order requests and hope that multiple
> > > +	 * runs of compaction will generate some high order ones for us.
> > > +	 *
> > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > +	 * if we are in the retry path - something like priority 0 for the
> > > +	 * reclaim
> > > +	 */
> > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +		return true;
> > > +
> > >  	return false;
> 
> This seems not a proper fix. Checking watermark with high order has
> another meaning that there is high order page or not. This isn't
> what we want here.

Why not? Why should we retry the reclaim if we do not have >=order page
available? Reclaim itself doesn't guarantee any of the freed pages will
form the requested order. The ordering on the LRU lists is pretty much
random wrt. pfn ordering. On the other hand if we have a page available
which is just hidden by watermarks then it makes perfect sense to retry
and free even order-0 pages.

> So, following fix is needed.

> 'if (order)' check isn't needed. It is used to clarify the meaning of
> this fix. You can remove it.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..8c80375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 return false;
>  
> +       /* To check whether compaction is available or not */
> +       if (order)
> +               order = 0;
> +

This would enforce the order 0 wmark check which is IMHO not correct as
per above.

>         /*
>          * Keep reclaiming pages while there is a chance this will lead
>          * somewhere.  If none of the target zones can satisfy our allocation
> 
> > >  }
> > >  
> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >  		goto noretry;
> > >  
> > >  	/*
> > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > -	 * their order will become available due to high fragmentation so do
> > > -	 * not reset the no progress counter for them
> > > +	 * High order allocations might have made a progress but this doesn't
> > > +	 * mean their order will become available due to high fragmentation so
> > > +	 * do not reset the no progress counter for them
> > >  	 */
> > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > +	if (did_some_progress && !order)
> > >  		no_progress_loops = 0;
> > >  	else
> > >  		no_progress_loops++;
> 
> This unconditionally increases no_progress_loops for high order
> allocation, so, after 16 iterations, it will fail. If compaction isn't
> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> to make high order page. Should we consider this case also?

How many retries would help? I do not think any number will work
reliably. Configurations without compaction enabled are asking for
problems by definition IMHO. Relying on order-0 reclaim for high order
allocations simply cannot work.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-01 18:14             ` Vlastimil Babka
  (?)
  (?)
@ 2016-03-02 12:24             ` Michal Hocko
  2016-03-02 13:00               ` Michal Hocko
  2016-03-02 13:22                 ` Vlastimil Babka
  -1 siblings, 2 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
[...]
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
> 
> Yes, compaction is historically quite careful to avoid making low memory
> conditions worse, and to prevent work if it doesn't look like it can
> ultimately succeed the allocation (so having not enough base pages means
> that compacting them is considered pointless).

The compaction is running in PF_MEMALLOC context so it shouldn't fail
the allocation. Moreover the additional memory is only temporal until
the migration finishes. Or am I missing something?

> This aspect of preventing non-zero-order OOMs is somewhat unexpected
> :)

I hope we can do something about it then...
 
[...]
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
> 
> I was under impression that similar checks to compaction_suitable() were
> done also in compact_finished(), to stop compacting if memory got low due to
> parallel activity. But I guess it was a patch from Joonsoo that didn't get
> merged.
> 
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration targets.

yes this certainly contributes to the problem and triggered in my case a
lot:
$ grep __isolate_free_page trace.log | wc -l
181
$ grep __alloc_pages_direct_compact: trace.log | wc -l
7

> I would suggest enabling all compaction tracepoint and the migration
> tracepoint. Looking at the trace could hopefully help faster than
> going one trace_printk() per attempt.

OK, here we go with both watermarks checks removed and hopefully all the
compaction related tracepoints enabled:
echo 1 > /debug/tracing/events/compaction/enable
echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

this was without the hugetlb handicap. See the trace log and vmstat
after the run attached.

Thanks
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: vmstat.log --]
[-- Type: text/plain, Size: 2333 bytes --]

nr_free_pages 151306
nr_alloc_batch 123
nr_inactive_anon 12815
nr_active_anon 44507
nr_inactive_file 1160
nr_active_file 5910
nr_unevictable 0
nr_mlock 0
nr_anon_pages 232
nr_mapped 1025
nr_file_pages 64246
nr_dirty 2
nr_writeback 0
nr_slab_reclaimable 12344
nr_slab_unreclaimable 21129
nr_page_table_pages 260
nr_kernel_stack 90
nr_unstable 0
nr_bounce 0
nr_vmscan_write 362270
nr_vmscan_immediate_reclaim 43
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 54592
nr_dirtied 5363
nr_written 364001
nr_pages_scanned 0
workingset_refault 16574
workingset_activate 9062
workingset_nodereclaim 640
nr_anon_transparent_hugepages 0
nr_free_cma 0
nr_dirty_threshold 31188
nr_dirty_background_threshold 15594
pgpgin 564127
pgpgout 1457932
pswpin 85569
pswpout 362180
pgalloc_dma 226916
pgalloc_dma32 21472873
pgalloc_normal 0
pgalloc_movable 0
pgfree 22057596
pgactivate 174766
pgdeactivate 919764
pgfault 23950701
pgmajfault 31819
pglazyfreed 0
pgrefill_dma 15589
pgrefill_dma32 999305
pgrefill_normal 0
pgrefill_movable 0
pgsteal_kswapd_dma 5339
pgsteal_kswapd_dma32 322951
pgsteal_kswapd_normal 0
pgsteal_kswapd_movable 0
pgsteal_direct_dma 334
pgsteal_direct_dma32 71877
pgsteal_direct_normal 0
pgsteal_direct_movable 0
pgscan_kswapd_dma 11213
pgscan_kswapd_dma32 653096
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 670
pgscan_direct_dma32 137488
pgscan_direct_normal 0
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 1920
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 351
kswapd_high_wmark_hit_quickly 13
pageoutrun 458
allocstall 1376
pgrotated 360480
drop_pagecache 0
drop_slab 0
pgmigrate_success 204875
pgmigrate_fail 169
compact_migrate_scanned 343087
compact_free_scanned 3597902
compact_isolated 412234
compact_stall 163
compact_fail 99
compact_success 64
compact_kcompatd_wake 2
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 1089
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1561
unevictable_pgs_mlocked 1561
unevictable_pgs_munlocked 1561
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 152
thp_fault_fallback 39
thp_collapse_alloc 69
thp_collapse_alloc_failed 11
thp_split_page 1
thp_split_page_failed 0
thp_deferred_split_page 212
thp_split_pmd 10
thp_zero_page_alloc 2
thp_zero_page_alloc_failed 1

[-- Attachment #3: trace.log.gz --]
[-- Type: application/gzip, Size: 472143 bytes --]

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:55               ` Joonsoo Kim
@ 2016-03-02 12:37                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
[...]
> > Yes, compaction is historically quite careful to avoid making low
> > memory conditions worse, and to prevent work if it doesn't look like
> > it can ultimately succeed the allocation (so having not enough base
> > pages means that compacting them is considered pointless). This
> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
> 
> It's better not to assume that compaction would succeed all the times.
> Compaction has some limitations so it sometimes fails.
> For example, in lowmem situation, it only scans small parts of memory
> and if that part is fragmented by non-movable page, compaction would fail.
> And, compaction would defer requests 64 times at maximum if successive
> compaction failure happens before.
> 
> Depending on compaction heavily is right direction to go but I think
> that it's not ready for now. More reclaim would relieve problem.

I really fail to see why. The reclaimable memory can be migrated as
well, no? Relying on the order-0 reclaim makes only sense to get over
wmarks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 12:37                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vlastimil Babka, Hugh Dickins, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
[...]
> > Yes, compaction is historically quite careful to avoid making low
> > memory conditions worse, and to prevent work if it doesn't look like
> > it can ultimately succeed the allocation (so having not enough base
> > pages means that compacting them is considered pointless). This
> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
> 
> It's better not to assume that compaction would succeed all the times.
> Compaction has some limitations so it sometimes fails.
> For example, in lowmem situation, it only scans small parts of memory
> and if that part is fragmented by non-movable page, compaction would fail.
> And, compaction would defer requests 64 times at maximum if successive
> compaction failure happens before.
> 
> Depending on compaction heavily is right direction to go but I think
> that it's not ready for now. More reclaim would relieve problem.

I really fail to see why. The reclaimable memory can be migrated as
well, no? Relying on the order-0 reclaim makes only sense to get over
wmarks.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  2:28             ` Joonsoo Kim
@ 2016-03-02 12:39               ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:28:46, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> 
> Hmm... This looks like one of potential culprit. If page is in
> writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
> In this case, this page works as pinned page and prevent compaction.
> It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
> 'no_progress_loops > XXX' will help in this situation.

Would it make sense to use MIGRATE_SYNC for !costly allocations by
default?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 12:39               ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 12:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hugh Dickins, Vlastimil Babka, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On Wed 02-03-16 11:28:46, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space.  Unless
> > > perhaps it's some superfast swap that you have?
> > 
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> 
> Hmm... This looks like one of potential culprit. If page is in
> writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
> In this case, this page works as pinned page and prevent compaction.
> It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
> 'no_progress_loops > XXX' will help in this situation.

Would it make sense to use MIGRATE_SYNC for !costly allocations by
default?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:24             ` Michal Hocko
@ 2016-03-02 13:00               ` Michal Hocko
  2016-03-02 13:22                 ` Vlastimil Babka
  1 sibling, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 13:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 3955 bytes --]

On Wed 02-03-16 13:24:10, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
[...]
> > I would suggest enabling all compaction tracepoint and the migration
> > tracepoint. Looking at the trace could hopefully help faster than
> > going one trace_printk() per attempt.
> 
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable
> 
> this was without the hugetlb handicap. See the trace log and vmstat
> after the run attached.

Just for the reference the above was with:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
 								alloc_flags))
 		return COMPACT_PARTIAL;
 
+	if (order <= PAGE_ALLOC_COSTLY_ORDER)
+		return COMPACT_CONTINUE;
+
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..50954a9a4433 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2245,7 +2245,6 @@ EXPORT_SYMBOL_GPL(split_page);
 
 int __isolate_free_page(struct page *page, unsigned int order)
 {
-	unsigned long watermark;
 	struct zone *zone;
 	int mt;
 
@@ -2254,14 +2253,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
 
-	if (!is_migrate_isolate(mt)) {
-		/* Obey watermarks as if the page was being allocated */
-		watermark = low_wmark_pages(zone) + (1 << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
-			return 0;
-
+	if (!is_migrate_isolate(mt))
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
-	}
 
 	/* Remove page from free list */
 	list_del(&page->lru);

And I rerun the same with the clean mmotm tree and the results are
attached.

As we can see there was less scanning on dma32 both in direct and kswapd
reclaim.
$ grep direct vmstat.*
vmstat.mmotm.log:pgsteal_direct_dma 420
vmstat.mmotm.log:pgsteal_direct_dma32 71234
vmstat.mmotm.log:pgsteal_direct_normal 0
vmstat.mmotm.log:pgsteal_direct_movable 0
vmstat.mmotm.log:pgscan_direct_dma 990
vmstat.mmotm.log:pgscan_direct_dma32 144376
vmstat.mmotm.log:pgscan_direct_normal 0
vmstat.mmotm.log:pgscan_direct_movable 0
vmstat.mmotm.log:pgscan_direct_throttle 0
vmstat.updated.log:pgsteal_direct_dma 334
vmstat.updated.log:pgsteal_direct_dma32 71877
vmstat.updated.log:pgsteal_direct_normal 0
vmstat.updated.log:pgsteal_direct_movable 0
vmstat.updated.log:pgscan_direct_dma 670
vmstat.updated.log:pgscan_direct_dma32 137488
vmstat.updated.log:pgscan_direct_normal 0
vmstat.updated.log:pgscan_direct_movable 0
vmstat.updated.log:pgscan_direct_throttle 0
$ grep kswapd vmstat.*
vmstat.mmotm.log:pgsteal_kswapd_dma 5602
vmstat.mmotm.log:pgsteal_kswapd_dma32 332336
vmstat.mmotm.log:pgsteal_kswapd_normal 0
vmstat.mmotm.log:pgsteal_kswapd_movable 0
vmstat.mmotm.log:pgscan_kswapd_dma 12187
vmstat.mmotm.log:pgscan_kswapd_dma32 679667
vmstat.mmotm.log:pgscan_kswapd_normal 0
vmstat.mmotm.log:pgscan_kswapd_movable 0
vmstat.mmotm.log:kswapd_inodesteal 0
vmstat.mmotm.log:kswapd_low_wmark_hit_quickly 339
vmstat.mmotm.log:kswapd_high_wmark_hit_quickly 10
vmstat.updated.log:pgsteal_kswapd_dma 5339
vmstat.updated.log:pgsteal_kswapd_dma32 322951
vmstat.updated.log:pgsteal_kswapd_normal 0
vmstat.updated.log:pgsteal_kswapd_movable 0
vmstat.updated.log:pgscan_kswapd_dma 11213
vmstat.updated.log:pgscan_kswapd_dma32 653096
vmstat.updated.log:pgscan_kswapd_normal 0
vmstat.updated.log:pgscan_kswapd_movable 0
vmstat.updated.log:kswapd_inodesteal 0
vmstat.updated.log:kswapd_low_wmark_hit_quickly 351
vmstat.updated.log:kswapd_high_wmark_hit_quickly 13
-- 
Michal Hocko
SUSE Labs

[-- Attachment #2: trace.mmotm.log.gz --]
[-- Type: application/gzip, Size: 512968 bytes --]

[-- Attachment #3: vmstat.mmotm.log --]
[-- Type: text/plain, Size: 2335 bytes --]

nr_free_pages 149226
nr_alloc_batch 114
nr_inactive_anon 13962
nr_active_anon 46754
nr_inactive_file 634
nr_active_file 5010
nr_unevictable 0
nr_mlock 0
nr_anon_pages 219
nr_mapped 793
nr_file_pages 66233
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 12355
nr_slab_unreclaimable 21208
nr_page_table_pages 320
nr_kernel_stack 92
nr_unstable 0
nr_bounce 0
nr_vmscan_write 358705
nr_vmscan_immediate_reclaim 111
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 58505
nr_dirtied 5516
nr_written 360677
nr_pages_scanned 0
workingset_refault 17291
workingset_activate 11908
workingset_nodereclaim 644
nr_anon_transparent_hugepages 0
nr_free_cma 0
nr_dirty_threshold 30487
nr_dirty_background_threshold 15243
pgpgin 525267
pgpgout 1444464
pswpin 75386
pswpout 358705
pgalloc_dma 241466
pgalloc_dma32 21491760
pgalloc_normal 0
pgalloc_movable 0
pgfree 22110844
pgactivate 204005
pgdeactivate 1033621
pgfault 23929641
pgmajfault 27748
pglazyfreed 0
pgrefill_dma 18759
pgrefill_dma32 1122090
pgrefill_normal 0
pgrefill_movable 0
pgsteal_kswapd_dma 5602
pgsteal_kswapd_dma32 332336
pgsteal_kswapd_normal 0
pgsteal_kswapd_movable 0
pgsteal_direct_dma 420
pgsteal_direct_dma32 71234
pgsteal_direct_normal 0
pgsteal_direct_movable 0
pgscan_kswapd_dma 12187
pgscan_kswapd_dma32 679667
pgscan_kswapd_normal 0
pgscan_kswapd_movable 0
pgscan_direct_dma 990
pgscan_direct_dma32 144376
pgscan_direct_normal 0
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 0
slabs_scanned 2052
kswapd_inodesteal 0
kswapd_low_wmark_hit_quickly 339
kswapd_high_wmark_hit_quickly 10
pageoutrun 448
allocstall 1376
pgrotated 357091
drop_pagecache 0
drop_slab 0
pgmigrate_success 227102
pgmigrate_fail 142
compact_migrate_scanned 374515
compact_free_scanned 4000566
compact_isolated 456131
compact_stall 133
compact_fail 73
compact_success 60
compact_kcompatd_wake 0
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 1087
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1530
unevictable_pgs_mlocked 1530
unevictable_pgs_munlocked 1529
unevictable_pgs_cleared 1
unevictable_pgs_stranded 0
thp_fault_alloc 164
thp_fault_fallback 26
thp_collapse_alloc 159
thp_collapse_alloc_failed 11
thp_split_page 0
thp_split_page_failed 0
thp_deferred_split_page 309
thp_split_pmd 7
thp_zero_page_alloc 3
thp_zero_page_alloc_failed 0

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:24             ` Michal Hocko
@ 2016-03-02 13:22                 ` Vlastimil Babka
  2016-03-02 13:22                 ` Vlastimil Babka
  1 sibling, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-02 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/02/2016 01:24 PM, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
>>
>> I was under impression that similar checks to compaction_suitable() were
>> done also in compact_finished(), to stop compacting if memory got low due to
>> parallel activity. But I guess it was a patch from Joonsoo that didn't get
>> merged.
>>
>> My only other theory so far is that watermark checks fail in
>> __isolate_free_page() when we want to grab page(s) as migration targets.
>
> yes this certainly contributes to the problem and triggered in my case a
> lot:
> $ grep __isolate_free_page trace.log | wc -l
> 181
> $ grep __alloc_pages_direct_compact: trace.log | wc -l
> 7
>
>> I would suggest enabling all compaction tracepoint and the migration
>> tracepoint. Looking at the trace could hopefully help faster than
>> going one trace_printk() per attempt.
>
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

The trace shows only 4 direct compaction attempts with order=2. The rest 
is order=9, i.e. THP, which has little chances of success under such 
pressure, and thus those failures and defers. The few order=2 attempts 
appear all successful (defer_reset is called).

So it seems your system is mostly fine with just reclaim, and there's 
little need for order-2 compaction, and that's also why you can't 
reproduce the OOMs. So I'm afraid we'll learn nothing here, and looks 
like Hugh will have to try those watermark check adjustments/removals 
and/or provide the same kind of trace.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 13:22                 ` Vlastimil Babka
  0 siblings, 0 replies; 299+ messages in thread
From: Vlastimil Babka @ 2016-03-02 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Joonsoo Kim, Andrew Morton, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML

On 03/02/2016 01:24 PM, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
>>
>> I was under impression that similar checks to compaction_suitable() were
>> done also in compact_finished(), to stop compacting if memory got low due to
>> parallel activity. But I guess it was a patch from Joonsoo that didn't get
>> merged.
>>
>> My only other theory so far is that watermark checks fail in
>> __isolate_free_page() when we want to grab page(s) as migration targets.
>
> yes this certainly contributes to the problem and triggered in my case a
> lot:
> $ grep __isolate_free_page trace.log | wc -l
> 181
> $ grep __alloc_pages_direct_compact: trace.log | wc -l
> 7
>
>> I would suggest enabling all compaction tracepoint and the migration
>> tracepoint. Looking at the trace could hopefully help faster than
>> going one trace_printk() per attempt.
>
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

The trace shows only 4 direct compaction attempts with order=2. The rest 
is order=9, i.e. THP, which has little chances of success under such 
pressure, and thus those failures and defers. The few order=2 attempts 
appear all successful (defer_reset is called).

So it seems your system is mostly fine with just reclaim, and there's 
little need for order-2 compaction, and that's also why you can't 
reproduce the OOMs. So I'm afraid we'll learn nothing here, and looks 
like Hugh will have to try those watermark check adjustments/removals 
and/or provide the same kind of trace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  9:50             ` Michal Hocko
@ 2016-03-02 13:32               ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
>> > > + /*
>> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> > > +  * retries for !costly high order requests and hope that multiple
>> > > +  * runs of compaction will generate some high order ones for us.
>> > > +  *
>> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> > > +  * if we are in the retry path - something like priority 0 for the
>> > > +  * reclaim
>> > > +  */
>> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > +         return true;
>> > > +
>> > >   return false;
>>
>> This seems not a proper fix. Checking watermark with high order has
>> another meaning that there is high order page or not. This isn't
>> what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.

If we have >= order page available, we would not reach here. We would
just allocate it.

And, should_reclaim_retry() is not just for reclaim. It is also for
retrying compaction.

That watermark check is to check further reclaim/compaction
is meaningful. And, for high order case, if there is enough freepage,
compaction could make high order page even if there is no high order
page now.

Adding freeable memory and checking watermark with it doesn't help
in this case because number of high order page isn't changed with it.

I just did quick review to your patches so maybe I am wrong.
Am I missing something?

>> So, following fix is needed.
>
>> 'if (order)' check isn't needed. It is used to clarify the meaning of
>> this fix. You can remove it.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 1993894..8c80375 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>>                 return false;
>>
>> +       /* To check whether compaction is available or not */
>> +       if (order)
>> +               order = 0;
>> +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
>>         /*
>>          * Keep reclaiming pages while there is a chance this will lead
>>          * somewhere.  If none of the target zones can satisfy our allocation
>>
>> > >  }
>> > >
>> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > >           goto noretry;
>> > >
>> > >   /*
>> > > -  * Costly allocations might have made a progress but this doesn't mean
>> > > -  * their order will become available due to high fragmentation so do
>> > > -  * not reset the no progress counter for them
>> > > +  * High order allocations might have made a progress but this doesn't
>> > > +  * mean their order will become available due to high fragmentation so
>> > > +  * do not reset the no progress counter for them
>> > >    */
>> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + if (did_some_progress && !order)
>> > >           no_progress_loops = 0;
>> > >   else
>> > >           no_progress_loops++;
>>
>> This unconditionally increases no_progress_loops for high order
>> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

At least, reset no_progress_loops when did_some_progress. High
order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
as order 0. And, reclaim something would increase probability of
compaction success. Why do we limit retry as 16 times with no
evidence of potential impossibility of making high order page?

And, 16 retry looks not good to me because compaction could defer
actual doing up to 64 times.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 13:32               ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 13:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
>> > > + /*
>> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> > > +  * retries for !costly high order requests and hope that multiple
>> > > +  * runs of compaction will generate some high order ones for us.
>> > > +  *
>> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> > > +  * if we are in the retry path - something like priority 0 for the
>> > > +  * reclaim
>> > > +  */
>> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > +         return true;
>> > > +
>> > >   return false;
>>
>> This seems not a proper fix. Checking watermark with high order has
>> another meaning that there is high order page or not. This isn't
>> what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.

If we have >= order page available, we would not reach here. We would
just allocate it.

And, should_reclaim_retry() is not just for reclaim. It is also for
retrying compaction.

That watermark check is to check further reclaim/compaction
is meaningful. And, for high order case, if there is enough freepage,
compaction could make high order page even if there is no high order
page now.

Adding freeable memory and checking watermark with it doesn't help
in this case because number of high order page isn't changed with it.

I just did quick review to your patches so maybe I am wrong.
Am I missing something?

>> So, following fix is needed.
>
>> 'if (order)' check isn't needed. It is used to clarify the meaning of
>> this fix. You can remove it.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 1993894..8c80375 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>>                 return false;
>>
>> +       /* To check whether compaction is available or not */
>> +       if (order)
>> +               order = 0;
>> +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
>>         /*
>>          * Keep reclaiming pages while there is a chance this will lead
>>          * somewhere.  If none of the target zones can satisfy our allocation
>>
>> > >  }
>> > >
>> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > >           goto noretry;
>> > >
>> > >   /*
>> > > -  * Costly allocations might have made a progress but this doesn't mean
>> > > -  * their order will become available due to high fragmentation so do
>> > > -  * not reset the no progress counter for them
>> > > +  * High order allocations might have made a progress but this doesn't
>> > > +  * mean their order will become available due to high fragmentation so
>> > > +  * do not reset the no progress counter for them
>> > >    */
>> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + if (did_some_progress && !order)
>> > >           no_progress_loops = 0;
>> > >   else
>> > >           no_progress_loops++;
>>
>> This unconditionally increases no_progress_loops for high order
>> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

At least, reset no_progress_loops when did_some_progress. High
order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
as order 0. And, reclaim something would increase probability of
compaction success. Why do we limit retry as 16 times with no
evidence of potential impossibility of making high order page?

And, 16 retry looks not good to me because compaction could defer
actual doing up to 64 times.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 13:32               ` Joonsoo Kim
@ 2016-03-02 14:06                 ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > [...]
> >> > > + /*
> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> > > +  * retries for !costly high order requests and hope that multiple
> >> > > +  * runs of compaction will generate some high order ones for us.
> >> > > +  *
> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> > > +  * reclaim
> >> > > +  */
> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > +         return true;
> >> > > +
> >> > >   return false;
> >>
> >> This seems not a proper fix. Checking watermark with high order has
> >> another meaning that there is high order page or not. This isn't
> >> what we want here.
> >
> > Why not? Why should we retry the reclaim if we do not have >=order page
> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > form the requested order. The ordering on the LRU lists is pretty much
> > random wrt. pfn ordering. On the other hand if we have a page available
> > which is just hidden by watermarks then it makes perfect sense to retry
> > and free even order-0 pages.
> 
> If we have >= order page available, we would not reach here. We would
> just allocate it.

not really, we can still be under the low watermark. Note that the
target for the should_reclaim_retry watermark check includes also the
reclaimable memory.
 
> And, should_reclaim_retry() is not just for reclaim. It is also for
> retrying compaction.
> 
> That watermark check is to check further reclaim/compaction
> is meaningful. And, for high order case, if there is enough freepage,
> compaction could make high order page even if there is no high order
> page now.
> 
> Adding freeable memory and checking watermark with it doesn't help
> in this case because number of high order page isn't changed with it.
> 
> I just did quick review to your patches so maybe I am wrong.
> Am I missing something?

The core idea behind should_reclaim_retry is to check whether the
reclaiming all the pages would help to get over the watermark and there
is at least one >= order page. Then it really makes sense to retry. As
the compaction has already was performed before this is called we should
have created some high order pages already. The decay guarantees that we
eventually trigger the OOM killer after some attempts.

If the compaction can backoff and ignore our requests then we are
screwed of course and that should be addressed imho at the compaction
layer. Maybe we can tell the compaction to try harder but I would like
to understand why this shouldn't be a default behavior for !costly
orders.
 
[...]
> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > >           goto noretry;
> >> > >
> >> > >   /*
> >> > > -  * Costly allocations might have made a progress but this doesn't mean
> >> > > -  * their order will become available due to high fragmentation so do
> >> > > -  * not reset the no progress counter for them
> >> > > +  * High order allocations might have made a progress but this doesn't
> >> > > +  * mean their order will become available due to high fragmentation so
> >> > > +  * do not reset the no progress counter for them
> >> > >    */
> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + if (did_some_progress && !order)
> >> > >           no_progress_loops = 0;
> >> > >   else
> >> > >           no_progress_loops++;
> >>
> >> This unconditionally increases no_progress_loops for high order
> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> >> to make high order page. Should we consider this case also?
> >
> > How many retries would help? I do not think any number will work
> > reliably. Configurations without compaction enabled are asking for
> > problems by definition IMHO. Relying on order-0 reclaim for high order
> > allocations simply cannot work.
> 
> At least, reset no_progress_loops when did_some_progress. High
> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> as order 0. And, reclaim something would increase probability of
> compaction success.

This is something I still do not understand. Why would reclaiming
random order-0 pages help compaction? Could you clarify this please?

> Why do we limit retry as 16 times with no evidence of potential
> impossibility of making high order page?

If we tried to compact 16 times without any progress then this sounds
like a sufficient evidence to me. Well, this number is somehow arbitrary
but the main point is to limit it to _some_ number, if we can show that
a larger value would work better then we can update it of course.

> And, 16 retry looks not good to me because compaction could defer
> actual doing up to 64 times.

OK, this is something that needs to be handled in a better way. The
primary question would be why to defer the compaction for <=
PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
why it makes sense it for the best effort mode of operation but !costly
orders should be trying much harder as they are nofail, no?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:06                 ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > [...]
> >> > > + /*
> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> > > +  * retries for !costly high order requests and hope that multiple
> >> > > +  * runs of compaction will generate some high order ones for us.
> >> > > +  *
> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> > > +  * reclaim
> >> > > +  */
> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > +         return true;
> >> > > +
> >> > >   return false;
> >>
> >> This seems not a proper fix. Checking watermark with high order has
> >> another meaning that there is high order page or not. This isn't
> >> what we want here.
> >
> > Why not? Why should we retry the reclaim if we do not have >=order page
> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > form the requested order. The ordering on the LRU lists is pretty much
> > random wrt. pfn ordering. On the other hand if we have a page available
> > which is just hidden by watermarks then it makes perfect sense to retry
> > and free even order-0 pages.
> 
> If we have >= order page available, we would not reach here. We would
> just allocate it.

not really, we can still be under the low watermark. Note that the
target for the should_reclaim_retry watermark check includes also the
reclaimable memory.
 
> And, should_reclaim_retry() is not just for reclaim. It is also for
> retrying compaction.
> 
> That watermark check is to check further reclaim/compaction
> is meaningful. And, for high order case, if there is enough freepage,
> compaction could make high order page even if there is no high order
> page now.
> 
> Adding freeable memory and checking watermark with it doesn't help
> in this case because number of high order page isn't changed with it.
> 
> I just did quick review to your patches so maybe I am wrong.
> Am I missing something?

The core idea behind should_reclaim_retry is to check whether the
reclaiming all the pages would help to get over the watermark and there
is at least one >= order page. Then it really makes sense to retry. As
the compaction has already was performed before this is called we should
have created some high order pages already. The decay guarantees that we
eventually trigger the OOM killer after some attempts.

If the compaction can backoff and ignore our requests then we are
screwed of course and that should be addressed imho at the compaction
layer. Maybe we can tell the compaction to try harder but I would like
to understand why this shouldn't be a default behavior for !costly
orders.
 
[...]
> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > >           goto noretry;
> >> > >
> >> > >   /*
> >> > > -  * Costly allocations might have made a progress but this doesn't mean
> >> > > -  * their order will become available due to high fragmentation so do
> >> > > -  * not reset the no progress counter for them
> >> > > +  * High order allocations might have made a progress but this doesn't
> >> > > +  * mean their order will become available due to high fragmentation so
> >> > > +  * do not reset the no progress counter for them
> >> > >    */
> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + if (did_some_progress && !order)
> >> > >           no_progress_loops = 0;
> >> > >   else
> >> > >           no_progress_loops++;
> >>
> >> This unconditionally increases no_progress_loops for high order
> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> >> to make high order page. Should we consider this case also?
> >
> > How many retries would help? I do not think any number will work
> > reliably. Configurations without compaction enabled are asking for
> > problems by definition IMHO. Relying on order-0 reclaim for high order
> > allocations simply cannot work.
> 
> At least, reset no_progress_loops when did_some_progress. High
> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> as order 0. And, reclaim something would increase probability of
> compaction success.

This is something I still do not understand. Why would reclaiming
random order-0 pages help compaction? Could you clarify this please?

> Why do we limit retry as 16 times with no evidence of potential
> impossibility of making high order page?

If we tried to compact 16 times without any progress then this sounds
like a sufficient evidence to me. Well, this number is somehow arbitrary
but the main point is to limit it to _some_ number, if we can show that
a larger value would work better then we can update it of course.

> And, 16 retry looks not good to me because compaction could defer
> actual doing up to 64 times.

OK, this is something that needs to be handled in a better way. The
primary question would be why to defer the compaction for <=
PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
why it makes sense it for the best effort mode of operation but !costly
orders should be trying much harder as they are nofail, no?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 12:37                 ` Michal Hocko
@ 2016-03-02 14:06                   ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Vlastimil Babka, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML

2016-03-02 21:37 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
>> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> [...]
>> > Yes, compaction is historically quite careful to avoid making low
>> > memory conditions worse, and to prevent work if it doesn't look like
>> > it can ultimately succeed the allocation (so having not enough base
>> > pages means that compacting them is considered pointless). This
>> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>>
>> It's better not to assume that compaction would succeed all the times.
>> Compaction has some limitations so it sometimes fails.
>> For example, in lowmem situation, it only scans small parts of memory
>> and if that part is fragmented by non-movable page, compaction would fail.
>> And, compaction would defer requests 64 times at maximum if successive
>> compaction failure happens before.
>>
>> Depending on compaction heavily is right direction to go but I think
>> that it's not ready for now. More reclaim would relieve problem.
>
> I really fail to see why. The reclaimable memory can be migrated as
> well, no? Relying on the order-0 reclaim makes only sense to get over
> wmarks.

Attached link on previous reply mentioned limitation of current compaction
implementation. Briefly speaking, It would not scan all range of memory
due to algorithm limitation so even if there is reclaimable memory that
can be also migrated, compaction could fail.

There is no such limitation on reclaim and that's why I think that compaction
is not ready for now.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:06                   ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Vlastimil Babka, Hugh Dickins, Andrew Morton,
	Linus Torvalds, Johannes Weiner, Mel Gorman, David Rientjes,
	Tetsuo Handa, Hillf Danton, KAMEZAWA Hiroyuki,
	Linux Memory Management List, LKML

2016-03-02 21:37 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
>> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> [...]
>> > Yes, compaction is historically quite careful to avoid making low
>> > memory conditions worse, and to prevent work if it doesn't look like
>> > it can ultimately succeed the allocation (so having not enough base
>> > pages means that compacting them is considered pointless). This
>> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>>
>> It's better not to assume that compaction would succeed all the times.
>> Compaction has some limitations so it sometimes fails.
>> For example, in lowmem situation, it only scans small parts of memory
>> and if that part is fragmented by non-movable page, compaction would fail.
>> And, compaction would defer requests 64 times at maximum if successive
>> compaction failure happens before.
>>
>> Depending on compaction heavily is right direction to go but I think
>> that it's not ready for now. More reclaim would relieve problem.
>
> I really fail to see why. The reclaimable memory can be migrated as
> well, no? Relying on the order-0 reclaim makes only sense to get over
> wmarks.

Attached link on previous reply mentioned limitation of current compaction
implementation. Briefly speaking, It would not scan all range of memory
due to algorithm limitation so even if there is reclaimable memory that
can be also migrated, compaction could fail.

There is no such limitation on reclaim and that's why I think that compaction
is not ready for now.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 14:06                 ` Michal Hocko
@ 2016-03-02 14:34                   ` Joonsoo Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> > [...]
>> >> > > + /*
>> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> > > +  *
>> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> > > +  * reclaim
>> >> > > +  */
>> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > +         return true;
>> >> > > +
>> >> > >   return false;
>> >>
>> >> This seems not a proper fix. Checking watermark with high order has
>> >> another meaning that there is high order page or not. This isn't
>> >> what we want here.
>> >
>> > Why not? Why should we retry the reclaim if we do not have >=order page
>> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> > form the requested order. The ordering on the LRU lists is pretty much
>> > random wrt. pfn ordering. On the other hand if we have a page available
>> > which is just hidden by watermarks then it makes perfect sense to retry
>> > and free even order-0 pages.
>>
>> If we have >= order page available, we would not reach here. We would
>> just allocate it.
>
> not really, we can still be under the low watermark. Note that the

you mean min watermark?

> target for the should_reclaim_retry watermark check includes also the
> reclaimable memory.

I guess that usual case for high order allocation failure has enough freepage.

>> And, should_reclaim_retry() is not just for reclaim. It is also for
>> retrying compaction.
>>
>> That watermark check is to check further reclaim/compaction
>> is meaningful. And, for high order case, if there is enough freepage,
>> compaction could make high order page even if there is no high order
>> page now.
>>
>> Adding freeable memory and checking watermark with it doesn't help
>> in this case because number of high order page isn't changed with it.
>>
>> I just did quick review to your patches so maybe I am wrong.
>> Am I missing something?
>
> The core idea behind should_reclaim_retry is to check whether the
> reclaiming all the pages would help to get over the watermark and there
> is at least one >= order page. Then it really makes sense to retry. As

How you can judge that reclaiming all the pages would help to check
there is at least one >= order page?

> the compaction has already was performed before this is called we should
> have created some high order pages already. The decay guarantees that we

Not really. Compaction could fail.

> eventually trigger the OOM killer after some attempts.

Yep.

> If the compaction can backoff and ignore our requests then we are
> screwed of course and that should be addressed imho at the compaction
> layer. Maybe we can tell the compaction to try harder but I would like
> to understand why this shouldn't be a default behavior for !costly
> orders.

Yes, I agree that.

> [...]
>> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > >           goto noretry;
>> >> > >
>> >> > >   /*
>> >> > > -  * Costly allocations might have made a progress but this doesn't mean
>> >> > > -  * their order will become available due to high fragmentation so do
>> >> > > -  * not reset the no progress counter for them
>> >> > > +  * High order allocations might have made a progress but this doesn't
>> >> > > +  * mean their order will become available due to high fragmentation so
>> >> > > +  * do not reset the no progress counter for them
>> >> > >    */
>> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + if (did_some_progress && !order)
>> >> > >           no_progress_loops = 0;
>> >> > >   else
>> >> > >           no_progress_loops++;
>> >>
>> >> This unconditionally increases no_progress_loops for high order
>> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> >> to make high order page. Should we consider this case also?
>> >
>> > How many retries would help? I do not think any number will work
>> > reliably. Configurations without compaction enabled are asking for
>> > problems by definition IMHO. Relying on order-0 reclaim for high order
>> > allocations simply cannot work.
>>
>> At least, reset no_progress_loops when did_some_progress. High
>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> as order 0. And, reclaim something would increase probability of
>> compaction success.
>
> This is something I still do not understand. Why would reclaiming
> random order-0 pages help compaction? Could you clarify this please?

I just can tell simple version. Please check the link from me on another reply.
Compaction could scan more range of memory if we have more freepage.
This is due to algorithm limitation. Anyway, so, reclaiming random
order-0 pages helps compaction.

>> Why do we limit retry as 16 times with no evidence of potential
>> impossibility of making high order page?
>
> If we tried to compact 16 times without any progress then this sounds
> like a sufficient evidence to me. Well, this number is somehow arbitrary
> but the main point is to limit it to _some_ number, if we can show that
> a larger value would work better then we can update it of course.

My arguing is for your band aid patch.
My point is that why retry count for order-0 is reset if there is some progress,
but, retry counter for order up to costly isn't reset even if there is
some progress

>> And, 16 retry looks not good to me because compaction could defer
>> actual doing up to 64 times.
>
> OK, this is something that needs to be handled in a better way. The
> primary question would be why to defer the compaction for <=
> PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
> why it makes sense it for the best effort mode of operation but !costly
> orders should be trying much harder as they are nofail, no?

Make sense.

Thanks.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 14:34                   ` Joonsoo Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Joonsoo Kim @ 2016-03-02 14:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
>> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> > [...]
>> >> > > + /*
>> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
>> >> > > +  * retries for !costly high order requests and hope that multiple
>> >> > > +  * runs of compaction will generate some high order ones for us.
>> >> > > +  *
>> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
>> >> > > +  * if we are in the retry path - something like priority 0 for the
>> >> > > +  * reclaim
>> >> > > +  */
>> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > +         return true;
>> >> > > +
>> >> > >   return false;
>> >>
>> >> This seems not a proper fix. Checking watermark with high order has
>> >> another meaning that there is high order page or not. This isn't
>> >> what we want here.
>> >
>> > Why not? Why should we retry the reclaim if we do not have >=order page
>> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> > form the requested order. The ordering on the LRU lists is pretty much
>> > random wrt. pfn ordering. On the other hand if we have a page available
>> > which is just hidden by watermarks then it makes perfect sense to retry
>> > and free even order-0 pages.
>>
>> If we have >= order page available, we would not reach here. We would
>> just allocate it.
>
> not really, we can still be under the low watermark. Note that the

you mean min watermark?

> target for the should_reclaim_retry watermark check includes also the
> reclaimable memory.

I guess that usual case for high order allocation failure has enough freepage.

>> And, should_reclaim_retry() is not just for reclaim. It is also for
>> retrying compaction.
>>
>> That watermark check is to check further reclaim/compaction
>> is meaningful. And, for high order case, if there is enough freepage,
>> compaction could make high order page even if there is no high order
>> page now.
>>
>> Adding freeable memory and checking watermark with it doesn't help
>> in this case because number of high order page isn't changed with it.
>>
>> I just did quick review to your patches so maybe I am wrong.
>> Am I missing something?
>
> The core idea behind should_reclaim_retry is to check whether the
> reclaiming all the pages would help to get over the watermark and there
> is at least one >= order page. Then it really makes sense to retry. As

How you can judge that reclaiming all the pages would help to check
there is at least one >= order page?

> the compaction has already was performed before this is called we should
> have created some high order pages already. The decay guarantees that we

Not really. Compaction could fail.

> eventually trigger the OOM killer after some attempts.

Yep.

> If the compaction can backoff and ignore our requests then we are
> screwed of course and that should be addressed imho at the compaction
> layer. Maybe we can tell the compaction to try harder but I would like
> to understand why this shouldn't be a default behavior for !costly
> orders.

Yes, I agree that.

> [...]
>> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > >           goto noretry;
>> >> > >
>> >> > >   /*
>> >> > > -  * Costly allocations might have made a progress but this doesn't mean
>> >> > > -  * their order will become available due to high fragmentation so do
>> >> > > -  * not reset the no progress counter for them
>> >> > > +  * High order allocations might have made a progress but this doesn't
>> >> > > +  * mean their order will become available due to high fragmentation so
>> >> > > +  * do not reset the no progress counter for them
>> >> > >    */
>> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + if (did_some_progress && !order)
>> >> > >           no_progress_loops = 0;
>> >> > >   else
>> >> > >           no_progress_loops++;
>> >>
>> >> This unconditionally increases no_progress_loops for high order
>> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> >> to make high order page. Should we consider this case also?
>> >
>> > How many retries would help? I do not think any number will work
>> > reliably. Configurations without compaction enabled are asking for
>> > problems by definition IMHO. Relying on order-0 reclaim for high order
>> > allocations simply cannot work.
>>
>> At least, reset no_progress_loops when did_some_progress. High
>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> as order 0. And, reclaim something would increase probability of
>> compaction success.
>
> This is something I still do not understand. Why would reclaiming
> random order-0 pages help compaction? Could you clarify this please?

I just can tell simple version. Please check the link from me on another reply.
Compaction could scan more range of memory if we have more freepage.
This is due to algorithm limitation. Anyway, so, reclaiming random
order-0 pages helps compaction.

>> Why do we limit retry as 16 times with no evidence of potential
>> impossibility of making high order page?
>
> If we tried to compact 16 times without any progress then this sounds
> like a sufficient evidence to me. Well, this number is somehow arbitrary
> but the main point is to limit it to _some_ number, if we can show that
> a larger value would work better then we can update it of course.

My arguing is for your band aid patch.
My point is that why retry count for order-0 is reset if there is some progress,
but, retry counter for order up to costly isn't reset even if there is
some progress

>> And, 16 retry looks not good to me because compaction could defer
>> actual doing up to 64 times.
>
> OK, this is something that needs to be handled in a better way. The
> primary question would be why to defer the compaction for <=
> PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
> why it makes sense it for the best effort mode of operation but !costly
> orders should be trying much harder as they are nofail, no?

Make sense.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02  9:50             ` Michal Hocko
@ 2016-03-02 15:01               ` Minchan Kim
  -1 siblings, 0 replies; 299+ messages in thread
From: Minchan Kim @ 2016-03-02 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Sergey Senozhatsky

On Wed, Mar 02, 2016 at 10:50:56AM +0100, Michal Hocko wrote:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
> > > > +	/*
> > > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > > +	 * retries for !costly high order requests and hope that multiple
> > > > +	 * runs of compaction will generate some high order ones for us.
> > > > +	 *
> > > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > > +	 * if we are in the retry path - something like priority 0 for the
> > > > +	 * reclaim
> > > > +	 */
> > > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +		return true;
> > > > +
> > > >  	return false;
> > 
> > This seems not a proper fix. Checking watermark with high order has
> > another meaning that there is high order page or not. This isn't
> > what we want here.
> 
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.
> 
> > So, following fix is needed.
> 
> > 'if (order)' check isn't needed. It is used to clarify the meaning of
> > this fix. You can remove it.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894..8c80375 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >                 return false;
> >  
> > +       /* To check whether compaction is available or not */
> > +       if (order)
> > +               order = 0;
> > +
> 
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
> 
> >         /*
> >          * Keep reclaiming pages while there is a chance this will lead
> >          * somewhere.  If none of the target zones can satisfy our allocation
> > 
> > > >  }
> > > >  
> > > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >  		goto noretry;
> > > >  
> > > >  	/*
> > > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > > -	 * their order will become available due to high fragmentation so do
> > > > -	 * not reset the no progress counter for them
> > > > +	 * High order allocations might have made a progress but this doesn't
> > > > +	 * mean their order will become available due to high fragmentation so
> > > > +	 * do not reset the no progress counter for them
> > > >  	 */
> > > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +	if (did_some_progress && !order)
> > > >  		no_progress_loops = 0;
> > > >  	else
> > > >  		no_progress_loops++;
> > 
> > This unconditionally increases no_progress_loops for high order
> > allocation, so, after 16 iterations, it will fail. If compaction isn't
> > enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> > to make high order page. Should we consider this case also?
> 
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

I left compaction code for a long time so a super hero might make it
perfect now but I don't think the dream come true yet and I believe
any algorithm has a drawback so we end up relying on a fallback approach
in case of not working compaction correctly.

My suggestion is to reintroduce *lumpy reclaim* and kicks in only when
compaction gave up by some reasons. It would be better to rely on
random number retrial of reclaim.

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-02 15:01               ` Minchan Kim
  0 siblings, 0 replies; 299+ messages in thread
From: Minchan Kim @ 2016-03-02 15:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, linux-mm, LKML,
	Sergey Senozhatsky

On Wed, Mar 02, 2016 at 10:50:56AM +0100, Michal Hocko wrote:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
> > > > +	/*
> > > > +	 * OK, so the watermak check has failed. Make sure we do all the
> > > > +	 * retries for !costly high order requests and hope that multiple
> > > > +	 * runs of compaction will generate some high order ones for us.
> > > > +	 *
> > > > +	 * XXX: ideally we should teach the compaction to try _really_ hard
> > > > +	 * if we are in the retry path - something like priority 0 for the
> > > > +	 * reclaim
> > > > +	 */
> > > > +	if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +		return true;
> > > > +
> > > >  	return false;
> > 
> > This seems not a proper fix. Checking watermark with high order has
> > another meaning that there is high order page or not. This isn't
> > what we want here.
> 
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.
> 
> > So, following fix is needed.
> 
> > 'if (order)' check isn't needed. It is used to clarify the meaning of
> > this fix. You can remove it.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894..8c80375 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> >         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> >                 return false;
> >  
> > +       /* To check whether compaction is available or not */
> > +       if (order)
> > +               order = 0;
> > +
> 
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
> 
> >         /*
> >          * Keep reclaiming pages while there is a chance this will lead
> >          * somewhere.  If none of the target zones can satisfy our allocation
> > 
> > > >  }
> > > >  
> > > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >  		goto noretry;
> > > >  
> > > >  	/*
> > > > -	 * Costly allocations might have made a progress but this doesn't mean
> > > > -	 * their order will become available due to high fragmentation so do
> > > > -	 * not reset the no progress counter for them
> > > > +	 * High order allocations might have made a progress but this doesn't
> > > > +	 * mean their order will become available due to high fragmentation so
> > > > +	 * do not reset the no progress counter for them
> > > >  	 */
> > > > -	if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > +	if (did_some_progress && !order)
> > > >  		no_progress_loops = 0;
> > > >  	else
> > > >  		no_progress_loops++;
> > 
> > This unconditionally increases no_progress_loops for high order
> > allocation, so, after 16 iterations, it will fail. If compaction isn't
> > enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> > to make high order page. Should we consider this case also?
> 
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

I left compaction code for a long time so a super hero might make it
perfect now but I don't think the dream come true yet and I believe
any algorithm has a drawback so we end up relying on a fallback approach
in case of not working compaction correctly.

My suggestion is to reintroduce *lumpy reclaim* and kicks in only when
compaction gave up by some reasons. It would be better to rely on
random number retrial of reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
  2016-03-02 14:34                   ` Joonsoo Kim
@ 2016-03-03  9:26                     ` Michal Hocko
  -1 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03  9:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> > [...]
> >> >> > > + /*
> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> > > +  *
> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> > > +  * reclaim
> >> >> > > +  */
> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> > > +         return true;
> >> >> > > +
> >> >> > >   return false;
> >> >>
> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> another meaning that there is high order page or not. This isn't
> >> >> what we want here.
> >> >
> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> > and free even order-0 pages.
> >>
> >> If we have >= order page available, we would not reach here. We would
> >> just allocate it.
> >
> > not really, we can still be under the low watermark. Note that the
> 
> you mean min watermark?

ohh, right...
 
> > target for the should_reclaim_retry watermark check includes also the
> > reclaimable memory.
> 
> I guess that usual case for high order allocation failure has enough freepage.

Not sure I understand you mean here but I wouldn't be surprised if high
order failed even with enough free pages. And that is exactly why I am
claiming that reclaiming more pages is no free ticket to high order
pages.

[...]
> >> I just did quick review to your patches so maybe I am wrong.
> >> Am I missing something?
> >
> > The core idea behind should_reclaim_retry is to check whether the
> > reclaiming all the pages would help to get over the watermark and there
> > is at least one >= order page. Then it really makes sense to retry. As
> 
> How you can judge that reclaiming all the pages would help to check
> there is at least one >= order page?

Again, not sure I understand you here. __zone_watermark_ok checks both
wmark and an available page of the sufficient order. While increased
free_pages (which includes reclaimable pages as well) will tell us
whether we have a chance to get over the min wmark, the order check will
tell us we have something to allocate from after we reach the min wmark.
 
> > the compaction has already was performed before this is called we should
> > have created some high order pages already. The decay guarantees that we
> 
> Not really. Compaction could fail.

Yes it could have failed. But what is the point to retry endlessly then?

[...]
> >> At least, reset no_progress_loops when did_some_progress. High
> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >> as order 0. And, reclaim something would increase probability of
> >> compaction success.
> >
> > This is something I still do not understand. Why would reclaiming
> > random order-0 pages help compaction? Could you clarify this please?
> 
> I just can tell simple version. Please check the link from me on another reply.
> Compaction could scan more range of memory if we have more freepage.
> This is due to algorithm limitation. Anyway, so, reclaiming random
> order-0 pages helps compaction.

I will have a look at that code but this just doesn't make any sense.
The compaction should be reshuffling pages, this shouldn't be a function
of free memory.

> >> Why do we limit retry as 16 times with no evidence of potential
> >> impossibility of making high order page?
> >
> > If we tried to compact 16 times without any progress then this sounds
> > like a sufficient evidence to me. Well, this number is somehow arbitrary
> > but the main point is to limit it to _some_ number, if we can show that
> > a larger value would work better then we can update it of course.
> 
> My arguing is for your band aid patch.
> My point is that why retry count for order-0 is reset if there is some progress,
> but, retry counter for order up to costly isn't reset even if there is
> some progress

Because we know that order-0 requests have chance to proceed if we keep
reclaiming order-0 pages while this is not true for order > 0. If we did
reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER 
then we would be back to the zone_reclaimable heuristic. Why? Because
order-0 reclaim progress will keep !costly in the reclaim loop while
compaction still might not make any progress. So we either have to fail
when __zone_watermark_ok fails for the order (which turned out to be
too easy to trigger) or have the fixed amount of retries regardless the
watermark check result. We cannot relax both unless we have other
measures in place.

Sure we can be more intelligent and reset the counter if the
feedback from compaction is optimistic and we are making some
progress. This would be less hackish and the XXX comment points into
that direction. For now I would like this to catch most loads reasonably
and build better heuristics on top. I would like to do as much as
possible to close the obvious regressions but I guess we have to expect
there will be cases where the OOM fires and hasn't before and vice
versa.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection rework v4
@ 2016-03-03  9:26                     ` Michal Hocko
  0 siblings, 0 replies; 299+ messages in thread
From: Michal Hocko @ 2016-03-03  9:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Hugh Dickins, Linus Torvalds,
	Johannes Weiner, Mel Gorman, David Rientjes, Tetsuo Handa,
	Hillf Danton, KAMEZAWA Hiroyuki, Linux Memory Management List,
	LKML, Sergey Senozhatsky

On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> 2016-03-02 23:06 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> > [...]
> >> >> > > + /*
> >> >> > > +  * OK, so the watermak check has failed. Make sure we do all the
> >> >> > > +  * retries for !costly high order requests and hope that multiple
> >> >> > > +  * runs of compaction will generate some high order ones for us.
> >> >> > > +  *
> >> >> > > +  * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> > > +  * if we are in the retry path - something like priority 0 for the
> >> >> > > +  * reclaim
> >> >> > > +  */
> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> > > +         return true;
> >> >> > > +
> >> >> > >   return false;
> >> >>
> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> another meaning that there is high order page or not. This isn't
> >> >> what we want here.
> >> >
> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> > and free even order-0 pages.
> >>
> >> If we have >= order page available, we would not reach here. We would
> >> just allocate it.
> >
> > not really, we can still be under the low watermark. Note that the
> 
> you mean min watermark?

ohh, right...
 
> > target for the should_reclaim_retry watermark check includes also the
> > reclaimable memory.
> 
> I guess that usual case for high order allocation failure has enough freepage.

Not sure I understand you mean here but I wouldn't be surprised if high
order failed even with enough free pages. And that is exactly why I am
claiming that reclaiming more pages is no free ticket to high order
pages.

[...]
> >> I just did quick review to your patches so maybe I am wrong.
> >> Am I missing something?
> >
> > The core idea behind should_reclaim_retry is to check whether the
> > reclaiming all the pages would help to get over the watermark and there
> > is at least one >= order page. Then it really makes sense to retry. As
> 
> How you can judge that reclaiming all the pages would help to check
> there is at least one >= order page?

Again, not sure I understand you here. __zone_watermark_ok checks both
wmark and an available page of the sufficient order. While increased
free_pages (which includes reclaimable pages as well) will tell us
whether we have a chance to get over the min wmark, the order check will
tell us we have something to allocate from after we reach the min wmark.
 
> > the compaction has already was performed before this is called we should
> > have created some high order pages already. The decay guarantees that we
> 
> Not really. Compaction could fail.

Yes it could have failed. But what is the point to retry endlessly then?

[...]
> >> At least, reset no_progress_loops when did_some_progress. High
> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >> as order 0. And, reclaim something would increase probability of
> >> compaction success.
> >
> > This is something I still do not understand. Why would reclaiming
> > random order-0 pages help compaction? Could you clarify this please?
> 
> I just can tell simple version. Please check the link from me on another reply.
> Compaction could scan more range of memory if we have more freepage.
> This is due to algorithm limitation. Anyway, so, reclaiming random
> order-0 pages helps compaction.

I will have a look at that code but this just doesn't make any sense.
The compaction should be reshuffling pages, this shouldn't be a function
of free memory.

> >> Why do we limit retry as 16 times with no evidence of potential
> >> impossibility of making high order page?
> >
> > If we tried to compact 16 times without any progress then this sounds
> > like a sufficient evidence to me. Well, this number is somehow arbitrary
> > but the main point is to limit it to _some_ number, if we can show that
> > a larger value would work better then we can update it of course.
> 
> My arguing is for your band aid patch.
> My point is that why retry count for order-0 is reset if there is some progress,
> but, retry counter for order up to costly isn't reset even if there is
> some progress

Because we know that order-0 requests have chance to proceed if we keep
reclaiming order-0 pages while this is not true for order > 0. If we did
reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER 
then we would be back to the zone_reclaimable heuristic. Why? Because
order-0 reclaim progress will keep !costly in the reclaim loop while
compaction still might not make any progress. So we either have to fail
when __zone_watermark_ok fails for the order (which turned out to be
too easy to trigger) or have the fixed amount of retries regardless the
watermark check result. We cannot relax both unless we have other
measures in place.

Sure we can be more intelligent and reset the counter if the
feedback from compaction is optimistic and we are making some
progress. This would be less hackish and the XXX comment points into
that direction. For now I would like this to catch most loads reasonably
and build better heuristics on top. I would like to do as much as
possible to close the obvious regressions but I guess we have to expect
there will be cases where the OOM fires and hasn't before and vice
versa.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 299+ messages in thread

* Re: [PATCH 0/3] OOM detection