linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: Silent hang up caused by pages being not scanned?
@ 2015-10-14  8:03 Hillf Danton
  0 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2015-10-14  8:03 UTC (permalink / raw)
  To: 'Tetsuo Handa'
  Cc: Linus Torvalds, Michal Hocko, David Rientjes, Johannes Weiner,
	linux-kernel, linux-mm

> > 
> > In particular, I think that you'll find that you will have to change
> > the heuristics in __alloc_pages_slowpath() where we currently do
> > 
> >         if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> > 
> > when the "did_some_progress" logic changes that radically.
> > 
> 
> Yes. But we can't simply do
> 
>	if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
> 
> because we won't be able to call out_of_memory(), can we?
>
Can you please try a simplified retry logic?

thanks
Hillf
--- a/mm/page_alloc.c	Wed Oct 14 14:45:28 2015
+++ b/mm/page_alloc.c	Wed Oct 14 15:43:31 2015
@@ -3154,8 +3154,7 @@ retry:
 
 	/* Keep reclaiming pages as long as there is reasonable progress */
 	pages_reclaimed += did_some_progress;
-	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
-	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
+	if (did_some_progress) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto retry;
--


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:49                       ` Tetsuo Handa
@ 2015-10-19 12:57                         ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2015-10-19 12:57 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: torvalds, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina, mgorman, riel

On Sat 17-10-15 03:49:39, Tetsuo Handa wrote:
> Linus Torvalds wrote:
> > Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> > you have? Does it seem to improve on your situation?
> 
> Yes, I already tried it and just replied to Michal.
> 
> I tested for one hour using various memory stressing programs.
> As far as I tested, I did not hit silent hang up (

Thank you for your testing!

[...]

> Only problem I felt is that the ratio of inactive_file/writeback
> (shown below) was high (compared to shown above) when I did

Yes this is the lack of congestion on the bdi as Linus expected.
Another patch I've just posted should help in that regards. At least it
seems to help in my testing.

[...]

> I can still hit OOM livelock (
> 
>  MemAlloc-Info: X stalling task, Y dying task, Z victim task.
> 
> where X > 0 && Y > 0).

This seems a separate issue, though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                     ` Linus Torvalds
  2015-10-16 18:49                       ` Tetsuo Handa
@ 2015-10-19 12:53                       ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2015-10-19 12:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri 16-10-15 11:34:48, Linus Torvalds wrote:
> On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > OK so here is what I am playing with currently. It is not complete
> > yet.
> 
> So this looks like it's going in a reasonable direction. However:
> 
> > +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > +                               ac->high_zoneidx, alloc_flags, target)) {
> > +                       /* Wait for some write requests to complete then retry */
> > +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > +                       goto retry;
> > +               }
> 
> I still think we should at least spend some time re-thinking that
> "wait_iff_congested()" thing.

You are right. I thought we would be congested most of the time because
of the heavy IO but a quick test has shown that the zone is marked
congested but the nr_wb_congested is zero all the time. That is most
probably because the IO is throttled severly by the lack of memory as
well.

> We may not actually be congested, but
> might be unable to write anything out because of our allocation flags
> (ie not allowed to recurse into the filesystems), so we might be in
> the situation that we have a lot of dirty pages that we can't directly
> do anything about.
> 
> Now, we will have woken kswapd, so something *will* hopefully be done
> about them eventually, but at no time do we actually really wait for
> it. We'll just busy-loop.
> 
> So at a minimum, I think we should yield to kswapd. We do do that
> "cond_resched()" in wait_iff_congested(), but I'm not entirely
> convinced that is at all enough to wait for kswapd to *do* something.

I went with congestion_wait which is what we used to do in the past
before wait_iff_congested has been introduced. The primary reason for
the change was that congestion_wait used to cause unhealthy stalls in
the direct reclaim where the bdi wasn't really congested and so we were
sleeping for the full timeout.

Now I think we can do better even with congestion_wait. We do not have
to wait when we did_some_progress so we won't affect a regular direct
reclaim path and we can reduce sleeping to:

dirty+writeback > reclaimable/2

This is a good signal that the reason for no progress is the stale
IO most likely and we need to wait even if the bdi itself is not
congested. We can also increase the timeout to HZ/10 because this is an
extreme slow path - we are not doing any progress and stalling is better
than OOM.

This is a diff on top of the previous patch. I even think that this part
would deserve a separate patch for a better bisect-ability. My testing
shows that close-to-oom behaves better (I can use more memory for
memeaters without hitting OOM)

What do you think?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e28028681c59..fed1bb7ea43a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,8 +3187,21 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 */
 		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
 				ac->high_zoneidx, alloc_flags, target)) {
-			/* Wait for some write requests to complete then retry */
-			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+			unsigned long writeback = zone_page_state(zone, NR_WRITEBACK),
+				      dirty = zone_page_state(zone, NR_FILE_DIRTY);
+			if (did_some_progress)
+				goto retry;
+
+			/*
+			 * If we didn't make any progress and have a lot of
+			 * dirty + writeback pages then we should wait for
+			 * an IO to complete to slow down the reclaim and
+			 * prevent from pre mature OOM
+			 */
+			if (2*(writeback + dirty) > reclaimable)
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+			else
+				cond_resched();
 			goto retry;
 		}
 	}

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 18:34                     ` Linus Torvalds
@ 2015-10-16 18:49                       ` Tetsuo Handa
  2015-10-19 12:57                         ` Michal Hocko
  2015-10-19 12:53                       ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-16 18:49 UTC (permalink / raw)
  To: torvalds, mhocko
  Cc: rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov, linux-mm,
	linux-kernel, skozina, mgorman, riel

Linus Torvalds wrote:
> Tetsuo, mind trying it out and maybe tweaking it a bit for the load
> you have? Does it seem to improve on your situation?

Yes, I already tried it and just replied to Michal.

I tested for one hour using various memory stressing programs.
As far as I tested, I did not hit silent hang up (

 MemAlloc-Info: X stalling task, 0 dying task, 0 victim task.

where X > 0).

----------------------------------------
[  134.510993] Mem-Info:
[  134.511940] active_anon:408777 inactive_anon:2088 isolated_anon:24
[  134.511940]  active_file:15 inactive_file:24 isolated_file:0
[  134.511940]  unevictable:0 dirty:4 writeback:1 unstable:0
[  134.511940]  slab_reclaimable:3109 slab_unreclaimable:5594
[  134.511940]  mapped:679 shmem:2156 pagetables:2077 bounce:0
[  134.511940]  free:12911 free_pcp:31 free_cma:0
[  134.521256] Node 0 DMA free:7256kB min:400kB low:500kB high:600kB active_anon:6560kB inactive_anon:180kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:80kB shmem:184kB slab_reclaimable:236kB slab_unreclaimable:296kB kernel_stack:48kB pagetables:556kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  134.532779] lowmem_reserve[]: 0 1714 1714 1714
[  134.534455] Node 0 DMA32 free:44388kB min:44652kB low:55812kB high:66976kB active_anon:1628548kB inactive_anon:8172kB active_file:60kB inactive_file:96kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:16kB writeback:4kB mapped:2636kB shmem:8440kB slab_reclaimable:12200kB slab_unreclaimable:22080kB kernel_stack:3584kB pagetables:7752kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1016 all_unreclaimable? yes
[  134.545830] lowmem_reserve[]: 0 0 0 0
[  134.547404] Node 0 DMA: 16*4kB (UME) 16*8kB (UME) 10*16kB (UME) 6*32kB (UME) 1*64kB (M) 2*128kB (UE) 1*256kB (M) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7264kB
[  134.552766] Node 0 DMA32: 1158*4kB (UME) 638*8kB (UE) 244*16kB (UME) 163*32kB (UE) 73*64kB (UE) 34*128kB (UME) 17*256kB (UME) 10*512kB (UME) 7*1024kB (UM) 0*2048kB 0*4096kB = 44520kB
[  134.558111] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  134.560358] 2195 total pagecache pages
[  134.562043] 0 pages in swap cache
[  134.563604] Swap cache stats: add 0, delete 0, find 0/0
[  134.565441] Free swap  = 0kB
[  134.567015] Total swap = 0kB
[  134.568628] 524157 pages RAM
[  134.570034] 0 pages HighMem/MovableOnly
[  134.571681] 80368 pages reserved
[  134.573467] 0 pages hwpoisoned
----------------------------------------

Only problem I felt is that the ratio of inactive_file/writeback
(shown below) was high (compared to shown above) when I did

  $ cat < /dev/zero > /tmp/file1 & cat < /dev/zero > /tmp/file2 & cat < /dev/zero > /tmp/file3 & sleep 10; ./a.out; killall cat

but I think this patch is better than current code.

----------------------------------------
[ 1135.909600] Mem-Info:
[ 1135.910686] active_anon:321011 inactive_anon:4664 isolated_anon:0
[ 1135.910686]  active_file:3170 inactive_file:78035 isolated_file:512
[ 1135.910686]  unevictable:0 dirty:0 writeback:78618 unstable:0
[ 1135.910686]  slab_reclaimable:5739 slab_unreclaimable:6170
[ 1135.910686]  mapped:4666 shmem:8300 pagetables:1966 bounce:0
[ 1135.910686]  free:12938 free_pcp:0 free_cma:0
[ 1135.925255] Node 0 DMA free:7232kB min:400kB low:500kB high:600kB active_anon:5852kB inactive_anon:196kB active_file:120kB inactive_file:980kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:968kB mapped:248kB shmem:388kB slab_reclaimable:316kB slab_unreclaimable:272kB kernel_stack:64kB pagetables:100kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:7444 all_unreclaimable? yes
[ 1135.936728] lowmem_reserve[]: 0 1714 1714 1714
[ 1135.938486] Node 0 DMA32 free:44520kB min:44652kB low:55812kB high:66976kB active_anon:1278192kB inactive_anon:18460kB active_file:12560kB inactive_file:313176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1759252kB mlocked:0kB dirty:0kB writeback:313504kB mapped:18416kB shmem:32812kB slab_reclaimable:22640kB slab_unreclaimable:24408kB kernel_stack:4240kB pagetables:7764kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2957668 all_unreclaimable? yes
[ 1135.950355] lowmem_reserve[]: 0 0 0 0
[ 1135.952011] Node 0 DMA: 7*4kB (U) 14*8kB (UM) 13*16kB (UM) 6*32kB (UME) 1*64kB (M) 4*128kB (UME) 2*256kB (UM) 3*512kB (UME) 2*1024kB (UE) 1*2048kB (M) 0*4096kB = 7260kB
[ 1135.957169] Node 0 DMA32: 241*4kB (UE) 929*8kB (UE) 496*16kB (UME) 277*32kB (UE) 135*64kB (UME) 17*128kB (UME) 3*256kB (E) 16*512kB (ME) 0*1024kB 0*2048kB 0*4096kB = 44972kB
[ 1135.963047] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1135.965472] 90009 total pagecache pages
[ 1135.967078] 0 pages in swap cache
[ 1135.968581] Swap cache stats: add 0, delete 0, find 0/0
[ 1135.970424] Free swap  = 0kB
[ 1135.971828] Total swap = 0kB
[ 1135.973248] 524157 pages RAM
[ 1135.974655] 0 pages HighMem/MovableOnly
[ 1135.976230] 80368 pages reserved
[ 1135.977745] 0 pages hwpoisoned
----------------------------------------

I can still hit OOM livelock (

 MemAlloc-Info: X stalling task, Y dying task, Z victim task.

where X > 0 && Y > 0).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-16 15:57                   ` Michal Hocko
@ 2015-10-16 18:34                     ` Linus Torvalds
  2015-10-16 18:49                       ` Tetsuo Handa
  2015-10-19 12:53                       ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2015-10-16 18:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Fri, Oct 16, 2015 at 8:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> OK so here is what I am playing with currently. It is not complete
> yet.

So this looks like it's going in a reasonable direction. However:

> +               if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +                               ac->high_zoneidx, alloc_flags, target)) {
> +                       /* Wait for some write requests to complete then retry */
> +                       wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> +                       goto retry;
> +               }

I still think we should at least spend some time re-thinking that
"wait_iff_congested()" thing. We may not actually be congested, but
might be unable to write anything out because of our allocation flags
(ie not allowed to recurse into the filesystems), so we might be in
the situation that we have a lot of dirty pages that we can't directly
do anything about.

Now, we will have woken kswapd, so something *will* hopefully be done
about them eventually, but at no time do we actually really wait for
it. We'll just busy-loop.

So at a minimum, I think we should yield to kswapd. We do do that
"cond_resched()" in wait_iff_congested(), but I'm not entirely
convinced that is at all enough to wait for kswapd to *do* something.

So before we really decide to see if we should oom, I think we should
have at least one  forced io_schedule_timeout(), whether we're
congested or not.

And yes, as Tetsuo Handa said, any kind of short wait might be too
short for IO to really complete, but *something* will have completed.
Unless we're so far up the creek that we really should just oom.

But I suspect we'll have to just try things out and tweak it. This
patch looks like a reasonable starting point to me.

Tetsuo, mind trying it out and maybe tweaking it a bit for the load
you have? Does it seem to improve on your situation?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-15 13:14                 ` Michal Hocko
@ 2015-10-16 15:57                   ` Michal Hocko
  2015-10-16 18:34                     ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-16 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

On Thu 15-10-15 15:14:09, Michal Hocko wrote:
> On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
[...]
> > Now, I realize the above suggestions are big changes, and they'll
> > likely break things and we'll still need to tweak things, but dammit,
> > wouldn't that be better than just randomly tweaking the insane
> > zone_reclaimable logic?
> 
> Yes zone_reclaimable is subtle and imho it is used even at the
> wrong level. We should decide whether we are really OOM at
> __alloc_pages_slowpath. We definitely need a big picture logic to tell
> us when it makes sense to drop the ball and trigger OOM killer or fail
> the allocation request.
> 
> E.g. free + reclaimable + writeback < min_wmark on all usable zones for
> more than X rounds of direct reclaim without any progress is
> a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
> of course. This is obviously a half baked idea which needs much more
> consideration all I am trying to say is that we need a high level metric
> to tell OOM condition.

OK so here is what I am playing with currently. It is not complete
yet. Anyway I have tested it with 2 scenarios on a swapless system with
2G of RAM both do

$ cat writer.sh
#!/bin/sh
size=$((1<<30))
block=$((4<<10))

writer()
{
	(
        while true
        do
                dd if=/dev/zero of=/mnt/data/file.$1 bs=$block count=$(($size/$block))
                rm /mnt/data/file.$1
                sync
        done
	) &
}

writer 1
writer 2

sleep 10s # allow to accumulate enough dirty pages

1) massive OOM
start 100 memeaters each 80M run in parallel (anon private MAP_POPULATE
mapping). This will trigger many OOM killers and the overall count is
what I was interested in. The test is considered finished when we get
a steady state - writers can make progress and there is no more OOM
killing for some time.

$ grep "invoked oom-killer" base-run-oom.log | wc -l
78
$ grep "invoked oom-killer" test-run-oom.log | wc -l
63

So it looks like we have triggered less OOM killing with the patch
applied. I haven't checked those too closely but it seems like at least
two instances might not have triggered with the current implementation
because DMA32 zone is considered reclaimable. But this check is
inherently racy so we cannot be sure.
$ grep "DMA32.*all_unreclaimable? no" test2-run-oom.log | wc -l
2

2) almost OOM situation
invoke 10 memeaters in parallel and try to fill up all the memory
without triggering the OOM killer. This is quite hard and it required a
lot of tunning. I've ended up with:
#!/bin/sh
pkill mem_eater
sync
echo 3 > /proc/sys/vm/drop_caches
sync
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
sh writer.sh &
sleep 10s
for i in $(seq 10)
do
        memcg_test/tools/mem_eater $size &
done

wait

and this one doesn't hit the OOM killer with the original implementation
while it hits it with the patch applied:
[   32.727001] DMA32 free:5428kB min:5532kB low:6912kB high:8296kB active_anon:1802520kB inactive_anon:204kB active_file:6692kB inactive_file:137184k
B unevictable:0kB isolated(anon):136kB isolated(file):32kB present:2080640kB managed:1997880kB mlocked:0kB dirty:0kB writeback:137168kB mapped:6408kB
 shmem:204kB slab_reclaimable:20472kB slab_unreclaimable:13276kB kernel_stack:1456kB pagetables:4756kB unstable:0kB bounce:0kB free_pcp:120kB local_p
cp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:948764 all_unreclaimable? yes

There is a lot of memory in the writeback but all_unreclaimable is yes
so who knows maybe it is just a coincidence we haven't triggered OOM in
the original kernel.

Anyway the two implementation will be hard to compare because workloads
are very different but I think something like below should be more
readable and deterministic than what we have right now. It will need
some more tuning for sure and I will be playing with it some more. I
would just like to hear opinions whether this approach makes sense.
If yes I will post it separately in a new thread for a wider discussion.
This email thread seems to be full of detours already.
---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37               ` Linus Torvalds
  2015-10-14 12:21                 ` Tetsuo Handa
@ 2015-10-15 13:14                 ` Michal Hocko
  2015-10-16 15:57                   ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-15 13:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina, Mel Gorman, Rik van Riel

[CC Mel and Rik as well - this has diverged from the original thread
 considerably but the current topic started here:
 http://lkml.kernel.org/r/201510130025.EJF21331.FFOQJtVOMLFHSO%40I-love.SAKURA.ne.jp
]

On Tue 13-10-15 09:37:06, Linus Torvalds wrote:
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.

I do agree that zone_reclaimable is subtle and hackish way to wait for
the writeback/kswapd to clean up pages which cannot be reclaimed from
the direct reclaim.

> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;
> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

This will certainly cap the reclaim retries but there are risks with
this approach afaics.

First of all other allocators might piggy back on the current reclaimer
and push it to the OOM killer even when we are not really OOM. Maybe
this is possible currently as well but it is less likely because
NR_PAGES_SCANNED is reset on a freed page which allows the reclaimer
another round.

I am also not sure it would help with pathological cases like the
one discussed here. If you have only a small amount of reclaimable
memory on the LRU lists then you scan them quite quickly which will
consume retries. Maybe a sufficient timeout can help but I am afraid we
can still hit the OOM prematurely because a large part of the memory
is still under writeback (which might be a slow device - e.g. an USB
stick).

We used have this kind of problems in memcg reclaim.  We do not
have (resp. didn't have until recently with CONFIG_CGROUP_WRITEBACK)
dirty memory throttling for memory cgroups so the LRU can become full
of dirty data really quickly and that led to memcg OOM killer.
We are not doing zone_reclaimable and other heuristics so we had to
explicitly wait_on_page_writeback in the reclaim to prevent from
premature OOM killer. Ugly hack but the only thing that worked
reliably. Time based solutions were tried and failed with different
workloads and quite randomly depending on the load/storage.

>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).
> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).
> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?

Yes zone_reclaimable is subtle and imho it is used even at the
wrong level. We should decide whether we are really OOM at
__alloc_pages_slowpath. We definitely need a big picture logic to tell
us when it makes sense to drop the ball and trigger OOM killer or fail
the allocation request.

E.g. free + reclaimable + writeback < min_wmark on all usable zones for
more than X rounds of direct reclaim without any progress is
a sufficient signal to go OOM. Costly/noretry allocations can fail earlier
of course. This is obviously a half baked idea which needs much more
consideration all I am trying to say is that we need a high level metric
to tell OOM condition.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:59                   ` Michal Hocko
@ 2015-10-14 15:06                     ` Tetsuo Handa
  0 siblings, 0 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 15:06 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > > configuration on that system.
> > 
> > All values are defaults of plain CentOS 7 installation.
> 
> So this is 3.10 kernel, right?

The userland is CentOS 7 but the kernel is linux-next-20151009.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 14:38                 ` Tetsuo Handa
@ 2015-10-14 14:59                   ` Michal Hocko
  2015-10-14 15:06                     ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-14 14:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 23:38:00, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Why hasn't balance_dirty_pages throttled writers and allowed them to
> > make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> > configuration on that system.
> 
> All values are defaults of plain CentOS 7 installation.

So this is 3.10 kernel, right?

> # sysctl -a | grep ^vm.
> vm.dirty_background_ratio = 10
> vm.dirty_bytes = 0
> vm.dirty_expire_centisecs = 3000
> vm.dirty_ratio = 30
[...]

OK, this is nothing unusual. And I _suspect_ that the throttling simply
didn't cope with the writer speed and a large anon memory consumer.
Dirtyable memory was quite high until your anon hammer bumped in
and reduced dirtyable memory down so the file LRU is full of dirty pages
when we get under serious memory pressure. Anonymous pages are not
reclaimable so the whole memory pressure goes to file LRUs and bang.

> > Also why throttle_vm_writeout haven't slown the reclaim down?
> 
> Too difficult question for me.
> 
> > 
> > Anyway this is exactly the case where zone_reclaimable helps us to
> > prevent OOM because we are looping over the remaining LRU pages without
> > making progress... This just shows how subtle all this is :/
> > 
> > I have to think about this much more..
> 
> I'm suspicious about tweaking current reclaim logic.
> Could you please respond to Linus's comments?

Yes I plan to I just didn't get to finish my email yet.
 
> There are more moles than kernel developers can find. I think that
> what we can do for short term is to prepare for moles that kernel
> developers could not find, and for long term is to reform page
> allocator for preventing moles from living.

This is much easier said than done :/ The current code is full of
heuristics grown over time based on very different requirements from
different kernel subsystems. There is no simple solution for this
problem I am afraid.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-14 13:22               ` Michal Hocko
@ 2015-10-14 14:38                 ` Tetsuo Handa
  2015-10-14 14:59                   ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 14:38 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> The OOM report is really interesting:
> 
> > [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> 
> so your whole file LRUs are either dirty or under writeback and
> reclaimable pages are below min wmark. This alone is quite suspicious.

I did

  $ cat < /dev/zero > /tmp/log

for 10 seconds before starting

  $ ./a.out

Thus, so much memory was waiting for writeback on XFS filesystem.

> Why hasn't balance_dirty_pages throttled writers and allowed them to
> make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
> configuration on that system.

All values are defaults of plain CentOS 7 installation.

# sysctl -a | grep ^vm.
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.compact_unevictable_allowed = 1
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 30
vm.dirty_writeback_centisecs = 500
vm.dirtytime_expire_seconds = 43200
vm.drop_caches = 0
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256   256     32
vm.max_map_count = 65530
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 45056
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 30
vm.user_reserve_kbytes = 54808
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

> 
> Also why throttle_vm_writeout haven't slown the reclaim down?

Too difficult question for me.

> 
> Anyway this is exactly the case where zone_reclaimable helps us to
> prevent OOM because we are looping over the remaining LRU pages without
> making progress... This just shows how subtle all this is :/
> 
> I have to think about this much more..

I'm suspicious about tweaking current reclaim logic.
Could you please respond to Linus's comments?

There are more moles than kernel developers can find. I think that
what we can do for short term is to prepare for moles that kernel
developers could not find, and for long term is to reform page
allocator for preventing moles from living.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:19             ` Tetsuo Handa
@ 2015-10-14 13:22               ` Michal Hocko
  2015-10-14 14:38                 ` Tetsuo Handa
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-14 13:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Wed 14-10-15 01:19:09, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I can see two options here. Either we teach zone_reclaimable to be less
> > fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> > them are risky because we have a long history of changes in this areas
> > which made other subtle behavior changes but I guess that the first
> > option should be less fragile. What about the following patch? I am not
> > happy about it because the condition is rather rough and a deeper
> > inspection is really needed to check all the call sites but it should be
> > good for testing.
> 
> While zone_reclaimable() for Node 0 DMA32 became false by your patch,
> zone_reclaimable() for Node 0 DMA kept returning true, and as a result
> overall result (i.e. zones_reclaimable) remained true.

Ahh, right you are. ZONE_DMA might have 0 or close to 0 pages on
LRUs while it is still protected from allocations which are not
targeted for this zone. My patch clearly haven't considered that. The
fix for that would be quite straightforward. We have to consider
lowmem_reserve of the zone wrt. the allocation/reclaim gfp target
zone. But this is getting more and more ugly (see the patch below just
for testing/demonstration purposes).

The OOM report is really interesting:

> [   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

so your whole file LRUs are either dirty or under writeback and
reclaimable pages are below min wmark. This alone is quite suspicious.
Why hasn't balance_dirty_pages throttled writers and allowed them to
make the whole LRU dirty? What is your dirty{_background}_{ratio,bytes}
configuration on that system.

Also why throttle_vm_writeout haven't slown the reclaim down?

Anyway this is exactly the case where zone_reclaimable helps us to
prevent OOM because we are looping over the remaining LRU pages without
making progress... This just shows how subtle all this is :/

I have to think about this much more..
---

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 16:37               ` Linus Torvalds
@ 2015-10-14 12:21                 ` Tetsuo Handa
  2015-10-15 13:14                 ` Michal Hocko
  1 sibling, 0 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-14 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > If I remove
> >
> >         /* Any of the zones still reclaimable?  Don't OOM. */
> >         if (zones_reclaimable)
> >                 return 1;
> >
> > the OOM killer is invoked even when there are so much memory which can be
> > reclaimed after written to disk. This is definitely premature invocation of
> > the OOM killer.
> 
> Right. The rest of the code knows that the return value right now
> means "there is no memory at all" rather than "I made progress".
> 
> > Yes. But we can't simply do
> >
> >         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
> >
> > because we won't be able to call out_of_memory(), can we?
> 
> So I think that whole thing is kind of senseless. Not just that
> particular conditional, but what it *does* too.
> 
> What can easily happen is that we are a blocking allocation, but
> because we're __GFP_FS or something, the code doesn't actually start
> writing anything out. Nor is anything congested. So the thing just
> loops.

congestion_wait() sounds like a source of silent hang up.
http://lkml.kernel.org/r/201406052145.CIB35534.OQLVMSJFOHtFOF@I-love.SAKURA.ne.jp

> 
> And looping is stupid, because we may be not able to actually free
> anything exactly because of limitations like __GFP_FS.
> 
> So
> 
>  (a) the looping condition is senseless
> 
>  (b) what we do when looping is senseless
> 
> and we actually do try to wake up kswapd in the loop, but we never
> *wait* for it, so that's largely pointless too.

Aren't we waiting for kswapd forever?
In other words, we never check whether kswapd can make some progress.
http://lkml.kernel.org/r/20150812091104.GA14940@dhcp22.suse.cz

> 
> So *of*course* the direct reclaim code has to set "I made progress",
> because if it doesn't lie and say so, then the code will randomly not
> loop, and will oom, and things go to hell.
> 
> But I hate the "let's tweak the zone_reclaimable" idea, because it
> doesn't actually fix anything. It just perpetuates this "the code
> doesn't make sense, so let's add *more* senseless heusristics to this
> whole loop".

I also don't think that tweaking current reclaim logic solves bugs
which bothered me via unexplained hangups / reboots.
To me, current memory allocator is too puzzling that it is as if

   if (there_is_much_free_memory() == TRUE)
       goto OK;
   if (do_some_heuristic1() == SUCCESS)
       goto OK;
   if (do_some_heuristic2() == SUCCESS)
       goto OK;
   if (do_some_heuristic3() == SUCCESS)
       goto OK;
   (...snipped...)
   if (do_some_heuristicN() == SUCCESS)
       goto OK;
   while (1);

and we don't know how many heuristics we need to add in order to avoid
reaching the "while (1);". (We are reaching the "while (1);" before

   if (out_of_memory() == SUCCESS)
       goto OK;

is called.)

> 
> So instead of that senseless thing, how about trying something
> *sensible*. Make the code do something that we can actually explain as
> making sense.
> 
> I'd suggest something like:
> 
>  - add a "retry count"
> 
>  - if direct reclaim made no progress, or made less progress than the target:
> 
>       if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

Yes.

> 
>  - regardless of whether we made progress or not:
> 
>       if (retry count < X) goto retry;
> 
>       if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
> goto retry

I tried sleeping for reducing CPU usage and reporting via SysRq-w.
http://lkml.kernel.org/r/201411231353.BDE90173.FQOMJtHOLVFOFS@I-love.SAKURA.ne.jp

I complained at http://lkml.kernel.org/r/201502162023.GGE26089.tJOOFQMFFHLOVS@I-love.SAKURA.ne.jp

| Oh, why every thread trying to allocate memory has to repeat
| the loop that might defer somebody who can make progress if CPU time was
| given? I wish only somebody like kswapd repeats the loop on behalf of all
| threads waiting at memory allocation slowpath...

Direct reclaim can defer termination upon SIGKILL if blocked at unkillable
lock. If performance were not a problem, is direct reclaim mandatory?

Of course, performance is the problem. Thus we would try direct reclaim
for at least once. But I wish memory allocation logic were as simple as

  (1) If there are enough free memory, allocate it.

  (2) If there are not enough free memory, join on the
      waitqueue list

        wait_event_timeout(waiter, memory_reclaimed, timeout)

      and wait for reclaiming kernel threads (e.g. kswapd) to wake
      the waiters up. If the caller is willing to give up upon SIGKILL
      (e.g. __GFP_KILLABLE) then

        wait_event_killable_timeout(waiter, memory_reclaimed, timeout)

      and return NULL upon SIGKILL.

  (3) Whenever reclaiming kernel threads reclaimed memory and there are
      waiters, wake the waiters up.

  (4) If reclaiming kernel threads cannot reclaim memory,
      the caller will wake up due to timeout, and invoke the OOM
      killer unless the caller does not want (e.g. __GFP_NO_OOMKILL).

> 
>    where 'X" is something sane that limits our CPU use, but also
> guarantees that we don't end up waiting *too* long (if a single
> allocation takes more than a big fraction of a second, we should
> probably stop trying).

Isn't a second too short for waiting for swapping / writeback?

> 
> The whole time-based thing might even be explicit. There's nothing
> wrong with doing something like
> 
>     unsigned long timeout = jiffies + HZ/4;
> 
> at the top of the function, and making the whole retry logic actually
> say something like
> 
>     if (time_after(timeout, jiffies)) goto noretry;
> 
> (or make *that* trigger the oom logic, or whatever).

I prefer time-based thing, for my customer's usage (where watchdog timeout
is configured to 10 seconds) will require kernel messages (maybe OOM killer
messages) printed within a few seconds.

> 
> Now, I realize the above suggestions are big changes, and they'll
> likely break things and we'll still need to tweak things, but dammit,
> wouldn't that be better than just randomly tweaking the insane
> zone_reclaimable logic?
> 
>                     Linus

Yes, this will be big changes. But this change will be better than living
with "no means for understanding what was happening are available" v.s.
"really interesting things are observed if means are available".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 12:21             ` Tetsuo Handa
@ 2015-10-13 16:37               ` Linus Torvalds
  2015-10-14 12:21                 ` Tetsuo Handa
  2015-10-15 13:14                 ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Linus Torvalds @ 2015-10-13 16:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Tue, Oct 13, 2015 at 5:21 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> If I remove
>
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
>
> the OOM killer is invoked even when there are so much memory which can be
> reclaimed after written to disk. This is definitely premature invocation of
> the OOM killer.

Right. The rest of the code knows that the return value right now
means "there is no memory at all" rather than "I made progress".

> Yes. But we can't simply do
>
>         if (order <= PAGE_ALLOC_COSTLY_ORDER || ..
>
> because we won't be able to call out_of_memory(), can we?

So I think that whole thing is kind of senseless. Not just that
particular conditional, but what it *does* too.

What can easily happen is that we are a blocking allocation, but
because we're __GFP_FS or something, the code doesn't actually start
writing anything out. Nor is anything congested. So the thing just
loops.

And looping is stupid, because we may be not able to actually free
anything exactly because of limitations like __GFP_FS.

So

 (a) the looping condition is senseless

 (b) what we do when looping is senseless

and we actually do try to wake up kswapd in the loop, but we never
*wait* for it, so that's largely pointless too.

So *of*course* the direct reclaim code has to set "I made progress",
because if it doesn't lie and say so, then the code will randomly not
loop, and will oom, and things go to hell.

But I hate the "let's tweak the zone_reclaimable" idea, because it
doesn't actually fix anything. It just perpetuates this "the code
doesn't make sense, so let's add *more* senseless heusristics to this
whole loop".

So instead of that senseless thing, how about trying something
*sensible*. Make the code do something that we can actually explain as
making sense.

I'd suggest something like:

 - add a "retry count"

 - if direct reclaim made no progress, or made less progress than the target:

      if (order > PAGE_ALLOC_COSTLY_ORDER) goto noretry;

 - regardless of whether we made progress or not:

      if (retry count < X) goto retry;

      if (retry count < 2*X) yield/sleep 10ms/wait-for-kswapd and then
goto retry

   where 'X" is something sane that limits our CPU use, but also
guarantees that we don't end up waiting *too* long (if a single
allocation takes more than a big fraction of a second, we should
probably stop trying).

The whole time-based thing might even be explicit. There's nothing
wrong with doing something like

    unsigned long timeout = jiffies + HZ/4;

at the top of the function, and making the whole retry logic actually
say something like

    if (time_after(timeout, jiffies)) goto noretry;

(or make *that* trigger the oom logic, or whatever).

Now, I realize the above suggestions are big changes, and they'll
likely break things and we'll still need to tweak things, but dammit,
wouldn't that be better than just randomly tweaking the insane
zone_reclaimable logic?

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-13 13:32           ` Michal Hocko
@ 2015-10-13 16:19             ` Tetsuo Handa
  2015-10-14 13:22               ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-13 16:19 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Michal Hocko wrote:
> I can see two options here. Either we teach zone_reclaimable to be less
> fragile or remove zone_reclaimable from shrink_zones altogether. Both of
> them are risky because we have a long history of changes in this areas
> which made other subtle behavior changes but I guess that the first
> option should be less fragile. What about the following patch? I am not
> happy about it because the condition is rather rough and a deeper
> inspection is really needed to check all the call sites but it should be
> good for testing.

While zone_reclaimable() for Node 0 DMA32 became false by your patch,
zone_reclaimable() for Node 0 DMA kept returning true, and as a result
overall result (i.e. zones_reclaimable) remained true.

  $ ./a.out

---------- When there is no data to write ----------
[  162.942371] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=16
[  162.944541] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  162.946560] zone_reclaimable returned 1 at line 2665
[  162.948722] shrink_zones returned 1 at line 2716
(...snipped...)
[  164.897587] zones_reclaimable=1 at line 2775
[  164.899172] do_try_to_free_pages returned 1 at line 2948
[  167.087119] __perform_reclaim returned 1 at line 2854
[  167.088868] did_some_progress=1 at line 3301
(...snipped...)
[  261.577944] MIN=11163 FREE=11155 (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  261.580093] MIN=100 FREE=1824 (ACTIVE_FILE=3+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  261.582333] zone_reclaimable returned 1 at line 2665
[  261.583841] shrink_zones returned 1 at line 2716
(...snipped...)
[  264.728434] zones_reclaimable=1 at line 2775
[  264.730002] do_try_to_free_pages returned 1 at line 2948
[  268.191368] __perform_reclaim returned 1 at line 2854
[  268.193113] did_some_progress=1 at line 3301
---------- When there is no data to write ----------

Complete log (with your patch inside) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151014.txt.xz .

By the way, the OOM killer seems to be invoked prematurely for different load
if your patch is applied.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[   69.019271] Mem-Info:
[   69.019755] active_anon:335006 inactive_anon:2084 isolated_anon:23
[   69.019755]  active_file:12197 inactive_file:65310 isolated_file:31
[   69.019755]  unevictable:0 dirty:533 writeback:51020 unstable:0
[   69.019755]  slab_reclaimable:4753 slab_unreclaimable:4134
[   69.019755]  mapped:9639 shmem:2144 pagetables:2030 bounce:0
[   69.019755]  free:12972 free_pcp:45 free_cma:0
[   69.026260] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5232kB inactive_anon:96kB active_file:424kB inactive_file:1068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:164kB writeback:972kB mapped:416kB shmem:104kB slab_reclaimable:304kB slab_unreclaimable:244kB kernel_stack:96kB pagetables:256kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[   69.037189] lowmem_reserve[]: 0 1729 1729 1729
[   69.039152] Node 0 DMA32 free:74224kB min:44652kB low:55812kB high:66976kB active_anon:1334792kB inactive_anon:8240kB active_file:48364kB inactive_file:230752kB unevictable:0kB isolated(anon):92kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:9328kB writeback:199060kB mapped:38140kB shmem:8472kB slab_reclaimable:17840kB slab_unreclaimable:16292kB kernel_stack:3840kB pagetables:7864kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   69.052017] lowmem_reserve[]: 0 0 0 0
[   69.053818] Node 0 DMA: 17*4kB (UME) 8*8kB (UME) 6*16kB (UME) 2*32kB (UM) 2*64kB (UE) 4*128kB (UME) 1*256kB (U) 2*512kB (UE) 3*1024kB (UME) 1*2048kB (U) 0*4096kB = 7332kB
[   69.059597] Node 0 DMA32: 632*4kB (UME) 454*8kB (UME) 507*16kB (UME) 310*32kB (UME) 177*64kB (UE) 61*128kB (UME) 15*256kB (ME) 19*512kB (M) 10*1024kB (M) 0*2048kB 0*4096kB = 67136kB
[   69.065810] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   69.068305] 72477 total pagecache pages
[   69.069932] 0 pages in swap cache
[   69.071435] Swap cache stats: add 0, delete 0, find 0/0
[   69.073354] Free swap  = 0kB
[   69.074822] Total swap = 0kB
[   69.076660] 524157 pages RAM
[   69.078113] 0 pages HighMem/MovableOnly
[   69.079930] 76615 pages reserved
[   69.081406] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
  2015-10-12 21:23           ` Linus Torvalds
@ 2015-10-13 13:32           ` Michal Hocko
  2015-10-13 16:19             ` Tetsuo Handa
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2015-10-13 13:32 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

On Tue 13-10-15 00:25:53, Tetsuo Handa wrote:
[...]
> What is strange, the values printed by this debug printk() patch did not
> change as time went by. Thus, I think that this is not a problem of lack of
> CPU time for scanning pages. I suspect that there is a bug that nobody is
> scanning pages.
> 
> ----------
> [   66.821450] zone_reclaimable returned 1 at line 2646
> [   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [   66.824935] shrink_zones returned 1 at line 2706
> [   66.826392] zones_reclaimable=1 at line 2765
> [   66.827865] do_try_to_free_pages returned 1 at line 2938
> [   67.102322] __perform_reclaim returned 1 at line 2854
> [   67.103968] did_some_progress=1 at line 3301
> (...snipped...)
> [  281.439977] zone_reclaimable returned 1 at line 2646
> [  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
> [  281.439978] shrink_zones returned 1 at line 2706
> [  281.439978] zones_reclaimable=1 at line 2765
> [  281.439979] do_try_to_free_pages returned 1 at line 2938
> [  281.439979] __perform_reclaim returned 1 at line 2854
> [  281.439980] did_some_progress=1 at line 3301

This is really interesting because even with reclaimable LRUs this low
we should eventually scan them enough times to convince zone_reclaimable
to fail. PAGES_SCANNED in your logs seems to be constant, though, which
suggests somebody manages to free a page every time before we get down
to priority 0 and manage to scan something finally. This is pretty much
pathological behavior and I have hard time to imagine how would that be
possible but it clearly shows that zone_reclaimable heuristic is not
working properly.

I can see two options here. Either we teach zone_reclaimable to be less
fragile or remove zone_reclaimable from shrink_zones altogether. Both of
them are risky because we have a long history of changes in this areas
which made other subtle behavior changes but I guess that the first
option should be less fragile. What about the following patch? I am not
happy about it because the condition is rather rough and a deeper
inspection is really needed to check all the call sites but it should be
good for testing.
--- 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 21:23           ` Linus Torvalds
@ 2015-10-13 12:21             ` Tetsuo Handa
  2015-10-13 16:37               ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-13 12:21 UTC (permalink / raw)
  To: torvalds
  Cc: mhocko, rientjes, oleg, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Linus Torvalds wrote:
> On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > I examined this hang up using additional debug printk() patch. And it was
> > observed that when this silent hang up occurs, zone_reclaimable() called from
> > shrink_zones() called from a __GFP_FS memory allocation request is returning
> > true forever. Since the __GFP_FS memory allocation request can never call
> > out_of_memory() due to did_some_progree > 0, the system will silently hang up
> > with 100% CPU usage.
> 
> I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.
> 

I compared "hang up after the OOM killer is invoked" and "hang up before
the OOM killer is invoked" by always printing the values.

 			}
 			reclaimable = true;
 		}
+		else if (dump_target_pid == current->pid) {
+			printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+			       zone_page_state(zone, NR_ACTIVE_FILE),
+			       zone_page_state(zone, NR_INACTIVE_FILE));
+			if (get_nr_swap_pages() > 0)
+				printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+				       zone_page_state(zone, NR_ACTIVE_ANON),
+				       zone_page_state(zone, NR_INACTIVE_ANON));
+			printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+			       zone_page_state(zone, NR_PAGES_SCANNED));
+		}
 	}
 
 	/*

For the former case, most of trials showed that

  (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0

. Sometimes PAGES_SCANNED > 0 (as grep'ed below), but ACTIVE_FILE and
INACTIVE_FILE seems to be always 0.

----------
[  195.905057] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  195.927430] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.317088] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  206.338007] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.723776] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  216.744618] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.129653] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  227.151238] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.650232] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  237.671343] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[  277.980310] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  278.001481] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.339220] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  288.361908] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.682988] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  298.704055] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=9
[  350.368952] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  350.389770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.724821] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  360.746100] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.231887] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.233770] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  845.253196] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=27
[  845.254910] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[ 1397.628073] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1397.649165] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.207041] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
[ 1408.228762] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=2
----------

For the latter case, most of output showed that
ACTIVE_FILE + INACTIVE_FILE > 0.

----------
[  142.647201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.648883] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  142.842868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  142.955817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.086363] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.231120] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.359238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.473342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.618103] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.746210] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  143.908162] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.035415] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.161926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.306435] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.434265] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.436099] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  144.643374] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.773239] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  144.902309] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.046154] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.185410] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.317218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.460304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.654212] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.817362] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  145.945136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.086303] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  146.242127] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.489868] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.491593] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  153.674246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  153.839478] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.003234] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=5
[  154.155085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.322187] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.447355] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.653150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.782216] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  154.939439] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.105921] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.278386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.440832] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.623970] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.625766] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.831074] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  155.996903] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.139137] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.318492] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.484300] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.667411] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  156.817246] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.012323] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.159483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.323193] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.488399] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  157.654198] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.339172] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.340896] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.583026] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.797386] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  164.965110] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.124935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.431304] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.700317] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  165.862071] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.029257] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.198312] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.356224] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.559302] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.684486] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.898551] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  166.900496] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.175960] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.324390] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.526150] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.693365] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  167.878407] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.061503] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.225306] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.416398] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.617395] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.783201] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  168.989053] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  169.196126] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.361136] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.362865] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.626817] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  175.797361] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.006389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.211479] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.433890] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.630951] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  176.855509] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.049814] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.258218] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.455404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.665085] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  177.874173] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.057217] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.059056] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.350935] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.559404] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.782483] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  178.982803] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.203930] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.428321] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.611349] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  179.851164] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.034220] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.279197] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.455284] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  180.811445] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.368405] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.370115] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.614733] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  186.845695] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.024274] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.211389] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.427147] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.552333] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.734117] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  187.935811] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.138296] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.354041] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.559245] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.641776] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.716434] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  188.718199] (ACTIVE_FILE=0+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.015952] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.218976] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.440131] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.659238] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  189.882360] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.087342] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.314442] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.408926] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.631240] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  190.850326] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.067488] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  191.283243] (ACTIVE_FILE=16+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
----------

So, something is preventing ACTIVE_FILE and INACTIVE_FILE to become 0 ?

I also tried below change, but the result was same. Therefore, this
problem seems to be independent with "!__GFP_FS allocations do not fail".
(Complete log with below change (uptime > 101) is at
http://I-love.SAKURA.ne.jp/tmp/serial-20151013-2.txt.xz . )

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2736,7 +2736,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 			 * and the OOM killer can't be invoked, but
 			 * keep looping as per tradition.
 			 */
-			*did_some_progress = 1;
 			goto out;
 		}
 		if (pm_suspended_storage())
----------

----------
[  102.719555] (ACTIVE_FILE=3+INACTIVE_FILE=3) * 6 > PAGES_SCANNED=19
[  102.721234] (ACTIVE_FILE=1+INACTIVE_FILE=0) * 6 > PAGES_SCANNED=0
[  102.722908] shrink_zones returned 1 at line 2717
----------

> So the do_try_to_free_pages() logic that does that
> 
>         /* Any of the zones still reclaimable?  Don't OOM. */
>         if (zones_reclaimable)
>                 return 1;
> 
> is rather dubious. The history of that odd line is pretty dubious too:
> it used to be that we would return success if "shrink_zones()"
> succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
> logic got rewritten, and I don't think the current situation is all
> that sane.
> 
> And returning 1 there is actively misleading to callers, since it
> makes them think that it made progress.
> 
> So I think you should look at what happens if you just remove that
> illogical and misleading return value.
> 

If I remove

	/* Any of the zones still reclaimable?  Don't OOM. */
	if (zones_reclaimable)
		return 1;

the OOM killer is invoked even when there are so much memory which can be
reclaimed after written to disk. This is definitely premature invocation of
the OOM killer.

  $ cat < /dev/zero > /tmp/log & sleep 10; ./a.out

---------- When there is a lot of data to write ----------
[  489.952827] Mem-Info:
[  489.953840] active_anon:328227 inactive_anon:3033 isolated_anon:26
[  489.953840]  active_file:2309 inactive_file:80915 isolated_file:0
[  489.953840]  unevictable:0 dirty:53 writeback:80874 unstable:0
[  489.953840]  slab_reclaimable:4975 slab_unreclaimable:4256
[  489.953840]  mapped:2973 shmem:4192 pagetables:1939 bounce:0
[  489.953840]  free:12963 free_pcp:60 free_cma:0
[  489.963395] Node 0 DMA free:7300kB min:400kB low:500kB high:600kB active_anon:5728kB inactive_anon:88kB active_file:140kB inactive_file:1276kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:1300kB mapped:140kB shmem:160kB slab_reclaimable:256kB slab_unreclaimable:180kB kernel_stack:64kB pagetables:180kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9768 all_unreclaimable? yes
[  489.974035] lowmem_reserve[]: 0 1729 1729 1729
[  489.975813] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1307180kB inactive_anon:12044kB active_file:9096kB inactive_file:322384kB unevictable:0kB isolated(anon):104kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:216kB writeback:322196kB mapped:11752kB shmem:16608kB slab_reclaimable:19644kB slab_unreclaimable:16844kB kernel_stack:3584kB pagetables:7576kB unstable:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:2419896 all_unreclaimable? yes
[  489.988452] lowmem_reserve[]: 0 0 0 0
[  489.990043] Node 0 DMA: 2*4kB (UE) 1*8kB (M) 4*16kB (UME) 1*32kB (E) 2*64kB (UE) 3*128kB (UME) 2*256kB (UM) 2*512kB (ME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 7280kB
[  489.995142] Node 0 DMA32: 578*4kB (UME) 726*8kB (UE) 447*16kB (UE) 253*32kB (UME) 155*64kB (UME) 42*128kB (UME) 3*256kB (UME) 2*512kB (UM) 4*1024kB (U) 0*2048kB 0*4096kB = 44552kB
[  490.000511] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  490.002914] 87434 total pagecache pages
[  490.004612] 0 pages in swap cache
[  490.006138] Swap cache stats: add 0, delete 0, find 0/0
[  490.007976] Free swap  = 0kB
[  490.009329] Total swap = 0kB
[  490.011033] 524157 pages RAM
[  490.012352] 0 pages HighMem/MovableOnly
[  490.013903] 76615 pages reserved
[  490.015260] 0 pages hwpoisoned
---------- When there is a lot of data to write ----------

  $ ./a.out

---------- When there is no data to write ----------
[  792.359024] Mem-Info:
[  792.360001] active_anon:413751 inactive_anon:6226 isolated_anon:0
[  792.360001]  active_file:0 inactive_file:0 isolated_file:0
[  792.360001]  unevictable:0 dirty:0 writeback:0 unstable:0
[  792.360001]  slab_reclaimable:1243 slab_unreclaimable:3638
[  792.360001]  mapped:104 shmem:6236 pagetables:1033 bounce:0
[  792.360001]  free:12965 free_pcp:126 free_cma:0
[  792.368559] Node 0 DMA free:7292kB min:400kB low:500kB high:600kB active_anon:7040kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:160kB slab_reclaimable:24kB slab_unreclaimable:172kB kernel_stack:64kB pagetables:460kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.378240] lowmem_reserve[]: 0 1729 1729 1729
[  792.379834] Node 0 DMA32 free:44568kB min:44652kB low:55812kB high:66976kB active_anon:1647964kB inactive_anon:24744kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774264kB mlocked:0kB dirty:0kB writeback:0kB mapped:416kB shmem:24784kB slab_reclaimable:4948kB slab_unreclaimable:14380kB kernel_stack:3104kB pagetables:3672kB unstable:0kB bounce:0kB free_pcp:504kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  792.390085] lowmem_reserve[]: 0 0 0 0
[  792.391643] Node 0 DMA: 3*4kB (UE) 0*8kB 3*16kB (UE) 24*32kB (ME) 11*64kB (UME) 5*128kB (UM) 2*256kB (ME) 3*512kB (ME) 1*1024kB (E) 1*2048kB (E) 0*4096kB = 7292kB
[  792.396201] Node 0 DMA32: 242*4kB (UME) 386*8kB (UME) 397*16kB (UME) 199*32kB (UE) 105*64kB (UME) 37*128kB (UME) 24*256kB (UME) 20*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 44616kB
[  792.401136] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  792.403356] 6250 total pagecache pages
[  792.404803] 0 pages in swap cache
[  792.406208] Swap cache stats: add 0, delete 0, find 0/0
[  792.407896] Free swap  = 0kB
[  792.409172] Total swap = 0kB
[  792.410460] 524157 pages RAM
[  792.411752] 0 pages HighMem/MovableOnly
[  792.413106] 76615 pages reserved
[  792.414493] 0 pages hwpoisoned
---------- When there is no data to write ----------

> HOWEVER.
> 
> I think that it's very true that we have then tuned all our *other*
> heuristics for taking this thing into account, so I suspect that we'll
> find that we'll need to tweak other places. But this crazy "let's say
> that we made progress even when we didn't" thing looks just wrong.
> 
> In particular, I think that you'll find that you will have to change
> the heuristics in __alloc_pages_slowpath() where we currently do
> 
>         if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..
> 
> when the "did_some_progress" logic changes that radically.
> 

Yes. But we can't simply do

	if (order <= PAGE_ALLOC_COSTLY_ORDER || ..

because we won't be able to call out_of_memory(), can we?

> Because while the current return value looks insane, all the other
> testing and tweaking has been done with that very odd return value in
> place.
> 
>                 Linus
> 

Well, did I encounter a difficult to fix problem?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Silent hang up caused by pages being not scanned?
  2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
@ 2015-10-12 21:23           ` Linus Torvalds
  2015-10-13 12:21             ` Tetsuo Handa
  2015-10-13 13:32           ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2015-10-12 21:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, David Rientjes, Oleg Nesterov, Kyle Walker,
	Christoph Lameter, Andrew Morton, Johannes Weiner,
	Vladimir Davydov, linux-mm, Linux Kernel Mailing List,
	Stanislav Kozina

On Mon, Oct 12, 2015 at 8:25 AM, Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> I examined this hang up using additional debug printk() patch. And it was
> observed that when this silent hang up occurs, zone_reclaimable() called from
> shrink_zones() called from a __GFP_FS memory allocation request is returning
> true forever. Since the __GFP_FS memory allocation request can never call
> out_of_memory() due to did_some_progree > 0, the system will silently hang up
> with 100% CPU usage.

I wouldn't blame the zones_reclaimable() logic itself, but yeah, that looks bad.

So the do_try_to_free_pages() logic that does that

        /* Any of the zones still reclaimable?  Don't OOM. */
        if (zones_reclaimable)
                return 1;

is rather dubious. The history of that odd line is pretty dubious too:
it used to be that we would return success if "shrink_zones()"
succeeded or if "nr_reclaimed" was non-zero, but that "shrink_zones()"
logic got rewritten, and I don't think the current situation is all
that sane.

And returning 1 there is actively misleading to callers, since it
makes them think that it made progress.

So I think you should look at what happens if you just remove that
illogical and misleading return value.

HOWEVER.

I think that it's very true that we have then tuned all our *other*
heuristics for taking this thing into account, so I suspect that we'll
find that we'll need to tweak other places. But this crazy "let's say
that we made progress even when we didn't" thing looks just wrong.

In particular, I think that you'll find that you will have to change
the heuristics in __alloc_pages_slowpath() where we currently do

        if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) || ..

when the "did_some_progress" logic changes that radically.

Because while the current return value looks insane, all the other
testing and tweaking has been done with that very odd return value in
place.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Silent hang up caused by pages being not scanned?
  2015-10-12  6:43       ` Tetsuo Handa
@ 2015-10-12 15:25         ` Tetsuo Handa
  2015-10-12 21:23           ` Linus Torvalds
  2015-10-13 13:32           ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Tetsuo Handa @ 2015-10-12 15:25 UTC (permalink / raw)
  To: mhocko
  Cc: rientjes, oleg, torvalds, kwalker, cl, akpm, hannes, vdavydov,
	linux-mm, linux-kernel, skozina

Tetsuo Handa wrote:
> Uptime between 101 and 300 is a silent hang up (i.e. no OOM killer messages,
> no SIGKILL pending tasks, no TIF_MEMDIE tasks) which I solved using SysRq-f
> at uptime = 289. I don't know the reason of this silent hang up, but the
> memory unzapping kernel thread will not help because there is no OOM victim.
> 
> ----------
> [  101.438951] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  111.817922] MemAlloc-Info: 12 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  122.281828] MemAlloc-Info: 13 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  132.793724] MemAlloc-Info: 14 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  143.336154] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  289.343187] sysrq: SysRq : Manual OOM execution
> (...snipped...)
> [  292.065650] MemAlloc-Info: 16 stalling task, 0 dying task, 0 victim task.
> (...snipped...)
> [  302.590736] kworker/3:2 invoked oom-killer: gfp_mask=0x24000c0, order=-1, oom_score_adj=0
> (...snipped...)
> [  302.690047] MemAlloc-Info: 4 stalling task, 0 dying task, 0 victim task.
> ----------

I examined this hang up using additional debug printk() patch. And it was
observed that when this silent hang up occurs, zone_reclaimable() called from
shrink_zones() called from a __GFP_FS memory allocation request is returning
true forever. Since the __GFP_FS memory allocation request can never call
out_of_memory() due to did_some_progree > 0, the system will silently hang up
with 100% CPU usage.

----------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0473eec..fda0bb5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,6 +2821,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 }
 #endif /* CONFIG_COMPACTION */
 
+pid_t dump_target_pid;
+
 /* Perform direct synchronous page reclaim */
 static int
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -2847,6 +2849,9 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "__perform_reclaim returned %u at line %u\n",
+		       progress, __LINE__);
 	return progress;
 }
 
@@ -3007,6 +3012,7 @@ static int malloc_watchdog(void *unused)
 	unsigned int memdie_pending;
 	unsigned int stalling_tasks;
 	u8 index;
+	pid_t pid;
 
  not_stalling: /* Healty case. */
 	/*
@@ -3025,12 +3031,16 @@ static int malloc_watchdog(void *unused)
 	 * and stop_memalloc_timer() within timeout duration.
 	 */
 	if (likely(!memalloc_counter[index]))
+	{
+		dump_target_pid = 0;
 		goto not_stalling;
+	}
  maybe_stalling: /* Maybe something is wrong. Let's check. */
 	/* First, report whether there are SIGKILL tasks and/or OOM victims. */
 	sigkill_pending = 0;
 	memdie_pending = 0;
 	stalling_tasks = 0;
+	pid = 0;
 	preempt_disable();
 	rcu_read_lock();
 	for_each_process_thread(g, p) {
@@ -3062,8 +3072,11 @@ static int malloc_watchdog(void *unused)
 			(fatal_signal_pending(p) ? "-dying" : ""),
 			p->comm, p->pid, m->gfp, m->order, spent);
 		show_stack(p, NULL);
+		if (!pid && (m->gfp & __GFP_FS))
+			pid = p->pid;
 	}
 	spin_unlock(&memalloc_list_lock);
+	dump_target_pid = -pid;
 	/* Wait until next timeout duration. */
 	schedule_timeout_interruptible(timeout);
 	if (memalloc_counter[index])
@@ -3155,6 +3168,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto nopage;
 
 retry:
+	if (dump_target_pid == -current->pid)
+		dump_target_pid = -dump_target_pid;
+
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
@@ -3280,6 +3296,11 @@ retry:
 		goto noretry;
 
 	/* Keep reclaiming pages as long as there is reasonable progress */
+	if (dump_target_pid == current->pid) {
+		printk(KERN_INFO "did_some_progress=%lu at line %u\n",
+		       did_some_progress, __LINE__);
+		dump_target_pid = 0;
+	}
 	pages_reclaimed += did_some_progress;
 	if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27d580b..cb0c22e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2527,6 +2527,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	return watermark_ok;
 }
 
+extern pid_t dump_target_pid;
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2619,16 +2621,41 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
 			if (nr_soft_reclaimed)
+			{
+				if (dump_target_pid == current->pid)
+					printk(KERN_INFO "nr_soft_reclaimed=%lu at line %u\n",
+					       nr_soft_reclaimed, __LINE__);
 				reclaimable = true;
+			}
 			/* need some check for avoid more shrink_zone() */
 		}
 
 		if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
+		{
+			if (dump_target_pid == current->pid)
+				printk(KERN_INFO "shrink_zone returned 1 at line %u\n",
+				       __LINE__);
 			reclaimable = true;
+		}
 
 		if (global_reclaim(sc) &&
 		    !reclaimable && zone_reclaimable(zone))
+		{
+			if (dump_target_pid == current->pid) {
+				printk(KERN_INFO "zone_reclaimable returned 1 at line %u\n",
+				       __LINE__);
+				printk(KERN_INFO "(ACTIVE_FILE=%lu+INACTIVE_FILE=%lu",
+				       zone_page_state(zone, NR_ACTIVE_FILE),
+				       zone_page_state(zone, NR_INACTIVE_FILE));
+				if (get_nr_swap_pages() > 0)
+					printk(KERN_CONT "+ACTIVE_ANON=%lu+INACTIVE_ANON=%lu",
+					       zone_page_state(zone, NR_ACTIVE_ANON),
+					       zone_page_state(zone, NR_INACTIVE_ANON));
+				printk(KERN_CONT ") * 6 > PAGES_SCANNED=%lu\n",
+				       zone_page_state(zone, NR_PAGES_SCANNED));
+			}
 			reclaimable = true;
+		}
 	}
 
 	/*
@@ -2674,6 +2701,9 @@ retry:
 				sc->priority);
 		sc->nr_scanned = 0;
 		zones_reclaimable = shrink_zones(zonelist, sc);
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "shrink_zones returned %u at line %u\n",
+			       zones_reclaimable, __LINE__);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2707,11 +2737,21 @@ retry:
 	delayacct_freepages_end();
 
 	if (sc->nr_reclaimed)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->nr_reclaimed=%lu at line %u\n",
+			       sc->nr_reclaimed, __LINE__);
 		return sc->nr_reclaimed;
+	}
 
 	/* Aborted reclaim to try compaction? don't OOM, then */
 	if (sc->compaction_ready)
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "sc->compaction_ready=%u at line %u\n",
+			       sc->compaction_ready, __LINE__);
 		return 1;
+	}
 
 	/* Untapped cgroup reserves?  Don't OOM, retry. */
 	if (!sc->may_thrash) {
@@ -2720,6 +2760,9 @@ retry:
 		goto retry;
 	}
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "zones_reclaimable=%u at line %u\n",
+		       zones_reclaimable, __LINE__);
 	/* Any of the zones still reclaimable?  Don't OOM. */
 	if (zones_reclaimable)
 		return 1;
@@ -2875,7 +2918,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	 * point.
 	 */
 	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
+	{
+		if (dump_target_pid == current->pid)
+			printk(KERN_INFO "throttle_direct_reclaim returned 1 at line %u\n",
+			       __LINE__);
 		return 1;
+	}
 
 	trace_mm_vmscan_direct_reclaim_begin(order,
 				sc.may_writepage,
@@ -2885,6 +2933,9 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
+	if (dump_target_pid == current->pid)
+		printk(KERN_INFO "do_try_to_free_pages returned %lu at line %u\n",
+		       nr_reclaimed, __LINE__);
 	return nr_reclaimed;
 }
 
----------

What is strange, the values printed by this debug printk() patch did not
change as time went by. Thus, I think that this is not a problem of lack of
CPU time for scanning pages. I suspect that there is a bug that nobody is
scanning pages.

----------
[   66.821450] zone_reclaimable returned 1 at line 2646
[   66.823020] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[   66.824935] shrink_zones returned 1 at line 2706
[   66.826392] zones_reclaimable=1 at line 2765
[   66.827865] do_try_to_free_pages returned 1 at line 2938
[   67.102322] __perform_reclaim returned 1 at line 2854
[   67.103968] did_some_progress=1 at line 3301
(...snipped...)
[  281.439977] zone_reclaimable returned 1 at line 2646
[  281.439977] (ACTIVE_FILE=26+INACTIVE_FILE=10) * 6 > PAGES_SCANNED=32
[  281.439978] shrink_zones returned 1 at line 2706
[  281.439978] zones_reclaimable=1 at line 2765
[  281.439979] do_try_to_free_pages returned 1 at line 2938
[  281.439979] __perform_reclaim returned 1 at line 2854
[  281.439980] did_some_progress=1 at line 3301
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151013.txt.xz

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-19 12:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-14  8:03 Silent hang up caused by pages being not scanned? Hillf Danton
  -- strict thread matches above, loose matches on Subject: below --
2015-09-28 16:18 can't oom-kill zap the victim's memory? Tetsuo Handa
2015-10-02 12:36 ` Michal Hocko
2015-10-03  6:02   ` Can't we use timeout based OOM warning/killing? Tetsuo Handa
2015-10-06 14:51     ` Tetsuo Handa
2015-10-12  6:43       ` Tetsuo Handa
2015-10-12 15:25         ` Silent hang up caused by pages being not scanned? Tetsuo Handa
2015-10-12 21:23           ` Linus Torvalds
2015-10-13 12:21             ` Tetsuo Handa
2015-10-13 16:37               ` Linus Torvalds
2015-10-14 12:21                 ` Tetsuo Handa
2015-10-15 13:14                 ` Michal Hocko
2015-10-16 15:57                   ` Michal Hocko
2015-10-16 18:34                     ` Linus Torvalds
2015-10-16 18:49                       ` Tetsuo Handa
2015-10-19 12:57                         ` Michal Hocko
2015-10-19 12:53                       ` Michal Hocko
2015-10-13 13:32           ` Michal Hocko
2015-10-13 16:19             ` Tetsuo Handa
2015-10-14 13:22               ` Michal Hocko
2015-10-14 14:38                 ` Tetsuo Handa
2015-10-14 14:59                   ` Michal Hocko
2015-10-14 15:06                     ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).