All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mempool vs. page allocator interaction
@ 2016-07-18  8:39 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel

Hi,
there have been two issues identified when investigating dm-crypt
backed swap recently [1]. The first one looks like a regression from
f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free
elements") because swapout path can now deplete all the available memory
reserves. The first patch tries to address that issue by dropping
__GFP_NOMEMALLOC only to TIF_MEMDIE tasks.

The second issue is that dm writeout path which relies on mempool
allocator gets throttled by the direct reclaim in throttle_vm_writeout
which just makes the whole memory pressure problem even worse. The
patch2 just makes sure that we annotate mempool users to be throttled
less by PF_LESS_THROTTLE flag and prevent from throttle_vm_writeout for
that path. mempool users are usually the IO path and throttle them less
sounds like a reasonable way to go.

I do not have any more complicated dm setup available so I would
appreciate if dm people (CCed) could give these two a try.

Also it would be great to iron out concerns from David. He has posted a
deadlock stack trace [2] which has led to f9054c70d28b which is bio
allocation lockup because the TIF_MEMDIE process cannot make a forward
progress without access to memory reserve. This case should be fixed by
patch 1 AFAICS. There are other potential cases when the stuck mempool
is called from PF_MEMALLOC context and blocks the oom victim indirectly
(over a lock) but I believe those are much less likely and we have the
oom reaper to make a forward progress.

Sorry of pulling the discussion outside of the original email thread
but there were more lines of dicussion there and I felt discussing
particualr solution with its justification has a greater chance of
moving towards a solution. I am sending this as an RFC because this
needs a deep review as there might be other side effects I do not see
(especially about patch 2).

Any comments, suggestions are welcome.

---
[1] http://lkml.kernel.org/r/alpine.LRH.2.02.1607111027080.14327@file01.intranet.prod.int.rdu2.redhat.com
[2] http://lkml.kernel.org/r/alpine.DEB.2.10.1607131644590.92037@chino.kir.corp.google.com

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 0/2] mempool vs. page allocator interaction
@ 2016-07-18  8:39 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:39 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel

Hi,
there have been two issues identified when investigating dm-crypt
backed swap recently [1]. The first one looks like a regression from
f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free
elements") because swapout path can now deplete all the available memory
reserves. The first patch tries to address that issue by dropping
__GFP_NOMEMALLOC only to TIF_MEMDIE tasks.

The second issue is that dm writeout path which relies on mempool
allocator gets throttled by the direct reclaim in throttle_vm_writeout
which just makes the whole memory pressure problem even worse. The
patch2 just makes sure that we annotate mempool users to be throttled
less by PF_LESS_THROTTLE flag and prevent from throttle_vm_writeout for
that path. mempool users are usually the IO path and throttle them less
sounds like a reasonable way to go.

I do not have any more complicated dm setup available so I would
appreciate if dm people (CCed) could give these two a try.

Also it would be great to iron out concerns from David. He has posted a
deadlock stack trace [2] which has led to f9054c70d28b which is bio
allocation lockup because the TIF_MEMDIE process cannot make a forward
progress without access to memory reserve. This case should be fixed by
patch 1 AFAICS. There are other potential cases when the stuck mempool
is called from PF_MEMALLOC context and blocks the oom victim indirectly
(over a lock) but I believe those are much less likely and we have the
oom reaper to make a forward progress.

Sorry of pulling the discussion outside of the original email thread
but there were more lines of dicussion there and I felt discussing
particualr solution with its justification has a greater chance of
moving towards a solution. I am sending this as an RFC because this
needs a deep review as there might be other side effects I do not see
(especially about patch 2).

Any comments, suggestions are welcome.

---
[1] http://lkml.kernel.org/r/alpine.LRH.2.02.1607111027080.14327@file01.intranet.prod.int.rdu2.redhat.com
[2] http://lkml.kernel.org/r/alpine.DEB.2.10.1607131644590.92037@chino.kir.corp.google.com


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-18  8:39 ` Michal Hocko
@ 2016-07-18  8:41   ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Mikulas was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. We can handle that case in a different way, though. We
can check whether the current task has access to memory reserves ad an
OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
is empty.

David Rientjes was objecting that such an approach wouldn't help if the
oom victim was blocked on a lock held by process doing mempool_alloc. This
is very similar to other oom deadlock situations and we have oom_reaper
to deal with them so it is reasonable to rely on the same mechanism
rather inventing a different one which has negative side effects.

Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	/*
+	 * Make sure that the OOM victim will get access to memory reserves
+	 * properly if there are no objects in the pool to prevent from
+	 * livelocks.
+	 */
+	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+		gfp_temp &= ~__GFP_NOMEMALLOC;
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-18  8:41   ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Mikulas was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. We can handle that case in a different way, though. We
can check whether the current task has access to memory reserves ad an
OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
is empty.

David Rientjes was objecting that such an approach wouldn't help if the
oom victim was blocked on a lock held by process doing mempool_alloc. This
is very similar to other oom deadlock situations and we have oom_reaper
to deal with them so it is reasonable to rely on the same mechanism
rather inventing a different one which has negative side effects.

Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	/*
+	 * Make sure that the OOM victim will get access to memory reserves
+	 * properly if there are no objects in the pool to prevent from
+	 * livelocks.
+	 */
+	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+		gfp_temp &= ~__GFP_NOMEMALLOC;
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-18  8:41   ` Michal Hocko
@ 2016-07-18  8:41     ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Mikulas has reported that a swap backed by dm-crypt doesn't work
properly because the swapout cannot make a sufficient forward progress
as the writeout path depends on dm_crypt worker which has to allocate
memory to perform the encryption. In order to guarantee a forward
progress it relies on the mempool allocator. mempool_alloc(), however,
prefers to use the underlying (usually page) allocator before it grabs
objects from the pool. Such an allocation can dive into the memory
reclaim and consequently to throttle_vm_writeout. If there are too many
dirty or pages under writeback it will get throttled even though it is
in fact a flusher to clear pending pages.

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Memory pools are usually used for the writeback paths and it doesn't
really make much sense to throttle them just because there are too many
dirty/writeback pages. The main purpose of throttle_vm_writeout is to
make sure that the pageout path doesn't generate too much dirty data.
Considering that we are in mempool path which performs __GFP_NORETRY
requests the risk shouldn't be really high.

Fix this by ensuring that mempool users will get PF_LESS_THROTTLE and
that such processes are not throttled in throttle_vm_writeout. They can
still get throttled due to current_may_throttle() sleeps but that should
happen when the backing device itself is congested which sounds like a
proper reaction.

Please note that the bonus given by domain_dirty_limits() alone is not
sufficient because at least dm-crypt has to double buffer each page
under writeback so this won't be sufficient to prevent from being
throttled.

There are other users of the flag but they are in the writeout path so
this looks like a proper thing for them as well.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c        | 19 +++++++++++++++----
 mm/page-writeback.c |  3 +++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index ea26d75c8adf..916e95c4192c 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
  */
 void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
-	void *element;
+	unsigned int pflags = current->flags;
+	void *element = NULL;
 	unsigned long flags;
 	wait_queue_t wait;
 	gfp_t gfp_temp;
@@ -328,6 +329,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+	/*
+	 * Make sure that the allocation doesn't get throttled during the
+	 * reclaim
+	 */
+	if (gfpflags_allow_blocking(gfp_mask))
+		current->flags |= PF_LESS_THROTTLE;
 repeat_alloc:
 	/*
 	 * Make sure that the OOM victim will get access to memory reserves
@@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
-		return element;
+		goto out;
 
 	spin_lock_irqsave(&pool->lock, flags);
 	if (likely(pool->curr_nr)) {
@@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		 * for debugging.
 		 */
 		kmemleak_update_trace(element);
-		return element;
+		goto out;
 	}
 
 	/*
@@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		return NULL;
+		goto out;
 	}
 
 	/* Let's wait for someone else to return an element to @pool */
@@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
+out:
+	if (gfpflags_allow_blocking(gfp_mask))
+		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
+	return element;
 }
 EXPORT_SYMBOL(mempool_alloc);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7fbb2d008078..a37661f1a11b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 
+	if (current->flags & PF_LESS_THROTTLE)
+		return;
+
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-18  8:41     ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-18  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Mikulas has reported that a swap backed by dm-crypt doesn't work
properly because the swapout cannot make a sufficient forward progress
as the writeout path depends on dm_crypt worker which has to allocate
memory to perform the encryption. In order to guarantee a forward
progress it relies on the mempool allocator. mempool_alloc(), however,
prefers to use the underlying (usually page) allocator before it grabs
objects from the pool. Such an allocation can dive into the memory
reclaim and consequently to throttle_vm_writeout. If there are too many
dirty or pages under writeback it will get throttled even though it is
in fact a flusher to clear pending pages.

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Memory pools are usually used for the writeback paths and it doesn't
really make much sense to throttle them just because there are too many
dirty/writeback pages. The main purpose of throttle_vm_writeout is to
make sure that the pageout path doesn't generate too much dirty data.
Considering that we are in mempool path which performs __GFP_NORETRY
requests the risk shouldn't be really high.

Fix this by ensuring that mempool users will get PF_LESS_THROTTLE and
that such processes are not throttled in throttle_vm_writeout. They can
still get throttled due to current_may_throttle() sleeps but that should
happen when the backing device itself is congested which sounds like a
proper reaction.

Please note that the bonus given by domain_dirty_limits() alone is not
sufficient because at least dm-crypt has to double buffer each page
under writeback so this won't be sufficient to prevent from being
throttled.

There are other users of the flag but they are in the writeout path so
this looks like a proper thing for them as well.

Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c        | 19 +++++++++++++++----
 mm/page-writeback.c |  3 +++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index ea26d75c8adf..916e95c4192c 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
  */
 void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
-	void *element;
+	unsigned int pflags = current->flags;
+	void *element = NULL;
 	unsigned long flags;
 	wait_queue_t wait;
 	gfp_t gfp_temp;
@@ -328,6 +329,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+	/*
+	 * Make sure that the allocation doesn't get throttled during the
+	 * reclaim
+	 */
+	if (gfpflags_allow_blocking(gfp_mask))
+		current->flags |= PF_LESS_THROTTLE;
 repeat_alloc:
 	/*
 	 * Make sure that the OOM victim will get access to memory reserves
@@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
-		return element;
+		goto out;
 
 	spin_lock_irqsave(&pool->lock, flags);
 	if (likely(pool->curr_nr)) {
@@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		 * for debugging.
 		 */
 		kmemleak_update_trace(element);
-		return element;
+		goto out;
 	}
 
 	/*
@@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		return NULL;
+		goto out;
 	}
 
 	/* Let's wait for someone else to return an element to @pool */
@@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
+out:
+	if (gfpflags_allow_blocking(gfp_mask))
+		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
+	return element;
 }
 EXPORT_SYMBOL(mempool_alloc);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7fbb2d008078..a37661f1a11b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 
+	if (current->flags & PF_LESS_THROTTLE)
+		return;
+
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
-- 
2.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-18  8:41   ` Michal Hocko
@ 2016-07-19  2:00     ` David Rientjes
  -1 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-19  2:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

On Mon, 18 Jul 2016, Michal Hocko wrote:

> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 

Right, this causes oom livelock as described in the aforementioned thread: 
the oom victim is waiting on a mutex that is held by a thread doing 
mempool_alloc().  The oom reaper is not guaranteed to free any memory, so 
nothing on the system can allocate memory from the page allocator.

I think the better solution here is to allow mempool_alloc() users to set 
__GFP_NOMEMALLOC if they are in a context which allows them to deplete 
memory reserves.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19  2:00     ` David Rientjes
  0 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-19  2:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko

On Mon, 18 Jul 2016, Michal Hocko wrote:

> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 

Right, this causes oom livelock as described in the aforementioned thread: 
the oom victim is waiting on a mutex that is held by a thread doing 
mempool_alloc().  The oom reaper is not guaranteed to free any memory, so 
nothing on the system can allocate memory from the page allocator.

I think the better solution here is to allow mempool_alloc() users to set 
__GFP_NOMEMALLOC if they are in a context which allows them to deplete 
memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19  2:00     ` David Rientjes
@ 2016-07-19  7:49       ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-19  7:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Johannes Weiner

On Mon 18-07-16 19:00:57, David Rientjes wrote:
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> 
> Right, this causes oom livelock as described in the aforementioned thread: 
> the oom victim is waiting on a mutex that is held by a thread doing 
> mempool_alloc().

The backtrace you have provided:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

is not PF_MEMALLOC context AFAICS so clearing __GFP_NOMEMALLOC for such
a task will not help unless that task has TIF_MEMDIE. Could you provide
a trace where the PF_MEMALLOC context holding a lock cannot make a
forward progress?

> The oom reaper is not guaranteed to free any memory, so 
> nothing on the system can allocate memory from the page allocator.

Sure, there is no guarantee but as I've said earlier, 1) oom_reaper will
allow to select another victim in many cases and 2) such a deadlock is
no different from any other where the victim cannot continue because of
another context blocking a lock while waiting for memory. Tweaking
mempool allocator to potentially catch such a case in a different way
doesn't sound right in principle, not to mention this is other dangerous
side effects.
 
> I think the better solution here is to allow mempool_alloc() users to set 
> __GFP_NOMEMALLOC if they are in a context which allows them to deplete 
> memory reserves.

I am not really sure about that. I agree with Johannes [1] that this
is bending mempool allocator into an undesirable direction because
the point of the mempool is to have its own reliably reusable memory
reserves. Now I am even not sure whether TIF_MEMDIE exception is a
good way forward and a plain revert is more appropriate. Let's CC
Johannes. The patch is [2].

[1] http://lkml.kernel.org/r/20160718151445.GB14604@cmpxchg.org
[2] http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19  7:49       ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-19  7:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Johannes Weiner

On Mon 18-07-16 19:00:57, David Rientjes wrote:
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> 
> Right, this causes oom livelock as described in the aforementioned thread: 
> the oom victim is waiting on a mutex that is held by a thread doing 
> mempool_alloc().

The backtrace you have provided:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

is not PF_MEMALLOC context AFAICS so clearing __GFP_NOMEMALLOC for such
a task will not help unless that task has TIF_MEMDIE. Could you provide
a trace where the PF_MEMALLOC context holding a lock cannot make a
forward progress?

> The oom reaper is not guaranteed to free any memory, so 
> nothing on the system can allocate memory from the page allocator.

Sure, there is no guarantee but as I've said earlier, 1) oom_reaper will
allow to select another victim in many cases and 2) such a deadlock is
no different from any other where the victim cannot continue because of
another context blocking a lock while waiting for memory. Tweaking
mempool allocator to potentially catch such a case in a different way
doesn't sound right in principle, not to mention this is other dangerous
side effects.
 
> I think the better solution here is to allow mempool_alloc() users to set 
> __GFP_NOMEMALLOC if they are in a context which allows them to deplete 
> memory reserves.

I am not really sure about that. I agree with Johannes [1] that this
is bending mempool allocator into an undesirable direction because
the point of the mempool is to have its own reliably reusable memory
reserves. Now I am even not sure whether TIF_MEMDIE exception is a
good way forward and a plain revert is more appropriate. Let's CC
Johannes. The patch is [2].

[1] http://lkml.kernel.org/r/20160718151445.GB14604@cmpxchg.org
[2] http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-18  8:41   ` Michal Hocko
@ 2016-07-19 13:54     ` Johannes Weiner
  -1 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-19 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel, Michal Hocko

On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.

I don't understand how this scenario wouldn't be a flat-out bug.

Mempool guarantees forward progress by having all necessary memory
objects for the guaranteed operation in reserve. Think about it this
way: you should be able to delete the pool->alloc() call entirely and
still make reliable forward progress. It would kill concurrency and be
super slow, but how could it be affected by a system OOM situation?

If our mempool_alloc() is waiting for an object that an OOM victim is
holding, where could that OOM victim get stuck before giving it back?
As I asked in the previous thread, surely you wouldn't do a mempool
allocation first and then rely on an unguarded page allocation to make
forward progress, right? It would defeat the purpose of using mempools
in the first place. And surely the OOM victim wouldn't be waiting for
a lock that somebody doing mempool_alloc() *against the same mempool*
is holding. That'd be an obvious ABBA deadlock.

So maybe I'm just dense, but could somebody please outline the exact
deadlock diagram? Who is doing what, and how are they getting stuck?

cpu0:                     cpu1:
                          mempool_alloc(pool0)
mempool_alloc(pool0)
  wait for cpu1
                          not allocating memory - would defeat mempool
                          not taking locks held by cpu0* - would ABBA
                          ???
                          mempool_free(pool0)

Thanks

* or any other task that does mempool_alloc(pool0) before unlock

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19 13:54     ` Johannes Weiner
  0 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-19 13:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel, Michal Hocko

On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.

I don't understand how this scenario wouldn't be a flat-out bug.

Mempool guarantees forward progress by having all necessary memory
objects for the guaranteed operation in reserve. Think about it this
way: you should be able to delete the pool->alloc() call entirely and
still make reliable forward progress. It would kill concurrency and be
super slow, but how could it be affected by a system OOM situation?

If our mempool_alloc() is waiting for an object that an OOM victim is
holding, where could that OOM victim get stuck before giving it back?
As I asked in the previous thread, surely you wouldn't do a mempool
allocation first and then rely on an unguarded page allocation to make
forward progress, right? It would defeat the purpose of using mempools
in the first place. And surely the OOM victim wouldn't be waiting for
a lock that somebody doing mempool_alloc() *against the same mempool*
is holding. That'd be an obvious ABBA deadlock.

So maybe I'm just dense, but could somebody please outline the exact
deadlock diagram? Who is doing what, and how are they getting stuck?

cpu0:                     cpu1:
                          mempool_alloc(pool0)
mempool_alloc(pool0)
  wait for cpu1
                          not allocating memory - would defeat mempool
                          not taking locks held by cpu0* - would ABBA
                          ???
                          mempool_free(pool0)

Thanks

* or any other task that does mempool_alloc(pool0) before unlock

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19 13:54     ` Johannes Weiner
@ 2016-07-19 14:19       ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-19 14:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> 
> I don't understand how this scenario wouldn't be a flat-out bug.
> 
> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?

Yes this is my understanding of the mempool usage as well. It is much
harder to check whether mempool users are really behaving and they do
not request more than the pre allocated pool allows them, though. That
would be a bug in the consumer not the mempool as such of course.

My original understanding of f9054c70d28b was that it acts as
a prevention for issues where the OOM victim loops inside the
mempool_alloc not doing reasonable progress because those who should
refill the pool are stuck for some reason (aka assume that not all
mempool users are behaving or they have unexpected dependencies like WQ
without WQ_MEM_RECLAIM and similar).

My thinking was that the victim has access to memory reserves by default
so it sounds reasonable to preserve this access also when it is in the
mempool_alloc. Therefore I wanted to preserve that particular logic and
came up with this patch which should be safer than f9054c70d28b. But the
more I am thinking about it the more it sounds like papering over a bug
somewhere else.

So I guess we should just go and revert f9054c70d28b and get back to
David's lockup and investigate what exactly went wrong and why. The
current form of f9054c70d28b is simply too dangerous.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19 14:19       ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-19 14:19 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> 
> I don't understand how this scenario wouldn't be a flat-out bug.
> 
> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?

Yes this is my understanding of the mempool usage as well. It is much
harder to check whether mempool users are really behaving and they do
not request more than the pre allocated pool allows them, though. That
would be a bug in the consumer not the mempool as such of course.

My original understanding of f9054c70d28b was that it acts as
a prevention for issues where the OOM victim loops inside the
mempool_alloc not doing reasonable progress because those who should
refill the pool are stuck for some reason (aka assume that not all
mempool users are behaving or they have unexpected dependencies like WQ
without WQ_MEM_RECLAIM and similar).

My thinking was that the victim has access to memory reserves by default
so it sounds reasonable to preserve this access also when it is in the
mempool_alloc. Therefore I wanted to preserve that particular logic and
came up with this patch which should be safer than f9054c70d28b. But the
more I am thinking about it the more it sounds like papering over a bug
somewhere else.

So I guess we should just go and revert f9054c70d28b and get back to
David's lockup and investigate what exactly went wrong and why. The
current form of f9054c70d28b is simply too dangerous.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19 13:54     ` Johannes Weiner
@ 2016-07-19 20:45       ` David Rientjes
  -1 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-19 20:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel, Michal Hocko

On Tue, 19 Jul 2016, Johannes Weiner wrote:

> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?
> 
> If our mempool_alloc() is waiting for an object that an OOM victim is
> holding, where could that OOM victim get stuck before giving it back?
> As I asked in the previous thread, surely you wouldn't do a mempool
> allocation first and then rely on an unguarded page allocation to make
> forward progress, right? It would defeat the purpose of using mempools
> in the first place. And surely the OOM victim wouldn't be waiting for
> a lock that somebody doing mempool_alloc() *against the same mempool*
> is holding. That'd be an obvious ABBA deadlock.
> 
> So maybe I'm just dense, but could somebody please outline the exact
> deadlock diagram? Who is doing what, and how are they getting stuck?
> 
> cpu0:                     cpu1:
>                           mempool_alloc(pool0)
> mempool_alloc(pool0)
>   wait for cpu1
>                           not allocating memory - would defeat mempool
>                           not taking locks held by cpu0* - would ABBA
>                           ???
>                           mempool_free(pool0)
> 
> Thanks
> 
> * or any other task that does mempool_alloc(pool0) before unlock
> 

I'm approaching this from a perspective of any possible mempool usage, not 
with any single current user in mind.

Any mempool_alloc() user that then takes a contended mutex can do this.  
An example:

	taskA		taskB		taskC
	-----		-----		-----
	mempool_alloc(a)
			mutex_lock(b)
	mutex_lock(b)
					mempool_alloc(a)

Imagine the mempool_alloc() done by taskA depleting all free elements so 
we rely on it to do mempool_free() before any other mempool allocator can 
be guaranteed.

If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
mempool_alloc().  This livelocks the page allocator for all processes.

taskB in this case need only stall after taking mutex_lock() successfully; 
that could be because of the oom livelock, it is contended on another 
mutex held by an allocator, etc.

Obviously taskB stalling while holding a mutex that is contended by a 
mempool user holding an element is not preferred, but it's possible.  (A 
simplified version is also possible with 0-size mempools, which are also 
allowed.)

My point is that I don't think we should be forcing any behavior wrt 
memory reserves as part of the mempool implementation.  In the above, 
taskC mempool_alloc() would succeed and not livelock unless 
__GFP_NOMEMALLOC is forced.  The mempool_alloc() user may construct their 
set of gfp flags as appropriate just like any other memory allocator in 
the kernel.

The alternative would be to ensure no mempool users ever take a lock that 
another thread can hold while contending another mutex or allocating 
memory itself.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19 20:45       ` David Rientjes
  0 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-19 20:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel, Michal Hocko

On Tue, 19 Jul 2016, Johannes Weiner wrote:

> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?
> 
> If our mempool_alloc() is waiting for an object that an OOM victim is
> holding, where could that OOM victim get stuck before giving it back?
> As I asked in the previous thread, surely you wouldn't do a mempool
> allocation first and then rely on an unguarded page allocation to make
> forward progress, right? It would defeat the purpose of using mempools
> in the first place. And surely the OOM victim wouldn't be waiting for
> a lock that somebody doing mempool_alloc() *against the same mempool*
> is holding. That'd be an obvious ABBA deadlock.
> 
> So maybe I'm just dense, but could somebody please outline the exact
> deadlock diagram? Who is doing what, and how are they getting stuck?
> 
> cpu0:                     cpu1:
>                           mempool_alloc(pool0)
> mempool_alloc(pool0)
>   wait for cpu1
>                           not allocating memory - would defeat mempool
>                           not taking locks held by cpu0* - would ABBA
>                           ???
>                           mempool_free(pool0)
> 
> Thanks
> 
> * or any other task that does mempool_alloc(pool0) before unlock
> 

I'm approaching this from a perspective of any possible mempool usage, not 
with any single current user in mind.

Any mempool_alloc() user that then takes a contended mutex can do this.  
An example:

	taskA		taskB		taskC
	-----		-----		-----
	mempool_alloc(a)
			mutex_lock(b)
	mutex_lock(b)
					mempool_alloc(a)

Imagine the mempool_alloc() done by taskA depleting all free elements so 
we rely on it to do mempool_free() before any other mempool allocator can 
be guaranteed.

If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
mempool_alloc().  This livelocks the page allocator for all processes.

taskB in this case need only stall after taking mutex_lock() successfully; 
that could be because of the oom livelock, it is contended on another 
mutex held by an allocator, etc.

Obviously taskB stalling while holding a mutex that is contended by a 
mempool user holding an element is not preferred, but it's possible.  (A 
simplified version is also possible with 0-size mempools, which are also 
allowed.)

My point is that I don't think we should be forcing any behavior wrt 
memory reserves as part of the mempool implementation.  In the above, 
taskC mempool_alloc() would succeed and not livelock unless 
__GFP_NOMEMALLOC is forced.  The mempool_alloc() user may construct their 
set of gfp flags as appropriate just like any other memory allocator in 
the kernel.

The alternative would be to ensure no mempool users ever take a lock that 
another thread can hold while contending another mutex or allocating 
memory itself.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-18  8:41   ` Michal Hocko
@ 2016-07-19 21:50     ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko



On Mon, 18 Jul 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Mikulas was
> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> 
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
> 
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
> 
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 
> Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> Bisected-by: Mikulas Patocka <mpatocka@redhat.com>

Bisect was done by Ondrej Kozina.

> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
>  mm/mempool.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..ea26d75c8adf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
> +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
> -	if (likely(pool->curr_nr)) {
> -		/*
> -		 * Don't allocate from emergency reserves if there are
> -		 * elements available.  This check is racy, but it will
> -		 * be rechecked each loop.
> -		 */
> -		gfp_temp |= __GFP_NOMEMALLOC;
> -	}
> +	/*
> +	 * Make sure that the OOM victim will get access to memory reserves
> +	 * properly if there are no objects in the pool to prevent from
> +	 * livelocks.
> +	 */
> +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> +		gfp_temp &= ~__GFP_NOMEMALLOC;
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
> -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		gfp_temp = gfp_mask;
>  		goto repeat_alloc;
> -- 
> 2.8.1
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19 21:50     ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko



On Mon, 18 Jul 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Mikulas was
> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> 
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
> 
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
> 
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. We can handle that case in a different way, though. We
> can check whether the current task has access to memory reserves ad an
> OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> is empty.
> 
> David Rientjes was objecting that such an approach wouldn't help if the
> oom victim was blocked on a lock held by process doing mempool_alloc. This
> is very similar to other oom deadlock situations and we have oom_reaper
> to deal with them so it is reasonable to rely on the same mechanism
> rather inventing a different one which has negative side effects.
> 
> Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> Bisected-by: Mikulas Patocka <mpatocka@redhat.com>

Bisect was done by Ondrej Kozina.

> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
>  mm/mempool.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..ea26d75c8adf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
> +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
> -	if (likely(pool->curr_nr)) {
> -		/*
> -		 * Don't allocate from emergency reserves if there are
> -		 * elements available.  This check is racy, but it will
> -		 * be rechecked each loop.
> -		 */
> -		gfp_temp |= __GFP_NOMEMALLOC;
> -	}
> +	/*
> +	 * Make sure that the OOM victim will get access to memory reserves
> +	 * properly if there are no objects in the pool to prevent from
> +	 * livelocks.
> +	 */
> +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> +		gfp_temp &= ~__GFP_NOMEMALLOC;
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
> -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		gfp_temp = gfp_mask;
>  		goto repeat_alloc;
> -- 
> 2.8.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-18  8:41     ` Michal Hocko
@ 2016-07-19 21:50       ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko



On Mon, 18 Jul 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Mikulas has reported that a swap backed by dm-crypt doesn't work
> properly because the swapout cannot make a sufficient forward progress
> as the writeout path depends on dm_crypt worker which has to allocate
> memory to perform the encryption. In order to guarantee a forward
> progress it relies on the mempool allocator. mempool_alloc(), however,
> prefers to use the underlying (usually page) allocator before it grabs
> objects from the pool. Such an allocation can dive into the memory
> reclaim and consequently to throttle_vm_writeout. If there are too many
> dirty or pages under writeback it will get throttled even though it is
> in fact a flusher to clear pending pages.
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> 
> Memory pools are usually used for the writeback paths and it doesn't
> really make much sense to throttle them just because there are too many
> dirty/writeback pages. The main purpose of throttle_vm_writeout is to
> make sure that the pageout path doesn't generate too much dirty data.
> Considering that we are in mempool path which performs __GFP_NORETRY
> requests the risk shouldn't be really high.
> 
> Fix this by ensuring that mempool users will get PF_LESS_THROTTLE and
> that such processes are not throttled in throttle_vm_writeout. They can
> still get throttled due to current_may_throttle() sleeps but that should
> happen when the backing device itself is congested which sounds like a
> proper reaction.
> 
> Please note that the bonus given by domain_dirty_limits() alone is not
> sufficient because at least dm-crypt has to double buffer each page
> under writeback so this won't be sufficient to prevent from being
> throttled.
> 
> There are other users of the flag but they are in the writeout path so
> this looks like a proper thing for them as well.
> 
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
>  mm/mempool.c        | 19 +++++++++++++++----
>  mm/page-writeback.c |  3 +++
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mempool.c b/mm/mempool.c
> index ea26d75c8adf..916e95c4192c 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
>   */
>  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  {
> -	void *element;
> +	unsigned int pflags = current->flags;
> +	void *element = NULL;
>  	unsigned long flags;
>  	wait_queue_t wait;
>  	gfp_t gfp_temp;
> @@ -328,6 +329,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
> +	/*
> +	 * Make sure that the allocation doesn't get throttled during the
> +	 * reclaim
> +	 */
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		current->flags |= PF_LESS_THROTTLE;
>  repeat_alloc:
>  	/*
>  	 * Make sure that the OOM victim will get access to memory reserves
> @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> -		return element;
> +		goto out;
>  
>  	spin_lock_irqsave(&pool->lock, flags);
>  	if (likely(pool->curr_nr)) {
> @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		 * for debugging.
>  		 */
>  		kmemleak_update_trace(element);
> -		return element;
> +		goto out;
>  	}
>  
>  	/*
> @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
> -		return NULL;
> +		goto out;
>  	}
>  
>  	/* Let's wait for someone else to return an element to @pool */
> @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	finish_wait(&pool->wait, &wait);
>  	goto repeat_alloc;
> +out:
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> +	return element;
>  }
>  EXPORT_SYMBOL(mempool_alloc);
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7fbb2d008078..a37661f1a11b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  
> +	if (current->flags & PF_LESS_THROTTLE)
> +		return;
> +
>          for ( ; ; ) {
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -- 
> 2.8.1
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-19 21:50       ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 21:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel,
	Michal Hocko



On Mon, 18 Jul 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Mikulas has reported that a swap backed by dm-crypt doesn't work
> properly because the swapout cannot make a sufficient forward progress
> as the writeout path depends on dm_crypt worker which has to allocate
> memory to perform the encryption. In order to guarantee a forward
> progress it relies on the mempool allocator. mempool_alloc(), however,
> prefers to use the underlying (usually page) allocator before it grabs
> objects from the pool. Such an allocation can dive into the memory
> reclaim and consequently to throttle_vm_writeout. If there are too many
> dirty or pages under writeback it will get throttled even though it is
> in fact a flusher to clear pending pages.
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> 
> Memory pools are usually used for the writeback paths and it doesn't
> really make much sense to throttle them just because there are too many
> dirty/writeback pages. The main purpose of throttle_vm_writeout is to
> make sure that the pageout path doesn't generate too much dirty data.
> Considering that we are in mempool path which performs __GFP_NORETRY
> requests the risk shouldn't be really high.
> 
> Fix this by ensuring that mempool users will get PF_LESS_THROTTLE and
> that such processes are not throttled in throttle_vm_writeout. They can
> still get throttled due to current_may_throttle() sleeps but that should
> happen when the backing device itself is congested which sounds like a
> proper reaction.
> 
> Please note that the bonus given by domain_dirty_limits() alone is not
> sufficient because at least dm-crypt has to double buffer each page
> under writeback so this won't be sufficient to prevent from being
> throttled.
> 
> There are other users of the flag but they are in the writeout path so
> this looks like a proper thing for them as well.
> 
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
>  mm/mempool.c        | 19 +++++++++++++++----
>  mm/page-writeback.c |  3 +++
>  2 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/mempool.c b/mm/mempool.c
> index ea26d75c8adf..916e95c4192c 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
>   */
>  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  {
> -	void *element;
> +	unsigned int pflags = current->flags;
> +	void *element = NULL;
>  	unsigned long flags;
>  	wait_queue_t wait;
>  	gfp_t gfp_temp;
> @@ -328,6 +329,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
> +	/*
> +	 * Make sure that the allocation doesn't get throttled during the
> +	 * reclaim
> +	 */
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		current->flags |= PF_LESS_THROTTLE;
>  repeat_alloc:
>  	/*
>  	 * Make sure that the OOM victim will get access to memory reserves
> @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> -		return element;
> +		goto out;
>  
>  	spin_lock_irqsave(&pool->lock, flags);
>  	if (likely(pool->curr_nr)) {
> @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		 * for debugging.
>  		 */
>  		kmemleak_update_trace(element);
> -		return element;
> +		goto out;
>  	}
>  
>  	/*
> @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
> -		return NULL;
> +		goto out;
>  	}
>  
>  	/* Let's wait for someone else to return an element to @pool */
> @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	finish_wait(&pool->wait, &wait);
>  	goto repeat_alloc;
> +out:
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> +	return element;
>  }
>  EXPORT_SYMBOL(mempool_alloc);
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7fbb2d008078..a37661f1a11b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  
> +	if (current->flags & PF_LESS_THROTTLE)
> +		return;
> +
>          for ( ; ; ) {
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -- 
> 2.8.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19 14:19       ` Michal Hocko
@ 2016-07-19 22:01         ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 22:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel



On Tue, 19 Jul 2016, Michal Hocko wrote:

> On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> > On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > > The original intention of f9054c70d28b was to help with the OOM
> > > situations where the oom victim depends on mempool allocation to make a
> > > forward progress. We can handle that case in a different way, though. We
> > > can check whether the current task has access to memory reserves ad an
> > > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > > is empty.
> > > 
> > > David Rientjes was objecting that such an approach wouldn't help if the
> > > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > > is very similar to other oom deadlock situations and we have oom_reaper
> > > to deal with them so it is reasonable to rely on the same mechanism
> > > rather inventing a different one which has negative side effects.
> > 
> > I don't understand how this scenario wouldn't be a flat-out bug.
> > 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> 
> Yes this is my understanding of the mempool usage as well. It is much

Yes, that's correct.

> harder to check whether mempool users are really behaving and they do
> not request more than the pre allocated pool allows them, though. That
> would be a bug in the consumer not the mempool as such of course.
> 
> My original understanding of f9054c70d28b was that it acts as
> a prevention for issues where the OOM victim loops inside the
> mempool_alloc not doing reasonable progress because those who should
> refill the pool are stuck for some reason (aka assume that not all
> mempool users are behaving or they have unexpected dependencies like WQ
> without WQ_MEM_RECLAIM and similar).

David Rientjes didn't tell us what is the configuration of his servers, we 
don't know what dm targets and block device drivers is he using, we don't 
know how they are connected - so it not really possible to know what 
happened for him.

Mikulas

> My thinking was that the victim has access to memory reserves by default
> so it sounds reasonable to preserve this access also when it is in the
> mempool_alloc. Therefore I wanted to preserve that particular logic and
> came up with this patch which should be safer than f9054c70d28b. But the
> more I am thinking about it the more it sounds like papering over a bug
> somewhere else.
> 
> So I guess we should just go and revert f9054c70d28b and get back to
> David's lockup and investigate what exactly went wrong and why. The
> current form of f9054c70d28b is simply too dangerous.
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-19 22:01         ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-19 22:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel



On Tue, 19 Jul 2016, Michal Hocko wrote:

> On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> > On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > > The original intention of f9054c70d28b was to help with the OOM
> > > situations where the oom victim depends on mempool allocation to make a
> > > forward progress. We can handle that case in a different way, though. We
> > > can check whether the current task has access to memory reserves ad an
> > > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > > is empty.
> > > 
> > > David Rientjes was objecting that such an approach wouldn't help if the
> > > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > > is very similar to other oom deadlock situations and we have oom_reaper
> > > to deal with them so it is reasonable to rely on the same mechanism
> > > rather inventing a different one which has negative side effects.
> > 
> > I don't understand how this scenario wouldn't be a flat-out bug.
> > 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> 
> Yes this is my understanding of the mempool usage as well. It is much

Yes, that's correct.

> harder to check whether mempool users are really behaving and they do
> not request more than the pre allocated pool allows them, though. That
> would be a bug in the consumer not the mempool as such of course.
> 
> My original understanding of f9054c70d28b was that it acts as
> a prevention for issues where the OOM victim loops inside the
> mempool_alloc not doing reasonable progress because those who should
> refill the pool are stuck for some reason (aka assume that not all
> mempool users are behaving or they have unexpected dependencies like WQ
> without WQ_MEM_RECLAIM and similar).

David Rientjes didn't tell us what is the configuration of his servers, we 
don't know what dm targets and block device drivers is he using, we don't 
know how they are connected - so it not really possible to know what 
happened for him.

Mikulas

> My thinking was that the victim has access to memory reserves by default
> so it sounds reasonable to preserve this access also when it is in the
> mempool_alloc. Therefore I wanted to preserve that particular logic and
> came up with this patch which should be safer than f9054c70d28b. But the
> more I am thinking about it the more it sounds like papering over a bug
> somewhere else.
> 
> So I guess we should just go and revert f9054c70d28b and get back to
> David's lockup and investigate what exactly went wrong and why. The
> current form of f9054c70d28b is simply too dangerous.
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19 21:50     ` Mikulas Patocka
@ 2016-07-20  6:44       ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-20  6:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel

On Tue 19-07-16 17:50:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > There has been a report about OOM killer invoked when swapping out to
> > a dm-crypt device. The primary reason seems to be that the swapout
> > out IO managed to completely deplete memory reserves. Mikulas was
> > able to bisect and explained the issue by pointing to f9054c70d28b
> > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> > 
> > The reason is that the swapout path is not throttled properly because
> > the md-raid layer needs to allocate from the generic_make_request path
> > which means it allocates from the PF_MEMALLOC context. dm layer uses
> > mempool_alloc in order to guarantee a forward progress which used to
> > inhibit access to memory reserves when using page allocator. This has
> > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> > there are free elements") which has dropped the __GFP_NOMEMALLOC
> > protection when the memory pool is depleted.
> > 
> > If we are running out of memory and the only way forward to free memory
> > is to perform swapout we just keep consuming memory reserves rather than
> > throttling the mempool allocations and allowing the pending IO to
> > complete up to a moment when the memory is depleted completely and there
> > is no way forward but invoking the OOM killer. This is less than
> > optimal.
> > 
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> > Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> > Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> Bisect was done by Ondrej Kozina.

OK, fixed

> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>

Let's see whether we decide to go with this patch or a plain revert. In
any case I will mark the patch for stable so it will end up in both 4.6
and 4.7

Anyway thanks for your and Ondrejs help here!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-20  6:44       ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-20  6:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Neil Brown, Andrew Morton, LKML, dm-devel

On Tue 19-07-16 17:50:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > There has been a report about OOM killer invoked when swapping out to
> > a dm-crypt device. The primary reason seems to be that the swapout
> > out IO managed to completely deplete memory reserves. Mikulas was
> > able to bisect and explained the issue by pointing to f9054c70d28b
> > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> > 
> > The reason is that the swapout path is not throttled properly because
> > the md-raid layer needs to allocate from the generic_make_request path
> > which means it allocates from the PF_MEMALLOC context. dm layer uses
> > mempool_alloc in order to guarantee a forward progress which used to
> > inhibit access to memory reserves when using page allocator. This has
> > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> > there are free elements") which has dropped the __GFP_NOMEMALLOC
> > protection when the memory pool is depleted.
> > 
> > If we are running out of memory and the only way forward to free memory
> > is to perform swapout we just keep consuming memory reserves rather than
> > throttling the mempool allocations and allowing the pending IO to
> > complete up to a moment when the memory is depleted completely and there
> > is no way forward but invoking the OOM killer. This is less than
> > optimal.
> > 
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> > Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
> > Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> Bisect was done by Ondrej Kozina.

OK, fixed

> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> Tested-by: Mikulas Patocka <mpatocka@redhat.com>

Let's see whether we decide to go with this patch or a plain revert. In
any case I will mark the patch for stable so it will end up in both 4.6
and 4.7

Anyway thanks for your and Ondrejs help here!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-19 20:45       ` David Rientjes
@ 2016-07-20  8:15         ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-20  8:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Tue 19-07-16 13:45:52, David Rientjes wrote:
> On Tue, 19 Jul 2016, Johannes Weiner wrote:
> 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> > 
> > If our mempool_alloc() is waiting for an object that an OOM victim is
> > holding, where could that OOM victim get stuck before giving it back?
> > As I asked in the previous thread, surely you wouldn't do a mempool
> > allocation first and then rely on an unguarded page allocation to make
> > forward progress, right? It would defeat the purpose of using mempools
> > in the first place. And surely the OOM victim wouldn't be waiting for
> > a lock that somebody doing mempool_alloc() *against the same mempool*
> > is holding. That'd be an obvious ABBA deadlock.
> > 
> > So maybe I'm just dense, but could somebody please outline the exact
> > deadlock diagram? Who is doing what, and how are they getting stuck?
> > 
> > cpu0:                     cpu1:
> >                           mempool_alloc(pool0)
> > mempool_alloc(pool0)
> >   wait for cpu1
> >                           not allocating memory - would defeat mempool
> >                           not taking locks held by cpu0* - would ABBA
> >                           ???
> >                           mempool_free(pool0)
> > 
> > Thanks
> > 
> > * or any other task that does mempool_alloc(pool0) before unlock
> > 
> 
> I'm approaching this from a perspective of any possible mempool usage, not 
> with any single current user in mind.
> 
> Any mempool_alloc() user that then takes a contended mutex can do this.  
> An example:
> 
> 	taskA		taskB		taskC
> 	-----		-----		-----
> 	mempool_alloc(a)
> 			mutex_lock(b)
> 	mutex_lock(b)
> 					mempool_alloc(a)
> 
> Imagine the mempool_alloc() done by taskA depleting all free elements so 
> we rely on it to do mempool_free() before any other mempool allocator can 
> be guaranteed.
> 
> If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> mempool_alloc().  This livelocks the page allocator for all processes.
> 
> taskB in this case need only stall after taking mutex_lock() successfully; 
> that could be because of the oom livelock, it is contended on another 
> mutex held by an allocator, etc.

But that falls down to the deadlock described by Johannes above because
then the mempool user would _depend_ on an "unguarded page allocation"
via that particular lock and that is a bug.
 
> Obviously taskB stalling while holding a mutex that is contended by a 
> mempool user holding an element is not preferred, but it's possible.  (A 
> simplified version is also possible with 0-size mempools, which are also 
> allowed.)
> 
> My point is that I don't think we should be forcing any behavior wrt 
> memory reserves as part of the mempool implementation. 

Isn't the reserve management the whole point of the mempool approach?

> In the above, 
> taskC mempool_alloc() would succeed and not livelock unless 
> __GFP_NOMEMALLOC is forced. 

Or it would get stuck because even page allocator memory reserves got
depleted. Without any way to throttle there is no guarantee to make
further progress. In fact this is not a theoretical situation. It has
been observed with the swap over dm-crypt and there shouldn't be any
lock dependeces you are describing above there AFAIU.

> The mempool_alloc() user may construct their 
> set of gfp flags as appropriate just like any other memory allocator in 
> the kernel.

So which users of mempool_alloc would benefit from not having
__GFP_NOMEMALLOC and why?

> The alternative would be to ensure no mempool users ever take a lock that 
> another thread can hold while contending another mutex or allocating 
> memory itself.

I am not sure how can we enforce that but surely that would detect a
clear mempool usage bug. Lockdep could be probably extended to do so.

Anway, I feel we are looping in a circle. We have a clear regression
caused by your patch. It might solve some oom livelock you are seeing
but there are only very dim details about it and the patch might very
well paper over a bug in mempool usage somewhere else. We definitely
need more details to know that better.

That being said, f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements") should be either reverted or
http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
should be applied as a temporal workaround because it would make a
lockup less likely for now until we find out more about your issue.

Does that sound like a way forward?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-20  8:15         ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-20  8:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Tue 19-07-16 13:45:52, David Rientjes wrote:
> On Tue, 19 Jul 2016, Johannes Weiner wrote:
> 
> > Mempool guarantees forward progress by having all necessary memory
> > objects for the guaranteed operation in reserve. Think about it this
> > way: you should be able to delete the pool->alloc() call entirely and
> > still make reliable forward progress. It would kill concurrency and be
> > super slow, but how could it be affected by a system OOM situation?
> > 
> > If our mempool_alloc() is waiting for an object that an OOM victim is
> > holding, where could that OOM victim get stuck before giving it back?
> > As I asked in the previous thread, surely you wouldn't do a mempool
> > allocation first and then rely on an unguarded page allocation to make
> > forward progress, right? It would defeat the purpose of using mempools
> > in the first place. And surely the OOM victim wouldn't be waiting for
> > a lock that somebody doing mempool_alloc() *against the same mempool*
> > is holding. That'd be an obvious ABBA deadlock.
> > 
> > So maybe I'm just dense, but could somebody please outline the exact
> > deadlock diagram? Who is doing what, and how are they getting stuck?
> > 
> > cpu0:                     cpu1:
> >                           mempool_alloc(pool0)
> > mempool_alloc(pool0)
> >   wait for cpu1
> >                           not allocating memory - would defeat mempool
> >                           not taking locks held by cpu0* - would ABBA
> >                           ???
> >                           mempool_free(pool0)
> > 
> > Thanks
> > 
> > * or any other task that does mempool_alloc(pool0) before unlock
> > 
> 
> I'm approaching this from a perspective of any possible mempool usage, not 
> with any single current user in mind.
> 
> Any mempool_alloc() user that then takes a contended mutex can do this.  
> An example:
> 
> 	taskA		taskB		taskC
> 	-----		-----		-----
> 	mempool_alloc(a)
> 			mutex_lock(b)
> 	mutex_lock(b)
> 					mempool_alloc(a)
> 
> Imagine the mempool_alloc() done by taskA depleting all free elements so 
> we rely on it to do mempool_free() before any other mempool allocator can 
> be guaranteed.
> 
> If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> mempool_alloc().  This livelocks the page allocator for all processes.
> 
> taskB in this case need only stall after taking mutex_lock() successfully; 
> that could be because of the oom livelock, it is contended on another 
> mutex held by an allocator, etc.

But that falls down to the deadlock described by Johannes above because
then the mempool user would _depend_ on an "unguarded page allocation"
via that particular lock and that is a bug.
 
> Obviously taskB stalling while holding a mutex that is contended by a 
> mempool user holding an element is not preferred, but it's possible.  (A 
> simplified version is also possible with 0-size mempools, which are also 
> allowed.)
> 
> My point is that I don't think we should be forcing any behavior wrt 
> memory reserves as part of the mempool implementation. 

Isn't the reserve management the whole point of the mempool approach?

> In the above, 
> taskC mempool_alloc() would succeed and not livelock unless 
> __GFP_NOMEMALLOC is forced. 

Or it would get stuck because even page allocator memory reserves got
depleted. Without any way to throttle there is no guarantee to make
further progress. In fact this is not a theoretical situation. It has
been observed with the swap over dm-crypt and there shouldn't be any
lock dependeces you are describing above there AFAIU.

> The mempool_alloc() user may construct their 
> set of gfp flags as appropriate just like any other memory allocator in 
> the kernel.

So which users of mempool_alloc would benefit from not having
__GFP_NOMEMALLOC and why?

> The alternative would be to ensure no mempool users ever take a lock that 
> another thread can hold while contending another mutex or allocating 
> memory itself.

I am not sure how can we enforce that but surely that would detect a
clear mempool usage bug. Lockdep could be probably extended to do so.

Anway, I feel we are looping in a circle. We have a clear regression
caused by your patch. It might solve some oom livelock you are seeing
but there are only very dim details about it and the patch might very
well paper over a bug in mempool usage somewhere else. We definitely
need more details to know that better.

That being said, f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements") should be either reverted or
http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mhocko@kernel.org
should be applied as a temporal workaround because it would make a
lockup less likely for now until we find out more about your issue.

Does that sound like a way forward?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-20  8:15         ` Michal Hocko
@ 2016-07-20 21:06           ` David Rientjes
  -1 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-20 21:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Wed, 20 Jul 2016, Michal Hocko wrote:

> > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > An example:
> > 
> > 	taskA		taskB		taskC
> > 	-----		-----		-----
> > 	mempool_alloc(a)
> > 			mutex_lock(b)
> > 	mutex_lock(b)
> > 					mempool_alloc(a)
> > 
> > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > we rely on it to do mempool_free() before any other mempool allocator can 
> > be guaranteed.
> > 
> > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > mempool_alloc().  This livelocks the page allocator for all processes.
> > 
> > taskB in this case need only stall after taking mutex_lock() successfully; 
> > that could be because of the oom livelock, it is contended on another 
> > mutex held by an allocator, etc.
> 
> But that falls down to the deadlock described by Johannes above because
> then the mempool user would _depend_ on an "unguarded page allocation"
> via that particular lock and that is a bug.
>  

It becomes a deadlock because of mempool_alloc(a) forcing 
__GFP_NOMEMALLOC, I agree.

For that not to be the case, it must be required that between 
mempool_alloc() and mempool_free() that we take no mutex that may be held 
by any other thread on the system, in any context, that is allocating 
memory.  If that's a caller's bug as you describe it, and only enabled by 
mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
lockdep detection, which would be trivial to add, so we can determine if 
any users are unsafe and prevent this issue in the future.  The 
overwhelming goal here should be to prevent possible problems in the 
future especially if an API does not allow you to opt-out of the behavior.

> > My point is that I don't think we should be forcing any behavior wrt 
> > memory reserves as part of the mempool implementation. 
> 
> Isn't the reserve management the whole point of the mempool approach?
> 

No, the whole point is to maintain the freelist of elements that are 
guaranteed; my suggestion is that we cannot make that guarantee if we are 
blocked from freeing elements.  It's trivial to fix by allowing 
__GFP_NOMEMALLOC from the caller in cases where you cannot possibly be 
blocked by an oom victim.

> Or it would get stuck because even page allocator memory reserves got
> depleted. Without any way to throttle there is no guarantee to make
> further progress. In fact this is not a theoretical situation. It has
> been observed with the swap over dm-crypt and there shouldn't be any
> lock dependeces you are describing above there AFAIU.
> 

They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

> > The mempool_alloc() user may construct their 
> > set of gfp flags as appropriate just like any other memory allocator in 
> > the kernel.
> 
> So which users of mempool_alloc would benefit from not having
> __GFP_NOMEMALLOC and why?
> 

Any mempool_alloc() user that would be blocked on returning the element 
back to the freelist by an oom condition.  I think the dm-crypt case is 
quite unique on how it is able to deplete memory reserves.

> Anway, I feel we are looping in a circle. We have a clear regression
> caused by your patch. It might solve some oom livelock you are seeing
> but there are only very dim details about it and the patch might very
> well paper over a bug in mempool usage somewhere else. We definitely
> need more details to know that better.
> 

What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
clear documentation on how to use it?  It can be described to not allow 
depletion of memory reserves with the caveat that the caller must ensure 
mempool_free() cannot be blocked in lowmem situations.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-20 21:06           ` David Rientjes
  0 siblings, 0 replies; 102+ messages in thread
From: David Rientjes @ 2016-07-20 21:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Wed, 20 Jul 2016, Michal Hocko wrote:

> > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > An example:
> > 
> > 	taskA		taskB		taskC
> > 	-----		-----		-----
> > 	mempool_alloc(a)
> > 			mutex_lock(b)
> > 	mutex_lock(b)
> > 					mempool_alloc(a)
> > 
> > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > we rely on it to do mempool_free() before any other mempool allocator can 
> > be guaranteed.
> > 
> > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > mempool_alloc().  This livelocks the page allocator for all processes.
> > 
> > taskB in this case need only stall after taking mutex_lock() successfully; 
> > that could be because of the oom livelock, it is contended on another 
> > mutex held by an allocator, etc.
> 
> But that falls down to the deadlock described by Johannes above because
> then the mempool user would _depend_ on an "unguarded page allocation"
> via that particular lock and that is a bug.
>  

It becomes a deadlock because of mempool_alloc(a) forcing 
__GFP_NOMEMALLOC, I agree.

For that not to be the case, it must be required that between 
mempool_alloc() and mempool_free() that we take no mutex that may be held 
by any other thread on the system, in any context, that is allocating 
memory.  If that's a caller's bug as you describe it, and only enabled by 
mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
lockdep detection, which would be trivial to add, so we can determine if 
any users are unsafe and prevent this issue in the future.  The 
overwhelming goal here should be to prevent possible problems in the 
future especially if an API does not allow you to opt-out of the behavior.

> > My point is that I don't think we should be forcing any behavior wrt 
> > memory reserves as part of the mempool implementation. 
> 
> Isn't the reserve management the whole point of the mempool approach?
> 

No, the whole point is to maintain the freelist of elements that are 
guaranteed; my suggestion is that we cannot make that guarantee if we are 
blocked from freeing elements.  It's trivial to fix by allowing 
__GFP_NOMEMALLOC from the caller in cases where you cannot possibly be 
blocked by an oom victim.

> Or it would get stuck because even page allocator memory reserves got
> depleted. Without any way to throttle there is no guarantee to make
> further progress. In fact this is not a theoretical situation. It has
> been observed with the swap over dm-crypt and there shouldn't be any
> lock dependeces you are describing above there AFAIU.
> 

They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

> > The mempool_alloc() user may construct their 
> > set of gfp flags as appropriate just like any other memory allocator in 
> > the kernel.
> 
> So which users of mempool_alloc would benefit from not having
> __GFP_NOMEMALLOC and why?
> 

Any mempool_alloc() user that would be blocked on returning the element 
back to the freelist by an oom condition.  I think the dm-crypt case is 
quite unique on how it is able to deplete memory reserves.

> Anway, I feel we are looping in a circle. We have a clear regression
> caused by your patch. It might solve some oom livelock you are seeing
> but there are only very dim details about it and the patch might very
> well paper over a bug in mempool usage somewhere else. We definitely
> need more details to know that better.
> 

What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
clear documentation on how to use it?  It can be described to not allow 
depletion of memory reserves with the caveat that the caller must ensure 
mempool_free() cannot be blocked in lowmem situations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-20 21:06           ` David Rientjes
@ 2016-07-21  8:52             ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-21  8:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Wed 20-07-16 14:06:26, David Rientjes wrote:
> On Wed, 20 Jul 2016, Michal Hocko wrote:
> 
> > > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > > An example:
> > > 
> > > 	taskA		taskB		taskC
> > > 	-----		-----		-----
> > > 	mempool_alloc(a)
> > > 			mutex_lock(b)
> > > 	mutex_lock(b)
> > > 					mempool_alloc(a)
> > > 
> > > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > > we rely on it to do mempool_free() before any other mempool allocator can 
> > > be guaranteed.
> > > 
> > > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > > mempool_alloc().  This livelocks the page allocator for all processes.
> > > 
> > > taskB in this case need only stall after taking mutex_lock() successfully; 
> > > that could be because of the oom livelock, it is contended on another 
> > > mutex held by an allocator, etc.
> > 
> > But that falls down to the deadlock described by Johannes above because
> > then the mempool user would _depend_ on an "unguarded page allocation"
> > via that particular lock and that is a bug.
> >  
> 
> It becomes a deadlock because of mempool_alloc(a) forcing 
> __GFP_NOMEMALLOC, I agree.
> 
> For that not to be the case, it must be required that between 
> mempool_alloc() and mempool_free() that we take no mutex that may be held 
> by any other thread on the system, in any context, that is allocating 
> memory.  If that's a caller's bug as you describe it, and only enabled by 
> mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
> lockdep detection, which would be trivial to add, so we can determine if 
> any users are unsafe and prevent this issue in the future.

I am sorry but I am neither familiar with the lockdep internals nor I
have a time to add this support.

> The 
> overwhelming goal here should be to prevent possible problems in the 
> future especially if an API does not allow you to opt-out of the behavior.

The __GFP_NOMEMALLOC enforcement is there since b84a35be0285 ("[PATCH]
mempool: NOMEMALLOC and NORETRY") so more than 10 years ago. So I think
it is quite reasonable to expect that users are familiar with this fact
and handle it properly in the vast majority cases. In fact mempool
deadlocks are really rare.

[...]

> > Or it would get stuck because even page allocator memory reserves got
> > depleted. Without any way to throttle there is no guarantee to make
> > further progress. In fact this is not a theoretical situation. It has
> > been observed with the swap over dm-crypt and there shouldn't be any
> > lock dependeces you are describing above there AFAIU.
> > 
> 
> They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

How that would be any different from any other mempool user which can be
invoked from the swap out path - aka any other IO path?

> What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
> clear documentation on how to use it?  It can be described to not allow 
> depletion of memory reserves with the caveat that the caller must ensure 
> mempool_free() cannot be blocked in lowmem situations.

Look, there are
$ git grep mempool_alloc | wc -l
304

many users of this API and we do not want to flip the default behavior
which is there for more than 10 years. So far you have been arguing
about potential deadlocks and haven't shown any particular path which
would have a direct or indirect dependency between mempool and normal
allocator and it wouldn't be a bug. As the matter of fact the change
we are discussing here causes a regression. If you want to change the
semantic of mempool allocator then you are absolutely free to do so. In
a separate patch which would be discussed with IO people and other
users, though. But we _absolutely_ want to fix the regression first
and have a simple fix for 4.6 and 4.7 backports. At this moment there
are revert and patch 1 on the table.  The later one should make your
backtrace happy and should be only as a temporal fix until we find out
what is actually misbehaving on your systems. If you are not interested
to pursue that way I will simply go with the revert.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-21  8:52             ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-21  8:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Wed 20-07-16 14:06:26, David Rientjes wrote:
> On Wed, 20 Jul 2016, Michal Hocko wrote:
> 
> > > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > > An example:
> > > 
> > > 	taskA		taskB		taskC
> > > 	-----		-----		-----
> > > 	mempool_alloc(a)
> > > 			mutex_lock(b)
> > > 	mutex_lock(b)
> > > 					mempool_alloc(a)
> > > 
> > > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > > we rely on it to do mempool_free() before any other mempool allocator can 
> > > be guaranteed.
> > > 
> > > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > > mempool_alloc().  This livelocks the page allocator for all processes.
> > > 
> > > taskB in this case need only stall after taking mutex_lock() successfully; 
> > > that could be because of the oom livelock, it is contended on another 
> > > mutex held by an allocator, etc.
> > 
> > But that falls down to the deadlock described by Johannes above because
> > then the mempool user would _depend_ on an "unguarded page allocation"
> > via that particular lock and that is a bug.
> >  
> 
> It becomes a deadlock because of mempool_alloc(a) forcing 
> __GFP_NOMEMALLOC, I agree.
> 
> For that not to be the case, it must be required that between 
> mempool_alloc() and mempool_free() that we take no mutex that may be held 
> by any other thread on the system, in any context, that is allocating 
> memory.  If that's a caller's bug as you describe it, and only enabled by 
> mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
> lockdep detection, which would be trivial to add, so we can determine if 
> any users are unsafe and prevent this issue in the future.

I am sorry but I am neither familiar with the lockdep internals nor I
have a time to add this support.

> The 
> overwhelming goal here should be to prevent possible problems in the 
> future especially if an API does not allow you to opt-out of the behavior.

The __GFP_NOMEMALLOC enforcement is there since b84a35be0285 ("[PATCH]
mempool: NOMEMALLOC and NORETRY") so more than 10 years ago. So I think
it is quite reasonable to expect that users are familiar with this fact
and handle it properly in the vast majority cases. In fact mempool
deadlocks are really rare.

[...]

> > Or it would get stuck because even page allocator memory reserves got
> > depleted. Without any way to throttle there is no guarantee to make
> > further progress. In fact this is not a theoretical situation. It has
> > been observed with the swap over dm-crypt and there shouldn't be any
> > lock dependeces you are describing above there AFAIU.
> > 
> 
> They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

How that would be any different from any other mempool user which can be
invoked from the swap out path - aka any other IO path?

> What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
> clear documentation on how to use it?  It can be described to not allow 
> depletion of memory reserves with the caveat that the caller must ensure 
> mempool_free() cannot be blocked in lowmem situations.

Look, there are
$ git grep mempool_alloc | wc -l
304

many users of this API and we do not want to flip the default behavior
which is there for more than 10 years. So far you have been arguing
about potential deadlocks and haven't shown any particular path which
would have a direct or indirect dependency between mempool and normal
allocator and it wouldn't be a bug. As the matter of fact the change
we are discussing here causes a regression. If you want to change the
semantic of mempool allocator then you are absolutely free to do so. In
a separate patch which would be discussed with IO people and other
users, though. But we _absolutely_ want to fix the regression first
and have a simple fix for 4.6 and 4.7 backports. At this moment there
are revert and patch 1 on the table.  The later one should make your
backtrace happy and should be only as a temporal fix until we find out
what is actually misbehaving on your systems. If you are not interested
to pursue that way I will simply go with the revert.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-21  8:52             ` Michal Hocko
@ 2016-07-21 12:13               ` Johannes Weiner
  -1 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-21 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> Look, there are
> $ git grep mempool_alloc | wc -l
> 304
> 
> many users of this API and we do not want to flip the default behavior
> which is there for more than 10 years. So far you have been arguing
> about potential deadlocks and haven't shown any particular path which
> would have a direct or indirect dependency between mempool and normal
> allocator and it wouldn't be a bug. As the matter of fact the change
> we are discussing here causes a regression. If you want to change the
> semantic of mempool allocator then you are absolutely free to do so. In
> a separate patch which would be discussed with IO people and other
> users, though. But we _absolutely_ want to fix the regression first
> and have a simple fix for 4.6 and 4.7 backports. At this moment there
> are revert and patch 1 on the table.  The later one should make your
> backtrace happy and should be only as a temporal fix until we find out
> what is actually misbehaving on your systems. If you are not interested
> to pursue that way I will simply go with the revert.

+1

It's very unlikely that decade-old mempool semantics are suddenly a
fundamental livelock problem, when all the evidence we have is one
hang and vague speculation. Given that the patch causes regressions,
and that the bug is most likely elsewhere anyway, a full revert rather
than merely-less-invasive mempool changes makes the most sense to me.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-21 12:13               ` Johannes Weiner
  0 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-21 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> Look, there are
> $ git grep mempool_alloc | wc -l
> 304
> 
> many users of this API and we do not want to flip the default behavior
> which is there for more than 10 years. So far you have been arguing
> about potential deadlocks and haven't shown any particular path which
> would have a direct or indirect dependency between mempool and normal
> allocator and it wouldn't be a bug. As the matter of fact the change
> we are discussing here causes a regression. If you want to change the
> semantic of mempool allocator then you are absolutely free to do so. In
> a separate patch which would be discussed with IO people and other
> users, though. But we _absolutely_ want to fix the regression first
> and have a simple fix for 4.6 and 4.7 backports. At this moment there
> are revert and patch 1 on the table.  The later one should make your
> backtrace happy and should be only as a temporal fix until we find out
> what is actually misbehaving on your systems. If you are not interested
> to pursue that way I will simply go with the revert.

+1

It's very unlikely that decade-old mempool semantics are suddenly a
fundamental livelock problem, when all the evidence we have is one
hang and vague speculation. Given that the patch causes regressions,
and that the bug is most likely elsewhere anyway, a full revert rather
than merely-less-invasive mempool changes makes the most sense to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-21 12:13               ` Johannes Weiner
  (?)
@ 2016-07-21 14:53                 ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-21 14:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu 21-07-16 08:13:00, Johannes Weiner wrote:
> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> > Look, there are
> > $ git grep mempool_alloc | wc -l
> > 304
> > 
> > many users of this API and we do not want to flip the default behavior
> > which is there for more than 10 years. So far you have been arguing
> > about potential deadlocks and haven't shown any particular path which
> > would have a direct or indirect dependency between mempool and normal
> > allocator and it wouldn't be a bug. As the matter of fact the change
> > we are discussing here causes a regression. If you want to change the
> > semantic of mempool allocator then you are absolutely free to do so. In
> > a separate patch which would be discussed with IO people and other
> > users, though. But we _absolutely_ want to fix the regression first
> > and have a simple fix for 4.6 and 4.7 backports. At this moment there
> > are revert and patch 1 on the table.  The later one should make your
> > backtrace happy and should be only as a temporal fix until we find out
> > what is actually misbehaving on your systems. If you are not interested
> > to pursue that way I will simply go with the revert.
> 
> +1
> 
> It's very unlikely that decade-old mempool semantics are suddenly a
> fundamental livelock problem, when all the evidence we have is one
> hang and vague speculation. Given that the patch causes regressions,
> and that the bug is most likely elsewhere anyway, a full revert rather
> than merely-less-invasive mempool changes makes the most sense to me.

OK, fair enough. What do you think about the following then? Mikulas, I
have dropped your Tested-by and Reviewed-by because the patch is
different but unless you have hit the OOM killer then the testing
results should be same.
---
>From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 21 Jul 2016 16:40:59 +0200
Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
 free elements"

This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Ondrej was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:

schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.

mempools are a mechanism to maintain dedicated memory reserves to guaratee
forward progress. Allowing them an unbounded access to the page allocator
memory reserves is going against the whole purpose of this mechanism.

Bisected-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..5ba6c8b3b814 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize);
  * returns NULL. Note that due to preallocation, this function
  * *never* fails when called from process contexts. (it might
  * fail if called from an IRQ context.)
- * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported.
+ * Note: using __GFP_ZERO is not supported.
  */
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
 	void *element;
 	unsigned long flags;
 	wait_queue_t wait;
 	gfp_t gfp_temp;
 
-	/* If oom killed, memory reserves are essential to prevent livelock */
-	VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC);
-	/* No element size to zero on allocation */
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
-
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,12 +348,11 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if (gfp_temp != gfp_mask) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
 	}
-	gfp_temp = gfp_mask;
 
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-21 14:53                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-21 14:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu 21-07-16 08:13:00, Johannes Weiner wrote:
> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> > Look, there are
> > $ git grep mempool_alloc | wc -l
> > 304
> > 
> > many users of this API and we do not want to flip the default behavior
> > which is there for more than 10 years. So far you have been arguing
> > about potential deadlocks and haven't shown any particular path which
> > would have a direct or indirect dependency between mempool and normal
> > allocator and it wouldn't be a bug. As the matter of fact the change
> > we are discussing here causes a regression. If you want to change the
> > semantic of mempool allocator then you are absolutely free to do so. In
> > a separate patch which would be discussed with IO people and other
> > users, though. But we _absolutely_ want to fix the regression first
> > and have a simple fix for 4.6 and 4.7 backports. At this moment there
> > are revert and patch 1 on the table.  The later one should make your
> > backtrace happy and should be only as a temporal fix until we find out
> > what is actually misbehaving on your systems. If you are not interested
> > to pursue that way I will simply go with the revert.
> 
> +1
> 
> It's very unlikely that decade-old mempool semantics are suddenly a
> fundamental livelock problem, when all the evidence we have is one
> hang and vague speculation. Given that the patch causes regressions,
> and that the bug is most likely elsewhere anyway, a full revert rather
> than merely-less-invasive mempool changes makes the most sense to me.

OK, fair enough. What do you think about the following then? Mikulas, I
have dropped your Tested-by and Reviewed-by because the patch is
different but unless you have hit the OOM killer then the testing
results should be same.
---

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-21 14:53                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-21 14:53 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu 21-07-16 08:13:00, Johannes Weiner wrote:
> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> > Look, there are
> > $ git grep mempool_alloc | wc -l
> > 304
> > 
> > many users of this API and we do not want to flip the default behavior
> > which is there for more than 10 years. So far you have been arguing
> > about potential deadlocks and haven't shown any particular path which
> > would have a direct or indirect dependency between mempool and normal
> > allocator and it wouldn't be a bug. As the matter of fact the change
> > we are discussing here causes a regression. If you want to change the
> > semantic of mempool allocator then you are absolutely free to do so. In
> > a separate patch which would be discussed with IO people and other
> > users, though. But we _absolutely_ want to fix the regression first
> > and have a simple fix for 4.6 and 4.7 backports. At this moment there
> > are revert and patch 1 on the table.  The later one should make your
> > backtrace happy and should be only as a temporal fix until we find out
> > what is actually misbehaving on your systems. If you are not interested
> > to pursue that way I will simply go with the revert.
> 
> +1
> 
> It's very unlikely that decade-old mempool semantics are suddenly a
> fundamental livelock problem, when all the evidence we have is one
> hang and vague speculation. Given that the patch causes regressions,
> and that the bug is most likely elsewhere anyway, a full revert rather
> than merely-less-invasive mempool changes makes the most sense to me.

OK, fair enough. What do you think about the following then? Mikulas, I
have dropped your Tested-by and Reviewed-by because the patch is
different but unless you have hit the OOM killer then the testing
results should be same.
---
From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 21 Jul 2016 16:40:59 +0200
Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
 free elements"

This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Ondrej was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:

schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.

mempools are a mechanism to maintain dedicated memory reserves to guaratee
forward progress. Allowing them an unbounded access to the page allocator
memory reserves is going against the whole purpose of this mechanism.

Bisected-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..5ba6c8b3b814 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize);
  * returns NULL. Note that due to preallocation, this function
  * *never* fails when called from process contexts. (it might
  * fail if called from an IRQ context.)
- * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported.
+ * Note: using __GFP_ZERO is not supported.
  */
-void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
 	void *element;
 	unsigned long flags;
 	wait_queue_t wait;
 	gfp_t gfp_temp;
 
-	/* If oom killed, memory reserves are essential to prevent livelock */
-	VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC);
-	/* No element size to zero on allocation */
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
-
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,12 +348,11 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if (gfp_temp != gfp_mask) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
 	}
-	gfp_temp = gfp_mask;
 
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
-- 
2.8.1


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-21 14:53                 ` Michal Hocko
@ 2016-07-21 15:26                   ` Johannes Weiner
  -1 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-21 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu, Jul 21, 2016 at 04:53:10PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
> 
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
> 
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Ondrej was

-out

> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> 
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
> 
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
> 
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. David has mentioned the following backtrace:
> 
> schedule
> schedule_timeout
> io_schedule_timeout
> mempool_alloc
> __split_and_process_bio
> dm_request
> generic_make_request
> submit_bio
> mpage_readpages
> ext4_readpages
> __do_page_cache_readahead
> ra_submit
> filemap_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault
> 
> We do not know more about why the mempool is depleted without being
> replenished in time, though. In any case the dm layer shouldn't depend
> on any allocations outside of the dedicated pools so a forward progress
> should be guaranteed. If this is not the case then the dm should be
> fixed rather than papering over the problem and postponing it to later
> by accessing more memory reserves.
> 
> mempools are a mechanism to maintain dedicated memory reserves to guaratee
> forward progress. Allowing them an unbounded access to the page allocator
> memory reserves is going against the whole purpose of this mechanism.
> 
> Bisected-by: Ondrej Kozina <okozina@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Michal

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-21 15:26                   ` Johannes Weiner
  0 siblings, 0 replies; 102+ messages in thread
From: Johannes Weiner @ 2016-07-21 15:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu, Jul 21, 2016 at 04:53:10PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
> 
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
> 
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Ondrej was

-out

> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> 
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
> 
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
> 
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. David has mentioned the following backtrace:
> 
> schedule
> schedule_timeout
> io_schedule_timeout
> mempool_alloc
> __split_and_process_bio
> dm_request
> generic_make_request
> submit_bio
> mpage_readpages
> ext4_readpages
> __do_page_cache_readahead
> ra_submit
> filemap_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault
> 
> We do not know more about why the mempool is depleted without being
> replenished in time, though. In any case the dm layer shouldn't depend
> on any allocations outside of the dedicated pools so a forward progress
> should be guaranteed. If this is not the case then the dm should be
> fixed rather than papering over the problem and postponing it to later
> by accessing more memory reserves.
> 
> mempools are a mechanism to maintain dedicated memory reserves to guaratee
> forward progress. Allowing them an unbounded access to the page allocator
> memory reserves is going against the whole purpose of this mechanism.
> 
> Bisected-by: Ondrej Kozina <okozina@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Michal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-21 14:53                 ` Michal Hocko
                                   ` (2 preceding siblings ...)
  (?)
@ 2016-07-22  1:41                 ` NeilBrown
  -1 siblings, 0 replies; 102+ messages in thread
From: NeilBrown @ 2016-07-22  1:41 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

[-- Attachment #1: Type: text/plain, Size: 6354 bytes --]

On Fri, Jul 22 2016, Michal Hocko wrote:

> On Thu 21-07-16 08:13:00, Johannes Weiner wrote:
>> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
>> > Look, there are
>> > $ git grep mempool_alloc | wc -l
>> > 304
>> > 
>> > many users of this API and we do not want to flip the default behavior
>> > which is there for more than 10 years. So far you have been arguing
>> > about potential deadlocks and haven't shown any particular path which
>> > would have a direct or indirect dependency between mempool and normal
>> > allocator and it wouldn't be a bug. As the matter of fact the change
>> > we are discussing here causes a regression. If you want to change the
>> > semantic of mempool allocator then you are absolutely free to do so. In
>> > a separate patch which would be discussed with IO people and other
>> > users, though. But we _absolutely_ want to fix the regression first
>> > and have a simple fix for 4.6 and 4.7 backports. At this moment there
>> > are revert and patch 1 on the table.  The later one should make your
>> > backtrace happy and should be only as a temporal fix until we find out
>> > what is actually misbehaving on your systems. If you are not interested
>> > to pursue that way I will simply go with the revert.
>> 
>> +1
>> 
>> It's very unlikely that decade-old mempool semantics are suddenly a
>> fundamental livelock problem, when all the evidence we have is one
>> hang and vague speculation. Given that the patch causes regressions,
>> and that the bug is most likely elsewhere anyway, a full revert rather
>> than merely-less-invasive mempool changes makes the most sense to me.
>
> OK, fair enough. What do you think about the following then? Mikulas, I
> have dropped your Tested-by and Reviewed-by because the patch is
> different but unless you have hit the OOM killer then the testing
> results should be same.
> ---
> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
>
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
>
> There has been a report about OOM killer invoked when swapping out to
> a dm-crypt device. The primary reason seems to be that the swapout
> out IO managed to completely deplete memory reserves. Ondrej was
> able to bisect and explained the issue by pointing to f9054c70d28b
> ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
>
> The reason is that the swapout path is not throttled properly because
> the md-raid layer needs to allocate from the generic_make_request path
> which means it allocates from the PF_MEMALLOC context. dm layer uses
> mempool_alloc in order to guarantee a forward progress which used to
> inhibit access to memory reserves when using page allocator. This has
> changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> there are free elements") which has dropped the __GFP_NOMEMALLOC
> protection when the memory pool is depleted.
>
> If we are running out of memory and the only way forward to free memory
> is to perform swapout we just keep consuming memory reserves rather than
> throttling the mempool allocations and allowing the pending IO to
> complete up to a moment when the memory is depleted completely and there
> is no way forward but invoking the OOM killer. This is less than
> optimal.
>
> The original intention of f9054c70d28b was to help with the OOM
> situations where the oom victim depends on mempool allocation to make a
> forward progress. David has mentioned the following backtrace:
>
> schedule
> schedule_timeout
> io_schedule_timeout
> mempool_alloc
> __split_and_process_bio
> dm_request
> generic_make_request
> submit_bio
> mpage_readpages
> ext4_readpages
> __do_page_cache_readahead
> ra_submit
> filemap_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault
>
> We do not know more about why the mempool is depleted without being
> replenished in time, though. In any case the dm layer shouldn't depend
> on any allocations outside of the dedicated pools so a forward progress
> should be guaranteed. If this is not the case then the dm should be
> fixed rather than papering over the problem and postponing it to later
> by accessing more memory reserves.
>
> mempools are a mechanism to maintain dedicated memory reserves to guaratee
> forward progress. Allowing them an unbounded access to the page allocator
> memory reserves is going against the whole purpose of this mechanism.
>
> Bisected-by: Ondrej Kozina <okozina@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/mempool.c | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
>
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..5ba6c8b3b814 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize);
>   * returns NULL. Note that due to preallocation, this function
>   * *never* fails when called from process contexts. (it might
>   * fail if called from an IRQ context.)
> - * Note: neither __GFP_NOMEMALLOC nor __GFP_ZERO are supported.
> + * Note: using __GFP_ZERO is not supported.
>   */
> -void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> +void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  {
>  	void *element;
>  	unsigned long flags;
>  	wait_queue_t wait;
>  	gfp_t gfp_temp;
>  
> -	/* If oom killed, memory reserves are essential to prevent livelock */
> -	VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC);
> -	/* No element size to zero on allocation */
>  	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> -
>  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
> +	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */

As I was reading through this thread I kept thinking "Surely
mempool_alloc() should never ever allocate from emergency reserves.
Ever."
Then I saw this patch.  It made me happy.

Thanks.

Acked-by: NeilBrown <neilb@suse.com>
(if you want it)

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-21 14:53                 ` Michal Hocko
@ 2016-07-22  6:37                   ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-22  6:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu 21-07-16 16:53:09, Michal Hocko wrote:
> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
> 
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

I've noticed that Andrew has already picked this one up. Is anybody
against marking it for stable?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-22  6:37                   ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-22  6:37 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On Thu 21-07-16 16:53:09, Michal Hocko wrote:
> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
> 
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

I've noticed that Andrew has already picked this one up. Is anybody
against marking it for stable?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-18  8:41     ` Michal Hocko
  (?)
  (?)
@ 2016-07-22  8:46     ` NeilBrown
  2016-07-22  9:04       ` NeilBrown
  2016-07-22  9:15         ` Michal Hocko
  -1 siblings, 2 replies; 102+ messages in thread
From: NeilBrown @ 2016-07-22  8:46 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Andrew Morton, LKML, dm-devel, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 5814 bytes --]

On Mon, Jul 18 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
>
> Mikulas has reported that a swap backed by dm-crypt doesn't work
> properly because the swapout cannot make a sufficient forward progress
> as the writeout path depends on dm_crypt worker which has to allocate
> memory to perform the encryption. In order to guarantee a forward
> progress it relies on the mempool allocator. mempool_alloc(), however,
> prefers to use the underlying (usually page) allocator before it grabs
> objects from the pool. Such an allocation can dive into the memory
> reclaim and consequently to throttle_vm_writeout.

That's just broken.
I used to think mempool should always use the pre-allocated reserves
first.  That is surely the most logical course of action.  Otherwise
that memory is just sitting there doing nothing useful.

I spoke to Nick Piggin about this some years ago and he pointed out that
the kmalloc allocation paths are much better optimized for low overhead
when there is plenty of memory.  They can just pluck a free block of a
per-CPU list without taking any locks.   By contrast, accessing the
preallocated pool always requires a spinlock.

So it makes lots of sense to prefer the underlying allocator if it can
provide a quick response.  If it cannot, the sensible thing is to use
the pool, or wait for the pool to be replenished.

So the allocator should never wait at all, never enter reclaim, never
throttle.

Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
time through, but if the pool is empty, direct-reclaim is allowed on the
next attempt.  Presumably this is where the throttling comes in ??  I
suspect that it really shouldn't do that. It should leave kswapd to do
reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
mempool_alloc where pool->wait can wake it up.

If I'm following the code properly, the stack trace below can only
happen if the first pool->alloc() attempt, with direct-reclaim disabled,
fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
and io_schedule_timeout().
I suspect the timeout *doesn't* fire (5 seconds is along time) so it
gets woken up when there is something in the pool.  It then loops around
and tries pool->alloc() again, even though there is something in the
pool.  This might be justified if that ->alloc would never block, but
obviously it does.

I would very strongly recommend just changing mempool_alloc() to
permanently mask out __GFP_DIRECT_RECLAIM.

Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
It is "LESS" throttle, not "NO" throttle, but you have made
throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
The purpose of that flag is to allow a thread to dirty a page-cache page
as part of cleaning another page-cache page.
So it makes sense for loop and sometimes for nfsd.  It would make sense
for dm-crypt if it was putting the encrypted version in the page cache.
But if dm-crypt is just allocating a transient page (which I think it
is), then a mempool should be sufficient (and we should make sure it is
sufficient) and access to an extra 10% (or whatever) of the page cache
isn't justified.

Thanks,
NeilBrown



 If there are too many
> dirty or pages under writeback it will get throttled even though it is
> in fact a flusher to clear pending pages.
>
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-22  8:46     ` NeilBrown
@ 2016-07-22  9:04       ` NeilBrown
  2016-07-22  9:15         ` Michal Hocko
  1 sibling, 0 replies; 102+ messages in thread
From: NeilBrown @ 2016-07-22  9:04 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Mikulas Patocka, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Andrew Morton, LKML, dm-devel, Michal Hocko

[-- Attachment #1: Type: text/plain, Size: 776 bytes --]

On Fri, Jul 22 2016, NeilBrown wrote:

>
> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
> time through, but if the pool is empty, direct-reclaim is allowed on the
> next attempt.  Presumably this is where the throttling comes in ??  I
> suspect that it really shouldn't do that. It should leave kswapd to do
> reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
> mempool_alloc where pool->wait can wake it up.

Actually, thinking about the kswapd connection, it might make sense
for mempool_alloc() to wait in the relevant pgdata->pfmemalloc_wait as
well as waiting on pool->wait.  What way it should be able to proceed as
soon as any memory is available.  I don't know what the correct 'pgdata'
is though.

Just a thought,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-22  8:46     ` NeilBrown
@ 2016-07-22  9:15         ` Michal Hocko
  2016-07-22  9:15         ` Michal Hocko
  1 sibling, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-22  9:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

On Fri 22-07-16 18:46:57, Neil Brown wrote:
> On Mon, Jul 18 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Mikulas has reported that a swap backed by dm-crypt doesn't work
> > properly because the swapout cannot make a sufficient forward progress
> > as the writeout path depends on dm_crypt worker which has to allocate
> > memory to perform the encryption. In order to guarantee a forward
> > progress it relies on the mempool allocator. mempool_alloc(), however,
> > prefers to use the underlying (usually page) allocator before it grabs
> > objects from the pool. Such an allocation can dive into the memory
> > reclaim and consequently to throttle_vm_writeout.
> 
> That's just broken.
> I used to think mempool should always use the pre-allocated reserves
> first.  That is surely the most logical course of action.  Otherwise
> that memory is just sitting there doing nothing useful.
> 
> I spoke to Nick Piggin about this some years ago and he pointed out that
> the kmalloc allocation paths are much better optimized for low overhead
> when there is plenty of memory.  They can just pluck a free block of a
> per-CPU list without taking any locks.   By contrast, accessing the
> preallocated pool always requires a spinlock.
> 
> So it makes lots of sense to prefer the underlying allocator if it can
> provide a quick response.  If it cannot, the sensible thing is to use
> the pool, or wait for the pool to be replenished.
> 
> So the allocator should never wait at all, never enter reclaim, never
> throttle.
> 
> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
> time through, but if the pool is empty, direct-reclaim is allowed on the
> next attempt.  Presumably this is where the throttling comes in ??

Yes that is correct.

> I suspect that it really shouldn't do that. It should leave kswapd to
> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
> mempool_alloc where pool->wait can wake it up.

Mikulas was already suggesting that and my concern was that this would
give up prematurely even under mild page cache load when there are many
clean page cache pages. If we just back off and rely on kswapd which
might get stuck on the writeout then the IO throughput can be reduced
I believe which would make the whole memory pressure just worse. So I am
not sure this is a good idea in general. I completely agree with you
that the mempool request shouldn't be throttled unless there is a strong
reason for that. More on that below.

> If I'm following the code properly, the stack trace below can only
> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> and io_schedule_timeout().

mempool_alloc retries immediatelly without any sleep after the first
no-reclaim attempt.

> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> gets woken up when there is something in the pool.  It then loops around
> and tries pool->alloc() again, even though there is something in the
> pool.  This might be justified if that ->alloc would never block, but
> obviously it does.
> 
> I would very strongly recommend just changing mempool_alloc() to
> permanently mask out __GFP_DIRECT_RECLAIM.
> 
> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> It is "LESS" throttle, not "NO" throttle, but you have made
> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.

Yes that is correct. But it still allows to throttle on congestion:
shrink_inactive_list:
	/*
	 * Stall direct reclaim for IO completions if underlying BDIs or zone
	 * is congested. Allow kswapd to continue until it starts encountering
	 * unqueued dirty pages or cycling through the LRU too quickly.
	 */
	if (!sc->hibernation_mode && !current_is_kswapd() &&
	    current_may_throttle())
		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);

My thinking was that throttle_vm_writeout is there to prevent from
dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
is part of the writeout so throttling it on too many dirty pages is
questionable (well we get some bias but that is not really reliable). It
still makes sense to throttle when the backing device is congested
because the writeout path wouldn't make much progress anyway and we also
do not want to cycle through LRU lists too quickly in that case.

Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
dirtying of memory?

> The purpose of that flag is to allow a thread to dirty a page-cache page
> as part of cleaning another page-cache page.
> So it makes sense for loop and sometimes for nfsd.  It would make sense
> for dm-crypt if it was putting the encrypted version in the page cache.
> But if dm-crypt is just allocating a transient page (which I think it
> is), then a mempool should be sufficient (and we should make sure it is
> sufficient) and access to an extra 10% (or whatever) of the page cache
> isn't justified.

If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
appropriate then would a PF_MEMPOOL be any better?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-22  9:15         ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-22  9:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

On Fri 22-07-16 18:46:57, Neil Brown wrote:
> On Mon, Jul 18 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Mikulas has reported that a swap backed by dm-crypt doesn't work
> > properly because the swapout cannot make a sufficient forward progress
> > as the writeout path depends on dm_crypt worker which has to allocate
> > memory to perform the encryption. In order to guarantee a forward
> > progress it relies on the mempool allocator. mempool_alloc(), however,
> > prefers to use the underlying (usually page) allocator before it grabs
> > objects from the pool. Such an allocation can dive into the memory
> > reclaim and consequently to throttle_vm_writeout.
> 
> That's just broken.
> I used to think mempool should always use the pre-allocated reserves
> first.  That is surely the most logical course of action.  Otherwise
> that memory is just sitting there doing nothing useful.
> 
> I spoke to Nick Piggin about this some years ago and he pointed out that
> the kmalloc allocation paths are much better optimized for low overhead
> when there is plenty of memory.  They can just pluck a free block of a
> per-CPU list without taking any locks.   By contrast, accessing the
> preallocated pool always requires a spinlock.
> 
> So it makes lots of sense to prefer the underlying allocator if it can
> provide a quick response.  If it cannot, the sensible thing is to use
> the pool, or wait for the pool to be replenished.
> 
> So the allocator should never wait at all, never enter reclaim, never
> throttle.
> 
> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
> time through, but if the pool is empty, direct-reclaim is allowed on the
> next attempt.  Presumably this is where the throttling comes in ??

Yes that is correct.

> I suspect that it really shouldn't do that. It should leave kswapd to
> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
> mempool_alloc where pool->wait can wake it up.

Mikulas was already suggesting that and my concern was that this would
give up prematurely even under mild page cache load when there are many
clean page cache pages. If we just back off and rely on kswapd which
might get stuck on the writeout then the IO throughput can be reduced
I believe which would make the whole memory pressure just worse. So I am
not sure this is a good idea in general. I completely agree with you
that the mempool request shouldn't be throttled unless there is a strong
reason for that. More on that below.

> If I'm following the code properly, the stack trace below can only
> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> and io_schedule_timeout().

mempool_alloc retries immediatelly without any sleep after the first
no-reclaim attempt.

> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> gets woken up when there is something in the pool.  It then loops around
> and tries pool->alloc() again, even though there is something in the
> pool.  This might be justified if that ->alloc would never block, but
> obviously it does.
> 
> I would very strongly recommend just changing mempool_alloc() to
> permanently mask out __GFP_DIRECT_RECLAIM.
> 
> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> It is "LESS" throttle, not "NO" throttle, but you have made
> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.

Yes that is correct. But it still allows to throttle on congestion:
shrink_inactive_list:
	/*
	 * Stall direct reclaim for IO completions if underlying BDIs or zone
	 * is congested. Allow kswapd to continue until it starts encountering
	 * unqueued dirty pages or cycling through the LRU too quickly.
	 */
	if (!sc->hibernation_mode && !current_is_kswapd() &&
	    current_may_throttle())
		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);

My thinking was that throttle_vm_writeout is there to prevent from
dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
is part of the writeout so throttling it on too many dirty pages is
questionable (well we get some bias but that is not really reliable). It
still makes sense to throttle when the backing device is congested
because the writeout path wouldn't make much progress anyway and we also
do not want to cycle through LRU lists too quickly in that case.

Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
dirtying of memory?

> The purpose of that flag is to allow a thread to dirty a page-cache page
> as part of cleaning another page-cache page.
> So it makes sense for loop and sometimes for nfsd.  It would make sense
> for dm-crypt if it was putting the encrypted version in the page cache.
> But if dm-crypt is just allocating a transient page (which I think it
> is), then a mempool should be sufficient (and we should make sure it is
> sufficient) and access to an extra 10% (or whatever) of the page cache
> isn't justified.

If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
appropriate then would a PF_MEMPOOL be any better?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-22  6:37                   ` Michal Hocko
@ 2016-07-22 12:26                     ` Vlastimil Babka
  -1 siblings, 0 replies; 102+ messages in thread
From: Vlastimil Babka @ 2016-07-22 12:26 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On 07/22/2016 08:37 AM, Michal Hocko wrote:
> On Thu 21-07-16 16:53:09, Michal Hocko wrote:
>> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Thu, 21 Jul 2016 16:40:59 +0200
>> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>>  free elements"
>>
>> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
>
> I've noticed that Andrew has already picked this one up. Is anybody
> against marking it for stable?

It would be strange to have different behavior with known regression in 
4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?

Vlastimil

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-22 12:26                     ` Vlastimil Babka
  0 siblings, 0 replies; 102+ messages in thread
From: Vlastimil Babka @ 2016-07-22 12:26 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner
  Cc: David Rientjes, linux-mm, Mikulas Patocka, Ondrej Kozina,
	Tetsuo Handa, Mel Gorman, Neil Brown, Andrew Morton, LKML,
	dm-devel

On 07/22/2016 08:37 AM, Michal Hocko wrote:
> On Thu 21-07-16 16:53:09, Michal Hocko wrote:
>> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Thu, 21 Jul 2016 16:40:59 +0200
>> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>>  free elements"
>>
>> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
>
> I've noticed that Andrew has already picked this one up. Is anybody
> against marking it for stable?

It would be strange to have different behavior with known regression in 
4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-22 12:26                     ` Vlastimil Babka
@ 2016-07-22 19:44                       ` Andrew Morton
  -1 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2016-07-22 19:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Johannes Weiner, David Rientjes, linux-mm,
	Mikulas Patocka, Ondrej Kozina, Tetsuo Handa, Mel Gorman,
	Neil Brown, LKML, dm-devel

On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 07/22/2016 08:37 AM, Michal Hocko wrote:
> > On Thu 21-07-16 16:53:09, Michal Hocko wrote:
> >> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko <mhocko@suse.com>
> >> Date: Thu, 21 Jul 2016 16:40:59 +0200
> >> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
> >>  free elements"
> >>
> >> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
> >
> > I've noticed that Andrew has already picked this one up. Is anybody
> > against marking it for stable?
> 
> It would be strange to have different behavior with known regression in 
> 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?
> 

I added the cc:stable.

Do we need to bust a gut to rush it into 4.7?  It sounds safer to let
it bake for a while, fix it in 4.7.1?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-22 19:44                       ` Andrew Morton
  0 siblings, 0 replies; 102+ messages in thread
From: Andrew Morton @ 2016-07-22 19:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Johannes Weiner, David Rientjes, linux-mm,
	Mikulas Patocka, Ondrej Kozina, Tetsuo Handa, Mel Gorman,
	Neil Brown, LKML, dm-devel

On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 07/22/2016 08:37 AM, Michal Hocko wrote:
> > On Thu 21-07-16 16:53:09, Michal Hocko wrote:
> >> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko <mhocko@suse.com>
> >> Date: Thu, 21 Jul 2016 16:40:59 +0200
> >> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
> >>  free elements"
> >>
> >> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
> >
> > I've noticed that Andrew has already picked this one up. Is anybody
> > against marking it for stable?
> 
> It would be strange to have different behavior with known regression in 
> 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?
> 

I added the cc:stable.

Do we need to bust a gut to rush it into 4.7?  It sounds safer to let
it bake for a while, fix it in 4.7.1?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-22  9:15         ` Michal Hocko
  (?)
@ 2016-07-23  0:12         ` NeilBrown
  2016-07-25  8:32             ` Michal Hocko
  2016-07-25 21:52             ` Mikulas Patocka
  -1 siblings, 2 replies; 102+ messages in thread
From: NeilBrown @ 2016-07-23  0:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

[-- Attachment #1: Type: text/plain, Size: 13656 bytes --]

On Fri, Jul 22 2016, Michal Hocko wrote:

> On Fri 22-07-16 18:46:57, Neil Brown wrote:
>> On Mon, Jul 18 2016, Michal Hocko wrote:
>> 
>> > From: Michal Hocko <mhocko@suse.com>
>> >
>> > Mikulas has reported that a swap backed by dm-crypt doesn't work
>> > properly because the swapout cannot make a sufficient forward progress
>> > as the writeout path depends on dm_crypt worker which has to allocate
>> > memory to perform the encryption. In order to guarantee a forward
>> > progress it relies on the mempool allocator. mempool_alloc(), however,
>> > prefers to use the underlying (usually page) allocator before it grabs
>> > objects from the pool. Such an allocation can dive into the memory
>> > reclaim and consequently to throttle_vm_writeout.
>> 
>> That's just broken.
>> I used to think mempool should always use the pre-allocated reserves
>> first.  That is surely the most logical course of action.  Otherwise
>> that memory is just sitting there doing nothing useful.
>> 
>> I spoke to Nick Piggin about this some years ago and he pointed out that
>> the kmalloc allocation paths are much better optimized for low overhead
>> when there is plenty of memory.  They can just pluck a free block of a
>> per-CPU list without taking any locks.   By contrast, accessing the
>> preallocated pool always requires a spinlock.
>> 
>> So it makes lots of sense to prefer the underlying allocator if it can
>> provide a quick response.  If it cannot, the sensible thing is to use
>> the pool, or wait for the pool to be replenished.
>> 
>> So the allocator should never wait at all, never enter reclaim, never
>> throttle.
>> 
>> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
>> time through, but if the pool is empty, direct-reclaim is allowed on the
>> next attempt.  Presumably this is where the throttling comes in ??
>
> Yes that is correct.
>
>> I suspect that it really shouldn't do that. It should leave kswapd to
>> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
>> mempool_alloc where pool->wait can wake it up.
>
> Mikulas was already suggesting that and my concern was that this would
> give up prematurely even under mild page cache load when there are many
> clean page cache pages.

That's a valid point - freeing up clean pages is a reasonable thing for
a mempool allocator to try to do.

>                          If we just back off and rely on kswapd which
> might get stuck on the writeout then the IO throughput can be reduced

If I were king of MM, I would make a decree to be proclaimed throughout
the land
    kswapd must never sleep except when it explicitly chooses to

Maybe that is impractical, but having firm rules like that would go a
long way to make it possible to actually understand and reason about how
MM works.  As it is, there seems to be a tendency to put bandaids over
bandaids.

> I believe which would make the whole memory pressure just worse. So I am
> not sure this is a good idea in general. I completely agree with you
> that the mempool request shouldn't be throttled unless there is a strong
> reason for that. More on that below.
>
>> If I'm following the code properly, the stack trace below can only
>> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
>> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
>> and io_schedule_timeout().
>
> mempool_alloc retries immediatelly without any sleep after the first
> no-reclaim attempt.

I missed that ... I see it now... I wonder if anyone has contemplated
using some modern programming techniques like, maybe, a "while" loop in
there..
Something like the below...

>
>> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
>> gets woken up when there is something in the pool.  It then loops around
>> and tries pool->alloc() again, even though there is something in the
>> pool.  This might be justified if that ->alloc would never block, but
>> obviously it does.
>> 
>> I would very strongly recommend just changing mempool_alloc() to
>> permanently mask out __GFP_DIRECT_RECLAIM.
>> 
>> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
>> It is "LESS" throttle, not "NO" throttle, but you have made
>> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
>
> Yes that is correct. But it still allows to throttle on congestion:
> shrink_inactive_list:
> 	/*
> 	 * Stall direct reclaim for IO completions if underlying BDIs or zone
> 	 * is congested. Allow kswapd to continue until it starts encountering
> 	 * unqueued dirty pages or cycling through the LRU too quickly.
> 	 */
> 	if (!sc->hibernation_mode && !current_is_kswapd() &&
> 	    current_may_throttle())
> 		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
>
> My thinking was that throttle_vm_writeout is there to prevent from
> dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> is part of the writeout so throttling it on too many dirty pages is
> questionable (well we get some bias but that is not really reliable). It
> still makes sense to throttle when the backing device is congested
> because the writeout path wouldn't make much progress anyway and we also
> do not want to cycle through LRU lists too quickly in that case.

"dirtying ... from the reclaim context" ??? What does that mean?
According to
  Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
From the history tree, the purpose of throttle_vm_writeout() is to
limit the amount of memory that is concurrently under I/O.
That seems strange to me because I thought it was the responsibility of
each backing device to impose a limit - a maximum queue size of some
sort.
I remember when NFS didn't impose a limit and you could end up with lots
of memory in NFS write-back, and very long latencies could result.

So I wonder what throttle_vm_writeout() really achieves these days.  Is
it just a bandaid that no-one is brave enough to remove?

I guess it could play a role in balancing the freeing of clean pages,
which can be done instantly, against dirty pages, which require
writeback.  Without some throttling, might all clean pages being cleaned
too quickly, just trashing our read caches?

>
> Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
> dirtying of memory?

In most cases, nfsd it just like any other application and needs to be
throttled like any other application when it writes too much data.
The only time nfsd *needs* PF_LESS_THROTTLE when when a loop-back mount
is active.  When the same page cache is the source and destination of
writes.
So nfsd needs to be able to dirty a few more pages when nothing else
can due to high dirty count.  Otherwise it deadlocks.
The main use of PF_LESS_THROTTLE is in zone_dirty_limit() and
domain_dirty_limits() where an extra 25% is allowed to overcome this
deadlock.

The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
avoid a live-lock.  A key premise is that nfsd only allocates unbounded
memory when it is writing to the page cache.  So it only needs to be
throttled when the backing device it is writing to is congested.  It is
particularly important that it *doesn't* get throttled just because an
NFS backing device is congested, because nfsd might be trying to clear
that congestion.

In general, callers of try_to_free_pages() might get throttled when any
backing device is congested.  This is a reasonable default when we don't
know what they are allocating memory for.  When we do know the purpose of
the allocation, we can be more cautious about throttling.

If a thread is allocating just to dirty pages for a given backing
device, we only need to throttle the allocation if the backing device is
congested.  Any further throttling needed happens in
balance_dirty_pages().

If a thread is only making transient allocations, ones which will be
freed shortly afterwards (not, for example, put in a cache), then I
don't think it needs to be throttled at all.  I think this universally
applies to mempools.
In the case of dm_crypt, if it is writing too fast it will eventually be
throttled in generic_make_request when the underlying device has a full
queue and so blocks waiting for requests to be completed, and thus parts
of them returned to the mempool.


>
>> The purpose of that flag is to allow a thread to dirty a page-cache page
>> as part of cleaning another page-cache page.
>> So it makes sense for loop and sometimes for nfsd.  It would make sense
>> for dm-crypt if it was putting the encrypted version in the page cache.
>> But if dm-crypt is just allocating a transient page (which I think it
>> is), then a mempool should be sufficient (and we should make sure it is
>> sufficient) and access to an extra 10% (or whatever) of the page cache
>> isn't justified.
>
> If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> appropriate then would a PF_MEMPOOL be any better?

Why a PF rather than a GFP flag?
NFSD uses a PF because there is no GFP interface for filesystem write.
But mempool can pass down a GFP flag, so I think it should.
The meaning of the flag is, in my opinion, that a 'transient' allocation
is being requested.  i.e. an allocation which will be used for a single
purpose for a short amount of time and will then be freed.  In
particularly it will never be placed in a cache, and if it is ever
placed on a queue, that is certain to be a queue with an upper bound on
the size and with guaranteed forward progress in the face of memory
pressure.
Any allocation request for a use case with those properties should be
allowed to set GFP_TRANSIENT (for example) with the effect that the
allocation will not be throttled.
A key point with the name is to identify the purpose of the flag, not a
specific use case (mempool) which we want it for.

At least, that is what I think we should do today...

NeilBrown


>
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs


diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..2dded8c1b9d7 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -313,7 +313,6 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	void *element;
 	unsigned long flags;
 	wait_queue_t wait;
-	gfp_t gfp_temp;
 
 	/* If oom killed, memory reserves are essential to prevent livelock */
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_NOMEMALLOC);
@@ -325,67 +324,47 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
-	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
+	element = pool->alloc(gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO),
+			      pool->pool_data);
 
-repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	while (!element) {
+		spin_lock_irqsave(&pool->lock, flags);
+		if (likely(pool->curr_nr)) {
+			element = remove_element(pool, gfp_mask);
+			spin_unlock_irqrestore(&pool->lock, flags);
+			/* paired with rmb in mempool_free(), read comment there */
+			smp_wmb();
+			/*
+			 * Update the allocation stack trace as this is more useful
+			 * for debugging.
+			 */
+			kmemleak_update_trace(element);
+			break;
+		}
+
+		/* We must not sleep if !__GFP_DIRECT_RECLAIM */
+		if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
+			spin_unlock_irqrestore(&pool->lock, flags);
+			break;
+		}
 
-	element = pool->alloc(gfp_temp, pool->pool_data);
-	if (likely(element != NULL))
-		return element;
+		/* Let's wait for someone else to return an element to @pool */
+		init_wait(&wait);
+		prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
 
-	spin_lock_irqsave(&pool->lock, flags);
-	if (likely(pool->curr_nr)) {
-		element = remove_element(pool, gfp_temp);
 		spin_unlock_irqrestore(&pool->lock, flags);
-		/* paired with rmb in mempool_free(), read comment there */
-		smp_wmb();
+
 		/*
-		 * Update the allocation stack trace as this is more useful
-		 * for debugging.
+		 * FIXME: this should be io_schedule().  The timeout is there as a
+		 * workaround for some DM problems in 2.6.18.
 		 */
-		kmemleak_update_trace(element);
-		return element;
-	}
+		io_schedule_timeout(5*HZ);
 
-	/*
-	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
-	 * alloc failed with that and @pool was empty, retry immediately.
-	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
-		spin_unlock_irqrestore(&pool->lock, flags);
-		gfp_temp = gfp_mask;
-		goto repeat_alloc;
-	}
-	gfp_temp = gfp_mask;
+		finish_wait(&pool->wait, &wait);
 
-	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
-	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
-		spin_unlock_irqrestore(&pool->lock, flags);
-		return NULL;
+		element = pool->alloc(gfp_mask, pool->pool_data);
 	}
-
-	/* Let's wait for someone else to return an element to @pool */
-	init_wait(&wait);
-	prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
-
-	spin_unlock_irqrestore(&pool->lock, flags);
-
-	/*
-	 * FIXME: this should be io_schedule().  The timeout is there as a
-	 * workaround for some DM problems in 2.6.18.
-	 */
-	io_schedule_timeout(5*HZ);
-
-	finish_wait(&pool->wait, &wait);
-	goto repeat_alloc;
+	return element;
 }
 EXPORT_SYMBOL(mempool_alloc);
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
  2016-07-22 19:44                       ` Andrew Morton
@ 2016-07-23 18:52                         ` Vlastimil Babka
  -1 siblings, 0 replies; 102+ messages in thread
From: Vlastimil Babka @ 2016-07-23 18:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, David Rientjes, linux-mm,
	Mikulas Patocka, Ondrej Kozina, Tetsuo Handa, Mel Gorman,
	Neil Brown, LKML, dm-devel

On 07/22/2016 09:44 PM, Andrew Morton wrote:
> On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 07/22/2016 08:37 AM, Michal Hocko wrote:
>>> On Thu 21-07-16 16:53:09, Michal Hocko wrote:
>>>> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
>>>> From: Michal Hocko <mhocko@suse.com>
>>>> Date: Thu, 21 Jul 2016 16:40:59 +0200
>>>> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>>>>  free elements"
>>>>
>>>> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
>>>
>>> I've noticed that Andrew has already picked this one up. Is anybody
>>> against marking it for stable?
>>
>> It would be strange to have different behavior with known regression in 
>> 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?
>>
> 
> I added the cc:stable.
> 
> Do we need to bust a gut to rush it into 4.7?  It sounds safer to let
> it bake for a while, fix it in 4.7.1?

Yeah, I guess it's safer to wait now. Would be different if the reverted
commit went in the same cycle.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path
@ 2016-07-23 18:52                         ` Vlastimil Babka
  0 siblings, 0 replies; 102+ messages in thread
From: Vlastimil Babka @ 2016-07-23 18:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, David Rientjes, linux-mm,
	Mikulas Patocka, Ondrej Kozina, Tetsuo Handa, Mel Gorman,
	Neil Brown, LKML, dm-devel

On 07/22/2016 09:44 PM, Andrew Morton wrote:
> On Fri, 22 Jul 2016 14:26:19 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> On 07/22/2016 08:37 AM, Michal Hocko wrote:
>>> On Thu 21-07-16 16:53:09, Michal Hocko wrote:
>>>> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
>>>> From: Michal Hocko <mhocko@suse.com>
>>>> Date: Thu, 21 Jul 2016 16:40:59 +0200
>>>> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>>>>  free elements"
>>>>
>>>> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.
>>>
>>> I've noticed that Andrew has already picked this one up. Is anybody
>>> against marking it for stable?
>>
>> It would be strange to have different behavior with known regression in 
>> 4.6 and 4.7 stables. Actually, there's still time for 4.7 proper?
>>
> 
> I added the cc:stable.
> 
> Do we need to bust a gut to rush it into 4.7?  It sounds safer to let
> it bake for a while, fix it in 4.7.1?

Yeah, I guess it's safer to wait now. Would be different if the reverted
commit went in the same cycle.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-23  0:12         ` NeilBrown
@ 2016-07-25  8:32             ` Michal Hocko
  2016-07-25 21:52             ` Mikulas Patocka
  1 sibling, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-25  8:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

On Sat 23-07-16 10:12:24, NeilBrown wrote:
> On Fri, Jul 22 2016, Michal Hocko wrote:
[...]
> >                          If we just back off and rely on kswapd which
> > might get stuck on the writeout then the IO throughput can be reduced
> 
> If I were king of MM, I would make a decree to be proclaimed throughout
> the land
>     kswapd must never sleep except when it explicitly chooses to
> 
> Maybe that is impractical, but having firm rules like that would go a
> long way to make it possible to actually understand and reason about how
> MM works.  As it is, there seems to be a tendency to put bandaids over
> bandaids.

Ohh, I would definitely wish for this to be more clear but as it turned
out over time there are quite some interdependencies between MM/FS/IO
layers which make the picture really blur. If there is a brave soul to
make that more clear without breaking any of that it would be really
cool ;)

> > I believe which would make the whole memory pressure just worse. So I am
> > not sure this is a good idea in general. I completely agree with you
> > that the mempool request shouldn't be throttled unless there is a strong
> > reason for that. More on that below.
> >
> >> If I'm following the code properly, the stack trace below can only
> >> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> >> and io_schedule_timeout().
> >
> > mempool_alloc retries immediatelly without any sleep after the first
> > no-reclaim attempt.
> 
> I missed that ... I see it now... I wonder if anyone has contemplated
> using some modern programming techniques like, maybe, a "while" loop in
> there..
> Something like the below...

Heh, why not, the code could definitely see some more love. Care to send
a proper patch so that we are not mixing two different things here.

> >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> >> gets woken up when there is something in the pool.  It then loops around
> >> and tries pool->alloc() again, even though there is something in the
> >> pool.  This might be justified if that ->alloc would never block, but
> >> obviously it does.
> >> 
> >> I would very strongly recommend just changing mempool_alloc() to
> >> permanently mask out __GFP_DIRECT_RECLAIM.
> >> 
> >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> >> It is "LESS" throttle, not "NO" throttle, but you have made
> >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
> >
> > Yes that is correct. But it still allows to throttle on congestion:
> > shrink_inactive_list:
> > 	/*
> > 	 * Stall direct reclaim for IO completions if underlying BDIs or zone
> > 	 * is congested. Allow kswapd to continue until it starts encountering
> > 	 * unqueued dirty pages or cycling through the LRU too quickly.
> > 	 */
> > 	if (!sc->hibernation_mode && !current_is_kswapd() &&
> > 	    current_may_throttle())
> > 		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> >
> > My thinking was that throttle_vm_writeout is there to prevent from
> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> > is part of the writeout so throttling it on too many dirty pages is
> > questionable (well we get some bias but that is not really reliable). It
> > still makes sense to throttle when the backing device is congested
> > because the writeout path wouldn't make much progress anyway and we also
> > do not want to cycle through LRU lists too quickly in that case.
> 
> "dirtying ... from the reclaim context" ??? What does that mean?

Say you would cause a swapout from the reclaim context. You would
effectively dirty that anon page until it gets written down to the
storage.

> According to
>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> From the history tree, the purpose of throttle_vm_writeout() is to
> limit the amount of memory that is concurrently under I/O.
> That seems strange to me because I thought it was the responsibility of
> each backing device to impose a limit - a maximum queue size of some
> sort.

We do throttle on the congestion during the reclaim so in some
sense this is already implemented but I am not really sure that is
sufficient. Maybe this is something to re-evaluate because
wait_iff_congested came in much later after throttle_vm_writeout. Let me
think about it some more.

> I remember when NFS didn't impose a limit and you could end up with lots
> of memory in NFS write-back, and very long latencies could result.
> 
> So I wonder what throttle_vm_writeout() really achieves these days.  Is
> it just a bandaid that no-one is brave enough to remove?

Maybe yes. It is sitting there quietly and you do not know about it
until it bites. Like in this particular case.
 
> I guess it could play a role in balancing the freeing of clean pages,
> which can be done instantly, against dirty pages, which require
> writeback.  Without some throttling, might all clean pages being cleaned
> too quickly, just trashing our read caches?

I do not see how that would happen. kswapd has its reclaim targets
depending on watermarks and direct reclaim has SWAP_CLUSTER_MAX. So none
of them should go too wild and reclaim way too many clean pages.

> > Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
> > dirtying of memory?
> 
> In most cases, nfsd it just like any other application and needs to be
> throttled like any other application when it writes too much data.
> The only time nfsd *needs* PF_LESS_THROTTLE when when a loop-back mount
> is active.  When the same page cache is the source and destination of
> writes.
> So nfsd needs to be able to dirty a few more pages when nothing else
> can due to high dirty count.  Otherwise it deadlocks.
> The main use of PF_LESS_THROTTLE is in zone_dirty_limit() and
> domain_dirty_limits() where an extra 25% is allowed to overcome this
> deadlock.
> 
> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> memory when it is writing to the page cache.  So it only needs to be
> throttled when the backing device it is writing to is congested.  It is
> particularly important that it *doesn't* get throttled just because an
> NFS backing device is congested, because nfsd might be trying to clear
> that congestion.

Thanks for the clarification. IIUC then removing throttle_vm_writeout
for the nfsd writeout should be harmless as well, right?

> In general, callers of try_to_free_pages() might get throttled when any
> backing device is congested.  This is a reasonable default when we don't
> know what they are allocating memory for.  When we do know the purpose of
> the allocation, we can be more cautious about throttling.
> 
> If a thread is allocating just to dirty pages for a given backing
> device, we only need to throttle the allocation if the backing device is
> congested.  Any further throttling needed happens in
> balance_dirty_pages().
> 
> If a thread is only making transient allocations, ones which will be
> freed shortly afterwards (not, for example, put in a cache), then I
> don't think it needs to be throttled at all.  I think this universally
> applies to mempools.
> In the case of dm_crypt, if it is writing too fast it will eventually be
> throttled in generic_make_request when the underlying device has a full
> queue and so blocks waiting for requests to be completed, and thus parts
> of them returned to the mempool.

Makes sense to me.

> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> as part of cleaning another page-cache page.
> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> But if dm-crypt is just allocating a transient page (which I think it
> >> is), then a mempool should be sufficient (and we should make sure it is
> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> isn't justified.
> >
> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> > appropriate then would a PF_MEMPOOL be any better?
> 
> Why a PF rather than a GFP flag?

Well, short answer is that gfp masks are almost depleted.

> NFSD uses a PF because there is no GFP interface for filesystem write.
> But mempool can pass down a GFP flag, so I think it should.
> The meaning of the flag is, in my opinion, that a 'transient' allocation
> is being requested.  i.e. an allocation which will be used for a single
> purpose for a short amount of time and will then be freed.  In
> particularly it will never be placed in a cache, and if it is ever
> placed on a queue, that is certain to be a queue with an upper bound on
> the size and with guaranteed forward progress in the face of memory
> pressure.
> Any allocation request for a use case with those properties should be
> allowed to set GFP_TRANSIENT (for example) with the effect that the
> allocation will not be throttled.
> A key point with the name is to identify the purpose of the flag, not a
> specific use case (mempool) which we want it for.

Agreed. But let's first explore throttle_vm_writeout and its potential
removal.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-25  8:32             ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-25  8:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel

On Sat 23-07-16 10:12:24, NeilBrown wrote:
> On Fri, Jul 22 2016, Michal Hocko wrote:
[...]
> >                          If we just back off and rely on kswapd which
> > might get stuck on the writeout then the IO throughput can be reduced
> 
> If I were king of MM, I would make a decree to be proclaimed throughout
> the land
>     kswapd must never sleep except when it explicitly chooses to
> 
> Maybe that is impractical, but having firm rules like that would go a
> long way to make it possible to actually understand and reason about how
> MM works.  As it is, there seems to be a tendency to put bandaids over
> bandaids.

Ohh, I would definitely wish for this to be more clear but as it turned
out over time there are quite some interdependencies between MM/FS/IO
layers which make the picture really blur. If there is a brave soul to
make that more clear without breaking any of that it would be really
cool ;)

> > I believe which would make the whole memory pressure just worse. So I am
> > not sure this is a good idea in general. I completely agree with you
> > that the mempool request shouldn't be throttled unless there is a strong
> > reason for that. More on that below.
> >
> >> If I'm following the code properly, the stack trace below can only
> >> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> >> and io_schedule_timeout().
> >
> > mempool_alloc retries immediatelly without any sleep after the first
> > no-reclaim attempt.
> 
> I missed that ... I see it now... I wonder if anyone has contemplated
> using some modern programming techniques like, maybe, a "while" loop in
> there..
> Something like the below...

Heh, why not, the code could definitely see some more love. Care to send
a proper patch so that we are not mixing two different things here.

> >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> >> gets woken up when there is something in the pool.  It then loops around
> >> and tries pool->alloc() again, even though there is something in the
> >> pool.  This might be justified if that ->alloc would never block, but
> >> obviously it does.
> >> 
> >> I would very strongly recommend just changing mempool_alloc() to
> >> permanently mask out __GFP_DIRECT_RECLAIM.
> >> 
> >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> >> It is "LESS" throttle, not "NO" throttle, but you have made
> >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
> >
> > Yes that is correct. But it still allows to throttle on congestion:
> > shrink_inactive_list:
> > 	/*
> > 	 * Stall direct reclaim for IO completions if underlying BDIs or zone
> > 	 * is congested. Allow kswapd to continue until it starts encountering
> > 	 * unqueued dirty pages or cycling through the LRU too quickly.
> > 	 */
> > 	if (!sc->hibernation_mode && !current_is_kswapd() &&
> > 	    current_may_throttle())
> > 		wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> >
> > My thinking was that throttle_vm_writeout is there to prevent from
> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> > is part of the writeout so throttling it on too many dirty pages is
> > questionable (well we get some bias but that is not really reliable). It
> > still makes sense to throttle when the backing device is congested
> > because the writeout path wouldn't make much progress anyway and we also
> > do not want to cycle through LRU lists too quickly in that case.
> 
> "dirtying ... from the reclaim context" ??? What does that mean?

Say you would cause a swapout from the reclaim context. You would
effectively dirty that anon page until it gets written down to the
storage.

> According to
>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> From the history tree, the purpose of throttle_vm_writeout() is to
> limit the amount of memory that is concurrently under I/O.
> That seems strange to me because I thought it was the responsibility of
> each backing device to impose a limit - a maximum queue size of some
> sort.

We do throttle on the congestion during the reclaim so in some
sense this is already implemented but I am not really sure that is
sufficient. Maybe this is something to re-evaluate because
wait_iff_congested came in much later after throttle_vm_writeout. Let me
think about it some more.

> I remember when NFS didn't impose a limit and you could end up with lots
> of memory in NFS write-back, and very long latencies could result.
> 
> So I wonder what throttle_vm_writeout() really achieves these days.  Is
> it just a bandaid that no-one is brave enough to remove?

Maybe yes. It is sitting there quietly and you do not know about it
until it bites. Like in this particular case.
 
> I guess it could play a role in balancing the freeing of clean pages,
> which can be done instantly, against dirty pages, which require
> writeback.  Without some throttling, might all clean pages being cleaned
> too quickly, just trashing our read caches?

I do not see how that would happen. kswapd has its reclaim targets
depending on watermarks and direct reclaim has SWAP_CLUSTER_MAX. So none
of them should go too wild and reclaim way too many clean pages.

> > Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
> > dirtying of memory?
> 
> In most cases, nfsd it just like any other application and needs to be
> throttled like any other application when it writes too much data.
> The only time nfsd *needs* PF_LESS_THROTTLE when when a loop-back mount
> is active.  When the same page cache is the source and destination of
> writes.
> So nfsd needs to be able to dirty a few more pages when nothing else
> can due to high dirty count.  Otherwise it deadlocks.
> The main use of PF_LESS_THROTTLE is in zone_dirty_limit() and
> domain_dirty_limits() where an extra 25% is allowed to overcome this
> deadlock.
> 
> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> memory when it is writing to the page cache.  So it only needs to be
> throttled when the backing device it is writing to is congested.  It is
> particularly important that it *doesn't* get throttled just because an
> NFS backing device is congested, because nfsd might be trying to clear
> that congestion.

Thanks for the clarification. IIUC then removing throttle_vm_writeout
for the nfsd writeout should be harmless as well, right?

> In general, callers of try_to_free_pages() might get throttled when any
> backing device is congested.  This is a reasonable default when we don't
> know what they are allocating memory for.  When we do know the purpose of
> the allocation, we can be more cautious about throttling.
> 
> If a thread is allocating just to dirty pages for a given backing
> device, we only need to throttle the allocation if the backing device is
> congested.  Any further throttling needed happens in
> balance_dirty_pages().
> 
> If a thread is only making transient allocations, ones which will be
> freed shortly afterwards (not, for example, put in a cache), then I
> don't think it needs to be throttled at all.  I think this universally
> applies to mempools.
> In the case of dm_crypt, if it is writing too fast it will eventually be
> throttled in generic_make_request when the underlying device has a full
> queue and so blocks waiting for requests to be completed, and thus parts
> of them returned to the mempool.

Makes sense to me.

> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> as part of cleaning another page-cache page.
> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> But if dm-crypt is just allocating a transient page (which I think it
> >> is), then a mempool should be sufficient (and we should make sure it is
> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> isn't justified.
> >
> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> > appropriate then would a PF_MEMPOOL be any better?
> 
> Why a PF rather than a GFP flag?

Well, short answer is that gfp masks are almost depleted.

> NFSD uses a PF because there is no GFP interface for filesystem write.
> But mempool can pass down a GFP flag, so I think it should.
> The meaning of the flag is, in my opinion, that a 'transient' allocation
> is being requested.  i.e. an allocation which will be used for a single
> purpose for a short amount of time and will then be freed.  In
> particularly it will never be placed in a cache, and if it is ever
> placed on a queue, that is certain to be a queue with an upper bound on
> the size and with guaranteed forward progress in the face of memory
> pressure.
> Any allocation request for a use case with those properties should be
> allowed to set GFP_TRANSIENT (for example) with the effect that the
> allocation will not be throttled.
> A key point with the name is to identify the purpose of the flag, not a
> specific use case (mempool) which we want it for.

Agreed. But let's first explore throttle_vm_writeout and its potential
removal.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-25  8:32             ` Michal Hocko
  (?)
@ 2016-07-25 19:23               ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-25 19:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel,
	Marcelo Tosatti

[CC Marcelo who might remember other details for the loads which made
 him to add this code - see the patch changelog for more context]

On Mon 25-07-16 10:32:47, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> > So I wonder what throttle_vm_writeout() really achieves these days.  Is
> > it just a bandaid that no-one is brave enough to remove?
> 
> Maybe yes. It is sitting there quietly and you do not know about it
> until it bites. Like in this particular case.

So I was playing with this today and tried to provoke throttle_vm_writeout
and couldn't hit that path with my pretty much default IO stack. I
probably need a more complex IO setup like dm-crypt or something that
basically have to double buffer every page in the writeout for some
time.

Anyway I believe that the throttle_vm_writeout is just a relict from the
past which just survived after many other changes in the reclaim path. I
fully realize my testing is quite poor and I would really appreciate if
Mikulas could try to retest with his more complex IO setups but let me
post a patch with the changelog so that we can at least reason about the
justification. In principle the reclaim path should have sufficient
throttling already and if that is not the case then we should
consolidate the remaining rather than have yet another one.

Thoughts?
---
>From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 25 Jul 2016 14:18:54 +0200
Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout

throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
by excessive pageout activity during the reclaim. Too many pages could
be put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages
when initiating the buffered IO and later during the memory pressure,
the direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi
is congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a lot
of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies
on the mempool allocator. mempool_alloc(), however, prefers to use
the underlying (usually page) allocator before it grabs objects from
the pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background
to make the memory pressure even stronger as well as to disrupt the
steady state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout
the test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/writeback.h |  1 -
 mm/page-writeback.c       | 30 ------------------------------
 mm/vmscan.c               |  2 --
 3 files changed, 33 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 44b4422ae57f..f67a992cdf89 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data);
 #else
 static inline void laptop_sync_completion(void) { }
 #endif
-void throttle_vm_writeout(gfp_t gfp_mask);
 bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b82303a9e67d..2828d6ca1451 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	return false;
 }
 
-void throttle_vm_writeout(gfp_t gfp_mask)
-{
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
-
-        for ( ; ; ) {
-		global_dirty_limits(&background_thresh, &dirty_thresh);
-		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
-
-                /*
-                 * Boost the allowable dirty threshold a bit for page
-                 * allocators so they don't get DoS'ed by heavy writers
-                 */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
-
-                if (global_node_page_state(NR_UNSTABLE_NFS) +
-			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-		/*
-		 * The caller might hold locks which can prevent IO completion
-		 * or progress in the filesystem.  So we cannot just sit here
-		 * waiting for IO to complete.
-		 */
-		if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
-			break;
-        }
-}
-
 /*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0294ab34f475..0f35ed30e35b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	if (inactive_list_is_low(lruvec, false, sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
-
-	throttle_vm_writeout(sc->gfp_mask);
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-25 19:23               ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-25 19:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel,
	Marcelo Tosatti

[CC Marcelo who might remember other details for the loads which made
 him to add this code - see the patch changelog for more context]

On Mon 25-07-16 10:32:47, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> > So I wonder what throttle_vm_writeout() really achieves these days.  Is
> > it just a bandaid that no-one is brave enough to remove?
> 
> Maybe yes. It is sitting there quietly and you do not know about it
> until it bites. Like in this particular case.

So I was playing with this today and tried to provoke throttle_vm_writeout
and couldn't hit that path with my pretty much default IO stack. I
probably need a more complex IO setup like dm-crypt or something that
basically have to double buffer every page in the writeout for some
time.

Anyway I believe that the throttle_vm_writeout is just a relict from the
past which just survived after many other changes in the reclaim path. I
fully realize my testing is quite poor and I would really appreciate if
Mikulas could try to retest with his more complex IO setups but let me
post a patch with the changelog so that we can at least reason about the
justification. In principle the reclaim path should have sufficient
throttling already and if that is not the case then we should
consolidate the remaining rather than have yet another one.

Thoughts?
---

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-25 19:23               ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-25 19:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel,
	Marcelo Tosatti

[CC Marcelo who might remember other details for the loads which made
 him to add this code - see the patch changelog for more context]

On Mon 25-07-16 10:32:47, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> > So I wonder what throttle_vm_writeout() really achieves these days.  Is
> > it just a bandaid that no-one is brave enough to remove?
> 
> Maybe yes. It is sitting there quietly and you do not know about it
> until it bites. Like in this particular case.

So I was playing with this today and tried to provoke throttle_vm_writeout
and couldn't hit that path with my pretty much default IO stack. I
probably need a more complex IO setup like dm-crypt or something that
basically have to double buffer every page in the writeout for some
time.

Anyway I believe that the throttle_vm_writeout is just a relict from the
past which just survived after many other changes in the reclaim path. I
fully realize my testing is quite poor and I would really appreciate if
Mikulas could try to retest with his more complex IO setups but let me
post a patch with the changelog so that we can at least reason about the
justification. In principle the reclaim path should have sufficient
throttling already and if that is not the case then we should
consolidate the remaining rather than have yet another one.

Thoughts?
---
From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 25 Jul 2016 14:18:54 +0200
Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout

throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
by excessive pageout activity during the reclaim. Too many pages could
be put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages
when initiating the buffered IO and later during the memory pressure,
the direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi
is congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a lot
of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies
on the mempool allocator. mempool_alloc(), however, prefers to use
the underlying (usually page) allocator before it grabs objects from
the pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background
to make the memory pressure even stronger as well as to disrupt the
steady state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout
the test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/writeback.h |  1 -
 mm/page-writeback.c       | 30 ------------------------------
 mm/vmscan.c               |  2 --
 3 files changed, 33 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 44b4422ae57f..f67a992cdf89 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data);
 #else
 static inline void laptop_sync_completion(void) { }
 #endif
-void throttle_vm_writeout(gfp_t gfp_mask);
 bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b82303a9e67d..2828d6ca1451 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	return false;
 }
 
-void throttle_vm_writeout(gfp_t gfp_mask)
-{
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
-
-        for ( ; ; ) {
-		global_dirty_limits(&background_thresh, &dirty_thresh);
-		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
-
-                /*
-                 * Boost the allowable dirty threshold a bit for page
-                 * allocators so they don't get DoS'ed by heavy writers
-                 */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
-
-                if (global_node_page_state(NR_UNSTABLE_NFS) +
-			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-		/*
-		 * The caller might hold locks which can prevent IO completion
-		 * or progress in the filesystem.  So we cannot just sit here
-		 * waiting for IO to complete.
-		 */
-		if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
-			break;
-        }
-}
-
 /*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0294ab34f475..0f35ed30e35b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
 	if (inactive_list_is_low(lruvec, false, sc))
 		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
 				   sc, LRU_ACTIVE_ANON);
-
-	throttle_vm_writeout(sc->gfp_mask);
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-23  0:12         ` NeilBrown
@ 2016-07-25 21:52             ` Mikulas Patocka
  2016-07-25 21:52             ` Mikulas Patocka
  1 sibling, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-25 21:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel



On Sat, 23 Jul 2016, NeilBrown wrote:

> "dirtying ... from the reclaim context" ??? What does that mean?
> According to
>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> From the history tree, the purpose of throttle_vm_writeout() is to
> limit the amount of memory that is concurrently under I/O.
> That seems strange to me because I thought it was the responsibility of
> each backing device to impose a limit - a maximum queue size of some
> sort.

Device mapper doesn't impose any limit for in-flight bios.

Some simple device mapper targets (such as linear or stripe) pass bio 
directly to the underlying device with generic_make_request, so if the 
underlying device's request limit is reached, the target's request routine 
waits.

However, complex dm targets (such as dm-crypt, dm-mirror, dm-thin) pass 
bios to a workqueue that processes them. And since there is no limit on 
the number of workqueue entries, there is no limit on the number of 
in-flight bios.

I've seen a case when I had a HPFS filesystem on dm-crypt. I wrote to the 
filesystem, there was about 2GB dirty data. The HPFS filesystem used 
512-byte bios. dm-crypt allocates one temporary page for each incoming 
bio. So, there were 4M bios in flight, each bio allocated 4k temporary 
page - that is attempted 16GB allocation. It didn't trigger OOM condition 
(because mempool allocations don't ever trigger it), but it temporarily 
exhausted all computer's memory.

I've made some patches that limit in-flight bios for device mapper in the 
past, but there were not integrated into upstream.

> If a thread is only making transient allocations, ones which will be
> freed shortly afterwards (not, for example, put in a cache), then I
> don't think it needs to be throttled at all.  I think this universally
> applies to mempools.
> In the case of dm_crypt, if it is writing too fast it will eventually be
> throttled in generic_make_request when the underlying device has a full
> queue and so blocks waiting for requests to be completed, and thus parts
> of them returned to the mempool.

No, it won't be throttled.

dm-crypt does:
1. pass the bio to the encryption workqueue
2. allocate the outgoing bio and allocate temporary pages for the 
   encrypted data
3. do the encryption
4. pass the bio to the writer thread
5. submit the write request with generic_make_request

So, if the underlying block device is throttled, it stalls the writer 
thread, but it doesn't stall the encryption threads and it doesn't stall 
the caller that submits the bios to dm-crypt.

There can be really high number of in-flight bios for dm-crypt.

Mikulas

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-25 21:52             ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-25 21:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Michal Hocko, linux-mm, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel



On Sat, 23 Jul 2016, NeilBrown wrote:

> "dirtying ... from the reclaim context" ??? What does that mean?
> According to
>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> From the history tree, the purpose of throttle_vm_writeout() is to
> limit the amount of memory that is concurrently under I/O.
> That seems strange to me because I thought it was the responsibility of
> each backing device to impose a limit - a maximum queue size of some
> sort.

Device mapper doesn't impose any limit for in-flight bios.

Some simple device mapper targets (such as linear or stripe) pass bio 
directly to the underlying device with generic_make_request, so if the 
underlying device's request limit is reached, the target's request routine 
waits.

However, complex dm targets (such as dm-crypt, dm-mirror, dm-thin) pass 
bios to a workqueue that processes them. And since there is no limit on 
the number of workqueue entries, there is no limit on the number of 
in-flight bios.

I've seen a case when I had a HPFS filesystem on dm-crypt. I wrote to the 
filesystem, there was about 2GB dirty data. The HPFS filesystem used 
512-byte bios. dm-crypt allocates one temporary page for each incoming 
bio. So, there were 4M bios in flight, each bio allocated 4k temporary 
page - that is attempted 16GB allocation. It didn't trigger OOM condition 
(because mempool allocations don't ever trigger it), but it temporarily 
exhausted all computer's memory.

I've made some patches that limit in-flight bios for device mapper in the 
past, but there were not integrated into upstream.

> If a thread is only making transient allocations, ones which will be
> freed shortly afterwards (not, for example, put in a cache), then I
> don't think it needs to be throttled at all.  I think this universally
> applies to mempools.
> In the case of dm_crypt, if it is writing too fast it will eventually be
> throttled in generic_make_request when the underlying device has a full
> queue and so blocks waiting for requests to be completed, and thus parts
> of them returned to the mempool.

No, it won't be throttled.

dm-crypt does:
1. pass the bio to the encryption workqueue
2. allocate the outgoing bio and allocate temporary pages for the 
   encrypted data
3. do the encryption
4. pass the bio to the writer thread
5. submit the write request with generic_make_request

So, if the underlying block device is throttled, it stalls the writer 
thread, but it doesn't stall the encryption threads and it doesn't stall 
the caller that submits the bios to dm-crypt.

There can be really high number of in-flight bios for dm-crypt.

Mikulas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-25 19:23               ` Michal Hocko
@ 2016-07-26  7:07                 ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-26  7:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel,
	Marcelo Tosatti

On Mon 25-07-16 21:23:44, Michal Hocko wrote:
> [CC Marcelo who might remember other details for the loads which made
>  him to add this code - see the patch changelog for more context]
> 
> On Mon 25-07-16 10:32:47, Michal Hocko wrote:
[...]
> From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 25 Jul 2016 14:18:54 +0200
> Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout
> 
> throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
> by excessive pageout activity during the reclaim. Too many pages could
> be put under writeback therefore LRUs would be full of unreclaimable pages
> until the IO completes and in turn the OOM killer could be invoked.
> 
> There have been some important changes introduced since then in the
> reclaim path though. Writers are throttled by balance_dirty_pages
> when initiating the buffered IO and later during the memory pressure,
> the direct reclaim is throttled by wait_iff_congested if the node is
> considered congested by dirty pages on LRUs and the underlying bdi
> is congested by the queued IO. The kswapd is throttled as well if it
> encounters pages marked for immediate reclaim or under writeback which
> signals that that there are too many pages under writeback already.
> Another important aspect is that we do not issue any IO from the direct
> reclaim context anymore. In a heavy parallel load this could queue a lot
> of IO which would be very scattered and thus unefficient which would
> just make the problem worse.

And I forgot another throttling point. should_reclaim_retry which is the
main logic to decide whether we go OOM or not has a congestion_wait if
there are too many dirty/writeback pages. That should give the IO
subsystem some time to finish the IO.

> This three mechanisms should throttle and keep the amount of IO in a
> steady state even under heavy IO and memory pressure so yet another
> throttling point doesn't really seem helpful. Quite contrary, Mikulas
> Patocka has reported that swap backed by dm-crypt doesn't work properly
> because the swapout IO cannot make sufficient progress as the writeout
> path depends on dm_crypt worker which has to allocate memory to perform
> the encryption. In order to guarantee a forward progress it relies
> on the mempool allocator. mempool_alloc(), however, prefers to use
> the underlying (usually page) allocator before it grabs objects from
> the pool. Such an allocation can dive into the memory reclaim and
> consequently to throttle_vm_writeout. If there are too many dirty or
> pages under writeback it will get throttled even though it is in fact a
> flusher to clear pending pages.
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> 
> Let's just drop throttle_vm_writeout altogether. It is not very much
> helpful anymore.
> 
> I have tried to test a potential writeback IO runaway similar to the one
> described in the original patch which has introduced that [1]. Small
> virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
> rather slow NFS in a sync mode on the host) with 8 parallel writers each
> writing 1G worth of data. As soon as the pagecache fills up and the
> direct reclaim hits then I start anon memory consumer in a loop
> (allocating 300M and exiting after populating it) in the background
> to make the memory pressure even stronger as well as to disrupt the
> steady state for the IO. The direct reclaim is throttled because of the
> congestion as well as kswapd hitting congestion_wait due to nr_immediate
> but throttle_vm_writeout doesn't ever trigger the sleep throughout
> the test. Dirty+writeback are close to nr_dirty_threshold with some
> fluctuations caused by the anon consumer.
> 
> [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/writeback.h |  1 -
>  mm/page-writeback.c       | 30 ------------------------------
>  mm/vmscan.c               |  2 --
>  3 files changed, 33 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 44b4422ae57f..f67a992cdf89 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data);
>  #else
>  static inline void laptop_sync_completion(void) { }
>  #endif
> -void throttle_vm_writeout(gfp_t gfp_mask);
>  bool node_dirty_ok(struct pglist_data *pgdat);
>  int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b82303a9e67d..2828d6ca1451 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  	return false;
>  }
>  
> -void throttle_vm_writeout(gfp_t gfp_mask)
> -{
> -	unsigned long background_thresh;
> -	unsigned long dirty_thresh;
> -
> -        for ( ; ; ) {
> -		global_dirty_limits(&background_thresh, &dirty_thresh);
> -		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -
> -                /*
> -                 * Boost the allowable dirty threshold a bit for page
> -                 * allocators so they don't get DoS'ed by heavy writers
> -                 */
> -                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> -
> -                if (global_node_page_state(NR_UNSTABLE_NFS) +
> -			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> -		/*
> -		 * The caller might hold locks which can prevent IO completion
> -		 * or progress in the filesystem.  So we cannot just sit here
> -		 * waiting for IO to complete.
> -		 */
> -		if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
> -			break;
> -        }
> -}
> -
>  /*
>   * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
>   */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0294ab34f475..0f35ed30e35b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	if (inactive_list_is_low(lruvec, false, sc))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
> -
> -	throttle_vm_writeout(sc->gfp_mask);
>  }
>  
>  /* Use reclaim/compaction for costly allocs or under memory pressure */
> -- 
> 2.8.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-26  7:07                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-26  7:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm, Mikulas Patocka, Ondrej Kozina, David Rientjes,
	Tetsuo Handa, Mel Gorman, Andrew Morton, LKML, dm-devel,
	Marcelo Tosatti

On Mon 25-07-16 21:23:44, Michal Hocko wrote:
> [CC Marcelo who might remember other details for the loads which made
>  him to add this code - see the patch changelog for more context]
> 
> On Mon 25-07-16 10:32:47, Michal Hocko wrote:
[...]
> From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 25 Jul 2016 14:18:54 +0200
> Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout
> 
> throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
> by excessive pageout activity during the reclaim. Too many pages could
> be put under writeback therefore LRUs would be full of unreclaimable pages
> until the IO completes and in turn the OOM killer could be invoked.
> 
> There have been some important changes introduced since then in the
> reclaim path though. Writers are throttled by balance_dirty_pages
> when initiating the buffered IO and later during the memory pressure,
> the direct reclaim is throttled by wait_iff_congested if the node is
> considered congested by dirty pages on LRUs and the underlying bdi
> is congested by the queued IO. The kswapd is throttled as well if it
> encounters pages marked for immediate reclaim or under writeback which
> signals that that there are too many pages under writeback already.
> Another important aspect is that we do not issue any IO from the direct
> reclaim context anymore. In a heavy parallel load this could queue a lot
> of IO which would be very scattered and thus unefficient which would
> just make the problem worse.

And I forgot another throttling point. should_reclaim_retry which is the
main logic to decide whether we go OOM or not has a congestion_wait if
there are too many dirty/writeback pages. That should give the IO
subsystem some time to finish the IO.

> This three mechanisms should throttle and keep the amount of IO in a
> steady state even under heavy IO and memory pressure so yet another
> throttling point doesn't really seem helpful. Quite contrary, Mikulas
> Patocka has reported that swap backed by dm-crypt doesn't work properly
> because the swapout IO cannot make sufficient progress as the writeout
> path depends on dm_crypt worker which has to allocate memory to perform
> the encryption. In order to guarantee a forward progress it relies
> on the mempool allocator. mempool_alloc(), however, prefers to use
> the underlying (usually page) allocator before it grabs objects from
> the pool. Such an allocation can dive into the memory reclaim and
> consequently to throttle_vm_writeout. If there are too many dirty or
> pages under writeback it will get throttled even though it is in fact a
> flusher to clear pending pages.
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> 
> Let's just drop throttle_vm_writeout altogether. It is not very much
> helpful anymore.
> 
> I have tried to test a potential writeback IO runaway similar to the one
> described in the original patch which has introduced that [1]. Small
> virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
> rather slow NFS in a sync mode on the host) with 8 parallel writers each
> writing 1G worth of data. As soon as the pagecache fills up and the
> direct reclaim hits then I start anon memory consumer in a loop
> (allocating 300M and exiting after populating it) in the background
> to make the memory pressure even stronger as well as to disrupt the
> steady state for the IO. The direct reclaim is throttled because of the
> congestion as well as kswapd hitting congestion_wait due to nr_immediate
> but throttle_vm_writeout doesn't ever trigger the sleep throughout
> the test. Dirty+writeback are close to nr_dirty_threshold with some
> fluctuations caused by the anon consumer.
> 
> [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Reported-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/writeback.h |  1 -
>  mm/page-writeback.c       | 30 ------------------------------
>  mm/vmscan.c               |  2 --
>  3 files changed, 33 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 44b4422ae57f..f67a992cdf89 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -319,7 +319,6 @@ void laptop_mode_timer_fn(unsigned long data);
>  #else
>  static inline void laptop_sync_completion(void) { }
>  #endif
> -void throttle_vm_writeout(gfp_t gfp_mask);
>  bool node_dirty_ok(struct pglist_data *pgdat);
>  int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
>  #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b82303a9e67d..2828d6ca1451 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,36 +1962,6 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  	return false;
>  }
>  
> -void throttle_vm_writeout(gfp_t gfp_mask)
> -{
> -	unsigned long background_thresh;
> -	unsigned long dirty_thresh;
> -
> -        for ( ; ; ) {
> -		global_dirty_limits(&background_thresh, &dirty_thresh);
> -		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -
> -                /*
> -                 * Boost the allowable dirty threshold a bit for page
> -                 * allocators so they don't get DoS'ed by heavy writers
> -                 */
> -                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> -
> -                if (global_node_page_state(NR_UNSTABLE_NFS) +
> -			global_node_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> -
> -		/*
> -		 * The caller might hold locks which can prevent IO completion
> -		 * or progress in the filesystem.  So we cannot just sit here
> -		 * waiting for IO to complete.
> -		 */
> -		if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
> -			break;
> -        }
> -}
> -
>  /*
>   * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
>   */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0294ab34f475..0f35ed30e35b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2410,8 +2410,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup *memc
>  	if (inactive_list_is_low(lruvec, false, sc))
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
> -
> -	throttle_vm_writeout(sc->gfp_mask);
>  }
>  
>  /* Use reclaim/compaction for costly allocs or under memory pressure */
> -- 
> 2.8.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-25 21:52             ` Mikulas Patocka
@ 2016-07-26  7:25               ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-26  7:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Andrew Morton, LKML, dm-devel

On Mon 25-07-16 17:52:17, Mikulas Patocka wrote:
> 
> 
> On Sat, 23 Jul 2016, NeilBrown wrote:
> 
> > "dirtying ... from the reclaim context" ??? What does that mean?
> > According to
> >   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > From the history tree, the purpose of throttle_vm_writeout() is to
> > limit the amount of memory that is concurrently under I/O.
> > That seems strange to me because I thought it was the responsibility of
> > each backing device to impose a limit - a maximum queue size of some
> > sort.
> 
> Device mapper doesn't impose any limit for in-flight bios.
> 
> Some simple device mapper targets (such as linear or stripe) pass bio 
> directly to the underlying device with generic_make_request, so if the 
> underlying device's request limit is reached, the target's request routine 
> waits.
> 
> However, complex dm targets (such as dm-crypt, dm-mirror, dm-thin) pass 
> bios to a workqueue that processes them. And since there is no limit on 
> the number of workqueue entries, there is no limit on the number of 
> in-flight bios.
> 
> I've seen a case when I had a HPFS filesystem on dm-crypt. I wrote to the 
> filesystem, there was about 2GB dirty data. The HPFS filesystem used 
> 512-byte bios. dm-crypt allocates one temporary page for each incoming 
> bio. So, there were 4M bios in flight, each bio allocated 4k temporary 
> page - that is attempted 16GB allocation. It didn't trigger OOM condition 
> (because mempool allocations don't ever trigger it), but it temporarily 
> exhausted all computer's memory.

OK, that is certainly not good and something that throttle_vm_writeout
aimed at protecting from. It is a little bit poor protection because
it might fire much more earlier than necessary. Shouldn't those workers
simply backoff when the underlying bdi is congested? It wouldn't help
to queue more IO when the bdi is hammered already.
 
> I've made some patches that limit in-flight bios for device mapper in the 
> past, but there were not integrated into upstream.

Care to revive them? I am not an expert in dm but unbounded amount of
inflight IO doesn't really sound good.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-26  7:25               ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-26  7:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, linux-mm, Ondrej Kozina, David Rientjes, Tetsuo Handa,
	Mel Gorman, Andrew Morton, LKML, dm-devel

On Mon 25-07-16 17:52:17, Mikulas Patocka wrote:
> 
> 
> On Sat, 23 Jul 2016, NeilBrown wrote:
> 
> > "dirtying ... from the reclaim context" ??? What does that mean?
> > According to
> >   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > From the history tree, the purpose of throttle_vm_writeout() is to
> > limit the amount of memory that is concurrently under I/O.
> > That seems strange to me because I thought it was the responsibility of
> > each backing device to impose a limit - a maximum queue size of some
> > sort.
> 
> Device mapper doesn't impose any limit for in-flight bios.
> 
> Some simple device mapper targets (such as linear or stripe) pass bio 
> directly to the underlying device with generic_make_request, so if the 
> underlying device's request limit is reached, the target's request routine 
> waits.
> 
> However, complex dm targets (such as dm-crypt, dm-mirror, dm-thin) pass 
> bios to a workqueue that processes them. And since there is no limit on 
> the number of workqueue entries, there is no limit on the number of 
> in-flight bios.
> 
> I've seen a case when I had a HPFS filesystem on dm-crypt. I wrote to the 
> filesystem, there was about 2GB dirty data. The HPFS filesystem used 
> 512-byte bios. dm-crypt allocates one temporary page for each incoming 
> bio. So, there were 4M bios in flight, each bio allocated 4k temporary 
> page - that is attempted 16GB allocation. It didn't trigger OOM condition 
> (because mempool allocations don't ever trigger it), but it temporarily 
> exhausted all computer's memory.

OK, that is certainly not good and something that throttle_vm_writeout
aimed at protecting from. It is a little bit poor protection because
it might fire much more earlier than necessary. Shouldn't those workers
simply backoff when the underlying bdi is congested? It wouldn't help
to queue more IO when the bdi is hammered already.
 
> I've made some patches that limit in-flight bios for device mapper in the 
> past, but there were not integrated into upstream.

Care to revive them? I am not an expert in dm but unbounded amount of
inflight IO doesn't really sound good.

[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-25  8:32             ` Michal Hocko
  (?)
  (?)
@ 2016-07-27  3:43             ` NeilBrown
  2016-07-27 18:24                 ` Michal Hocko
  -1 siblings, 1 reply; 102+ messages in thread
From: NeilBrown @ 2016-07-27  3:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 4267 bytes --]

On Mon, Jul 25 2016, Michal Hocko wrote:

> On Sat 23-07-16 10:12:24, NeilBrown wrote:

>> Maybe that is impractical, but having firm rules like that would go a
>> long way to make it possible to actually understand and reason about how
>> MM works.  As it is, there seems to be a tendency to put bandaids over
>> bandaids.
>
> Ohh, I would definitely wish for this to be more clear but as it turned
> out over time there are quite some interdependencies between MM/FS/IO
> layers which make the picture really blur. If there is a brave soul to
> make that more clear without breaking any of that it would be really
> cool ;)

Just need that comprehensive regression-test-suite and off we go....


>> > My thinking was that throttle_vm_writeout is there to prevent from
>> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
>> > is part of the writeout so throttling it on too many dirty pages is
>> > questionable (well we get some bias but that is not really reliable). It
>> > still makes sense to throttle when the backing device is congested
>> > because the writeout path wouldn't make much progress anyway and we also
>> > do not want to cycle through LRU lists too quickly in that case.
>> 
>> "dirtying ... from the reclaim context" ??? What does that mean?
>
> Say you would cause a swapout from the reclaim context. You would
> effectively dirty that anon page until it gets written down to the
> storage.

I should probably figure out how swap really works.  I have vague ideas
which are probably missing important details...
Isn't the first step that the page gets moved into the swap-cache - and
marked dirty I guess.  Then it gets written out and the page is marked
'clean'.
Then further memory pressure might push it out of the cache, or an early
re-use would pull it back from the cache.
If so, then "dirtying in reclaim context" could also be described as
"moving into the swap cache" - yes?  So should there be a limit on dirty
pages in the swap cache just like there is for dirty pages in any
filesystem (the max_dirty_ratio thing) ??
Maybe there is?

>> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
>> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
>> memory when it is writing to the page cache.  So it only needs to be
>> throttled when the backing device it is writing to is congested.  It is
>> particularly important that it *doesn't* get throttled just because an
>> NFS backing device is congested, because nfsd might be trying to clear
>> that congestion.
>
> Thanks for the clarification. IIUC then removing throttle_vm_writeout
> for the nfsd writeout should be harmless as well, right?

Certainly shouldn't hurt from the perspective of nfsd.

>> >> The purpose of that flag is to allow a thread to dirty a page-cache page
>> >> as part of cleaning another page-cache page.
>> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
>> >> for dm-crypt if it was putting the encrypted version in the page cache.
>> >> But if dm-crypt is just allocating a transient page (which I think it
>> >> is), then a mempool should be sufficient (and we should make sure it is
>> >> sufficient) and access to an extra 10% (or whatever) of the page cache
>> >> isn't justified.
>> >
>> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
>> > appropriate then would a PF_MEMPOOL be any better?
>> 
>> Why a PF rather than a GFP flag?
>
> Well, short answer is that gfp masks are almost depleted.

Really?  We have 26.

pagemap has a cute hack to store both GFP flags and other flag bits in
the one 32 it number per address_space.  'struct address_space' could
afford an extra 32 number I think.

radix_tree_root adds 3 'tag' flags to the gfp_mask.
There is 16bits of free space in radix_tree_node (between 'offset' and
'count').  That space on the root node could store a record of which tags
are set anywhere.  Or would that extra memory de-ref be a killer?

I think we'd end up with cleaner code if we removed the cute-hacks.  And
we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
need all those 26).

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-25 21:52             ` Mikulas Patocka
  (?)
  (?)
@ 2016-07-27  4:02             ` NeilBrown
  2016-07-27 14:28                 ` Mikulas Patocka
  -1 siblings, 1 reply; 102+ messages in thread
From: NeilBrown @ 2016-07-27  4:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Tetsuo Handa, LKML, Michal Hocko, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1715 bytes --]

On Tue, Jul 26 2016, Mikulas Patocka wrote:

> On Sat, 23 Jul 2016, NeilBrown wrote:
>
>> "dirtying ... from the reclaim context" ??? What does that mean?
>> According to
>>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
>> From the history tree, the purpose of throttle_vm_writeout() is to
>> limit the amount of memory that is concurrently under I/O.
>> That seems strange to me because I thought it was the responsibility of
>> each backing device to impose a limit - a maximum queue size of some
>> sort.
>
> Device mapper doesn't impose any limit for in-flight bios.

I would suggest that it probably should. At least it should
"set_wb_congested()" when the number of in-flight bios reaches some
arbitrary threshold.

The write-back throttling needs this to get an estimate of how fast the
backing device is, so it can share the dirty_threshold space fairly
among the different backing devices.

I added an arbitrary limit to raid1 back in 2011 (34db0cd60f8a1f)
because the lack of a limit was causing problems.
Specifically the write queue would get so long that ext3 would block for
an extended period when trying to flush a transaction, and that blocked
lots of other things, like atime updates.

Maybe there have been other fixes since then to other parts of the
puzzle, but the congestion tracking still seems to be an important part
of the picture and I think it would be best if every bdi would admit to
being congested well before it has consumed a significant fraction of
memory in its output queue.

> I've made some patches that limit in-flight bios for device mapper in
> the past, but there were not integrated into upstream.

I second the motion to resurrect these.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27  4:02             ` [dm-devel] " NeilBrown
@ 2016-07-27 14:28                 ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-27 14:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, Michal Hocko, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 27 Jul 2016, NeilBrown wrote:

> On Tue, Jul 26 2016, Mikulas Patocka wrote:
> 
> > On Sat, 23 Jul 2016, NeilBrown wrote:
> >
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >> According to
> >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> >> From the history tree, the purpose of throttle_vm_writeout() is to
> >> limit the amount of memory that is concurrently under I/O.
> >> That seems strange to me because I thought it was the responsibility of
> >> each backing device to impose a limit - a maximum queue size of some
> >> sort.
> >
> > Device mapper doesn't impose any limit for in-flight bios.
> 
> I would suggest that it probably should. At least it should
> "set_wb_congested()" when the number of in-flight bios reaches some
> arbitrary threshold.

If we set the device mapper device as congested, it can again trigger that 
mempool alloc throttling bug.

I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
becomes clogged and sets its state as congested. The underlying block 
device is not congested.

The mempool_alloc function in the dm-crypt workqueue sets the 
PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.

So if we set the dm-crypt device as congested, it can incorrectly throttle 
the dm-crypt workqueue that does allocations of temporary pages and 
encryption.

I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
and that mempool allocations should never be throttled.

> > I've made some patches that limit in-flight bios for device mapper in
> > the past, but there were not integrated into upstream.
> 
> I second the motion to resurrect these.

I uploaded those patches here:

http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Mikulas

> Thanks,
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-27 14:28                 ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-07-27 14:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, Michal Hocko, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 27 Jul 2016, NeilBrown wrote:

> On Tue, Jul 26 2016, Mikulas Patocka wrote:
> 
> > On Sat, 23 Jul 2016, NeilBrown wrote:
> >
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >> According to
> >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> >> From the history tree, the purpose of throttle_vm_writeout() is to
> >> limit the amount of memory that is concurrently under I/O.
> >> That seems strange to me because I thought it was the responsibility of
> >> each backing device to impose a limit - a maximum queue size of some
> >> sort.
> >
> > Device mapper doesn't impose any limit for in-flight bios.
> 
> I would suggest that it probably should. At least it should
> "set_wb_congested()" when the number of in-flight bios reaches some
> arbitrary threshold.

If we set the device mapper device as congested, it can again trigger that 
mempool alloc throttling bug.

I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
becomes clogged and sets its state as congested. The underlying block 
device is not congested.

The mempool_alloc function in the dm-crypt workqueue sets the 
PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.

So if we set the dm-crypt device as congested, it can incorrectly throttle 
the dm-crypt workqueue that does allocations of temporary pages and 
encryption.

I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
and that mempool allocations should never be throttled.

> > I've made some patches that limit in-flight bios for device mapper in
> > the past, but there were not integrated into upstream.
> 
> I second the motion to resurrect these.

I uploaded those patches here:

http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Mikulas

> Thanks,
> NeilBrown
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27  3:43             ` [dm-devel] " NeilBrown
@ 2016-07-27 18:24                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-27 18:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 27-07-16 13:43:35, NeilBrown wrote:
> On Mon, Jul 25 2016, Michal Hocko wrote:
> 
> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> >> > My thinking was that throttle_vm_writeout is there to prevent from
> >> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> >> > is part of the writeout so throttling it on too many dirty pages is
> >> > questionable (well we get some bias but that is not really reliable). It
> >> > still makes sense to throttle when the backing device is congested
> >> > because the writeout path wouldn't make much progress anyway and we also
> >> > do not want to cycle through LRU lists too quickly in that case.
> >> 
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >
> > Say you would cause a swapout from the reclaim context. You would
> > effectively dirty that anon page until it gets written down to the
> > storage.
> 
> I should probably figure out how swap really works.  I have vague ideas
> which are probably missing important details...
> Isn't the first step that the page gets moved into the swap-cache - and
> marked dirty I guess.  Then it gets written out and the page is marked
> 'clean'.
> Then further memory pressure might push it out of the cache, or an early
> re-use would pull it back from the cache.
> If so, then "dirtying in reclaim context" could also be described as
> "moving into the swap cache" - yes?

Yes that is basically correct

> So should there be a limit on dirty
> pages in the swap cache just like there is for dirty pages in any
> filesystem (the max_dirty_ratio thing) ??
> Maybe there is?

There is no limit AFAIK. We are relying that the reclaim is throttled
when necessary.
 
> >> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> >> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> >> memory when it is writing to the page cache.  So it only needs to be
> >> throttled when the backing device it is writing to is congested.  It is
> >> particularly important that it *doesn't* get throttled just because an
> >> NFS backing device is congested, because nfsd might be trying to clear
> >> that congestion.
> >
> > Thanks for the clarification. IIUC then removing throttle_vm_writeout
> > for the nfsd writeout should be harmless as well, right?
> 
> Certainly shouldn't hurt from the perspective of nfsd.
> 
> >> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> >> as part of cleaning another page-cache page.
> >> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> >> But if dm-crypt is just allocating a transient page (which I think it
> >> >> is), then a mempool should be sufficient (and we should make sure it is
> >> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> >> isn't justified.
> >> >
> >> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> >> > appropriate then would a PF_MEMPOOL be any better?
> >> 
> >> Why a PF rather than a GFP flag?
> >
> > Well, short answer is that gfp masks are almost depleted.
> 
> Really?  We have 26.
> 
> pagemap has a cute hack to store both GFP flags and other flag bits in
> the one 32 it number per address_space.  'struct address_space' could
> afford an extra 32 number I think.
> 
> radix_tree_root adds 3 'tag' flags to the gfp_mask.
> There is 16bits of free space in radix_tree_node (between 'offset' and
> 'count').  That space on the root node could store a record of which tags
> are set anywhere.  Or would that extra memory de-ref be a killer?

Yes these are reasons why adding new gfp flags is more complicated.

> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> need all those 26).

Well, maybe we are able to remove those hacks, I wouldn't definitely
be opposed.  But right now I am not even convinced that the mempool
specific gfp flags is the right way to go.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-27 18:24                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-27 18:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 27-07-16 13:43:35, NeilBrown wrote:
> On Mon, Jul 25 2016, Michal Hocko wrote:
> 
> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> >> > My thinking was that throttle_vm_writeout is there to prevent from
> >> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> >> > is part of the writeout so throttling it on too many dirty pages is
> >> > questionable (well we get some bias but that is not really reliable). It
> >> > still makes sense to throttle when the backing device is congested
> >> > because the writeout path wouldn't make much progress anyway and we also
> >> > do not want to cycle through LRU lists too quickly in that case.
> >> 
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >
> > Say you would cause a swapout from the reclaim context. You would
> > effectively dirty that anon page until it gets written down to the
> > storage.
> 
> I should probably figure out how swap really works.  I have vague ideas
> which are probably missing important details...
> Isn't the first step that the page gets moved into the swap-cache - and
> marked dirty I guess.  Then it gets written out and the page is marked
> 'clean'.
> Then further memory pressure might push it out of the cache, or an early
> re-use would pull it back from the cache.
> If so, then "dirtying in reclaim context" could also be described as
> "moving into the swap cache" - yes?

Yes that is basically correct

> So should there be a limit on dirty
> pages in the swap cache just like there is for dirty pages in any
> filesystem (the max_dirty_ratio thing) ??
> Maybe there is?

There is no limit AFAIK. We are relying that the reclaim is throttled
when necessary.
 
> >> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> >> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> >> memory when it is writing to the page cache.  So it only needs to be
> >> throttled when the backing device it is writing to is congested.  It is
> >> particularly important that it *doesn't* get throttled just because an
> >> NFS backing device is congested, because nfsd might be trying to clear
> >> that congestion.
> >
> > Thanks for the clarification. IIUC then removing throttle_vm_writeout
> > for the nfsd writeout should be harmless as well, right?
> 
> Certainly shouldn't hurt from the perspective of nfsd.
> 
> >> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> >> as part of cleaning another page-cache page.
> >> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> >> But if dm-crypt is just allocating a transient page (which I think it
> >> >> is), then a mempool should be sufficient (and we should make sure it is
> >> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> >> isn't justified.
> >> >
> >> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> >> > appropriate then would a PF_MEMPOOL be any better?
> >> 
> >> Why a PF rather than a GFP flag?
> >
> > Well, short answer is that gfp masks are almost depleted.
> 
> Really?  We have 26.
> 
> pagemap has a cute hack to store both GFP flags and other flag bits in
> the one 32 it number per address_space.  'struct address_space' could
> afford an extra 32 number I think.
> 
> radix_tree_root adds 3 'tag' flags to the gfp_mask.
> There is 16bits of free space in radix_tree_node (between 'offset' and
> 'count').  That space on the root node could store a record of which tags
> are set anywhere.  Or would that extra memory de-ref be a killer?

Yes these are reasons why adding new gfp flags is more complicated.

> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> need all those 26).

Well, maybe we are able to remove those hacks, I wouldn't definitely
be opposed.  But right now I am not even convinced that the mempool
specific gfp flags is the right way to go.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27 14:28                 ` Mikulas Patocka
@ 2016-07-27 18:40                   ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-27 18:40 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, NeilBrown wrote:
> 
> > On Tue, Jul 26 2016, Mikulas Patocka wrote:
> > 
> > > On Sat, 23 Jul 2016, NeilBrown wrote:
> > >
> > >> "dirtying ... from the reclaim context" ??? What does that mean?
> > >> According to
> > >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > >> From the history tree, the purpose of throttle_vm_writeout() is to
> > >> limit the amount of memory that is concurrently under I/O.
> > >> That seems strange to me because I thought it was the responsibility of
> > >> each backing device to impose a limit - a maximum queue size of some
> > >> sort.
> > >
> > > Device mapper doesn't impose any limit for in-flight bios.
> > 
> > I would suggest that it probably should. At least it should
> > "set_wb_congested()" when the number of in-flight bios reaches some
> > arbitrary threshold.
> 
> If we set the device mapper device as congested, it can again trigger that 
> mempool alloc throttling bug.
> 
> I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> becomes clogged and sets its state as congested. The underlying block 
> device is not congested.
> 
> The mempool_alloc function in the dm-crypt workqueue sets the 
> PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
> 
> So if we set the dm-crypt device as congested, it can incorrectly throttle 
> the dm-crypt workqueue that does allocations of temporary pages and 
> encryption.
> 
> I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> and that mempool allocations should never be throttled.

I'm not really sure this is the right approach. If a particular mempool
user cannot ever be throttled by the page allocator then it should
perform GFP_NOWAIT. Even mempool allocations shouldn't allow reclaim to
scan pages too quickly even when LRU lists are full of dirty pages. But
as I've said that would restrict the success rates even under light page
cache load. Throttling on the wait_iff_congested should be quite rare.

Anyway do you see an excessive throttling with the patch posted
http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
another side. Do you see an excessive number of dirty/writeback pages
wrt. the dirty threshold or any other undesirable side effects?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-27 18:40                   ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-27 18:40 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, NeilBrown wrote:
> 
> > On Tue, Jul 26 2016, Mikulas Patocka wrote:
> > 
> > > On Sat, 23 Jul 2016, NeilBrown wrote:
> > >
> > >> "dirtying ... from the reclaim context" ??? What does that mean?
> > >> According to
> > >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > >> From the history tree, the purpose of throttle_vm_writeout() is to
> > >> limit the amount of memory that is concurrently under I/O.
> > >> That seems strange to me because I thought it was the responsibility of
> > >> each backing device to impose a limit - a maximum queue size of some
> > >> sort.
> > >
> > > Device mapper doesn't impose any limit for in-flight bios.
> > 
> > I would suggest that it probably should. At least it should
> > "set_wb_congested()" when the number of in-flight bios reaches some
> > arbitrary threshold.
> 
> If we set the device mapper device as congested, it can again trigger that 
> mempool alloc throttling bug.
> 
> I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> becomes clogged and sets its state as congested. The underlying block 
> device is not congested.
> 
> The mempool_alloc function in the dm-crypt workqueue sets the 
> PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
> 
> So if we set the dm-crypt device as congested, it can incorrectly throttle 
> the dm-crypt workqueue that does allocations of temporary pages and 
> encryption.
> 
> I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> and that mempool allocations should never be throttled.

I'm not really sure this is the right approach. If a particular mempool
user cannot ever be throttled by the page allocator then it should
perform GFP_NOWAIT. Even mempool allocations shouldn't allow reclaim to
scan pages too quickly even when LRU lists are full of dirty pages. But
as I've said that would restrict the success rates even under light page
cache load. Throttling on the wait_iff_congested should be quite rare.

Anyway do you see an excessive throttling with the patch posted
http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
another side. Do you see an excessive number of dirty/writeback pages
wrt. the dirty threshold or any other undesirable side effects?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27 18:24                 ` Michal Hocko
  (?)
@ 2016-07-27 21:33                 ` NeilBrown
  2016-07-28  7:17                     ` Michal Hocko
  -1 siblings, 1 reply; 102+ messages in thread
From: NeilBrown @ 2016-07-27 21:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1272 bytes --]

On Thu, Jul 28 2016, Michal Hocko wrote:

> On Wed 27-07-16 13:43:35, NeilBrown wrote:
>> On Mon, Jul 25 2016, Michal Hocko wrote:
>> 
>> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
> [...]
>> So should there be a limit on dirty
>> pages in the swap cache just like there is for dirty pages in any
>> filesystem (the max_dirty_ratio thing) ??
>> Maybe there is?
>
> There is no limit AFAIK. We are relying that the reclaim is throttled
> when necessary.

Is that a bit indirect?  It is hard to tell without a clear big-picture.
Something to keep in mind anyway.

>
>> I think we'd end up with cleaner code if we removed the cute-hacks.  And
>> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
>> need all those 26).
>
> Well, maybe we are able to remove those hacks, I wouldn't definitely
> be opposed.  But right now I am not even convinced that the mempool
> specific gfp flags is the right way to go.

I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
transient-allocation gfp flag, which would be quite useful for mempool.

Can you give more details on why using a gfp flag isn't your first choice
for guiding what happens when the system is trying to get a free page
:-?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27 14:28                 ` Mikulas Patocka
  (?)
  (?)
@ 2016-07-27 21:36                 ` NeilBrown
  -1 siblings, 0 replies; 102+ messages in thread
From: NeilBrown @ 2016-07-27 21:36 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Tetsuo Handa, LKML, Michal Hocko, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2223 bytes --]

On Thu, Jul 28 2016, Mikulas Patocka wrote:

> On Wed, 27 Jul 2016, NeilBrown wrote:
>
>> On Tue, Jul 26 2016, Mikulas Patocka wrote:
>> 
>> > On Sat, 23 Jul 2016, NeilBrown wrote:
>> >
>> >> "dirtying ... from the reclaim context" ??? What does that mean?
>> >> According to
>> >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
>> >> From the history tree, the purpose of throttle_vm_writeout() is to
>> >> limit the amount of memory that is concurrently under I/O.
>> >> That seems strange to me because I thought it was the responsibility of
>> >> each backing device to impose a limit - a maximum queue size of some
>> >> sort.
>> >
>> > Device mapper doesn't impose any limit for in-flight bios.
>> 
>> I would suggest that it probably should. At least it should
>> "set_wb_congested()" when the number of in-flight bios reaches some
>> arbitrary threshold.
>
> If we set the device mapper device as congested, it can again trigger that 
> mempool alloc throttling bug.
>
> I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> becomes clogged and sets its state as congested. The underlying block 
> device is not congested.
>
> The mempool_alloc function in the dm-crypt workqueue sets the 
> PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
>
> So if we set the dm-crypt device as congested, it can incorrectly throttle 
> the dm-crypt workqueue that does allocations of temporary pages and 
> encryption.
>
> I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> and that mempool allocations should never be throttled.

I very much agree with that last statement!  It may be that to get to
that point we will need all backing devices to signal congestion
correctly.

>
>> > I've made some patches that limit in-flight bios for device mapper in
>> > the past, but there were not integrated into upstream.
>> 
>> I second the motion to resurrect these.
>
> I uploaded those patches here:
>
> http://people.redhat.com/~mpatocka/patches/kernel/dm-limit-outstanding-bios/

Thanks!  I'll have a look.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27 21:33                 ` NeilBrown
@ 2016-07-28  7:17                     ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-28  7:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

On Thu 28-07-16 07:33:19, NeilBrown wrote:
> On Thu, Jul 28 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 13:43:35, NeilBrown wrote:
> >> On Mon, Jul 25 2016, Michal Hocko wrote:
> >> 
> >> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
> > [...]
> >> So should there be a limit on dirty
> >> pages in the swap cache just like there is for dirty pages in any
> >> filesystem (the max_dirty_ratio thing) ??
> >> Maybe there is?
> >
> > There is no limit AFAIK. We are relying that the reclaim is throttled
> > when necessary.
> 
> Is that a bit indirect?

Yes it is. Dunno, how much of a problem is that, though.

> It is hard to tell without a clear big-picture.
> Something to keep in mind anyway.
> 
> >
> >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> >> need all those 26).
> >
> > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > be opposed.  But right now I am not even convinced that the mempool
> > specific gfp flags is the right way to go.
> 
> I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> transient-allocation gfp flag, which would be quite useful for mempool.
> 
> Can you give more details on why using a gfp flag isn't your first choice
> for guiding what happens when the system is trying to get a free page
> :-?

If we get rid of throttle_vm_writeout then I guess it might turn out to
be unnecessary. There are other places which will still throttle but I
believe those should be kept regardless of who is doing the allocation
because they are helping the LRU scanning sane. I might be wrong here
and bailing out from the reclaim rather than waiting would turn out
better for some users but I would like to see whether the first approach
works reasonably well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-07-28  7:17                     ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-07-28  7:17 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tetsuo Handa, LKML, linux-mm, dm-devel, Mikulas Patocka,
	Mel Gorman, David Rientjes, Ondrej Kozina, Andrew Morton

On Thu 28-07-16 07:33:19, NeilBrown wrote:
> On Thu, Jul 28 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 13:43:35, NeilBrown wrote:
> >> On Mon, Jul 25 2016, Michal Hocko wrote:
> >> 
> >> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
> > [...]
> >> So should there be a limit on dirty
> >> pages in the swap cache just like there is for dirty pages in any
> >> filesystem (the max_dirty_ratio thing) ??
> >> Maybe there is?
> >
> > There is no limit AFAIK. We are relying that the reclaim is throttled
> > when necessary.
> 
> Is that a bit indirect?

Yes it is. Dunno, how much of a problem is that, though.

> It is hard to tell without a clear big-picture.
> Something to keep in mind anyway.
> 
> >
> >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> >> need all those 26).
> >
> > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > be opposed.  But right now I am not even convinced that the mempool
> > specific gfp flags is the right way to go.
> 
> I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> transient-allocation gfp flag, which would be quite useful for mempool.
> 
> Can you give more details on why using a gfp flag isn't your first choice
> for guiding what happens when the system is trying to get a free page
> :-?

If we get rid of throttle_vm_writeout then I guess it might turn out to
be unnecessary. There are other places which will still throttle but I
believe those should be kept regardless of who is doing the allocation
because they are helping the LRU scanning sane. I might be wrong here
and bailing out from the reclaim rather than waiting would turn out
better for some users but I would like to see whether the first approach
works reasonably well.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-28  7:17                     ` Michal Hocko
@ 2016-08-03 12:53                       ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-03 12:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Thu, 28 Jul 2016, Michal Hocko wrote:

> > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > >> need all those 26).
> > >
> > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > be opposed.  But right now I am not even convinced that the mempool
> > > specific gfp flags is the right way to go.
> > 
> > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > transient-allocation gfp flag, which would be quite useful for mempool.
> > 
> > Can you give more details on why using a gfp flag isn't your first choice
> > for guiding what happens when the system is trying to get a free page
> > :-?
> 
> If we get rid of throttle_vm_writeout then I guess it might turn out to
> be unnecessary. There are other places which will still throttle but I
> believe those should be kept regardless of who is doing the allocation
> because they are helping the LRU scanning sane. I might be wrong here
> and bailing out from the reclaim rather than waiting would turn out
> better for some users but I would like to see whether the first approach
> works reasonably well.

If we are swapping to a dm-crypt device, the dm-crypt device is congested 
and the underlying block device is not congested, we should not throttle 
mempool allocations made from the dm-crypt workqueue. Not even a little 
bit.

So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
__GFP_NO_THROTTLE).

Mikulas

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-03 12:53                       ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-03 12:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Thu, 28 Jul 2016, Michal Hocko wrote:

> > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > >> need all those 26).
> > >
> > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > be opposed.  But right now I am not even convinced that the mempool
> > > specific gfp flags is the right way to go.
> > 
> > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > transient-allocation gfp flag, which would be quite useful for mempool.
> > 
> > Can you give more details on why using a gfp flag isn't your first choice
> > for guiding what happens when the system is trying to get a free page
> > :-?
> 
> If we get rid of throttle_vm_writeout then I guess it might turn out to
> be unnecessary. There are other places which will still throttle but I
> believe those should be kept regardless of who is doing the allocation
> because they are helping the LRU scanning sane. I might be wrong here
> and bailing out from the reclaim rather than waiting would turn out
> better for some users but I would like to see whether the first approach
> works reasonably well.

If we are swapping to a dm-crypt device, the dm-crypt device is congested 
and the underlying block device is not congested, we should not throttle 
mempool allocations made from the dm-crypt workqueue. Not even a little 
bit.

So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
__GFP_NO_THROTTLE).

Mikulas

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-07-27 18:40                   ` Michal Hocko
@ 2016-08-03 13:59                     ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-03 13:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 27 Jul 2016, Michal Hocko wrote:

> On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 27 Jul 2016, NeilBrown wrote:
> > 
> > > On Tue, Jul 26 2016, Mikulas Patocka wrote:
> > > 
> > > > On Sat, 23 Jul 2016, NeilBrown wrote:
> > > >
> > > >> "dirtying ... from the reclaim context" ??? What does that mean?
> > > >> According to
> > > >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > > >> From the history tree, the purpose of throttle_vm_writeout() is to
> > > >> limit the amount of memory that is concurrently under I/O.
> > > >> That seems strange to me because I thought it was the responsibility of
> > > >> each backing device to impose a limit - a maximum queue size of some
> > > >> sort.
> > > >
> > > > Device mapper doesn't impose any limit for in-flight bios.
> > > 
> > > I would suggest that it probably should. At least it should
> > > "set_wb_congested()" when the number of in-flight bios reaches some
> > > arbitrary threshold.
> > 
> > If we set the device mapper device as congested, it can again trigger that 
> > mempool alloc throttling bug.
> > 
> > I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> > becomes clogged and sets its state as congested. The underlying block 
> > device is not congested.
> > 
> > The mempool_alloc function in the dm-crypt workqueue sets the 
> > PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> > Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
> > 
> > So if we set the dm-crypt device as congested, it can incorrectly throttle 
> > the dm-crypt workqueue that does allocations of temporary pages and 
> > encryption.
> > 
> > I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> > and that mempool allocations should never be throttled.
> 
> I'm not really sure this is the right approach. If a particular mempool
> user cannot ever be throttled by the page allocator then it should
> perform GFP_NOWAIT.

Then, all block device drivers should have GFP_NOWAIT - which means that 
we can as well make it default.

But GFP_NOWAIT also disables direct reclaim. We really want direct reclaim 
when allocating from mempool - we just don't want to throttle due to block 
device congestion.

We could use __GFP_NORETRY as an indication that we don't want to sleep - 
or make a new flag __GFP_NO_THROTTLE.

> Even mempool allocations shouldn't allow reclaim to
> scan pages too quickly even when LRU lists are full of dirty pages. But
> as I've said that would restrict the success rates even under light page
> cache load. Throttling on the wait_iff_congested should be quite rare.
> 
> Anyway do you see an excessive throttling with the patch posted
> http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from

It didn't have much effect.

Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
limitless mempool allocations), swapping to dm-crypt works in the simple 
example.

> another side. Do you see an excessive number of dirty/writeback pages
> wrt. the dirty threshold or any other undesirable side effects?
> -- 
> Michal Hocko
> SUSE Labs

I also got got dmcrypt stalled in bt_get when submitting I/Os to the 
underlying virtio device. I don't know what could be done about it.

[   30.441074] dmcrypt_write   D ffff88003de7bba8     0  2155      2 0x00080000
[   30.441956]  ffff88003de7bba8 ffff88003de7be70 ffff88003de7c000 ffff88003fc34740
[   30.442934]  7fffffffffffffff ffff88003fc3a680 ffff880037a911f8 ffff88003de7bbc0
[   30.443969]  ffffffff812770df 7fffffffffffffff ffff88003de7bc10 ffffffff81278ca7
[   30.444926] Call Trace:
[   30.445232]  [<ffffffff812770df>] schedule+0x83/0x98
[   30.445825]  [<ffffffff81278ca7>] schedule_timeout+0x2f/0xcf
[   30.446506]  [<ffffffff81276c84>] io_schedule_timeout+0x64/0x90
[   30.447235]  [<ffffffff81276c84>] ? io_schedule_timeout+0x64/0x90
[   30.448088]  [<ffffffff8115787a>] bt_get+0x11a/0x1bc
[   30.448688]  [<ffffffff8105ef86>] ? wake_up_atomic_t+0x25/0x25
[   30.449392]  [<ffffffff81157abb>] blk_mq_get_tag+0x7e/0x9b
[   30.450041]  [<ffffffff81155066>] __blk_mq_alloc_request+0x1b/0x1e0
[   30.450805]  [<ffffffff81155ee8>] blk_mq_map_request+0xf6/0x136
[   30.451516]  [<ffffffff81156866>] blk_sq_make_request+0xac/0x173
[   30.452322]  [<ffffffff8114db56>] generic_make_request+0xb8/0x15b
[   30.453038]  [<ffffffffa012ba65>] dmcrypt_write+0x13b/0x174 [dm_crypt]
[   30.453852]  [<ffffffff81052779>] ? wake_up_q+0x42/0x42
[   30.454508]  [<ffffffffa012b92a>] ? crypt_iv_tcw_dtr+0x62/0x62 [dm_crypt]
[   30.455369]  [<ffffffff8104dc6a>] kthread+0xa0/0xa8
[   30.456041]  [<ffffffff8104dc6a>] ? kthread+0xa0/0xa8
[   30.456688]  [<ffffffff8127999f>] ret_from_fork+0x1f/0x40
[   30.457396]  [<ffffffff8104dbca>] ? init_completion+0x24/0x24

Mikulas

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-03 13:59                     ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-03 13:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 27 Jul 2016, Michal Hocko wrote:

> On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 27 Jul 2016, NeilBrown wrote:
> > 
> > > On Tue, Jul 26 2016, Mikulas Patocka wrote:
> > > 
> > > > On Sat, 23 Jul 2016, NeilBrown wrote:
> > > >
> > > >> "dirtying ... from the reclaim context" ??? What does that mean?
> > > >> According to
> > > >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > > >> From the history tree, the purpose of throttle_vm_writeout() is to
> > > >> limit the amount of memory that is concurrently under I/O.
> > > >> That seems strange to me because I thought it was the responsibility of
> > > >> each backing device to impose a limit - a maximum queue size of some
> > > >> sort.
> > > >
> > > > Device mapper doesn't impose any limit for in-flight bios.
> > > 
> > > I would suggest that it probably should. At least it should
> > > "set_wb_congested()" when the number of in-flight bios reaches some
> > > arbitrary threshold.
> > 
> > If we set the device mapper device as congested, it can again trigger that 
> > mempool alloc throttling bug.
> > 
> > I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> > becomes clogged and sets its state as congested. The underlying block 
> > device is not congested.
> > 
> > The mempool_alloc function in the dm-crypt workqueue sets the 
> > PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> > Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
> > 
> > So if we set the dm-crypt device as congested, it can incorrectly throttle 
> > the dm-crypt workqueue that does allocations of temporary pages and 
> > encryption.
> > 
> > I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> > and that mempool allocations should never be throttled.
> 
> I'm not really sure this is the right approach. If a particular mempool
> user cannot ever be throttled by the page allocator then it should
> perform GFP_NOWAIT.

Then, all block device drivers should have GFP_NOWAIT - which means that 
we can as well make it default.

But GFP_NOWAIT also disables direct reclaim. We really want direct reclaim 
when allocating from mempool - we just don't want to throttle due to block 
device congestion.

We could use __GFP_NORETRY as an indication that we don't want to sleep - 
or make a new flag __GFP_NO_THROTTLE.

> Even mempool allocations shouldn't allow reclaim to
> scan pages too quickly even when LRU lists are full of dirty pages. But
> as I've said that would restrict the success rates even under light page
> cache load. Throttling on the wait_iff_congested should be quite rare.
> 
> Anyway do you see an excessive throttling with the patch posted
> http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from

It didn't have much effect.

Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
limitless mempool allocations), swapping to dm-crypt works in the simple 
example.

> another side. Do you see an excessive number of dirty/writeback pages
> wrt. the dirty threshold or any other undesirable side effects?
> -- 
> Michal Hocko
> SUSE Labs

I also got got dmcrypt stalled in bt_get when submitting I/Os to the 
underlying virtio device. I don't know what could be done about it.

[   30.441074] dmcrypt_write   D ffff88003de7bba8     0  2155      2 0x00080000
[   30.441956]  ffff88003de7bba8 ffff88003de7be70 ffff88003de7c000 ffff88003fc34740
[   30.442934]  7fffffffffffffff ffff88003fc3a680 ffff880037a911f8 ffff88003de7bbc0
[   30.443969]  ffffffff812770df 7fffffffffffffff ffff88003de7bc10 ffffffff81278ca7
[   30.444926] Call Trace:
[   30.445232]  [<ffffffff812770df>] schedule+0x83/0x98
[   30.445825]  [<ffffffff81278ca7>] schedule_timeout+0x2f/0xcf
[   30.446506]  [<ffffffff81276c84>] io_schedule_timeout+0x64/0x90
[   30.447235]  [<ffffffff81276c84>] ? io_schedule_timeout+0x64/0x90
[   30.448088]  [<ffffffff8115787a>] bt_get+0x11a/0x1bc
[   30.448688]  [<ffffffff8105ef86>] ? wake_up_atomic_t+0x25/0x25
[   30.449392]  [<ffffffff81157abb>] blk_mq_get_tag+0x7e/0x9b
[   30.450041]  [<ffffffff81155066>] __blk_mq_alloc_request+0x1b/0x1e0
[   30.450805]  [<ffffffff81155ee8>] blk_mq_map_request+0xf6/0x136
[   30.451516]  [<ffffffff81156866>] blk_sq_make_request+0xac/0x173
[   30.452322]  [<ffffffff8114db56>] generic_make_request+0xb8/0x15b
[   30.453038]  [<ffffffffa012ba65>] dmcrypt_write+0x13b/0x174 [dm_crypt]
[   30.453852]  [<ffffffff81052779>] ? wake_up_q+0x42/0x42
[   30.454508]  [<ffffffffa012b92a>] ? crypt_iv_tcw_dtr+0x62/0x62 [dm_crypt]
[   30.455369]  [<ffffffff8104dc6a>] kthread+0xa0/0xa8
[   30.456041]  [<ffffffff8104dc6a>] ? kthread+0xa0/0xa8
[   30.456688]  [<ffffffff8127999f>] ret_from_fork+0x1f/0x40
[   30.457396]  [<ffffffff8104dbca>] ? init_completion+0x24/0x24

Mikulas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-03 12:53                       ` Mikulas Patocka
@ 2016-08-03 14:34                         ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-03 14:34 UTC (permalink / raw)
  To: Mikulas Patocka, Mel Gorman
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> 
> 
> On Thu, 28 Jul 2016, Michal Hocko wrote:
> 
> > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > >> need all those 26).
> > > >
> > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > be opposed.  But right now I am not even convinced that the mempool
> > > > specific gfp flags is the right way to go.
> > > 
> > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > 
> > > Can you give more details on why using a gfp flag isn't your first choice
> > > for guiding what happens when the system is trying to get a free page
> > > :-?
> > 
> > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > be unnecessary. There are other places which will still throttle but I
> > believe those should be kept regardless of who is doing the allocation
> > because they are helping the LRU scanning sane. I might be wrong here
> > and bailing out from the reclaim rather than waiting would turn out
> > better for some users but I would like to see whether the first approach
> > works reasonably well.
> 
> If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> and the underlying block device is not congested, we should not throttle 
> mempool allocations made from the dm-crypt workqueue. Not even a little 
> bit.

But the device congestion is not the only condition required for the
throttling. The pgdat has also be marked congested which means that the
LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
tail of the LRU. That should only happen if we are rotating LRUs too
quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
situation.

> So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
> __GFP_NO_THROTTLE).

As I've said earlier that would probably require to bail out from the
reclaim if we detect a potential pgdat congestion. What do you think
Mel?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-03 14:34                         ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-03 14:34 UTC (permalink / raw)
  To: Mikulas Patocka, Mel Gorman
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> 
> 
> On Thu, 28 Jul 2016, Michal Hocko wrote:
> 
> > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > >> need all those 26).
> > > >
> > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > be opposed.  But right now I am not even convinced that the mempool
> > > > specific gfp flags is the right way to go.
> > > 
> > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > 
> > > Can you give more details on why using a gfp flag isn't your first choice
> > > for guiding what happens when the system is trying to get a free page
> > > :-?
> > 
> > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > be unnecessary. There are other places which will still throttle but I
> > believe those should be kept regardless of who is doing the allocation
> > because they are helping the LRU scanning sane. I might be wrong here
> > and bailing out from the reclaim rather than waiting would turn out
> > better for some users but I would like to see whether the first approach
> > works reasonably well.
> 
> If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> and the underlying block device is not congested, we should not throttle 
> mempool allocations made from the dm-crypt workqueue. Not even a little 
> bit.

But the device congestion is not the only condition required for the
throttling. The pgdat has also be marked congested which means that the
LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
tail of the LRU. That should only happen if we are rotating LRUs too
quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
situation.

> So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
> __GFP_NO_THROTTLE).

As I've said earlier that would probably require to bail out from the
reclaim if we detect a potential pgdat congestion. What do you think
Mel?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-03 13:59                     ` Mikulas Patocka
@ 2016-08-03 14:42                       ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-03 14:42 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 03-08-16 09:59:11, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
[...]
> > > I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> > > and that mempool allocations should never be throttled.
> > 
> > I'm not really sure this is the right approach. If a particular mempool
> > user cannot ever be throttled by the page allocator then it should
> > perform GFP_NOWAIT.
> 
> Then, all block device drivers should have GFP_NOWAIT - which means that 
> we can as well make it default.
> 
> But GFP_NOWAIT also disables direct reclaim. We really want direct reclaim 
> when allocating from mempool - we just don't want to throttle due to block 
> device congestion.
> 
> We could use __GFP_NORETRY as an indication that we don't want to sleep - 
> or make a new flag __GFP_NO_THROTTLE.

__GFP_NORETRY is used for other contexts so it is not suitable.
__GFP_NO_THROTTLE would be possible but I would still prefer if we
didn't go that way unless really necessary.

> > Even mempool allocations shouldn't allow reclaim to
> > scan pages too quickly even when LRU lists are full of dirty pages. But
> > as I've said that would restrict the success rates even under light page
> > cache load. Throttling on the wait_iff_congested should be quite rare.
> > 
> > Anyway do you see an excessive throttling with the patch posted
> > http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
> 
> It didn't have much effect.
> 
> Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
> limitless mempool allocations), swapping to dm-crypt works in the simple 
> example.

OK. Do you see any throttling due to wait_iff_congested?
writeback_wait_iff_congested trace point should help here. If not maybe
we should start with the above patch and see how it works in practise.
If the there is still an excessive and unexpected throttling then we
should move on to a more mempool/block layer users specific solution.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-03 14:42                       ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-03 14:42 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton

On Wed 03-08-16 09:59:11, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
[...]
> > > I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> > > and that mempool allocations should never be throttled.
> > 
> > I'm not really sure this is the right approach. If a particular mempool
> > user cannot ever be throttled by the page allocator then it should
> > perform GFP_NOWAIT.
> 
> Then, all block device drivers should have GFP_NOWAIT - which means that 
> we can as well make it default.
> 
> But GFP_NOWAIT also disables direct reclaim. We really want direct reclaim 
> when allocating from mempool - we just don't want to throttle due to block 
> device congestion.
> 
> We could use __GFP_NORETRY as an indication that we don't want to sleep - 
> or make a new flag __GFP_NO_THROTTLE.

__GFP_NORETRY is used for other contexts so it is not suitable.
__GFP_NO_THROTTLE would be possible but I would still prefer if we
didn't go that way unless really necessary.

> > Even mempool allocations shouldn't allow reclaim to
> > scan pages too quickly even when LRU lists are full of dirty pages. But
> > as I've said that would restrict the success rates even under light page
> > cache load. Throttling on the wait_iff_congested should be quite rare.
> > 
> > Anyway do you see an excessive throttling with the patch posted
> > http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
> 
> It didn't have much effect.
> 
> Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
> limitless mempool allocations), swapping to dm-crypt works in the simple 
> example.

OK. Do you see any throttling due to wait_iff_congested?
writeback_wait_iff_congested trace point should help here. If not maybe
we should start with the above patch and see how it works in practise.
If the there is still an excessive and unexpected throttling then we
should move on to a more mempool/block layer users specific solution.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-03 14:42                       ` Michal Hocko
@ 2016-08-04 18:46                         ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-04 18:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 3 Aug 2016, Michal Hocko wrote:

> > > Even mempool allocations shouldn't allow reclaim to
> > > scan pages too quickly even when LRU lists are full of dirty pages. But
> > > as I've said that would restrict the success rates even under light page
> > > cache load. Throttling on the wait_iff_congested should be quite rare.
> > > 
> > > Anyway do you see an excessive throttling with the patch posted
> > > http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
> > 
> > It didn't have much effect.
> > 
> > Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
> > limitless mempool allocations), swapping to dm-crypt works in the simple 
> > example.
> 
> OK. Do you see any throttling due to wait_iff_congested?

No, but I've seen occasional stalls of mempool allocations in 
throttle_vm_writeout - but the patch that removed throttle_vm_writeout 
didn't improve overall speed, so the stalls were only minor.

> writeback_wait_iff_congested trace point should help here. If not maybe
> we should start with the above patch and see how it works in practise.
> If the there is still an excessive and unexpected throttling then we
> should move on to a more mempool/block layer users specific solution.

Currently, dm-crypt reports the device congested only if the underlying 
block device is congested.

But as others suggested, dm-crypt should report congested status if is 
clogged due to slow encryption progress - and in that case you should not 
throttle mempool allocations (because such throttling would decrease 
encryption speed even more).

Mikulas

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-04 18:46                         ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-04 18:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: NeilBrown, Tetsuo Handa, LKML, linux-mm, dm-devel, Mel Gorman,
	David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 3 Aug 2016, Michal Hocko wrote:

> > > Even mempool allocations shouldn't allow reclaim to
> > > scan pages too quickly even when LRU lists are full of dirty pages. But
> > > as I've said that would restrict the success rates even under light page
> > > cache load. Throttling on the wait_iff_congested should be quite rare.
> > > 
> > > Anyway do you see an excessive throttling with the patch posted
> > > http://lkml.kernel.org/r/20160725192344.GD2166@dhcp22.suse.cz ? Or from
> > 
> > It didn't have much effect.
> > 
> > Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
> > limitless mempool allocations), swapping to dm-crypt works in the simple 
> > example.
> 
> OK. Do you see any throttling due to wait_iff_congested?

No, but I've seen occasional stalls of mempool allocations in 
throttle_vm_writeout - but the patch that removed throttle_vm_writeout 
didn't improve overall speed, so the stalls were only minor.

> writeback_wait_iff_congested trace point should help here. If not maybe
> we should start with the above patch and see how it works in practise.
> If the there is still an excessive and unexpected throttling then we
> should move on to a more mempool/block layer users specific solution.

Currently, dm-crypt reports the device congested only if the underlying 
block device is congested.

But as others suggested, dm-crypt should report congested status if is 
clogged due to slow encryption progress - and in that case you should not 
throttle mempool allocations (because such throttling would decrease 
encryption speed even more).

Mikulas

> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-03 14:34                         ` Michal Hocko
@ 2016-08-04 18:49                           ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-04 18:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 3 Aug 2016, Michal Hocko wrote:

> On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 28 Jul 2016, Michal Hocko wrote:
> > 
> > > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > > >> need all those 26).
> > > > >
> > > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > > be opposed.  But right now I am not even convinced that the mempool
> > > > > specific gfp flags is the right way to go.
> > > > 
> > > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > > 
> > > > Can you give more details on why using a gfp flag isn't your first choice
> > > > for guiding what happens when the system is trying to get a free page
> > > > :-?
> > > 
> > > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > > be unnecessary. There are other places which will still throttle but I
> > > believe those should be kept regardless of who is doing the allocation
> > > because they are helping the LRU scanning sane. I might be wrong here
> > > and bailing out from the reclaim rather than waiting would turn out
> > > better for some users but I would like to see whether the first approach
> > > works reasonably well.
> > 
> > If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> > and the underlying block device is not congested, we should not throttle 
> > mempool allocations made from the dm-crypt workqueue. Not even a little 
> > bit.
> 
> But the device congestion is not the only condition required for the
> throttling. The pgdat has also be marked congested which means that the
> LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> tail of the LRU. That should only happen if we are rotating LRUs too
> quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> situation.

The obvious problem here is that mempool allocations should sleep in 
mempool_alloc() on &pool->wait (until someone returns some entries into 
the mempool), they should not sleep inside the page allocator.

Mikulas

> > So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
> > __GFP_NO_THROTTLE).
> 
> As I've said earlier that would probably require to bail out from the
> reclaim if we detect a potential pgdat congestion. What do you think
> Mel?
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-04 18:49                           ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-04 18:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Wed, 3 Aug 2016, Michal Hocko wrote:

> On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 28 Jul 2016, Michal Hocko wrote:
> > 
> > > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > > >> need all those 26).
> > > > >
> > > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > > be opposed.  But right now I am not even convinced that the mempool
> > > > > specific gfp flags is the right way to go.
> > > > 
> > > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > > 
> > > > Can you give more details on why using a gfp flag isn't your first choice
> > > > for guiding what happens when the system is trying to get a free page
> > > > :-?
> > > 
> > > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > > be unnecessary. There are other places which will still throttle but I
> > > believe those should be kept regardless of who is doing the allocation
> > > because they are helping the LRU scanning sane. I might be wrong here
> > > and bailing out from the reclaim rather than waiting would turn out
> > > better for some users but I would like to see whether the first approach
> > > works reasonably well.
> > 
> > If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> > and the underlying block device is not congested, we should not throttle 
> > mempool allocations made from the dm-crypt workqueue. Not even a little 
> > bit.
> 
> But the device congestion is not the only condition required for the
> throttling. The pgdat has also be marked congested which means that the
> LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> tail of the LRU. That should only happen if we are rotating LRUs too
> quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> situation.

The obvious problem here is that mempool allocations should sleep in 
mempool_alloc() on &pool->wait (until someone returns some entries into 
the mempool), they should not sleep inside the page allocator.

Mikulas

> > So, I think, mempool_alloc should set PF_NO_THROTTLE (or 
> > __GFP_NO_THROTTLE).
> 
> As I've said earlier that would probably require to bail out from the
> reclaim if we detect a potential pgdat congestion. What do you think
> Mel?
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-04 18:49                           ` Mikulas Patocka
@ 2016-08-12 12:32                             ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-12 12:32 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> 
> 
> On Wed, 3 Aug 2016, Michal Hocko wrote:
> 
> > On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Thu, 28 Jul 2016, Michal Hocko wrote:
> > > 
> > > > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > > > >> need all those 26).
> > > > > >
> > > > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > > > be opposed.  But right now I am not even convinced that the mempool
> > > > > > specific gfp flags is the right way to go.
> > > > > 
> > > > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > > > 
> > > > > Can you give more details on why using a gfp flag isn't your first choice
> > > > > for guiding what happens when the system is trying to get a free page
> > > > > :-?
> > > > 
> > > > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > > > be unnecessary. There are other places which will still throttle but I
> > > > believe those should be kept regardless of who is doing the allocation
> > > > because they are helping the LRU scanning sane. I might be wrong here
> > > > and bailing out from the reclaim rather than waiting would turn out
> > > > better for some users but I would like to see whether the first approach
> > > > works reasonably well.
> > > 
> > > If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> > > and the underlying block device is not congested, we should not throttle 
> > > mempool allocations made from the dm-crypt workqueue. Not even a little 
> > > bit.
> > 
> > But the device congestion is not the only condition required for the
> > throttling. The pgdat has also be marked congested which means that the
> > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > tail of the LRU. That should only happen if we are rotating LRUs too
> > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > situation.
> 
> The obvious problem here is that mempool allocations should sleep in 
> mempool_alloc() on &pool->wait (until someone returns some entries into 
> the mempool), they should not sleep inside the page allocator.

I agree that mempool_alloc should _primarily_ sleep on their own
throttling mechanism. I am not questioning that. I am just saying that
the page allocator has its own throttling which it relies on and that
cannot be just ignored because that might have other undesirable side
effects. So if the right approach is really to never throttle certain
requests then we have to bail out from a congested nodes/zones as soon
as the congestion is detected.

Now, I would like to see that something like that is _really_ necessary.
I believe that we should simply start with easier part and get rid of
throttle_vm_writeout because that seems like a left over from the past.
If that turns out unsatisfactory and we have clear picture when the
throttling is harmful/suboptimal then we can move on with a more complex
solution. Does this sound like a way forward?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-12 12:32                             ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-12 12:32 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> 
> 
> On Wed, 3 Aug 2016, Michal Hocko wrote:
> 
> > On Wed 03-08-16 08:53:25, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Thu, 28 Jul 2016, Michal Hocko wrote:
> > > 
> > > > > >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> > > > > >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> > > > > >> need all those 26).
> > > > > >
> > > > > > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > > > > > be opposed.  But right now I am not even convinced that the mempool
> > > > > > specific gfp flags is the right way to go.
> > > > > 
> > > > > I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> > > > > transient-allocation gfp flag, which would be quite useful for mempool.
> > > > > 
> > > > > Can you give more details on why using a gfp flag isn't your first choice
> > > > > for guiding what happens when the system is trying to get a free page
> > > > > :-?
> > > > 
> > > > If we get rid of throttle_vm_writeout then I guess it might turn out to
> > > > be unnecessary. There are other places which will still throttle but I
> > > > believe those should be kept regardless of who is doing the allocation
> > > > because they are helping the LRU scanning sane. I might be wrong here
> > > > and bailing out from the reclaim rather than waiting would turn out
> > > > better for some users but I would like to see whether the first approach
> > > > works reasonably well.
> > > 
> > > If we are swapping to a dm-crypt device, the dm-crypt device is congested 
> > > and the underlying block device is not congested, we should not throttle 
> > > mempool allocations made from the dm-crypt workqueue. Not even a little 
> > > bit.
> > 
> > But the device congestion is not the only condition required for the
> > throttling. The pgdat has also be marked congested which means that the
> > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > tail of the LRU. That should only happen if we are rotating LRUs too
> > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > situation.
> 
> The obvious problem here is that mempool allocations should sleep in 
> mempool_alloc() on &pool->wait (until someone returns some entries into 
> the mempool), they should not sleep inside the page allocator.

I agree that mempool_alloc should _primarily_ sleep on their own
throttling mechanism. I am not questioning that. I am just saying that
the page allocator has its own throttling which it relies on and that
cannot be just ignored because that might have other undesirable side
effects. So if the right approach is really to never throttle certain
requests then we have to bail out from a congested nodes/zones as soon
as the congestion is detected.

Now, I would like to see that something like that is _really_ necessary.
I believe that we should simply start with easier part and get rid of
throttle_vm_writeout because that seems like a left over from the past.
If that turns out unsatisfactory and we have clear picture when the
throttling is harmful/suboptimal then we can move on with a more complex
solution. Does this sound like a way forward?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-12 12:32                             ` Michal Hocko
@ 2016-08-13 17:34                               ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-13 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Fri, 12 Aug 2016, Michal Hocko wrote:

> On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> 
> > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > 
> > > But the device congestion is not the only condition required for the
> > > throttling. The pgdat has also be marked congested which means that the
> > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > situation.
> > 
> > The obvious problem here is that mempool allocations should sleep in 
> > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > the mempool), they should not sleep inside the page allocator.
> 
> I agree that mempool_alloc should _primarily_ sleep on their own
> throttling mechanism. I am not questioning that. I am just saying that
> the page allocator has its own throttling which it relies on and that
> cannot be just ignored because that might have other undesirable side
> effects. So if the right approach is really to never throttle certain
> requests then we have to bail out from a congested nodes/zones as soon
> as the congestion is detected.
> 
> Now, I would like to see that something like that is _really_ necessary.

Currently, it is not a problem - device mapper reports the device as 
congested only if the underlying physical disks are congested.

But once we change it so that device mapper reports congested state on its 
own (when it has too many bios in progress), this starts being a problem.

I would add PF_NO_THROTTLE or __GFP_NO_THROTTLE to mempool_alloc.

Or - we can prevent the memory reclaim from throttling if we see both 
__GFP_NOMEMALLOC and __GFP_NORETRY - that would be sufficient to detect 
mempool_alloc usage and it wouldn't hurt other __GFP_NORETRY users.

Mikulas

> I believe that we should simply start with easier part and get rid of
> throttle_vm_writeout because that seems like a left over from the past.
> If that turns out unsatisfactory and we have clear picture when the
> throttling is harmful/suboptimal then we can move on with a more complex
> solution. Does this sound like a way forward?
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-13 17:34                               ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-13 17:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Fri, 12 Aug 2016, Michal Hocko wrote:

> On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> 
> > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > 
> > > But the device congestion is not the only condition required for the
> > > throttling. The pgdat has also be marked congested which means that the
> > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > situation.
> > 
> > The obvious problem here is that mempool allocations should sleep in 
> > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > the mempool), they should not sleep inside the page allocator.
> 
> I agree that mempool_alloc should _primarily_ sleep on their own
> throttling mechanism. I am not questioning that. I am just saying that
> the page allocator has its own throttling which it relies on and that
> cannot be just ignored because that might have other undesirable side
> effects. So if the right approach is really to never throttle certain
> requests then we have to bail out from a congested nodes/zones as soon
> as the congestion is detected.
> 
> Now, I would like to see that something like that is _really_ necessary.

Currently, it is not a problem - device mapper reports the device as 
congested only if the underlying physical disks are congested.

But once we change it so that device mapper reports congested state on its 
own (when it has too many bios in progress), this starts being a problem.

I would add PF_NO_THROTTLE or __GFP_NO_THROTTLE to mempool_alloc.

Or - we can prevent the memory reclaim from throttling if we see both 
__GFP_NOMEMALLOC and __GFP_NORETRY - that would be sufficient to detect 
mempool_alloc usage and it wouldn't hurt other __GFP_NORETRY users.

Mikulas

> I believe that we should simply start with easier part and get rid of
> throttle_vm_writeout because that seems like a left over from the past.
> If that turns out unsatisfactory and we have clear picture when the
> throttling is harmful/suboptimal then we can move on with a more complex
> solution. Does this sound like a way forward?
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-13 17:34                               ` Mikulas Patocka
@ 2016-08-14 10:34                                 ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-14 10:34 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> 
> 
> On Fri, 12 Aug 2016, Michal Hocko wrote:
> 
> > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > 
> > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > 
> > > > But the device congestion is not the only condition required for the
> > > > throttling. The pgdat has also be marked congested which means that the
> > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > situation.
> > > 
> > > The obvious problem here is that mempool allocations should sleep in 
> > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > the mempool), they should not sleep inside the page allocator.
> > 
> > I agree that mempool_alloc should _primarily_ sleep on their own
> > throttling mechanism. I am not questioning that. I am just saying that
> > the page allocator has its own throttling which it relies on and that
> > cannot be just ignored because that might have other undesirable side
> > effects. So if the right approach is really to never throttle certain
> > requests then we have to bail out from a congested nodes/zones as soon
> > as the congestion is detected.
> > 
> > Now, I would like to see that something like that is _really_ necessary.
> 
> Currently, it is not a problem - device mapper reports the device as 
> congested only if the underlying physical disks are congested.
> 
> But once we change it so that device mapper reports congested state on its 
> own (when it has too many bios in progress), this starts being a problem.

OK, can we wait until it starts becoming a real problem and solve it
appropriately then?

I will repost the patch which removes thottle_vm_pageout in the meantime
as it doesn't seem to be needed anymore.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-14 10:34                                 ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-08-14 10:34 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton

On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> 
> 
> On Fri, 12 Aug 2016, Michal Hocko wrote:
> 
> > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > 
> > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > 
> > > > But the device congestion is not the only condition required for the
> > > > throttling. The pgdat has also be marked congested which means that the
> > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > situation.
> > > 
> > > The obvious problem here is that mempool allocations should sleep in 
> > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > the mempool), they should not sleep inside the page allocator.
> > 
> > I agree that mempool_alloc should _primarily_ sleep on their own
> > throttling mechanism. I am not questioning that. I am just saying that
> > the page allocator has its own throttling which it relies on and that
> > cannot be just ignored because that might have other undesirable side
> > effects. So if the right approach is really to never throttle certain
> > requests then we have to bail out from a congested nodes/zones as soon
> > as the congestion is detected.
> > 
> > Now, I would like to see that something like that is _really_ necessary.
> 
> Currently, it is not a problem - device mapper reports the device as 
> congested only if the underlying physical disks are congested.
> 
> But once we change it so that device mapper reports congested state on its 
> own (when it has too many bios in progress), this starts being a problem.

OK, can we wait until it starts becoming a real problem and solve it
appropriately then?

I will repost the patch which removes thottle_vm_pageout in the meantime
as it doesn't seem to be needed anymore.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-14 10:34                                 ` Michal Hocko
@ 2016-08-15 16:15                                   ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-15 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Sun, 14 Aug 2016, Michal Hocko wrote:

> On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 12 Aug 2016, Michal Hocko wrote:
> > 
> > > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > > 
> > > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > > 
> > > > > But the device congestion is not the only condition required for the
> > > > > throttling. The pgdat has also be marked congested which means that the
> > > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > > situation.
> > > > 
> > > > The obvious problem here is that mempool allocations should sleep in 
> > > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > > the mempool), they should not sleep inside the page allocator.
> > > 
> > > I agree that mempool_alloc should _primarily_ sleep on their own
> > > throttling mechanism. I am not questioning that. I am just saying that
> > > the page allocator has its own throttling which it relies on and that
> > > cannot be just ignored because that might have other undesirable side
> > > effects. So if the right approach is really to never throttle certain
> > > requests then we have to bail out from a congested nodes/zones as soon
> > > as the congestion is detected.
> > > 
> > > Now, I would like to see that something like that is _really_ necessary.
> > 
> > Currently, it is not a problem - device mapper reports the device as 
> > congested only if the underlying physical disks are congested.
> > 
> > But once we change it so that device mapper reports congested state on its 
> > own (when it has too many bios in progress), this starts being a problem.
> 
> OK, can we wait until it starts becoming a real problem and solve it
> appropriately then?

I don't like the idea to deliberately introduce some code that triggers 
this bug into device mapper, then wait until some user hits the bug and 
then fix the bug.

If the VM throttles mempool allocations when the swap device is congested 
- than I won't report the device as congested in the device mapper.

Mikulas

> I will repost the patch which removes thottle_vm_pageout in the meantime
> as it doesn't seem to be needed anymore.
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-08-15 16:15                                   ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-08-15 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton



On Sun, 14 Aug 2016, Michal Hocko wrote:

> On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 12 Aug 2016, Michal Hocko wrote:
> > 
> > > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > > 
> > > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > > 
> > > > > But the device congestion is not the only condition required for the
> > > > > throttling. The pgdat has also be marked congested which means that the
> > > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > > situation.
> > > > 
> > > > The obvious problem here is that mempool allocations should sleep in 
> > > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > > the mempool), they should not sleep inside the page allocator.
> > > 
> > > I agree that mempool_alloc should _primarily_ sleep on their own
> > > throttling mechanism. I am not questioning that. I am just saying that
> > > the page allocator has its own throttling which it relies on and that
> > > cannot be just ignored because that might have other undesirable side
> > > effects. So if the right approach is really to never throttle certain
> > > requests then we have to bail out from a congested nodes/zones as soon
> > > as the congestion is detected.
> > > 
> > > Now, I would like to see that something like that is _really_ necessary.
> > 
> > Currently, it is not a problem - device mapper reports the device as 
> > congested only if the underlying physical disks are congested.
> > 
> > But once we change it so that device mapper reports congested state on its 
> > own (when it has too many bios in progress), this starts being a problem.
> 
> OK, can we wait until it starts becoming a real problem and solve it
> appropriately then?

I don't like the idea to deliberately introduce some code that triggers 
this bug into device mapper, then wait until some user hits the bug and 
then fix the bug.

If the VM throttles mempool allocations when the swap device is congested 
- than I won't report the device as congested in the device mapper.

Mikulas

> I will repost the patch which removes thottle_vm_pageout in the meantime
> as it doesn't seem to be needed anymore.
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-08-14 10:34                                 ` Michal Hocko
@ 2016-11-23 21:11                                   ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-11-23 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov



On Sun, 14 Aug 2016, Michal Hocko wrote:

> On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 12 Aug 2016, Michal Hocko wrote:
> > 
> > > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > > 
> > > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > > 
> > > > > But the device congestion is not the only condition required for the
> > > > > throttling. The pgdat has also be marked congested which means that the
> > > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > > situation.
> > > > 
> > > > The obvious problem here is that mempool allocations should sleep in 
> > > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > > the mempool), they should not sleep inside the page allocator.
> > > 
> > > I agree that mempool_alloc should _primarily_ sleep on their own
> > > throttling mechanism. I am not questioning that. I am just saying that
> > > the page allocator has its own throttling which it relies on and that
> > > cannot be just ignored because that might have other undesirable side
> > > effects. So if the right approach is really to never throttle certain
> > > requests then we have to bail out from a congested nodes/zones as soon
> > > as the congestion is detected.
> > > 
> > > Now, I would like to see that something like that is _really_ necessary.
> > 
> > Currently, it is not a problem - device mapper reports the device as 
> > congested only if the underlying physical disks are congested.
> > 
> > But once we change it so that device mapper reports congested state on its 
> > own (when it has too many bios in progress), this starts being a problem.
> 
> OK, can we wait until it starts becoming a real problem and solve it
> appropriately then?
> 
> I will repost the patch which removes thottle_vm_pageout in the meantime
> as it doesn't seem to be needed anymore.
> 
> -- 
> Michal Hocko
> SUSE Labs

Hi Michal

So, here Google developers hit a stacktrace where a block device driver is 
being throttled in the memory management:

https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html

dm-bufio layer is something like a buffer cache, used by block device 
drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
progress even if there is no memory free.

dm-bufio does something similar like a mempool allocation, it tries an 
allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
(just like a mempool) and if it fails, it will reuse some existing buffer.

Here, they caught it being throttled in the memory management:

   Workqueue: kverityd verity_prefetch_io
   __switch_to+0x9c/0xa8
   __schedule+0x440/0x6d8
   schedule+0x94/0xb4
   schedule_timeout+0x204/0x27c
   schedule_timeout_uninterruptible+0x44/0x50
   wait_iff_congested+0x9c/0x1f0
   shrink_inactive_list+0x3a0/0x4cc
   shrink_lruvec+0x418/0x5cc
   shrink_zone+0x88/0x198
   try_to_free_pages+0x51c/0x588
   __alloc_pages_nodemask+0x648/0xa88
   __get_free_pages+0x34/0x7c
   alloc_buffer+0xa4/0x144
   __bufio_new+0x84/0x278
   dm_bufio_prefetch+0x9c/0x154
   verity_prefetch_io+0xe8/0x10c
   process_one_work+0x240/0x424
   worker_thread+0x2fc/0x424
   kthread+0x10c/0x114

Will you consider removing vm throttling for __GFP_NORETRY allocations?

Mikulas

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-11-23 21:11                                   ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-11-23 21:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov



On Sun, 14 Aug 2016, Michal Hocko wrote:

> On Sat 13-08-16 13:34:29, Mikulas Patocka wrote:
> > 
> > 
> > On Fri, 12 Aug 2016, Michal Hocko wrote:
> > 
> > > On Thu 04-08-16 14:49:41, Mikulas Patocka wrote:
> > > 
> > > > On Wed, 3 Aug 2016, Michal Hocko wrote:
> > > > 
> > > > > But the device congestion is not the only condition required for the
> > > > > throttling. The pgdat has also be marked congested which means that the
> > > > > LRU page scanner bumped into dirty/writeback/pg_reclaim pages at the
> > > > > tail of the LRU. That should only happen if we are rotating LRUs too
> > > > > quickly. AFAIU the reclaim shouldn't allow free ticket scanning in that
> > > > > situation.
> > > > 
> > > > The obvious problem here is that mempool allocations should sleep in 
> > > > mempool_alloc() on &pool->wait (until someone returns some entries into 
> > > > the mempool), they should not sleep inside the page allocator.
> > > 
> > > I agree that mempool_alloc should _primarily_ sleep on their own
> > > throttling mechanism. I am not questioning that. I am just saying that
> > > the page allocator has its own throttling which it relies on and that
> > > cannot be just ignored because that might have other undesirable side
> > > effects. So if the right approach is really to never throttle certain
> > > requests then we have to bail out from a congested nodes/zones as soon
> > > as the congestion is detected.
> > > 
> > > Now, I would like to see that something like that is _really_ necessary.
> > 
> > Currently, it is not a problem - device mapper reports the device as 
> > congested only if the underlying physical disks are congested.
> > 
> > But once we change it so that device mapper reports congested state on its 
> > own (when it has too many bios in progress), this starts being a problem.
> 
> OK, can we wait until it starts becoming a real problem and solve it
> appropriately then?
> 
> I will repost the patch which removes thottle_vm_pageout in the meantime
> as it doesn't seem to be needed anymore.
> 
> -- 
> Michal Hocko
> SUSE Labs

Hi Michal

So, here Google developers hit a stacktrace where a block device driver is 
being throttled in the memory management:

https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html

dm-bufio layer is something like a buffer cache, used by block device 
drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
progress even if there is no memory free.

dm-bufio does something similar like a mempool allocation, it tries an 
allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
(just like a mempool) and if it fails, it will reuse some existing buffer.

Here, they caught it being throttled in the memory management:

   Workqueue: kverityd verity_prefetch_io
   __switch_to+0x9c/0xa8
   __schedule+0x440/0x6d8
   schedule+0x94/0xb4
   schedule_timeout+0x204/0x27c
   schedule_timeout_uninterruptible+0x44/0x50
   wait_iff_congested+0x9c/0x1f0
   shrink_inactive_list+0x3a0/0x4cc
   shrink_lruvec+0x418/0x5cc
   shrink_zone+0x88/0x198
   try_to_free_pages+0x51c/0x588
   __alloc_pages_nodemask+0x648/0xa88
   __get_free_pages+0x34/0x7c
   alloc_buffer+0xa4/0x144
   __bufio_new+0x84/0x278
   dm_bufio_prefetch+0x9c/0x154
   verity_prefetch_io+0xe8/0x10c
   process_one_work+0x240/0x424
   worker_thread+0x2fc/0x424
   kthread+0x10c/0x114

Will you consider removing vm throttling for __GFP_NORETRY allocations?

Mikulas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-11-23 21:11                                   ` Mikulas Patocka
@ 2016-11-24 13:29                                     ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-11-24 13:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov

On Wed 23-11-16 16:11:59, Mikulas Patocka wrote:
[...]
> Hi Michal
> 
> So, here Google developers hit a stacktrace where a block device driver is 
> being throttled in the memory management:
> 
> https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html
> 
> dm-bufio layer is something like a buffer cache, used by block device 
> drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
> progress even if there is no memory free.
> 
> dm-bufio does something similar like a mempool allocation, it tries an 
> allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
> (just like a mempool) and if it fails, it will reuse some existing buffer.
> 
> Here, they caught it being throttled in the memory management:
> 
>    Workqueue: kverityd verity_prefetch_io
>    __switch_to+0x9c/0xa8
>    __schedule+0x440/0x6d8
>    schedule+0x94/0xb4
>    schedule_timeout+0x204/0x27c
>    schedule_timeout_uninterruptible+0x44/0x50
>    wait_iff_congested+0x9c/0x1f0
>    shrink_inactive_list+0x3a0/0x4cc
>    shrink_lruvec+0x418/0x5cc
>    shrink_zone+0x88/0x198
>    try_to_free_pages+0x51c/0x588
>    __alloc_pages_nodemask+0x648/0xa88
>    __get_free_pages+0x34/0x7c
>    alloc_buffer+0xa4/0x144
>    __bufio_new+0x84/0x278
>    dm_bufio_prefetch+0x9c/0x154
>    verity_prefetch_io+0xe8/0x10c
>    process_one_work+0x240/0x424
>    worker_thread+0x2fc/0x424
>    kthread+0x10c/0x114
> 
> Will you consider removing vm throttling for __GFP_NORETRY allocations?

As I've already said before I do not think that tweaking __GFP_NORETRY
is the right approach is the right approach. The whole point of the flag
is to not loop in the _allocator_ and it has nothing to do with the reclaim
and the way how it is doing throttling.

On the other hand I perfectly understand your point and a lack of
anything between GFP_NOWAIT and ___GFP_DIRECT_RECLAIM can be a bit
frustrating. It would be nice to have sime middle ground - only a
light reclaim involved and a quick back off if the memory is harder to
reclaim. That is a hard thing to do, though because all the reclaimers
(including slab shrinkers) would have to be aware of this concept to
work properly.

I have read the report from the link above and I am really wondering why
s@GFP_NOIO@GFP_NOWAIT@ is not the right way to go there. You have argued
about a clean page cache would force buffer reuse. That might be true
to some extent but is it a real problem? Please note that even
GFP_NOWAIT allocations will wake up kspwad which should clean up that
clean page cache in the background. I would even expect kswapd being
active at the time when NOWAIT requests hit the min watermark. If that
is not the case then we should probably think about why kspwad is not
proactive enough rather than tweaking __GFP_NORETRY semantic.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-11-24 13:29                                     ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-11-24 13:29 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov

On Wed 23-11-16 16:11:59, Mikulas Patocka wrote:
[...]
> Hi Michal
> 
> So, here Google developers hit a stacktrace where a block device driver is 
> being throttled in the memory management:
> 
> https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html
> 
> dm-bufio layer is something like a buffer cache, used by block device 
> drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
> progress even if there is no memory free.
> 
> dm-bufio does something similar like a mempool allocation, it tries an 
> allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
> (just like a mempool) and if it fails, it will reuse some existing buffer.
> 
> Here, they caught it being throttled in the memory management:
> 
>    Workqueue: kverityd verity_prefetch_io
>    __switch_to+0x9c/0xa8
>    __schedule+0x440/0x6d8
>    schedule+0x94/0xb4
>    schedule_timeout+0x204/0x27c
>    schedule_timeout_uninterruptible+0x44/0x50
>    wait_iff_congested+0x9c/0x1f0
>    shrink_inactive_list+0x3a0/0x4cc
>    shrink_lruvec+0x418/0x5cc
>    shrink_zone+0x88/0x198
>    try_to_free_pages+0x51c/0x588
>    __alloc_pages_nodemask+0x648/0xa88
>    __get_free_pages+0x34/0x7c
>    alloc_buffer+0xa4/0x144
>    __bufio_new+0x84/0x278
>    dm_bufio_prefetch+0x9c/0x154
>    verity_prefetch_io+0xe8/0x10c
>    process_one_work+0x240/0x424
>    worker_thread+0x2fc/0x424
>    kthread+0x10c/0x114
> 
> Will you consider removing vm throttling for __GFP_NORETRY allocations?

As I've already said before I do not think that tweaking __GFP_NORETRY
is the right approach is the right approach. The whole point of the flag
is to not loop in the _allocator_ and it has nothing to do with the reclaim
and the way how it is doing throttling.

On the other hand I perfectly understand your point and a lack of
anything between GFP_NOWAIT and ___GFP_DIRECT_RECLAIM can be a bit
frustrating. It would be nice to have sime middle ground - only a
light reclaim involved and a quick back off if the memory is harder to
reclaim. That is a hard thing to do, though because all the reclaimers
(including slab shrinkers) would have to be aware of this concept to
work properly.

I have read the report from the link above and I am really wondering why
s@GFP_NOIO@GFP_NOWAIT@ is not the right way to go there. You have argued
about a clean page cache would force buffer reuse. That might be true
to some extent but is it a real problem? Please note that even
GFP_NOWAIT allocations will wake up kspwad which should clean up that
clean page cache in the background. I would even expect kswapd being
active at the time when NOWAIT requests hit the min watermark. If that
is not the case then we should probably think about why kspwad is not
proactive enough rather than tweaking __GFP_NORETRY semantic.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-11-24 13:29                                     ` Michal Hocko
@ 2016-11-24 17:10                                       ` Mikulas Patocka
  -1 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-11-24 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov



On Thu, 24 Nov 2016, Michal Hocko wrote:

> On Wed 23-11-16 16:11:59, Mikulas Patocka wrote:
> [...]
> > Hi Michal
> > 
> > So, here Google developers hit a stacktrace where a block device driver is 
> > being throttled in the memory management:
> > 
> > https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html
> > 
> > dm-bufio layer is something like a buffer cache, used by block device 
> > drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
> > progress even if there is no memory free.
> > 
> > dm-bufio does something similar like a mempool allocation, it tries an 
> > allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
> > (just like a mempool) and if it fails, it will reuse some existing buffer.
> > 
> > Here, they caught it being throttled in the memory management:
> > 
> >    Workqueue: kverityd verity_prefetch_io
> >    __switch_to+0x9c/0xa8
> >    __schedule+0x440/0x6d8
> >    schedule+0x94/0xb4
> >    schedule_timeout+0x204/0x27c
> >    schedule_timeout_uninterruptible+0x44/0x50
> >    wait_iff_congested+0x9c/0x1f0
> >    shrink_inactive_list+0x3a0/0x4cc
> >    shrink_lruvec+0x418/0x5cc
> >    shrink_zone+0x88/0x198
> >    try_to_free_pages+0x51c/0x588
> >    __alloc_pages_nodemask+0x648/0xa88
> >    __get_free_pages+0x34/0x7c
> >    alloc_buffer+0xa4/0x144
> >    __bufio_new+0x84/0x278
> >    dm_bufio_prefetch+0x9c/0x154
> >    verity_prefetch_io+0xe8/0x10c
> >    process_one_work+0x240/0x424
> >    worker_thread+0x2fc/0x424
> >    kthread+0x10c/0x114
> > 
> > Will you consider removing vm throttling for __GFP_NORETRY allocations?
> 
> As I've already said before I do not think that tweaking __GFP_NORETRY
> is the right approach is the right approach. The whole point of the flag
> is to not loop in the _allocator_ and it has nothing to do with the reclaim
> and the way how it is doing throttling.
> 
> On the other hand I perfectly understand your point and a lack of
> anything between GFP_NOWAIT and ___GFP_DIRECT_RECLAIM can be a bit
> frustrating. It would be nice to have sime middle ground - only a
> light reclaim involved and a quick back off if the memory is harder to
> reclaim. That is a hard thing to do, though because all the reclaimers
> (including slab shrinkers) would have to be aware of this concept to
> work properly.
> 
> I have read the report from the link above and I am really wondering why
> s@GFP_NOIO@GFP_NOWAIT@ is not the right way to go there. You have argued
> about a clean page cache would force buffer reuse. That might be true
> to some extent but is it a real problem?

The dm-bufio cache is limited by default to 2% of all memory. And the 
buffers are freed after 5 minutes of not being used.

It is unfair to reclaim the small dm-bufio cache (that was recently used) 
instead of the big page cache (that could be indefinitely old).

> Please note that even
> GFP_NOWAIT allocations will wake up kspwad which should clean up that

The mempool is also using GFP_NOIO allocations - so do you claim that it 
should not use GFP_NOIO too?

You should provide a clear API that the block device drivers should use to 
allocate memory - not to apply band aid to vm throttling problems as they 
are being discovered.

> clean page cache in the background. I would even expect kswapd being
> active at the time when NOWAIT requests hit the min watermark. If that
> is not the case then we should probably think about why kspwad is not
> proactive enough rather than tweaking __GFP_NORETRY semantic.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

Mikulas

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-11-24 17:10                                       ` Mikulas Patocka
  0 siblings, 0 replies; 102+ messages in thread
From: Mikulas Patocka @ 2016-11-24 17:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov



On Thu, 24 Nov 2016, Michal Hocko wrote:

> On Wed 23-11-16 16:11:59, Mikulas Patocka wrote:
> [...]
> > Hi Michal
> > 
> > So, here Google developers hit a stacktrace where a block device driver is 
> > being throttled in the memory management:
> > 
> > https://www.redhat.com/archives/dm-devel/2016-November/msg00158.html
> > 
> > dm-bufio layer is something like a buffer cache, used by block device 
> > drivers. Unlike the real buffer cache, dm-bufio guarantees forward 
> > progress even if there is no memory free.
> > 
> > dm-bufio does something similar like a mempool allocation, it tries an 
> > allocation with GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN 
> > (just like a mempool) and if it fails, it will reuse some existing buffer.
> > 
> > Here, they caught it being throttled in the memory management:
> > 
> >    Workqueue: kverityd verity_prefetch_io
> >    __switch_to+0x9c/0xa8
> >    __schedule+0x440/0x6d8
> >    schedule+0x94/0xb4
> >    schedule_timeout+0x204/0x27c
> >    schedule_timeout_uninterruptible+0x44/0x50
> >    wait_iff_congested+0x9c/0x1f0
> >    shrink_inactive_list+0x3a0/0x4cc
> >    shrink_lruvec+0x418/0x5cc
> >    shrink_zone+0x88/0x198
> >    try_to_free_pages+0x51c/0x588
> >    __alloc_pages_nodemask+0x648/0xa88
> >    __get_free_pages+0x34/0x7c
> >    alloc_buffer+0xa4/0x144
> >    __bufio_new+0x84/0x278
> >    dm_bufio_prefetch+0x9c/0x154
> >    verity_prefetch_io+0xe8/0x10c
> >    process_one_work+0x240/0x424
> >    worker_thread+0x2fc/0x424
> >    kthread+0x10c/0x114
> > 
> > Will you consider removing vm throttling for __GFP_NORETRY allocations?
> 
> As I've already said before I do not think that tweaking __GFP_NORETRY
> is the right approach is the right approach. The whole point of the flag
> is to not loop in the _allocator_ and it has nothing to do with the reclaim
> and the way how it is doing throttling.
> 
> On the other hand I perfectly understand your point and a lack of
> anything between GFP_NOWAIT and ___GFP_DIRECT_RECLAIM can be a bit
> frustrating. It would be nice to have sime middle ground - only a
> light reclaim involved and a quick back off if the memory is harder to
> reclaim. That is a hard thing to do, though because all the reclaimers
> (including slab shrinkers) would have to be aware of this concept to
> work properly.
> 
> I have read the report from the link above and I am really wondering why
> s@GFP_NOIO@GFP_NOWAIT@ is not the right way to go there. You have argued
> about a clean page cache would force buffer reuse. That might be true
> to some extent but is it a real problem?

The dm-bufio cache is limited by default to 2% of all memory. And the 
buffers are freed after 5 minutes of not being used.

It is unfair to reclaim the small dm-bufio cache (that was recently used) 
instead of the big page cache (that could be indefinitely old).

> Please note that even
> GFP_NOWAIT allocations will wake up kspwad which should clean up that

The mempool is also using GFP_NOIO allocations - so do you claim that it 
should not use GFP_NOIO too?

You should provide a clear API that the block device drivers should use to 
allocate memory - not to apply band aid to vm throttling problems as they 
are being discovered.

> clean page cache in the background. I would even expect kswapd being
> active at the time when NOWAIT requests hit the min watermark. If that
> is not the case then we should probably think about why kspwad is not
> proactive enough rather than tweaking __GFP_NORETRY semantic.
> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs

Mikulas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
  2016-11-24 17:10                                       ` Mikulas Patocka
@ 2016-11-28 14:06                                         ` Michal Hocko
  -1 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-11-28 14:06 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov

On Thu 24-11-16 12:10:08, Mikulas Patocka wrote:
> 
> 
> On Thu, 24 Nov 2016, Michal Hocko wrote:
[...]
> > Please note that even
> > GFP_NOWAIT allocations will wake up kspwad which should clean up that
> 
> The mempool is also using GFP_NOIO allocations - so do you claim that it 
> should not use GFP_NOIO too?

No, I am not claiming that. The last time I have asked the throttling
didn't seem to serious enough to cause any problems. If the memory
reclaim throttling is serious enough then let's measure and evaluate it.

> You should provide a clear API that the block device drivers should use to 
> allocate memory - not to apply band aid to vm throttling problems as they 
> are being discovered.

This is easier said than done, I am afraid. We have been using GFP_NOIO
in mempool for years and there were no major complains.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks
@ 2016-11-28 14:06                                         ` Michal Hocko
  0 siblings, 0 replies; 102+ messages in thread
From: Michal Hocko @ 2016-11-28 14:06 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mel Gorman, NeilBrown, Tetsuo Handa, LKML, linux-mm,
	dm-devel@redhat.com David Rientjes, Ondrej Kozina, Andrew Morton,
	Douglas Anderson, shli, Dmitry Torokhov

On Thu 24-11-16 12:10:08, Mikulas Patocka wrote:
> 
> 
> On Thu, 24 Nov 2016, Michal Hocko wrote:
[...]
> > Please note that even
> > GFP_NOWAIT allocations will wake up kspwad which should clean up that
> 
> The mempool is also using GFP_NOIO allocations - so do you claim that it 
> should not use GFP_NOIO too?

No, I am not claiming that. The last time I have asked the throttling
didn't seem to serious enough to cause any problems. If the memory
reclaim throttling is serious enough then let's measure and evaluate it.

> You should provide a clear API that the block device drivers should use to 
> allocate memory - not to apply band aid to vm throttling problems as they 
> are being discovered.

This is easier said than done, I am afraid. We have been using GFP_NOIO
in mempool for years and there were no major complains.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2016-11-28 14:06 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-18  8:39 [RFC PATCH 0/2] mempool vs. page allocator interaction Michal Hocko
2016-07-18  8:39 ` Michal Hocko
2016-07-18  8:41 ` [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path Michal Hocko
2016-07-18  8:41   ` Michal Hocko
2016-07-18  8:41   ` [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks Michal Hocko
2016-07-18  8:41     ` Michal Hocko
2016-07-19 21:50     ` Mikulas Patocka
2016-07-19 21:50       ` Mikulas Patocka
2016-07-22  8:46     ` NeilBrown
2016-07-22  9:04       ` NeilBrown
2016-07-22  9:15       ` Michal Hocko
2016-07-22  9:15         ` Michal Hocko
2016-07-23  0:12         ` NeilBrown
2016-07-25  8:32           ` Michal Hocko
2016-07-25  8:32             ` Michal Hocko
2016-07-25 19:23             ` Michal Hocko
2016-07-25 19:23               ` Michal Hocko
2016-07-25 19:23               ` Michal Hocko
2016-07-26  7:07               ` Michal Hocko
2016-07-26  7:07                 ` Michal Hocko
2016-07-27  3:43             ` [dm-devel] " NeilBrown
2016-07-27 18:24               ` Michal Hocko
2016-07-27 18:24                 ` Michal Hocko
2016-07-27 21:33                 ` NeilBrown
2016-07-28  7:17                   ` Michal Hocko
2016-07-28  7:17                     ` Michal Hocko
2016-08-03 12:53                     ` Mikulas Patocka
2016-08-03 12:53                       ` Mikulas Patocka
2016-08-03 14:34                       ` Michal Hocko
2016-08-03 14:34                         ` Michal Hocko
2016-08-04 18:49                         ` Mikulas Patocka
2016-08-04 18:49                           ` Mikulas Patocka
2016-08-12 12:32                           ` Michal Hocko
2016-08-12 12:32                             ` Michal Hocko
2016-08-13 17:34                             ` Mikulas Patocka
2016-08-13 17:34                               ` Mikulas Patocka
2016-08-14 10:34                               ` Michal Hocko
2016-08-14 10:34                                 ` Michal Hocko
2016-08-15 16:15                                 ` Mikulas Patocka
2016-08-15 16:15                                   ` Mikulas Patocka
2016-11-23 21:11                                 ` Mikulas Patocka
2016-11-23 21:11                                   ` Mikulas Patocka
2016-11-24 13:29                                   ` Michal Hocko
2016-11-24 13:29                                     ` Michal Hocko
2016-11-24 17:10                                     ` Mikulas Patocka
2016-11-24 17:10                                       ` Mikulas Patocka
2016-11-28 14:06                                       ` Michal Hocko
2016-11-28 14:06                                         ` Michal Hocko
2016-07-25 21:52           ` Mikulas Patocka
2016-07-25 21:52             ` Mikulas Patocka
2016-07-26  7:25             ` Michal Hocko
2016-07-26  7:25               ` Michal Hocko
2016-07-27  4:02             ` [dm-devel] " NeilBrown
2016-07-27 14:28               ` Mikulas Patocka
2016-07-27 14:28                 ` Mikulas Patocka
2016-07-27 18:40                 ` Michal Hocko
2016-07-27 18:40                   ` Michal Hocko
2016-08-03 13:59                   ` Mikulas Patocka
2016-08-03 13:59                     ` Mikulas Patocka
2016-08-03 14:42                     ` Michal Hocko
2016-08-03 14:42                       ` Michal Hocko
2016-08-04 18:46                       ` Mikulas Patocka
2016-08-04 18:46                         ` Mikulas Patocka
2016-07-27 21:36                 ` NeilBrown
2016-07-19  2:00   ` [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path David Rientjes
2016-07-19  2:00     ` David Rientjes
2016-07-19  7:49     ` Michal Hocko
2016-07-19  7:49       ` Michal Hocko
2016-07-19 13:54   ` Johannes Weiner
2016-07-19 13:54     ` Johannes Weiner
2016-07-19 14:19     ` Michal Hocko
2016-07-19 14:19       ` Michal Hocko
2016-07-19 22:01       ` Mikulas Patocka
2016-07-19 22:01         ` Mikulas Patocka
2016-07-19 20:45     ` David Rientjes
2016-07-19 20:45       ` David Rientjes
2016-07-20  8:15       ` Michal Hocko
2016-07-20  8:15         ` Michal Hocko
2016-07-20 21:06         ` David Rientjes
2016-07-20 21:06           ` David Rientjes
2016-07-21  8:52           ` Michal Hocko
2016-07-21  8:52             ` Michal Hocko
2016-07-21 12:13             ` Johannes Weiner
2016-07-21 12:13               ` Johannes Weiner
2016-07-21 14:53               ` Michal Hocko
2016-07-21 14:53                 ` Michal Hocko
2016-07-21 14:53                 ` Michal Hocko
2016-07-21 15:26                 ` Johannes Weiner
2016-07-21 15:26                   ` Johannes Weiner
2016-07-22  1:41                 ` NeilBrown
2016-07-22  6:37                 ` Michal Hocko
2016-07-22  6:37                   ` Michal Hocko
2016-07-22 12:26                   ` Vlastimil Babka
2016-07-22 12:26                     ` Vlastimil Babka
2016-07-22 19:44                     ` Andrew Morton
2016-07-22 19:44                       ` Andrew Morton
2016-07-23 18:52                       ` Vlastimil Babka
2016-07-23 18:52                         ` Vlastimil Babka
2016-07-19 21:50   ` Mikulas Patocka
2016-07-19 21:50     ` Mikulas Patocka
2016-07-20  6:44     ` Michal Hocko
2016-07-20  6:44       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.