linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* OOM detection regressions since 4.7
@ 2016-08-22  9:32 Michal Hocko
  2016-08-22  9:37 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-22  9:32 UTC (permalink / raw)
  To: Andrew Morton, greg, Linus Torvalds
  Cc: Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm,
	LKML

Hi, 
there have been multiple reports [1][2][3][4][5] about pre-mature OOM
killer invocations since 4.7 which contains oom detection rework. All of
them were for order-2 (kernel stack) alloaction requests failing because
of a high fragmentation and compaction failing to make any forward
progress. While investigating this we have found out that the compaction
just gives up too early. Vlastimil has been working on compaction
improvement for quite some time and his series [6] is already sitting
in mmotm tree. This already helps a lot because it drops some heuristics
which are more aimed at lower latencies for high orders rather than
reliability. Joonsoo has then identified further problem with too many
blocks being marked as unmovable [7] and Vlastimil has prepared a patch
on top of his series [8] which is also in the mmotm tree now.

That being said, the regression is real and should be fixed for 4.7
stable users. [6][8] was reported to help and ooms are no longer
reproducible. I know we are quite late (rc3) in 4.8 but I would vote
for mergeing those patches and have them in 4.8. For 4.7 I would go
with a partial revert of the detection rework for high order requests
(see patch below). This patch is really trivial. If those compaction
improvements are just too large for 4.8 then we can use the same patch
as for 4.7 stable for now and revert it in 4.9 after compaction changes
are merged.

Thoughts?

[1] http://lkml.kernel.org/r/20160731051121.GB307@x4
[2] http://lkml.kernel.org/r/201608120901.41463.a.miskiewicz@gmail.com
[3] http://lkml.kernel.org/r/20160801192620.GD31957@dhcp22.suse.cz
[4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
[5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
[6] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
[7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
[8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22  9:32 OOM detection regressions since 4.7 Michal Hocko
@ 2016-08-22  9:37 ` Michal Hocko
  2016-08-22 10:05   ` Greg KH
  2016-08-22 10:16 ` Markus Trippelsdorf
  2016-08-23  4:52 ` Joonsoo Kim
  2 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-22  9:37 UTC (permalink / raw)
  To: Andrew Morton, Greg KH, Linus Torvalds
  Cc: Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm,
	LKML

[ups, fixing up Greg's email]

On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiewicz@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.GD31957@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> 
> ---
> From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.
> 
> [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> 
> Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 50 ++------------------------------------------------
>  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b3e1341b754..6e354199151b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	return NULL;
>  }
>  
> -static inline bool
> -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> -		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
> -		     int compaction_retries)
> -{
> -	int max_retries = MAX_COMPACT_RETRIES;
> -
> -	if (!order)
> -		return false;
> -
> -	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by weak migration mode.
> -	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*migrate_mode == MIGRATE_ASYNC) {
> -			*migrate_mode = MIGRATE_SYNC_LIGHT;
> -			return true;
> -		}
> -		return false;
> -	}
> -
> -	/*
> -	 * make sure the compaction wasn't deferred or didn't bail out early
> -	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> -	 */
> -	if (compaction_withdrawn(compact_result))
> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
> -
> -	/*
> -	 * !costly requests are much more important than __GFP_REPEAT
> -	 * costly ones because they are de facto nofail and invoke OOM
> -	 * killer to move on while costly can fail and users are ready
> -	 * to cope with that. 1/4 retries is rather arbitrary but we
> -	 * would need much more detailed feedback from compaction to
> -	 * make a better decision.
> -	 */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		max_retries /= 4;
> -	if (compaction_retries <= max_retries)
> -		return true;
> -
> -	return false;
> -}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3311,6 +3264,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	return NULL;
>  }
>  
> +#endif /* CONFIG_COMPACTION */
> +
>  static inline bool
>  should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
>  		     enum compact_result compact_result,
> @@ -3337,7 +3292,6 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
>  	}
>  	return false;
>  }
> -#endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
>  static int
> -- 
> 2.8.1
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22  9:37 ` Michal Hocko
@ 2016-08-22 10:05   ` Greg KH
  2016-08-22 10:54     ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2016-08-22 10:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [ups, fixing up Greg's email]
> 
> On Mon 22-08-16 11:32:49, Michal Hocko wrote:
> > Hi, 
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiewicz@gmail.com
> > [3] http://lkml.kernel.org/r/20160801192620.GD31957@dhcp22.suse.cz
> > [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> > [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> > [6] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> > [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > 
> > ---
> > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> >  order request
> > 
> > There have been several reports about pre-mature OOM killer invocation
> > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > invoked OOM killer even during basic workloads (light IO or even kernel
> > compile on some filesystems). In all reported cases the memory is
> > fragmented and there are no order-2+ pages available. There is usually
> > a large amount of slab memory (usually dentries/inodes) and further
> > debugging has shown that there are way too many unmovable blocks which
> > are skipped during the compaction. Multiple reporters have confirmed that
> > the current linux-next which includes [1] and [2] helped and OOMs are
> > not reproducible anymore. A simpler fix for the stable is to simply
> > ignore the compaction feedback and retry as long as there is a reclaim
> > progress for high order requests which we used to do before. We already
> > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > compaction is enabled as well.
> > 
> > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > 
> > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 50 ++------------------------------------------------
> >  1 file changed, 2 insertions(+), 48 deletions(-)

So, if this goes into Linus's tree, can you let stable@vger.kernel.org
know about it so we can add it to the 4.7-stable tree?  Otherwise
there's not much I can do here now, right?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22  9:32 OOM detection regressions since 4.7 Michal Hocko
  2016-08-22  9:37 ` Michal Hocko
@ 2016-08-22 10:16 ` Markus Trippelsdorf
  2016-08-22 10:56   ` Michal Hocko
  2016-08-23  4:52 ` Joonsoo Kim
  2 siblings, 1 reply; 35+ messages in thread
From: Markus Trippelsdorf @ 2016-08-22 10:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, greg, Linus Torvalds, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Joonsoo Kim, linux-mm, LKML

On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4

For the report [1] above:

markus@x4 linux % cat .config | grep CONFIG_COMPACTION
# CONFIG_COMPACTION is not set

-- 
Markus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 10:05   ` Greg KH
@ 2016-08-22 10:54     ` Michal Hocko
  2016-08-22 13:31       ` Greg KH
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-22 10:54 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon 22-08-16 06:05:28, Greg KH wrote:
> On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
[...]
> > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> > >  order request
> > > 
> > > There have been several reports about pre-mature OOM killer invocation
> > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > compile on some filesystems). In all reported cases the memory is
> > > fragmented and there are no order-2+ pages available. There is usually
> > > a large amount of slab memory (usually dentries/inodes) and further
> > > debugging has shown that there are way too many unmovable blocks which
> > > are skipped during the compaction. Multiple reporters have confirmed that
> > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > not reproducible anymore. A simpler fix for the stable is to simply
> > > ignore the compaction feedback and retry as long as there is a reclaim
> > > progress for high order requests which we used to do before. We already
> > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > compaction is enabled as well.
> > > 
> > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > > 
> > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/page_alloc.c | 50 ++------------------------------------------------
> > >  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> So, if this goes into Linus's tree, can you let stable@vger.kernel.org
> know about it so we can add it to the 4.7-stable tree?  Otherwise
> there's not much I can do here now, right?

My plan would be actually to not push this to Linus because we have a
proper fix for Linus tree. It is just that the fix is quite large and I
felt like the stable should get the most simple fix possible, which is
this partial revert. So, what I am trying to tell is to push a non-linus
patch to stable as it is simpler.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 10:16 ` Markus Trippelsdorf
@ 2016-08-22 10:56   ` Michal Hocko
  2016-08-22 11:01     ` Markus Trippelsdorf
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-22 10:56 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Andrew Morton, greg, Linus Torvalds, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Joonsoo Kim, linux-mm, LKML

On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> > killer invocations since 4.7 which contains oom detection rework. All of
> > them were for order-2 (kernel stack) alloaction requests failing because
> > of a high fragmentation and compaction failing to make any forward
> > progress. While investigating this we have found out that the compaction
> > just gives up too early. Vlastimil has been working on compaction
> > improvement for quite some time and his series [6] is already sitting
> > in mmotm tree. This already helps a lot because it drops some heuristics
> > which are more aimed at lower latencies for high orders rather than
> > reliability. Joonsoo has then identified further problem with too many
> > blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> > on top of his series [8] which is also in the mmotm tree now.
> > 
> > That being said, the regression is real and should be fixed for 4.7
> > stable users. [6][8] was reported to help and ooms are no longer
> > reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> > for mergeing those patches and have them in 4.8. For 4.7 I would go
> > with a partial revert of the detection rework for high order requests
> > (see patch below). This patch is really trivial. If those compaction
> > improvements are just too large for 4.8 then we can use the same patch
> > as for 4.7 stable for now and revert it in 4.9 after compaction changes
> > are merged.
> > 
> > Thoughts?
> > 
> > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> 
> For the report [1] above:
> 
> markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> # CONFIG_COMPACTION is not set

Hmm, without compaction and a heavy fragmentation then I am afraid we
cannot really do much. What is the reason to disable compaction in the
first place?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 10:56   ` Michal Hocko
@ 2016-08-22 11:01     ` Markus Trippelsdorf
  2016-08-22 11:13       ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Markus Trippelsdorf @ 2016-08-22 11:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, greg, Linus Torvalds, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Joonsoo Kim, linux-mm, LKML

On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > 
> > For the report [1] above:
> > 
> > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > # CONFIG_COMPACTION is not set
> 
> Hmm, without compaction and a heavy fragmentation then I am afraid we
> cannot really do much. What is the reason to disable compaction in the
> first place?

I don't recall. Must have been some issue in the past. I will re-enable
the option.

-- 
Markus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 11:01     ` Markus Trippelsdorf
@ 2016-08-22 11:13       ` Michal Hocko
  2016-08-22 11:20         ` Markus Trippelsdorf
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-22 11:13 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Andrew Morton, greg, Linus Torvalds, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Joonsoo Kim, linux-mm, LKML

On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > 
> > > For the report [1] above:
> > > 
> > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > # CONFIG_COMPACTION is not set
> > 
> > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > cannot really do much. What is the reason to disable compaction in the
> > first place?
> 
> I don't recall. Must have been some issue in the past. I will re-enable
> the option.

Well, without the compaction there is no source of high order pages at
all. You can only reclaim and hope that some of the reclaimed pages will
find its buddy on the list and form the higher order page. This can take
for ever. We used to have the lumpy reclaim and that could help but this
is long gone.

I do not think we can really sanely optimize for high-order heavy loads
without COMPACTION sanely. At least not without reintroducing lumpy
reclaim or something similar. To be honest I am even not sure which
configurations should disable compaction - except for really highly
controlled !mmu or other one purpose systems.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 11:13       ` Michal Hocko
@ 2016-08-22 11:20         ` Markus Trippelsdorf
  0 siblings, 0 replies; 35+ messages in thread
From: Markus Trippelsdorf @ 2016-08-22 11:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, greg, Linus Torvalds, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Joonsoo Kim, linux-mm, LKML

On 2016.08.22 at 13:13 +0200, Michal Hocko wrote:
> On Mon 22-08-16 13:01:13, Markus Trippelsdorf wrote:
> > On 2016.08.22 at 12:56 +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 12:16:14, Markus Trippelsdorf wrote:
> > > > On 2016.08.22 at 11:32 +0200, Michal Hocko wrote:
> > > > > [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> > > > 
> > > > For the report [1] above:
> > > > 
> > > > markus@x4 linux % cat .config | grep CONFIG_COMPACTION
> > > > # CONFIG_COMPACTION is not set
> > > 
> > > Hmm, without compaction and a heavy fragmentation then I am afraid we
> > > cannot really do much. What is the reason to disable compaction in the
> > > first place?
> > 
> > I don't recall. Must have been some issue in the past. I will re-enable
> > the option.
> 
> Well, without the compaction there is no source of high order pages at
> all. You can only reclaim and hope that some of the reclaimed pages will
> find its buddy on the list and form the higher order page. This can take
> for ever. We used to have the lumpy reclaim and that could help but this
> is long gone.
> 
> I do not think we can really sanely optimize for high-order heavy loads
> without COMPACTION sanely. At least not without reintroducing lumpy
> reclaim or something similar. To be honest I am even not sure which
> configurations should disable compaction - except for really highly
> controlled !mmu or other one purpose systems.

I now recall. It was an issue with CONFIG_TRANSPARENT_HUGEPAGE, so I
disabled that option. This then de-selected CONFIG_COMPACTION...

-- 
Markus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 10:54     ` Michal Hocko
@ 2016-08-22 13:31       ` Greg KH
  2016-08-22 13:42         ` Michal Hocko
  0 siblings, 1 reply; 35+ messages in thread
From: Greg KH @ 2016-08-22 13:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 06:05:28, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> [...]
> > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> > > >  order request
> > > > 
> > > > There have been several reports about pre-mature OOM killer invocation
> > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > > compile on some filesystems). In all reported cases the memory is
> > > > fragmented and there are no order-2+ pages available. There is usually
> > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > debugging has shown that there are way too many unmovable blocks which
> > > > are skipped during the compaction. Multiple reporters have confirmed that
> > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > progress for high order requests which we used to do before. We already
> > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > compaction is enabled as well.
> > > > 
> > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > > > 
> > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/page_alloc.c | 50 ++------------------------------------------------
> > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > 
> > So, if this goes into Linus's tree, can you let stable@vger.kernel.org
> > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > there's not much I can do here now, right?
> 
> My plan would be actually to not push this to Linus because we have a
> proper fix for Linus tree. It is just that the fix is quite large and I
> felt like the stable should get the most simple fix possible, which is
> this partial revert. So, what I am trying to tell is to push a non-linus
> patch to stable as it is simpler.

I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
of the time (well, almost always), it ends up being wrong and hurting us
in the end.

What exactly are the commits that are in Linus's tree that resolve this
issue?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 13:31       ` Greg KH
@ 2016-08-22 13:42         ` Michal Hocko
  2016-08-22 14:02           ` Greg KH
  2016-08-22 22:05           ` Andrew Morton
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-22 13:42 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrew Morton, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon 22-08-16 09:31:14, Greg KH wrote:
> On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > [...]
> > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> > > > >  order request
> > > > > 
> > > > > There have been several reports about pre-mature OOM killer invocation
> > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > > > compile on some filesystems). In all reported cases the memory is
> > > > > fragmented and there are no order-2+ pages available. There is usually
> > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > debugging has shown that there are way too many unmovable blocks which
> > > > > are skipped during the compaction. Multiple reporters have confirmed that
> > > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > > progress for high order requests which we used to do before. We already
> > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > compaction is enabled as well.
> > > > > 
> > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > > > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > > > > 
> > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > >  mm/page_alloc.c | 50 ++------------------------------------------------
> > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > 
> > > So, if this goes into Linus's tree, can you let stable@vger.kernel.org
> > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > there's not much I can do here now, right?
> > 
> > My plan would be actually to not push this to Linus because we have a
> > proper fix for Linus tree. It is just that the fix is quite large and I
> > felt like the stable should get the most simple fix possible, which is
> > this partial revert. So, what I am trying to tell is to push a non-linus
> > patch to stable as it is simpler.
> 
> I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> of the time (well, almost always), it ends up being wrong and hurting us
> in the end.

I do not like it either but if there is a simple and straightforward
workaround for stable while the upstream can go with the _proper_ fix
from the longer POV then I think this is perfectly justified. Stable
should be always about the simplest fix for the problem IMHO.

Of course, if Linus/Andrew doesn't like to take those compaction
improvements this late then I will ask to merge the partial revert to
Linus tree as well and then there is not much to discuss.

> What exactly are the commits that are in Linus's tree that resolve this
> issue?

The initial email in this thread has pointed to those patches. Please
note that some of its dependeces (mostly code cleanups) are already
merged and that backporting without them would make the backport harder
and more risky.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 13:42         ` Michal Hocko
@ 2016-08-22 14:02           ` Greg KH
  2016-08-22 22:05           ` Andrew Morton
  1 sibling, 0 replies; 35+ messages in thread
From: Greg KH @ 2016-08-22 14:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon, Aug 22, 2016 at 03:42:28PM +0200, Michal Hocko wrote:
> On Mon 22-08-16 09:31:14, Greg KH wrote:
> > On Mon, Aug 22, 2016 at 12:54:41PM +0200, Michal Hocko wrote:
> > > On Mon 22-08-16 06:05:28, Greg KH wrote:
> > > > On Mon, Aug 22, 2016 at 11:37:07AM +0200, Michal Hocko wrote:
> > > [...]
> > > > > > From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > Date: Mon, 22 Aug 2016 10:52:06 +0200
> > > > > > Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
> > > > > >  order request
> > > > > > 
> > > > > > There have been several reports about pre-mature OOM killer invocation
> > > > > > in 4.7 kernel when order-2 allocation request (for the kernel stack)
> > > > > > invoked OOM killer even during basic workloads (light IO or even kernel
> > > > > > compile on some filesystems). In all reported cases the memory is
> > > > > > fragmented and there are no order-2+ pages available. There is usually
> > > > > > a large amount of slab memory (usually dentries/inodes) and further
> > > > > > debugging has shown that there are way too many unmovable blocks which
> > > > > > are skipped during the compaction. Multiple reporters have confirmed that
> > > > > > the current linux-next which includes [1] and [2] helped and OOMs are
> > > > > > not reproducible anymore. A simpler fix for the stable is to simply
> > > > > > ignore the compaction feedback and retry as long as there is a reclaim
> > > > > > progress for high order requests which we used to do before. We already
> > > > > > do that for CONFING_COMPACTION=n so let's reuse the same code when
> > > > > > compaction is enabled as well.
> > > > > > 
> > > > > > [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> > > > > > [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> > > > > > 
> > > > > > Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > > ---
> > > > > >  mm/page_alloc.c | 50 ++------------------------------------------------
> > > > > >  1 file changed, 2 insertions(+), 48 deletions(-)
> > > > 
> > > > So, if this goes into Linus's tree, can you let stable@vger.kernel.org
> > > > know about it so we can add it to the 4.7-stable tree?  Otherwise
> > > > there's not much I can do here now, right?
> > > 
> > > My plan would be actually to not push this to Linus because we have a
> > > proper fix for Linus tree. It is just that the fix is quite large and I
> > > felt like the stable should get the most simple fix possible, which is
> > > this partial revert. So, what I am trying to tell is to push a non-linus
> > > patch to stable as it is simpler.
> > 
> > I _REALLY_ hate taking any patches that are not in Linus's tree as 90%
> > of the time (well, almost always), it ends up being wrong and hurting us
> > in the end.
> 
> I do not like it either but if there is a simple and straightforward
> workaround for stable while the upstream can go with the _proper_ fix
> from the longer POV then I think this is perfectly justified. Stable
> should be always about the simplest fix for the problem IMHO.

No, stable should always be "what is in Linus's tree to get it fixed."

Again, almost every time we try to "just do this simple thing instead"
in a stable tree, it ends up being broken somehow.  We have the history
to back this up, look at our archives.

I'll gladly take 10+ patches to resolve something, _if_ it actually
resolves something.

But, if we argue about it for a month or so, then we don't have to worry
about it as everyone will be using 4.8 :)

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

Ok, let me know how it goes and we can see what to do.

thanks.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 13:42         ` Michal Hocko
  2016-08-22 14:02           ` Greg KH
@ 2016-08-22 22:05           ` Andrew Morton
  2016-08-23  7:43             ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Andrew Morton @ 2016-08-22 22:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Greg KH, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> Of course, if Linus/Andrew doesn't like to take those compaction
> improvements this late then I will ask to merge the partial revert to
> Linus tree as well and then there is not much to discuss.

This sounds like the prudent option.  Can we get 4.8 working
well-enough, backport that into 4.7.x and worry about the fancier stuff
for 4.9?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22  9:32 OOM detection regressions since 4.7 Michal Hocko
  2016-08-22  9:37 ` Michal Hocko
  2016-08-22 10:16 ` Markus Trippelsdorf
@ 2016-08-23  4:52 ` Joonsoo Kim
  2016-08-23  7:33   ` Michal Hocko
  2 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2016-08-23  4:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, greg, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Mon, Aug 22, 2016 at 11:32:49AM +0200, Michal Hocko wrote:
> Hi, 
> there have been multiple reports [1][2][3][4][5] about pre-mature OOM
> killer invocations since 4.7 which contains oom detection rework. All of
> them were for order-2 (kernel stack) alloaction requests failing because
> of a high fragmentation and compaction failing to make any forward
> progress. While investigating this we have found out that the compaction
> just gives up too early. Vlastimil has been working on compaction
> improvement for quite some time and his series [6] is already sitting
> in mmotm tree. This already helps a lot because it drops some heuristics
> which are more aimed at lower latencies for high orders rather than
> reliability. Joonsoo has then identified further problem with too many
> blocks being marked as unmovable [7] and Vlastimil has prepared a patch
> on top of his series [8] which is also in the mmotm tree now.
> 
> That being said, the regression is real and should be fixed for 4.7
> stable users. [6][8] was reported to help and ooms are no longer
> reproducible. I know we are quite late (rc3) in 4.8 but I would vote
> for mergeing those patches and have them in 4.8. For 4.7 I would go
> with a partial revert of the detection rework for high order requests
> (see patch below). This patch is really trivial. If those compaction
> improvements are just too large for 4.8 then we can use the same patch
> as for 4.7 stable for now and revert it in 4.9 after compaction changes
> are merged.
> 
> Thoughts?
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] http://lkml.kernel.org/r/201608120901.41463.a.miskiewicz@gmail.com
> [3] http://lkml.kernel.org/r/20160801192620.GD31957@dhcp22.suse.cz
> [4] https://lists.opensuse.org/opensuse-kernel/2016-08/msg00021.html
> [5] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
> [6] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> [7] http://lkml.kernel.org/r/20160816031222.GC16913@js1304-P5Q-DELUXE
> [8] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> 
> ---
> >From 899b738538de41295839dca2090a774bdd17acd2 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>  order request
> 
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore. A simpler fix for the stable is to simply
> ignore the compaction feedback and retry as long as there is a reclaim
> progress for high order requests which we used to do before. We already
> do that for CONFING_COMPACTION=n so let's reuse the same code when
> compaction is enabled as well.

Hello, Michal.

I agree with partial revert but revert should be a different form.
Below change try to reuse should_compact_retry() version for
!CONFIG_COMPACTION but it turned out that it also causes regression in
Markus report [1].

Theoretical reason for this regression is that it would stop retry
even if there are enough lru pages. It only checks if freepage
excesses min watermark or not for retry decision. To prevent
pre-mature OOM killer, we need to keep allocation loop when there are
enough lru pages. So, logic should be something like that.

should_compact_retry()
{
        for_each_zone_zonelist_nodemask {
                available = zone_reclaimable_pages(zone);
                available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
                if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
                        ac_classzone_idx(ac), alloc_flags, available))
                        return true;

        }
}

I suggested it before and current situation looks like it is indeed
needed.

And, I still think that your OOM detection rework has some flaws.

1) It doesn't consider freeable objects that can be freed by shrink_slab().
There are many subsystems that cache many objects and they will be
freed by shrink_slab() interface. But, you don't account them when
making the OOM decision.

Think about following situation that we are trying to find order-2
freepage and some subsystem has order-2 freepage. It can be freed by
shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
invoked to free this order-2 freepage in that subsystem. OOM would be
triggered when compaction fails even if there is a order-2 freeable
page. I think that if decision is made before whole lru list is
scanned and then shrink_slab() is invoked for whole freeable objects,
it would cause pre-mature OOM.

It seems that you already knows this issue [2].

2) 'OOM detection rework' depends on compaction too much. Compaction
algorithm is racy and has some limitation. It's failure doesn't mean we
are in OOM situation. Even if Vlastimil's patchset and mine is
applied, it is still possible that compaction scanner cannot find enough
freepage due to race condition and return pre-mature failure. To
reduce this race effect, I hope to give more chances to retry even if
full compaction is failed. We can remove this heuristic when we make
sure that compaction is stable enough.

As you know, I said these things several times but isn't accepted.
Please consider them more deeply at this time.

Thanks.

[1] http://lkml.kernel.org/r/20160731051121.GB307@x4
[2] https://bugzilla.opensuse.org/show_bug.cgi?id=994066


> 
> [1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
> [2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz
> 
> Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 50 ++------------------------------------------------
>  1 file changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8b3e1341b754..6e354199151b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3254,53 +3254,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	return NULL;
>  }
>  
> -static inline bool
> -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> -		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
> -		     int compaction_retries)
> -{
> -	int max_retries = MAX_COMPACT_RETRIES;
> -
> -	if (!order)
> -		return false;
> -
> -	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by weak migration mode.
> -	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*migrate_mode == MIGRATE_ASYNC) {
> -			*migrate_mode = MIGRATE_SYNC_LIGHT;
> -			return true;
> -		}
> -		return false;
> -	}
> -
> -	/*
> -	 * make sure the compaction wasn't deferred or didn't bail out early
> -	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> -	 */
> -	if (compaction_withdrawn(compact_result))
> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
> -
> -	/*
> -	 * !costly requests are much more important than __GFP_REPEAT
> -	 * costly ones because they are de facto nofail and invoke OOM
> -	 * killer to move on while costly can fail and users are ready
> -	 * to cope with that. 1/4 retries is rather arbitrary but we
> -	 * would need much more detailed feedback from compaction to
> -	 * make a better decision.
> -	 */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		max_retries /= 4;
> -	if (compaction_retries <= max_retries)
> -		return true;
> -
> -	return false;
> -}
>  #else
>  static inline struct page *
>  __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3311,6 +3264,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	return NULL;
>  }
>  
> +#endif /* CONFIG_COMPACTION */
> +
>  static inline bool
>  should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
>  		     enum compact_result compact_result,
> @@ -3337,7 +3292,6 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
>  	}
>  	return false;
>  }
> -#endif /* CONFIG_COMPACTION */
>  
>  /* Perform direct synchronous page reclaim */
>  static int
> -- 
> 2.8.1
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  4:52 ` Joonsoo Kim
@ 2016-08-23  7:33   ` Michal Hocko
  2016-08-23  7:40     ` Markus Trippelsdorf
                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-23  7:33 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, greg, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
[...]
> Hello, Michal.
> 
> I agree with partial revert but revert should be a different form.
> Below change try to reuse should_compact_retry() version for
> !CONFIG_COMPACTION but it turned out that it also causes regression in
> Markus report [1].

I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
order workloads that calling any change in that behavior a regression
is little bit exaggerated. Disabling compaction should have a very
strong reason. I haven't heard any so far. I am even wondering whether
there is a legitimate reason for that these days.

> Theoretical reason for this regression is that it would stop retry
> even if there are enough lru pages. It only checks if freepage
> excesses min watermark or not for retry decision. To prevent
> pre-mature OOM killer, we need to keep allocation loop when there are
> enough lru pages. So, logic should be something like that.
> 
> should_compact_retry()
> {
>         for_each_zone_zonelist_nodemask {
>                 available = zone_reclaimable_pages(zone);
>                 available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>                 if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
>                         ac_classzone_idx(ac), alloc_flags, available))
>                         return true;
> 
>         }
> }
> 
> I suggested it before and current situation looks like it is indeed
> needed.

this just opens doors for an unbounded reclaim/threshing becacause
you can reclaim as much as you like and there is no guarantee of a
forward progress. The reason why !COMPACTION should_compact_retry only
checks for the min_wmark without the reclaimable bias is that this will
guarantee a retry if we are failing due to high order wmark check rather
than a lack of memory. This condition is guaranteed to converge and the
probability of the unbounded reclaim is much more reduced.

> And, I still think that your OOM detection rework has some flaws.
>
> 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> There are many subsystems that cache many objects and they will be
> freed by shrink_slab() interface. But, you don't account them when
> making the OOM decision.

I fully rely on the reclaim and compaction feedback. And that is the
place where we should strive for improvements. So if we are growing way
too many slab objects we should take care about that in the slab reclaim
which is tightly coupled with the LRU reclaim rather than up the layer
in the page allocator.
 
> Think about following situation that we are trying to find order-2
> freepage and some subsystem has order-2 freepage. It can be freed by
> shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
> invoked to free this order-2 freepage in that subsystem. OOM would be
> triggered when compaction fails even if there is a order-2 freeable
> page. I think that if decision is made before whole lru list is
> scanned and then shrink_slab() is invoked for whole freeable objects,
> it would cause pre-mature OOM.

I do not see why we would need to scan through the whole LRU list when
we are under a high order pressure. It is true, though, that slab
shrinkers can and should be more sensitive to the requested order to
help release higher order pages preferably.

> It seems that you already knows this issue [2].
> 
> 2) 'OOM detection rework' depends on compaction too much. Compaction
> algorithm is racy and has some limitation. It's failure doesn't mean we
> are in OOM situation.

As long as this is the only reliable source of higher order pages then
we do not have any other choice in order to have deterministic behavior.

> Even if Vlastimil's patchset and mine is
> applied, it is still possible that compaction scanner cannot find enough
> freepage due to race condition and return pre-mature failure. To
> reduce this race effect, I hope to give more chances to retry even if
> full compaction is failed.

Than we can improve compaction_failed() heuristic and do not call it the
end of the day after a single attempt to get a high order page after
scanning the whole memory. But to me this all sounds like an internal
implementation detail of the compaction and the OOM detection in the
page allocator should be as much independent on it as possible - same as
it is independent on the internal reclaim decisions. That was the whole
point of my rework. To actually melt "do something as long as at least a
single page is reclaimed" into an actual algorithm which can be measured
and reason about.

> We can remove this heuristic when we make sure that compaction is
> stable enough.

How do we know that, though, if we do not rely on it? Artificial tests
do not exhibit those corner cases. I was bashing my testing systems to
cause as much fragmentation as possible, yet I wasn't able to trigger
issues reported recently by real world workloads. Do not take me wrong,
I understand your concerns but OOM detection will never be perfect. We
can easily get to one or other extremes. We should strive to make it
work in most workloads. So far it seems that there were no regressions
for order-0 pressure and we can improve compaction to cover higher
orders. I am willing to reconsider this after we hit a cliff where we
cannot do much more in the compaction proper and still hit pre-mature
oom killer invocations in not-so-insane workloads, though.

I believe that Vlastimil's patches show the path to go longterm. Get rid
of the latency heuristics for allocations where that matters in the
first step. Then try to squeeze as much for reliability for !costly
orders as possible.

I also believe that these issues will be less of the problem once we
switch to vmalloc stacks because this is the primary source of high
order allocations these days. Most others are more an optimization than
a reliability thing.

> As you know, I said these things several times but isn't accepted.
> Please consider them more deeply at this time.
> 
> Thanks.
> 
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:33   ` Michal Hocko
@ 2016-08-23  7:40     ` Markus Trippelsdorf
  2016-08-23  7:48       ` Michal Hocko
  2016-08-23 19:08     ` Linus Torvalds
  2016-08-24  5:01     ` Joonsoo Kim
  2 siblings, 1 reply; 35+ messages in thread
From: Markus Trippelsdorf @ 2016-08-23  7:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, greg, Linus Torvalds,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.

BTW, the current config description:

  CONFIG_COMPACTION:
  Allows the compaction of memory for the allocation of huge pages. 

doesn't make it clear to the user that this is an essential feature.

-- 
Markus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-22 22:05           ` Andrew Morton
@ 2016-08-23  7:43             ` Michal Hocko
  2016-08-25  7:11               ` Michal Hocko
  2016-08-25 20:30               ` Ralf-Peter Rohbeck
  0 siblings, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-23  7:43 UTC (permalink / raw)
  To: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering
  Cc: Greg KH, Linus Torvalds, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > Of course, if Linus/Andrew doesn't like to take those compaction
> > improvements this late then I will ask to merge the partial revert to
> > Linus tree as well and then there is not much to discuss.
> 
> This sounds like the prudent option.  Can we get 4.8 working
> well-enough, backport that into 4.7.x and worry about the fancier stuff
> for 4.9?

OK, fair enough.

I would really appreciate if the original reporters could retest with
this patch on top of the current Linus tree. The stable backport posted
earlier doesn't apply on the current master cleanly but the change is
essentially same. mmotm tree then can revert this patch before Vlastimil
series is applied because that code is touching the currently removed
code.
---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:40     ` Markus Trippelsdorf
@ 2016-08-23  7:48       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-23  7:48 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Joonsoo Kim, Andrew Morton, greg, Linus Torvalds,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Tue 23-08-16 09:40:14, Markus Trippelsdorf wrote:
> On 2016.08.23 at 09:33 +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> 
> BTW, the current config description:
> 
>   CONFIG_COMPACTION:
>   Allows the compaction of memory for the allocation of huge pages. 
> 
> doesn't make it clear to the user that this is an essential feature.

Yes I plan to send a clarification patch.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:33   ` Michal Hocko
  2016-08-23  7:40     ` Markus Trippelsdorf
@ 2016-08-23 19:08     ` Linus Torvalds
  2016-08-24  6:32       ` Michal Hocko
  2016-08-24  5:01     ` Joonsoo Kim
  2 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2016-08-23 19:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, greg, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko <mhocko@kernel.org> wrote:
>
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated.

Well, the thread info allocations certainly haven't been big problems
before. So regressing those would seem to be a real regression.

What happened? We've done the order-2 allocation for the stack since
May 2014, so that isn't new. Did we cut off retries for low orders?

So I would not say that it's an exaggeration to say that order-2
allocations failing is a regression.

Yes, yes, for 4.9 we may well end up using vmalloc for the kernel
stack, but there are certainly other things that want low-order
(non-hugepage) allocations. Like kmalloc(), which often ends up using
small orders just to pack data more efficiently (allocating a single
page can be hugely wasteful even if the individual allocations are
smaller than that - so allocating a few pages and packing more
allocations into it helps fight internal fragmentation)

So this definitely needs to be fixed for 4.7 (and apparently there's a
few patches still pending even for 4.8)

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:33   ` Michal Hocko
  2016-08-23  7:40     ` Markus Trippelsdorf
  2016-08-23 19:08     ` Linus Torvalds
@ 2016-08-24  5:01     ` Joonsoo Kim
  2016-08-24  7:04       ` Michal Hocko
  2 siblings, 1 reply; 35+ messages in thread
From: Joonsoo Kim @ 2016-08-24  5:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, greg, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

Looks like my mail client eat my reply so I resend.

On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> [...]
> > Hello, Michal.
> > 
> > I agree with partial revert but revert should be a different form.
> > Below change try to reuse should_compact_retry() version for
> > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > Markus report [1].
> 
> I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> order workloads that calling any change in that behavior a regression
> is little bit exaggerated. Disabling compaction should have a very
> strong reason. I haven't heard any so far. I am even wondering whether
> there is a legitimate reason for that these days.
> 
> > Theoretical reason for this regression is that it would stop retry
> > even if there are enough lru pages. It only checks if freepage
> > excesses min watermark or not for retry decision. To prevent
> > pre-mature OOM killer, we need to keep allocation loop when there are
> > enough lru pages. So, logic should be something like that.
> > 
> > should_compact_retry()
> > {
> >         for_each_zone_zonelist_nodemask {
> >                 available = zone_reclaimable_pages(zone);
> >                 available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> >                 if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> >                         ac_classzone_idx(ac), alloc_flags, available))
> >                         return true;
> > 
> >         }
> > }
> > 
> > I suggested it before and current situation looks like it is indeed
> > needed.
> 
> this just opens doors for an unbounded reclaim/threshing becacause
> you can reclaim as much as you like and there is no guarantee of a
> forward progress. The reason why !COMPACTION should_compact_retry only
> checks for the min_wmark without the reclaimable bias is that this will
> guarantee a retry if we are failing due to high order wmark check rather
> than a lack of memory. This condition is guaranteed to converge and the
> probability of the unbounded reclaim is much more reduced.

In case of a lack of memory with a lot of reclaimable lru pages, why 
do we stop reclaim/compaction?

With your partial reverting patch, allocation logic would be like as
following.

Assume following situation:
o a lot of reclaimable lru pages
o no order-2 freepage
o not enough order-0 freepage for min watermark
o order-2 allocation

1. order-2 allocation failed due to min watermark
2. go to reclaim/compaction
3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
min watermark isn't met for order-0
4. compaction is skipped due to not enough freepage
5. should_reclaim_retry() returns false because min watermark for
order-2 page isn't met
6. should_compact_retry() returns false because min watermark for
order-0 page isn't met
6. allocation is failed without any retry and OOM is invoked.

Is it what you want?

And, please elaborate more on how your logic guarantee to converge.
After order-0 freepage exceed min watermark, there is no way to stop
reclaim/threshing. Number of freepage just increase monotonically and
retry cannot be stopped until order-2 allocation succeed. Am I missing
something?


> > And, I still think that your OOM detection rework has some flaws.
> >
> > 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> > There are many subsystems that cache many objects and they will be
> > freed by shrink_slab() interface. But, you don't account them when
> > making the OOM decision.
> 
> I fully rely on the reclaim and compaction feedback. And that is the
> place where we should strive for improvements. So if we are growing way
> too many slab objects we should take care about that in the slab reclaim
> which is tightly coupled with the LRU reclaim rather than up the layer
> in the page allocator.

No. slab shrink logic which is tightly coupled with the LRU reclaim
totally makes sense. What doesn't makes sense is the way of using
these functionality and utilizing these freebacks on your OOM
detection rework.

For example, compaction will do it's best with current resource. But,
as I said before, compaction will be more powerful if the system has
more free memory. Your logic just guarantee to give it to minimum
amount of free memory to run so I don't think it's result is
reliable to determine if we are in OOM or not.

And, your logic doesn't consider how many pages can be freed by slab
shrink. As I said before, there would exist high order reclaimable
page or we can make high order freepage by actual free.

Most importantly, I think that it is fundamentally impossible to
anticipate if we can make high order freepage or not by snapshot of
information about number of freeable page. So, your logic rely on
compaction but there are many types of pages that cannot be migrated
by compaction but can be reclaimed. So, fully relying on compaction
result for OOM decision would cause the problem.

I know that there is a trade-off. But, your logic makes me worry that
we lose too much accuracy for deterministic behaviour.

>  
> > Think about following situation that we are trying to find order-2
> > freepage and some subsystem has order-2 freepage. It can be freed by
> > shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
> > invoked to free this order-2 freepage in that subsystem. OOM would be
> > triggered when compaction fails even if there is a order-2 freeable
> > page. I think that if decision is made before whole lru list is
> > scanned and then shrink_slab() is invoked for whole freeable objects,
> > it would cause pre-mature OOM.
> 
> I do not see why we would need to scan through the whole LRU list when
> we are under a high order pressure. It is true, though, that slab
> shrinkers can and should be more sensitive to the requested order to
> help release higher order pages preferably.
> 
> > It seems that you already knows this issue [2].
> > 
> > 2) 'OOM detection rework' depends on compaction too much. Compaction
> > algorithm is racy and has some limitation. It's failure doesn't mean we
> > are in OOM situation.
> 
> As long as this is the only reliable source of higher order pages then
> we do not have any other choice in order to have deterministic behavior.
> 
> > Even if Vlastimil's patchset and mine is
> > applied, it is still possible that compaction scanner cannot find enough
> > freepage due to race condition and return pre-mature failure. To
> > reduce this race effect, I hope to give more chances to retry even if
> > full compaction is failed.
> 
> Than we can improve compaction_failed() heuristic and do not call it the
> end of the day after a single attempt to get a high order page after
> scanning the whole memory. But to me this all sounds like an internal
> implementation detail of the compaction and the OOM detection in the
> page allocator should be as much independent on it as possible - same as
> it is independent on the internal reclaim decisions. That was the whole
> point of my rework. To actually melt "do something as long as at least a
> single page is reclaimed" into an actual algorithm which can be measured
> and reason about.

As you said before, your logic cannot be independent of these
feedbacks.

 "I fully rely on the reclaim and compaction feedback"

Your logic need to consider implementation details.

> 
> > We can remove this heuristic when we make sure that compaction is
> > stable enough.
> 
> How do we know that, though, if we do not rely on it? Artificial tests
> do not exhibit those corner cases. I was bashing my testing systems to
> cause as much fragmentation as possible, yet I wasn't able to trigger
> issues reported recently by real world workloads. Do not take me wrong,
> I understand your concerns but OOM detection will never be perfect. We
> can easily get to one or other extremes. We should strive to make it
> work in most workloads. So far it seems that there were no regressions
> for order-0 pressure and we can improve compaction to cover higher
> orders. I am willing to reconsider this after we hit a cliff where we

As I said before, I fully agree that your work will work well for
order-0 pressure.

> cannot do much more in the compaction proper and still hit pre-mature
> oom killer invocations in not-so-insane workloads, though.

If you understand my concners, it would be better to prevent the known
possible problem in advance? You cannot know whole real workload in
the world. Your logic has some limitations at least theoretically and
cause a lot of regressions already. Why do you continue to insist
"let's see other report by real workload?" Bug report can be reported
long time later and it would be not appropriate time to fix the issue.

> 
> I believe that Vlastimil's patches show the path to go longterm. Get rid
> of the latency heuristics for allocations where that matters in the
> first step. Then try to squeeze as much for reliability for !costly
> orders as possible.
> 
> I also believe that these issues will be less of the problem once we
> switch to vmalloc stacks because this is the primary source of high
> order allocations these days. Most others are more an optimization than
> a reliability thing.

Even if vmalloc stacks patches are applied, there are other
core cases. For example, ARM uses order-2 allocation for page table
allocation(pgd). It is just one allocation per process rather per
thread so less aggressive but it caused the problem before in our system.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23 19:08     ` Linus Torvalds
@ 2016-08-24  6:32       ` Michal Hocko
  0 siblings, 0 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-24  6:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Joonsoo Kim, Andrew Morton, greg, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Tue 23-08-16 15:08:05, Linus Torvalds wrote:
> On Tue, Aug 23, 2016 at 3:33 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated.
> 
> Well, the thread info allocations certainly haven't been big problems
> before. So regressing those would seem to be a real regression.
> 
> What happened? We've done the order-2 allocation for the stack since
> May 2014, so that isn't new. Did we cut off retries for low orders?

Yes, with the original implementation the number of reclaim retries is
basically unbounded and as long as we have a reclaim progress. This has
changed to be a bounded process. Without the compaction this means that
we were reclaim as long as an order-2 page was formed.

> So I would not say that it's an exaggeration to say that order-2
> allocations failing is a regression.

I would agree with you with COMPACTION enabled but with compaction
disabled which should be really limited to !MMU configurations I think
there is not much we can do. Well, we could simply retry for ever
without invoking OOM killer for higher order request for this config
option and rely on order-0 to hit the OOM. Do we want that though?
I do not remember anybody with !MMU to complain. Markus had COMPACTION
disabled accidentally.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-24  5:01     ` Joonsoo Kim
@ 2016-08-24  7:04       ` Michal Hocko
  2016-08-24  7:29         ` Joonsoo Kim
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-24  7:04 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, greg, Linus Torvalds, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Vlastimil Babka, linux-mm, LKML

On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
> Looks like my mail client eat my reply so I resend.
> 
> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
> > [...]
> > > Hello, Michal.
> > > 
> > > I agree with partial revert but revert should be a different form.
> > > Below change try to reuse should_compact_retry() version for
> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
> > > Markus report [1].
> > 
> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
> > order workloads that calling any change in that behavior a regression
> > is little bit exaggerated. Disabling compaction should have a very
> > strong reason. I haven't heard any so far. I am even wondering whether
> > there is a legitimate reason for that these days.
> > 
> > > Theoretical reason for this regression is that it would stop retry
> > > even if there are enough lru pages. It only checks if freepage
> > > excesses min watermark or not for retry decision. To prevent
> > > pre-mature OOM killer, we need to keep allocation loop when there are
> > > enough lru pages. So, logic should be something like that.
> > > 
> > > should_compact_retry()
> > > {
> > >         for_each_zone_zonelist_nodemask {
> > >                 available = zone_reclaimable_pages(zone);
> > >                 available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> > >                 if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> > >                         ac_classzone_idx(ac), alloc_flags, available))
> > >                         return true;
> > > 
> > >         }
> > > }
> > > 
> > > I suggested it before and current situation looks like it is indeed
> > > needed.
> > 
> > this just opens doors for an unbounded reclaim/threshing becacause
> > you can reclaim as much as you like and there is no guarantee of a
> > forward progress. The reason why !COMPACTION should_compact_retry only
> > checks for the min_wmark without the reclaimable bias is that this will
> > guarantee a retry if we are failing due to high order wmark check rather
> > than a lack of memory. This condition is guaranteed to converge and the
> > probability of the unbounded reclaim is much more reduced.
> 
> In case of a lack of memory with a lot of reclaimable lru pages, why 
> do we stop reclaim/compaction?
> 
> With your partial reverting patch, allocation logic would be like as
> following.
> 
> Assume following situation:
> o a lot of reclaimable lru pages
> o no order-2 freepage
> o not enough order-0 freepage for min watermark
> o order-2 allocation
> 
> 1. order-2 allocation failed due to min watermark
> 2. go to reclaim/compaction
> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
> min watermark isn't met for order-0
> 4. compaction is skipped due to not enough freepage
> 5. should_reclaim_retry() returns false because min watermark for
> order-2 page isn't met
> 6. should_compact_retry() returns false because min watermark for
> order-0 page isn't met
> 6. allocation is failed without any retry and OOM is invoked.

If the direct reclaim is not able to get us over min wmark for order-0
then we would be likely to hit the oom even for order-0 requests.

> Is it what you want?
> 
> And, please elaborate more on how your logic guarantee to converge.
> After order-0 freepage exceed min watermark, there is no way to stop
> reclaim/threshing. Number of freepage just increase monotonically and
> retry cannot be stopped until order-2 allocation succeed. Am I missing
> something?

My statement was imprecise at best. You are right that there is no
guarantee to fullfil order-2 request. What I meant to say is that we
should converge when we are getting out of memory (aka even order-0
would have hard time to succeed). should_reclaim_retry does that by
the back off scaling of the reclaimable pages. should_compact_retry
would have to do the same thing which would effectively turn it into
should_reclaim_retry.

> > > And, I still think that your OOM detection rework has some flaws.
> > >
> > > 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> > > There are many subsystems that cache many objects and they will be
> > > freed by shrink_slab() interface. But, you don't account them when
> > > making the OOM decision.
> > 
> > I fully rely on the reclaim and compaction feedback. And that is the
> > place where we should strive for improvements. So if we are growing way
> > too many slab objects we should take care about that in the slab reclaim
> > which is tightly coupled with the LRU reclaim rather than up the layer
> > in the page allocator.
> 
> No. slab shrink logic which is tightly coupled with the LRU reclaim
> totally makes sense.

Once the number of slab object is much larger than LRU pages (what we
have seen in some oom reports) then the way how they are coupled just
stops making a sense because the current approach no longer scales.  We
might not have cared before because we used to retry blindly.  At least
that is my understanding.

I am sorry to skip large parts of your email but I believe those things
have been discussed and we would just repeat here. I full understand
there are some disagreements between our views but I still maintain that
as long as we can handle not-so-crazy workloads I prefer determinism
over a blind retrying. It is to be expected that there will be some
regressions. It would be just too ideal to change one heuristic by
another and expect nobody will notice. But as long as we are able to fix
those issues without adding hacks on top of hacks then I think it is
worth pursuing this path. And so far the compaction changes which helped
to cover recent regressions are not hacks but rather long term way how
to change it from best effort to reliability behavior. As I've said
before, if this proves to be insufficient then I will definitely not
insist on the current approach and replace the compaction feedback by
something else. I do not have much idea by what because, yet again, this
is a heuristic and there is clearly not right thing to do (tm).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-24  7:04       ` Michal Hocko
@ 2016-08-24  7:29         ` Joonsoo Kim
  0 siblings, 0 replies; 35+ messages in thread
From: Joonsoo Kim @ 2016-08-24  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Joonsoo Kim, Andrew Morton, greg, Linus Torvalds,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Vlastimil Babka,
	Linux Memory Management List, LKML

2016-08-24 16:04 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> On Wed 24-08-16 14:01:57, Joonsoo Kim wrote:
>> Looks like my mail client eat my reply so I resend.
>>
>> On Tue, Aug 23, 2016 at 09:33:18AM +0200, Michal Hocko wrote:
>> > On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
>> > [...]
>> > > Hello, Michal.
>> > >
>> > > I agree with partial revert but revert should be a different form.
>> > > Below change try to reuse should_compact_retry() version for
>> > > !CONFIG_COMPACTION but it turned out that it also causes regression in
>> > > Markus report [1].
>> >
>> > I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
>> > order workloads that calling any change in that behavior a regression
>> > is little bit exaggerated. Disabling compaction should have a very
>> > strong reason. I haven't heard any so far. I am even wondering whether
>> > there is a legitimate reason for that these days.
>> >
>> > > Theoretical reason for this regression is that it would stop retry
>> > > even if there are enough lru pages. It only checks if freepage
>> > > excesses min watermark or not for retry decision. To prevent
>> > > pre-mature OOM killer, we need to keep allocation loop when there are
>> > > enough lru pages. So, logic should be something like that.
>> > >
>> > > should_compact_retry()
>> > > {
>> > >         for_each_zone_zonelist_nodemask {
>> > >                 available = zone_reclaimable_pages(zone);
>> > >                 available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>> > >                 if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
>> > >                         ac_classzone_idx(ac), alloc_flags, available))
>> > >                         return true;
>> > >
>> > >         }
>> > > }
>> > >
>> > > I suggested it before and current situation looks like it is indeed
>> > > needed.
>> >
>> > this just opens doors for an unbounded reclaim/threshing becacause
>> > you can reclaim as much as you like and there is no guarantee of a
>> > forward progress. The reason why !COMPACTION should_compact_retry only
>> > checks for the min_wmark without the reclaimable bias is that this will
>> > guarantee a retry if we are failing due to high order wmark check rather
>> > than a lack of memory. This condition is guaranteed to converge and the
>> > probability of the unbounded reclaim is much more reduced.
>>
>> In case of a lack of memory with a lot of reclaimable lru pages, why
>> do we stop reclaim/compaction?
>>
>> With your partial reverting patch, allocation logic would be like as
>> following.
>>
>> Assume following situation:
>> o a lot of reclaimable lru pages
>> o no order-2 freepage
>> o not enough order-0 freepage for min watermark
>> o order-2 allocation
>>
>> 1. order-2 allocation failed due to min watermark
>> 2. go to reclaim/compaction
>> 3. reclaim some pages (maybe SWAP_CLUSTER_MAX (32) pages) but still
>> min watermark isn't met for order-0
>> 4. compaction is skipped due to not enough freepage
>> 5. should_reclaim_retry() returns false because min watermark for
>> order-2 page isn't met
>> 6. should_compact_retry() returns false because min watermark for
>> order-0 page isn't met
>> 6. allocation is failed without any retry and OOM is invoked.
>
> If the direct reclaim is not able to get us over min wmark for order-0
> then we would be likely to hit the oom even for order-0 requests.

No, this situation is that direct reclaim can get us over min wmark for order-0
but it needs retry. IIUC, direct reclaim would not reclaim enough memory
at once. It tries to reclaim small amount of lru pages and break out to check
watermark.

>> Is it what you want?
>>
>> And, please elaborate more on how your logic guarantee to converge.
>> After order-0 freepage exceed min watermark, there is no way to stop
>> reclaim/threshing. Number of freepage just increase monotonically and
>> retry cannot be stopped until order-2 allocation succeed. Am I missing
>> something?
>
> My statement was imprecise at best. You are right that there is no
> guarantee to fullfil order-2 request. What I meant to say is that we
> should converge when we are getting out of memory (aka even order-0
> would have hard time to succeed). should_reclaim_retry does that by
> the back off scaling of the reclaimable pages. should_compact_retry
> would have to do the same thing which would effectively turn it into
> should_reclaim_retry.

So, I suggested to change should_reclaim_retry() for high order request,
before.

>> > > And, I still think that your OOM detection rework has some flaws.
>> > >
>> > > 1) It doesn't consider freeable objects that can be freed by shrink_slab().
>> > > There are many subsystems that cache many objects and they will be
>> > > freed by shrink_slab() interface. But, you don't account them when
>> > > making the OOM decision.
>> >
>> > I fully rely on the reclaim and compaction feedback. And that is the
>> > place where we should strive for improvements. So if we are growing way
>> > too many slab objects we should take care about that in the slab reclaim
>> > which is tightly coupled with the LRU reclaim rather than up the layer
>> > in the page allocator.
>>
>> No. slab shrink logic which is tightly coupled with the LRU reclaim
>> totally makes sense.
>
> Once the number of slab object is much larger than LRU pages (what we
> have seen in some oom reports) then the way how they are coupled just
> stops making a sense because the current approach no longer scales.  We
> might not have cared before because we used to retry blindly.  At least
> that is my understanding.

If your logic guarantee to retry until number of lru pages are scanned,
it would work well. It's not a problem of slab shrink.

> I am sorry to skip large parts of your email but I believe those things
> have been discussed and we would just repeat here. I full understand

Okay. We discussed it several times and I'm also tired to discuss this topic.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:43             ` Michal Hocko
@ 2016-08-25  7:11               ` Michal Hocko
  2016-08-25  7:17                 ` Olaf Hering
  2016-08-28  5:50                 ` Arkadiusz Miskiewicz
  2016-08-25 20:30               ` Ralf-Peter Rohbeck
  1 sibling, 2 replies; 35+ messages in thread
From: Michal Hocko @ 2016-08-25  7:11 UTC (permalink / raw)
  To: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering
  Cc: Greg KH, Linus Torvalds, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > improvements this late then I will ask to merge the partial revert to
> > > Linus tree as well and then there is not much to discuss.
> > 
> > This sounds like the prudent option.  Can we get 4.8 working
> > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > for 4.9?
> 
> OK, fair enough.
> 
> I would really appreciate if the original reporters could retest with
> this patch on top of the current Linus tree.

Any luck with the testing of this patch?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-25  7:11               ` Michal Hocko
@ 2016-08-25  7:17                 ` Olaf Hering
  2016-08-29 14:52                   ` Olaf Hering
  2016-08-28  5:50                 ` Arkadiusz Miskiewicz
  1 sibling, 1 reply; 35+ messages in thread
From: Olaf Hering @ 2016-08-25  7:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 109 bytes --]

On Thu, Aug 25, Michal Hocko wrote:

> Any luck with the testing of this patch?

Not this week, sorry.

Olaf

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-23  7:43             ` Michal Hocko
  2016-08-25  7:11               ` Michal Hocko
@ 2016-08-25 20:30               ` Ralf-Peter Rohbeck
  2016-08-26  6:26                 ` Michal Hocko
  1 sibling, 1 reply; 35+ messages in thread
From: Ralf-Peter Rohbeck @ 2016-08-25 20:30 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Jiri Slaby, Olaf Hering
  Cc: Greg KH, Linus Torvalds, Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On 23.08.2016 00:43, Michal Hocko wrote:
> OK, fair enough.
> I would really appreciate if the original reporters could retest with
> this patch on top of the current Linus tree. The stable backport posted
> earlier doesn't apply on the current master cleanly but the change is
> essentially same. mmotm tree then can revert this patch before Vlastimil
> series is applied because that code is touching the currently removed
> code.
> ---
>  From 90b6b282bede7966fb6c830a6d012d2239ac40e4 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Mon, 22 Aug 2016 10:52:06 +0200
> Subject: [PATCH] mm, oom: prevent pre-mature OOM killer invocation for high
>   order request
>
> There have been several reports about pre-mature OOM killer invocation
> in 4.7 kernel when order-2 allocation request (for the kernel stack)
> invoked OOM killer even during basic workloads (light IO or even kernel
> compile on some filesystems). In all reported cases the memory is
> fragmented and there are no order-2+ pages available. There is usually
> a large amount of slab memory (usually dentries/inodes) and further
> debugging has shown that there are way too many unmovable blocks which
> are skipped during the compaction. Multiple reporters have confirmed that
> the current linux-next which includes [1] and [2] helped and OOMs are
> not reproducible anymore.
>
> A simpler fix for the late rc and stable is to simply ignore the
> compaction feedback and retry as long as there is a reclaim progress
> and we are not getting OOM for order-0 pages. We already do that for
> CONFING_COMPACTION=n so let's reuse the same code when compaction is
> enabled as well.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_20160810091226.6709-2D1-2Dvbabka-40suse.cz&d=DQIBAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ&m=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM&s=sJBmU_ySuE2OhXkEeFyTfUr05xjB-mO4aQ5yl4w8z1M&e=
> [2] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_f7a9ea9d-2Dbb88-2Dbfd6-2De340-2D3a933559305a-40suse.cz&d=DQIBAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ&m=RnSShi4nOuCcTfoBOsx8P8OCPnA5R6zXLo9uZ2RBNjM&s=9oXRJsI8kr1rfMU9tAb9q0-8YlBCZO0XCCFRo0ASjlg&e=
>
> Fixes: 0a0337e0d1d1 ("mm, oom: rework oom detection")
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   mm/page_alloc.c | 51 ++-------------------------------------------------
>   1 file changed, 2 insertions(+), 49 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3fbe73a6fe4b..7791a03f8deb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3137,54 +3137,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   	return NULL;
>   }
>   
> -static inline bool
> -should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> -		     enum compact_result compact_result,
> -		     enum compact_priority *compact_priority,
> -		     int compaction_retries)
> -{
> -	int max_retries = MAX_COMPACT_RETRIES;
> -
> -	if (!order)
> -		return false;
> -
> -	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by insufficient priority
> -	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*compact_priority > MIN_COMPACT_PRIORITY) {
> -			(*compact_priority)--;
> -			return true;
> -		}
> -		return false;
> -	}
> -
> -	/*
> -	 * make sure the compaction wasn't deferred or didn't bail out early
> -	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> -	 */
> -	if (compaction_withdrawn(compact_result))
> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
> -
> -	/*
> -	 * !costly requests are much more important than __GFP_REPEAT
> -	 * costly ones because they are de facto nofail and invoke OOM
> -	 * killer to move on while costly can fail and users are ready
> -	 * to cope with that. 1/4 retries is rather arbitrary but we
> -	 * would need much more detailed feedback from compaction to
> -	 * make a better decision.
> -	 */
> -	if (order > PAGE_ALLOC_COSTLY_ORDER)
> -		max_retries /= 4;
> -	if (compaction_retries <= max_retries)
> -		return true;
> -
> -	return false;
> -}
>   #else
>   static inline struct page *
>   __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3195,6 +3147,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>   	return NULL;
>   }
>   
> +#endif /* CONFIG_COMPACTION */
> +
>   static inline bool
>   should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
>   		     enum compact_result compact_result,
> @@ -3221,7 +3175,6 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
>   	}
>   	return false;
>   }
> -#endif /* CONFIG_COMPACTION */
>   
>   /* Perform direct synchronous page reclaim */
>   static int

This worked for me for about 12 hours of my torture test. Logs are at 
https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2.


Ralf-Peter


----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-25 20:30               ` Ralf-Peter Rohbeck
@ 2016-08-26  6:26                 ` Michal Hocko
  2016-08-26 20:17                   ` Ralf-Peter Rohbeck
  0 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-26  6:26 UTC (permalink / raw)
  To: Ralf-Peter Rohbeck
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Jiri Slaby, Olaf Hering, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
[...]
> This worked for me for about 12 hours of my torture test. Logs are at
> https://filebin.net/2rfah407nbhzs69e/OOM_4.8.0-rc2_p1.tar.bz2.

Thanks! Can we add your
Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>

to the patch?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-26  6:26                 ` Michal Hocko
@ 2016-08-26 20:17                   ` Ralf-Peter Rohbeck
  0 siblings, 0 replies; 35+ messages in thread
From: Ralf-Peter Rohbeck @ 2016-08-26 20:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Jiri Slaby, Olaf Hering, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On 25.08.2016 23:26, Michal Hocko wrote:
> On Thu 25-08-16 13:30:23, Ralf-Peter Rohbeck wrote:
> [...]
>> This worked for me for about 12 hours of my torture test. Logs are at
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__filebin.net_2rfah407nbhzs69e_OOM-5F4.8.0-2Drc2-5Fp1.tar.bz2&d=DQIBAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=yGQdEpZknbtYvR0TyhkCGu-ifLklIvXIf740poRFltQ&m=xBE9zOUuzzrfyIgW70g1kmSzqiGPNXjBnN_zvF4eStQ&s=jdGSxmrQNhIx4cjVDsyyAA0K83hANgWXu1aFBDh_1B4&e= .
> Thanks! Can we add your
> Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
>
> to the patch?

Sure.


Ralf-Peter


----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-25  7:11               ` Michal Hocko
  2016-08-25  7:17                 ` Olaf Hering
@ 2016-08-28  5:50                 ` Arkadiusz Miskiewicz
  1 sibling, 0 replies; 35+ messages in thread
From: Arkadiusz Miskiewicz @ 2016-08-28  5:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Thursday 25 of August 2016, Michal Hocko wrote:
> On Tue 23-08-16 09:43:39, Michal Hocko wrote:
> > On Mon 22-08-16 15:05:17, Andrew Morton wrote:
> > > On Mon, 22 Aug 2016 15:42:28 +0200 Michal Hocko <mhocko@kernel.org> 
wrote:
> > > > Of course, if Linus/Andrew doesn't like to take those compaction
> > > > improvements this late then I will ask to merge the partial revert to
> > > > Linus tree as well and then there is not much to discuss.
> > > 
> > > This sounds like the prudent option.  Can we get 4.8 working
> > > well-enough, backport that into 4.7.x and worry about the fancier stuff
> > > for 4.9?
> > 
> > OK, fair enough.
> > 
> > I would really appreciate if the original reporters could retest with
> > this patch on top of the current Linus tree.
> 
> Any luck with the testing of this patch?

Here my "rm -rf && cp -al" 10x in parallel test finished without OOM, so

Tested-by: Arkadiusz Miśkiewicz <arekm@maven.pl>

-- 
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-25  7:17                 ` Olaf Hering
@ 2016-08-29 14:52                   ` Olaf Hering
  2016-08-29 14:54                     ` Olaf Hering
                                       ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Olaf Hering @ 2016-08-29 14:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6115 bytes --]

On Thu, Aug 25, Olaf Hering wrote:

> On Thu, Aug 25, Michal Hocko wrote:
> 
> > Any luck with the testing of this patch?

I ran rc3 for a few hours on Friday amd FireFox was not killed.
Now rc3 is running for a day with the usual workload and FireFox is
still running.

Today I noticed the nfsserver was disabled, probably since months already.
Starting it gives a OOM, not sure if this is new with 4.7+.
Full dmesg attached.


[    0.000000] Linux version 4.8.0-rc3-3.bug994066-default (geeko@buildhost) (gcc version 6.1.1 20160815 [gcc-6-branch revision 239479] (SUSE Linux) ) #1 SMP PREEMPT Mon Aug 22 14:52:18 UTC 2016 (c0d2ef5)

[64378.582489] tun: Universal TUN/TAP device driver, 1.6
[64378.582493] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
[93347.645123] RPC: Registered named UNIX socket transport module.
[93347.645128] RPC: Registered udp transport module.
[93347.645130] RPC: Registered tcp transport module.
[93347.645132] RPC: Registered tcp NFSv4.1 backchannel transport module.
[93348.227828] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[93348.306369] modprobe: page allocation failure: order:4, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
[93348.306379] CPU: 2 PID: 30467 Comm: modprobe Not tainted 4.8.0-rc3-3.bug994066-default #1
[93348.306382] Hardware name: Hewlett-Packard HP ProBook 6555b/1455, BIOS 68DTM Ver. F.21 06/14/2012
[93348.306386]  0000000000000000 ffffffff813a2952 0000000000000004 ffff88003fb6ba30
[93348.306394]  ffffffff81198a4b 026040c00000000f 026040c000000001 ffff88003fb6c000
[93348.306400]  0000000000000004 ffff88003fb6baac 00000000026040c0 0000000000000040
[93348.306406] Call Trace:
[93348.306437]  [<ffffffff8102eefe>] dump_trace+0x5e/0x310
[93348.306449]  [<ffffffff8102f2cb>] show_stack_log_lvl+0x11b/0x1a0
[93348.306459]  [<ffffffff81030001>] show_stack+0x21/0x40
[93348.306468]  [<ffffffff813a2952>] dump_stack+0x5c/0x7a
[93348.306478]  [<ffffffff81198a4b>] warn_alloc_failed+0xdb/0x150
[93348.306490]  [<ffffffff81198cef>] __alloc_pages_slowpath+0x1af/0xa10
[93348.306501]  [<ffffffff811997a0>] __alloc_pages_nodemask+0x250/0x290
[93348.306511]  [<ffffffff811f1c3d>] cache_grow_begin+0x8d/0x540
[93348.306520]  [<ffffffff811f23d1>] fallback_alloc+0x161/0x200
[93348.306530]  [<ffffffff811f43f2>] __kmalloc+0x1d2/0x570
[93348.306589]  [<ffffffffa08f025a>] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
[93348.306649]  [<ffffffffa093f1b6>] init_nfsd+0x56/0xea0 [nfsd]
[93348.306664]  [<ffffffff8100218b>] do_one_initcall+0x4b/0x180
[93348.306674]  [<ffffffff8118e119>] do_init_module+0x5b/0x1fe
[93348.306684]  [<ffffffff81105395>] load_module+0x1a75/0x1d00
[93348.306695]  [<ffffffff81105804>] SYSC_finit_module+0xa4/0xe0
[93348.306705]  [<ffffffff816d2cb6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[93348.313626] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8

[93348.313629] Leftover inexact backtrace:

[93348.313691] Mem-Info:
[93348.313704] active_anon:467209 inactive_anon:125491 isolated_anon:0
                active_file:264880 inactive_file:166389 isolated_file:0
                unevictable:8 dirty:250 writeback:0 unstable:0
                slab_reclaimable:796425 slab_unreclaimable:34803
                mapped:54783 shmem:24119 pagetables:9083 bounce:0
                free:51321 free_pcp:68 free_cma:0
[93348.313717] Node 0 active_anon:1868836kB inactive_anon:501964kB active_file:1059520kB inactive_file:665556kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:219132kB dirty:1000kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 749568kB anon_thp: 96476kB writeback_tmp:0kB unstable:0kB pages_scanned:24 all_unreclaimable? no
[93348.313719] Node 0 DMA free:15908kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[93348.313729] lowmem_reserve[]: 0 2626 7621 7621 7621
[93348.313745] Node 0 DMA32 free:133192kB min:23244kB low:29052kB high:34860kB active_anon:642152kB inactive_anon:119848kB active_file:257900kB inactive_file:116560kB unevictable:0kB writepending:292kB present:2847412kB managed:2766832kB mlocked:0kB slab_reclaimable:1418576kB slab_unreclaimable:39004kB kernel_stack:256kB pagetables:1448kB bounce:0kB free_pcp:128kB local_pcp:0kB free_cma:0kB
[93348.313755] lowmem_reserve[]: 0 0 4994 4994 4994
[93348.313762] Node 0 Normal free:56184kB min:44200kB low:55248kB high:66296kB active_anon:1226576kB inactive_anon:382200kB active_file:801508kB inactive_file:548992kB unevictable:32kB writepending:536kB present:5242880kB managed:5114880kB mlocked:32kB slab_reclaimable:1767124kB slab_unreclaimable:100208kB kernel_stack:9104kB pagetables:34884kB bounce:0kB free_pcp:144kB local_pcp:0kB free_cma:0kB
[93348.313771] lowmem_reserve[]: 0 0 0 0 0
[93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB
[93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB
[93348.313843] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[93348.313846] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[93348.313848] 457622 total pagecache pages
[93348.313850] 2194 pages in swap cache
[93348.313853] Swap cache stats: add 60025, delete 57831, find 17283/19516
[93348.313854] Free swap  = 8170356kB
[93348.313856] Total swap = 8384508kB
[93348.313858] 2026571 pages RAM
[93348.313859] 0 pages HighMem/MovableOnly
[93348.313860] 52166 pages reserved
[93348.313861] 0 pages hwpoisoned
[93348.313865] nfsd: failed to allocate reply cache

Olaf

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-29 14:52                   ` Olaf Hering
@ 2016-08-29 14:54                     ` Olaf Hering
  2016-08-29 15:07                     ` Michal Hocko
  2016-08-29 17:28                     ` Linus Torvalds
  2 siblings, 0 replies; 35+ messages in thread
From: Olaf Hering @ 2016-08-29 14:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML


[-- Attachment #1.1: Type: text/plain, Size: 66 bytes --]

On Mon, Aug 29, Olaf Hering wrote:

> Full dmesg attached.

Now..

[-- Attachment #1.2: dmesg-4.8.0-rc3-3.bug994066-default.txt.gz --]
[-- Type: application/x-gzip, Size: 22707 bytes --]

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-29 14:52                   ` Olaf Hering
  2016-08-29 14:54                     ` Olaf Hering
@ 2016-08-29 15:07                     ` Michal Hocko
  2016-08-29 15:59                       ` Olaf Hering
  2016-08-29 17:28                     ` Linus Torvalds
  2 siblings, 1 reply; 35+ messages in thread
From: Michal Hocko @ 2016-08-29 15:07 UTC (permalink / raw)
  To: Olaf Hering
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> On Thu, Aug 25, Olaf Hering wrote:
> 
> > On Thu, Aug 25, Michal Hocko wrote:
> > 
> > > Any luck with the testing of this patch?
> 
> I ran rc3 for a few hours on Friday amd FireFox was not killed.
> Now rc3 is running for a day with the usual workload and FireFox is
> still running.

Is the patch
(http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz) applied?

> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.
> Full dmesg attached.
> [93348.306369] modprobe: page allocation failure: order:4, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)

ok so order-4 (COSTLY allocation) has failed because

[...]
> [93348.313778] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
> [93348.313803] Node 0 DMA32: 13633*4kB (UME) 8035*8kB (UME) 890*16kB (UME) 10*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 133372kB
> [93348.313822] Node 0 Normal: 14003*4kB (UME) 25*8kB (UME) 2*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56244kB

the memory is too fragmented for such a large allocation. Failing
order-4 requests is not so severe because we do not invoke the oom
killer if they fail. Especially without GFP_REPEAT we do not even try
too hard. Recent oom detection changes shouldn't change this behavior.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-29 15:07                     ` Michal Hocko
@ 2016-08-29 15:59                       ` Olaf Hering
  0 siblings, 0 replies; 35+ messages in thread
From: Olaf Hering @ 2016-08-29 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Greg KH, Linus Torvalds,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 387 bytes --]

On Mon, Aug 29, Michal Hocko wrote:

> On Mon 29-08-16 16:52:03, Olaf Hering wrote:
> > I ran rc3 for a few hours on Friday amd FireFox was not killed.
> > Now rc3 is running for a day with the usual workload and FireFox is
> > still running.
> Is the patch
> (http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz) applied?

Yes.

Tested-by: Olaf Hering <olaf@aepfle.de>

Olaf

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-29 14:52                   ` Olaf Hering
  2016-08-29 14:54                     ` Olaf Hering
  2016-08-29 15:07                     ` Michal Hocko
@ 2016-08-29 17:28                     ` Linus Torvalds
  2016-08-29 17:52                       ` Jeff Layton
  2 siblings, 1 reply; 35+ messages in thread
From: Linus Torvalds @ 2016-08-29 17:28 UTC (permalink / raw)
  To: Olaf Hering, Bruce Fields, Jeff Layton
  Cc: Michal Hocko, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby, Greg KH,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML,
	Linux NFS Mailing List

On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering <olaf@aepfle.de> wrote:
>
> Today I noticed the nfsserver was disabled, probably since months already.
> Starting it gives a OOM, not sure if this is new with 4.7+.

That's not an oom, that's just an allocation failure.

And with order-4, that's actually pretty normal. Nobody should use
order-4 (that's 16 contiguous pages, fragmentation can easily make
that hard - *much* harder than the small order-2 or order-2 cases that
we should largely be able to rely on).

In fact, people who do multi-order allocations should always have a
fallback, and use __GFP_NOWARN.

> [93348.306406] Call Trace:
> [93348.306490]  [<ffffffff81198cef>] __alloc_pages_slowpath+0x1af/0xa10
> [93348.306501]  [<ffffffff811997a0>] __alloc_pages_nodemask+0x250/0x290
> [93348.306511]  [<ffffffff811f1c3d>] cache_grow_begin+0x8d/0x540
> [93348.306520]  [<ffffffff811f23d1>] fallback_alloc+0x161/0x200
> [93348.306530]  [<ffffffff811f43f2>] __kmalloc+0x1d2/0x570
> [93348.306589]  [<ffffffffa08f025a>] nfsd_reply_cache_init+0xaa/0x110 [nfsd]

Hmm. That's kmalloc itself falling back after already failing to grow
the slab cache earlier (the earlier allocations *were* done with
NOWARN afaik).

It does look like nfsdstarts out by allocating the hash table with one
single fairly big allocation, and has no fallback position.

I suspect the code expects to be started at boot time, when this just
isn't an issue. The fact that you loaded the nfsd kernel module with
memory already fragmented after heavy use is likely why nobody else
has seen this.

Adding the nfsd people to the cc, because just from a robustness
standpoint I suspect it would be better if the code did something like

 (a) shrink the hash table if the allocation fails (we've got some
examples of that elsewhere)

or

 (b) fall back on a vmalloc allocation (that's certainly the simpler model)

We do have a "kvfree()" helper function for the "free either a kmalloc
or vmalloc allocation" but we don't actually have a good helper
pattern for the allocation side. People just do it by hand, at least
partly because we have so many different ways to allocate things -
zeroing, non-zeroing, node-specific or not, atomic or not (atomic
cannot fall back to vmalloc, obviously) etc etc.

Bruce, Jeff, comments?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: OOM detection regressions since 4.7
  2016-08-29 17:28                     ` Linus Torvalds
@ 2016-08-29 17:52                       ` Jeff Layton
  0 siblings, 0 replies; 35+ messages in thread
From: Jeff Layton @ 2016-08-29 17:52 UTC (permalink / raw)
  To: Linus Torvalds, Olaf Hering, Bruce Fields
  Cc: Michal Hocko, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby, Greg KH,
	Vlastimil Babka, Joonsoo Kim, linux-mm, LKML,
	Linux NFS Mailing List

On Mon, 2016-08-29 at 10:28 -0700, Linus Torvalds wrote:
> > On Mon, Aug 29, 2016 at 7:52 AM, Olaf Hering <olaf@aepfle.de> wrote:
> > 
> > 
> > Today I noticed the nfsserver was disabled, probably since months already.
> > Starting it gives a OOM, not sure if this is new with 4.7+.
> 
> That's not an oom, that's just an allocation failure.
> 
> And with order-4, that's actually pretty normal. Nobody should use
> order-4 (that's 16 contiguous pages, fragmentation can easily make
> that hard - *much* harder than the small order-2 or order-2 cases that
> we should largely be able to rely on).
> 
> In fact, people who do multi-order allocations should always have a
> fallback, and use __GFP_NOWARN.
> 
> > 
> > [93348.306406] Call Trace:
> > [93348.306490]A A [<ffffffff81198cef>] __alloc_pages_slowpath+0x1af/0xa10
> > [93348.306501]A A [<ffffffff811997a0>] __alloc_pages_nodemask+0x250/0x290
> > [93348.306511]A A [<ffffffff811f1c3d>] cache_grow_begin+0x8d/0x540
> > [93348.306520]A A [<ffffffff811f23d1>] fallback_alloc+0x161/0x200
> > [93348.306530]A A [<ffffffff811f43f2>] __kmalloc+0x1d2/0x570
> > [93348.306589]A A [<ffffffffa08f025a>] nfsd_reply_cache_init+0xaa/0x110 [nfsd]
> 
> Hmm. That's kmalloc itself falling back after already failing to grow
> the slab cache earlier (the earlier allocations *were* done with
> NOWARN afaik).
> 
> It does look like nfsdstarts out by allocating the hash table with one
> single fairly big allocation, and has no fallback position.
> 
> I suspect the code expects to be started at boot time, when this just
> isn't an issue. The fact that you loaded the nfsd kernel module with
> memory already fragmented after heavy use is likely why nobody else
> has seen this.
> 
> Adding the nfsd people to the cc, because just from a robustness
> standpoint I suspect it would be better if the code did something like
> 
> A (a) shrink the hash table if the allocation fails (we've got some
> examples of that elsewhere)
> 
> or
> 
> A (b) fall back on a vmalloc allocation (that's certainly the simpler model)
> 
> We do have a "kvfree()" helper function for the "free either a kmalloc
> or vmalloc allocation" but we don't actually have a good helper
> pattern for the allocation side. People just do it by hand, at least
> partly because we have so many different ways to allocate things -
> zeroing, non-zeroing, node-specific or not, atomic or not (atomic
> cannot fall back to vmalloc, obviously) etc etc.
> 
> Bruce, Jeff, comments?
> 
> A A A A A A A A A A A A A Linus

Yeah, that makes total sense.

Hmm...we _do_ already auto-size the hash at init time already, so
shrinking it downward and retrying if the allocation fails wouldn't be
hard to do. Maybe I can just cut it in half and throw a pr_warn to tell
the admin in that case.

In any case...I'll take a look at how we can improve it.

Thanks for the heads-up!
--A 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-08-29 17:52 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-22  9:32 OOM detection regressions since 4.7 Michal Hocko
2016-08-22  9:37 ` Michal Hocko
2016-08-22 10:05   ` Greg KH
2016-08-22 10:54     ` Michal Hocko
2016-08-22 13:31       ` Greg KH
2016-08-22 13:42         ` Michal Hocko
2016-08-22 14:02           ` Greg KH
2016-08-22 22:05           ` Andrew Morton
2016-08-23  7:43             ` Michal Hocko
2016-08-25  7:11               ` Michal Hocko
2016-08-25  7:17                 ` Olaf Hering
2016-08-29 14:52                   ` Olaf Hering
2016-08-29 14:54                     ` Olaf Hering
2016-08-29 15:07                     ` Michal Hocko
2016-08-29 15:59                       ` Olaf Hering
2016-08-29 17:28                     ` Linus Torvalds
2016-08-29 17:52                       ` Jeff Layton
2016-08-28  5:50                 ` Arkadiusz Miskiewicz
2016-08-25 20:30               ` Ralf-Peter Rohbeck
2016-08-26  6:26                 ` Michal Hocko
2016-08-26 20:17                   ` Ralf-Peter Rohbeck
2016-08-22 10:16 ` Markus Trippelsdorf
2016-08-22 10:56   ` Michal Hocko
2016-08-22 11:01     ` Markus Trippelsdorf
2016-08-22 11:13       ` Michal Hocko
2016-08-22 11:20         ` Markus Trippelsdorf
2016-08-23  4:52 ` Joonsoo Kim
2016-08-23  7:33   ` Michal Hocko
2016-08-23  7:40     ` Markus Trippelsdorf
2016-08-23  7:48       ` Michal Hocko
2016-08-23 19:08     ` Linus Torvalds
2016-08-24  6:32       ` Michal Hocko
2016-08-24  5:01     ` Joonsoo Kim
2016-08-24  7:04       ` Michal Hocko
2016-08-24  7:29         ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).