All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-13  2:02 ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-13  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

The oom killer is only invoked when reclaim has already failed and it
only kills processes if the victim is also oom.  In other words, the oom
killer does not select victims when a process tries to allocate from a
disjoint cpuset or allocate DMA memory, for example.

Therefore, it's pointless for an oom killed process to continue
attempting to reclaim memory in a loop when it has been granted access to
memory reserves.  It can simply return to the page allocator and allocate
memory.

If there is a very large number of processes trying to reclaim memory,
the cond_resched() in shrink_slab() becomes troublesome since it always
forces a schedule to other processes also trying to reclaim memory.
Compounded by many reclaim loops, it is possible for a process to sit in
do_try_to_free_pages() for a very long time when reclaim is pointless and
it could allocate if it just returned to the page allocator.

This patch checks if current has been oom killed and, if so, aborts
futile reclaim immediately.  We're not concerned with complete depletion
of memory reserves since there's nothing else we can do.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2428,6 +2428,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			goto out;
 
 		/*
+		 * If we've been oom killed, reclaim has already failed.  We've
+		 * been given access to memory reserves so that we can allocate
+		 * and quickly die, so just abort futile efforts.
+		 */
+		if (unlikely(test_thread_flag(TIF_MEMDIE)))
+			aborted_reclaim = true;
+
+		/*
 		 * If we're getting trouble reclaiming, start doing
 		 * writepage even in laptop mode.
 		 */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-13  2:02 ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-13  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Rik van Riel, Johannes Weiner, linux-kernel, linux-mm

The oom killer is only invoked when reclaim has already failed and it
only kills processes if the victim is also oom.  In other words, the oom
killer does not select victims when a process tries to allocate from a
disjoint cpuset or allocate DMA memory, for example.

Therefore, it's pointless for an oom killed process to continue
attempting to reclaim memory in a loop when it has been granted access to
memory reserves.  It can simply return to the page allocator and allocate
memory.

If there is a very large number of processes trying to reclaim memory,
the cond_resched() in shrink_slab() becomes troublesome since it always
forces a schedule to other processes also trying to reclaim memory.
Compounded by many reclaim loops, it is possible for a process to sit in
do_try_to_free_pages() for a very long time when reclaim is pointless and
it could allocate if it just returned to the page allocator.

This patch checks if current has been oom killed and, if so, aborts
futile reclaim immediately.  We're not concerned with complete depletion
of memory reserves since there's nothing else we can do.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/vmscan.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2428,6 +2428,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			goto out;
 
 		/*
+		 * If we've been oom killed, reclaim has already failed.  We've
+		 * been given access to memory reserves so that we can allocate
+		 * and quickly die, so just abort futile efforts.
+		 */
+		if (unlikely(test_thread_flag(TIF_MEMDIE)))
+			aborted_reclaim = true;
+
+		/*
 		 * If we're getting trouble reclaiming, start doing
 		 * writepage even in laptop mode.
 		 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-13  2:02 ` David Rientjes
@ 2013-11-13 15:24   ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-13 15:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Tue, Nov 12, 2013 at 06:02:18PM -0800, David Rientjes wrote:
> The oom killer is only invoked when reclaim has already failed and it
> only kills processes if the victim is also oom.  In other words, the oom
> killer does not select victims when a process tries to allocate from a
> disjoint cpuset or allocate DMA memory, for example.
> 
> Therefore, it's pointless for an oom killed process to continue
> attempting to reclaim memory in a loop when it has been granted access to
> memory reserves.  It can simply return to the page allocator and allocate
> memory.

On the other hand, finishing reclaim of 32 pages should not be a
problem.

> If there is a very large number of processes trying to reclaim memory,
> the cond_resched() in shrink_slab() becomes troublesome since it always
> forces a schedule to other processes also trying to reclaim memory.
> Compounded by many reclaim loops, it is possible for a process to sit in
> do_try_to_free_pages() for a very long time when reclaim is pointless and
> it could allocate if it just returned to the page allocator.

"Very large number of processes"

"sit in do_try_to_free_pages() for a very long time"

Can you quantify this a bit more?

And how common are OOM kills on your setups that you need to optimize
them on this level?

It sounds like your problem could be solved by having cond_resched()
not schedule away from TIF_MEMDIE processes, which would be much
preferable to oom-killed checks in random places.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-13 15:24   ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-13 15:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Tue, Nov 12, 2013 at 06:02:18PM -0800, David Rientjes wrote:
> The oom killer is only invoked when reclaim has already failed and it
> only kills processes if the victim is also oom.  In other words, the oom
> killer does not select victims when a process tries to allocate from a
> disjoint cpuset or allocate DMA memory, for example.
> 
> Therefore, it's pointless for an oom killed process to continue
> attempting to reclaim memory in a loop when it has been granted access to
> memory reserves.  It can simply return to the page allocator and allocate
> memory.

On the other hand, finishing reclaim of 32 pages should not be a
problem.

> If there is a very large number of processes trying to reclaim memory,
> the cond_resched() in shrink_slab() becomes troublesome since it always
> forces a schedule to other processes also trying to reclaim memory.
> Compounded by many reclaim loops, it is possible for a process to sit in
> do_try_to_free_pages() for a very long time when reclaim is pointless and
> it could allocate if it just returned to the page allocator.

"Very large number of processes"

"sit in do_try_to_free_pages() for a very long time"

Can you quantify this a bit more?

And how common are OOM kills on your setups that you need to optimize
them on this level?

It sounds like your problem could be solved by having cond_resched()
not schedule away from TIF_MEMDIE processes, which would be much
preferable to oom-killed checks in random places.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-13 15:24   ` Johannes Weiner
@ 2013-11-13 22:16     ` David Rientjes
  -1 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-13 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > The oom killer is only invoked when reclaim has already failed and it
> > only kills processes if the victim is also oom.  In other words, the oom
> > killer does not select victims when a process tries to allocate from a
> > disjoint cpuset or allocate DMA memory, for example.
> > 
> > Therefore, it's pointless for an oom killed process to continue
> > attempting to reclaim memory in a loop when it has been granted access to
> > memory reserves.  It can simply return to the page allocator and allocate
> > memory.
> 
> On the other hand, finishing reclaim of 32 pages should not be a
> problem.
> 

The reclaim will fail, the only reason current has TIF_MEMDIE set is 
because reclaim has completely failed.

> > If there is a very large number of processes trying to reclaim memory,
> > the cond_resched() in shrink_slab() becomes troublesome since it always
> > forces a schedule to other processes also trying to reclaim memory.
> > Compounded by many reclaim loops, it is possible for a process to sit in
> > do_try_to_free_pages() for a very long time when reclaim is pointless and
> > it could allocate if it just returned to the page allocator.
> 
> "Very large number of processes"
> 
> "sit in do_try_to_free_pages() for a very long time"
> 
> Can you quantify this a bit more?
> 

I have seen kernel logs where ~700 processes are stuck in direct reclaim 
simultaneously or scanning the tasklist in the oom killer only to defer 
because it finds a process that has already been oom killed as is stuck in 
do_try_to_free_pages() making very slow progress because of the number of 
processes trying to reclaim.

I haven't quantified how long the oom killed process sits in 
do_try_to_free_pages() as a result of needlessly looping trying to reclaim 
memory that will ultimately fail.

When the kernel oom kills something in a system oom condition, we hope 
that it will exit quickly because otherwise every other memory allocator 
comes to a grinding halt for as long as it takes to free memory.

> And how common are OOM kills on your setups that you need to optimize
> them on this level?
> 

Very common, the sum of our top-level memcg hardlimits exceeds the amount 
of memory on the system and we very frequently encounter system conditions 
as a regular event.

> It sounds like your problem could be solved by having cond_resched()
> not schedule away from TIF_MEMDIE processes, which would be much
> preferable to oom-killed checks in random places.
> 

I don't know of any other "random places" other than when the oom killed 
process is sitting in reclaim before it is selected as the victim.  Once 
it returns to the page allocator, it will immediately allocate and then be 
able to handle its pending SIGKILL.  The one spot identified where it is 
absolutely pointless to spin is in reclaim since it is virtually 
guaranteed to fail.  This patch fixes that issue.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-13 22:16     ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-13 22:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > The oom killer is only invoked when reclaim has already failed and it
> > only kills processes if the victim is also oom.  In other words, the oom
> > killer does not select victims when a process tries to allocate from a
> > disjoint cpuset or allocate DMA memory, for example.
> > 
> > Therefore, it's pointless for an oom killed process to continue
> > attempting to reclaim memory in a loop when it has been granted access to
> > memory reserves.  It can simply return to the page allocator and allocate
> > memory.
> 
> On the other hand, finishing reclaim of 32 pages should not be a
> problem.
> 

The reclaim will fail, the only reason current has TIF_MEMDIE set is 
because reclaim has completely failed.

> > If there is a very large number of processes trying to reclaim memory,
> > the cond_resched() in shrink_slab() becomes troublesome since it always
> > forces a schedule to other processes also trying to reclaim memory.
> > Compounded by many reclaim loops, it is possible for a process to sit in
> > do_try_to_free_pages() for a very long time when reclaim is pointless and
> > it could allocate if it just returned to the page allocator.
> 
> "Very large number of processes"
> 
> "sit in do_try_to_free_pages() for a very long time"
> 
> Can you quantify this a bit more?
> 

I have seen kernel logs where ~700 processes are stuck in direct reclaim 
simultaneously or scanning the tasklist in the oom killer only to defer 
because it finds a process that has already been oom killed as is stuck in 
do_try_to_free_pages() making very slow progress because of the number of 
processes trying to reclaim.

I haven't quantified how long the oom killed process sits in 
do_try_to_free_pages() as a result of needlessly looping trying to reclaim 
memory that will ultimately fail.

When the kernel oom kills something in a system oom condition, we hope 
that it will exit quickly because otherwise every other memory allocator 
comes to a grinding halt for as long as it takes to free memory.

> And how common are OOM kills on your setups that you need to optimize
> them on this level?
> 

Very common, the sum of our top-level memcg hardlimits exceeds the amount 
of memory on the system and we very frequently encounter system conditions 
as a regular event.

> It sounds like your problem could be solved by having cond_resched()
> not schedule away from TIF_MEMDIE processes, which would be much
> preferable to oom-killed checks in random places.
> 

I don't know of any other "random places" other than when the oom killed 
process is sitting in reclaim before it is selected as the victim.  Once 
it returns to the page allocator, it will immediately allocate and then be 
able to handle its pending SIGKILL.  The one spot identified where it is 
absolutely pointless to spin is in reclaim since it is virtually 
guaranteed to fail.  This patch fixes that issue.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-13 22:16     ` David Rientjes
@ 2013-11-14  0:00       ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-14  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 13, 2013 at 02:16:04PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > The oom killer is only invoked when reclaim has already failed and it
> > > only kills processes if the victim is also oom.  In other words, the oom
> > > killer does not select victims when a process tries to allocate from a
> > > disjoint cpuset or allocate DMA memory, for example.
> > > 
> > > Therefore, it's pointless for an oom killed process to continue
> > > attempting to reclaim memory in a loop when it has been granted access to
> > > memory reserves.  It can simply return to the page allocator and allocate
> > > memory.
> > 
> > On the other hand, finishing reclaim of 32 pages should not be a
> > problem.
> > 
> 
> The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> because reclaim has completely failed.

...for somebody else.

> > > If there is a very large number of processes trying to reclaim memory,
> > > the cond_resched() in shrink_slab() becomes troublesome since it always
> > > forces a schedule to other processes also trying to reclaim memory.
> > > Compounded by many reclaim loops, it is possible for a process to sit in
> > > do_try_to_free_pages() for a very long time when reclaim is pointless and
> > > it could allocate if it just returned to the page allocator.
> > 
> > "Very large number of processes"
> > 
> > "sit in do_try_to_free_pages() for a very long time"
> > 
> > Can you quantify this a bit more?
> > 
> 
> I have seen kernel logs where ~700 processes are stuck in direct reclaim 
> simultaneously or scanning the tasklist in the oom killer only to defer 
> because it finds a process that has already been oom killed as is stuck in 
> do_try_to_free_pages() making very slow progress because of the number of 
> processes trying to reclaim.
>
> I haven't quantified how long the oom killed process sits in 
> do_try_to_free_pages() as a result of needlessly looping trying to reclaim 
> memory that will ultimately fail.
> 
> When the kernel oom kills something in a system oom condition, we hope 
> that it will exit quickly because otherwise every other memory allocator 
> comes to a grinding halt for as long as it takes to free memory.
> 
> > And how common are OOM kills on your setups that you need to optimize
> > them on this level?
> > 
> 
> Very common, the sum of our top-level memcg hardlimits exceeds the amount 
> of memory on the system and we very frequently encounter system conditions 
> as a regular event.
> 
> > It sounds like your problem could be solved by having cond_resched()
> > not schedule away from TIF_MEMDIE processes, which would be much
> > preferable to oom-killed checks in random places.
> > 
> 
> I don't know of any other "random places" other than when the oom killed 
> process is sitting in reclaim before it is selected as the victim.  Once 
> it returns to the page allocator, it will immediately allocate and then be 
> able to handle its pending SIGKILL.  The one spot identified where it is 
> absolutely pointless to spin is in reclaim since it is virtually 
> guaranteed to fail.  This patch fixes that issue.

No, this applies to every other operation that does not immediately
lead to the task exiting or which creates more system load.  Readahead
would be another example.  They're all pointless and you could do
without all of them at this point, but I'm not okay with putting these
checks in random places that happen to bother you right now.  It's not
a proper solution to the problem.

Is it a good idea to let ~700 processes simultaneously go into direct
global reclaim?

The victim aborting reclaim still leaves you with ~699 processes
spinning in reclaim that should instead just retry the allocation as
well.  What about them?

The situation your setups seem to get in frequently is bananas, don't
micro optimize this.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-14  0:00       ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-14  0:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 13, 2013 at 02:16:04PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > The oom killer is only invoked when reclaim has already failed and it
> > > only kills processes if the victim is also oom.  In other words, the oom
> > > killer does not select victims when a process tries to allocate from a
> > > disjoint cpuset or allocate DMA memory, for example.
> > > 
> > > Therefore, it's pointless for an oom killed process to continue
> > > attempting to reclaim memory in a loop when it has been granted access to
> > > memory reserves.  It can simply return to the page allocator and allocate
> > > memory.
> > 
> > On the other hand, finishing reclaim of 32 pages should not be a
> > problem.
> > 
> 
> The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> because reclaim has completely failed.

...for somebody else.

> > > If there is a very large number of processes trying to reclaim memory,
> > > the cond_resched() in shrink_slab() becomes troublesome since it always
> > > forces a schedule to other processes also trying to reclaim memory.
> > > Compounded by many reclaim loops, it is possible for a process to sit in
> > > do_try_to_free_pages() for a very long time when reclaim is pointless and
> > > it could allocate if it just returned to the page allocator.
> > 
> > "Very large number of processes"
> > 
> > "sit in do_try_to_free_pages() for a very long time"
> > 
> > Can you quantify this a bit more?
> > 
> 
> I have seen kernel logs where ~700 processes are stuck in direct reclaim 
> simultaneously or scanning the tasklist in the oom killer only to defer 
> because it finds a process that has already been oom killed as is stuck in 
> do_try_to_free_pages() making very slow progress because of the number of 
> processes trying to reclaim.
>
> I haven't quantified how long the oom killed process sits in 
> do_try_to_free_pages() as a result of needlessly looping trying to reclaim 
> memory that will ultimately fail.
> 
> When the kernel oom kills something in a system oom condition, we hope 
> that it will exit quickly because otherwise every other memory allocator 
> comes to a grinding halt for as long as it takes to free memory.
> 
> > And how common are OOM kills on your setups that you need to optimize
> > them on this level?
> > 
> 
> Very common, the sum of our top-level memcg hardlimits exceeds the amount 
> of memory on the system and we very frequently encounter system conditions 
> as a regular event.
> 
> > It sounds like your problem could be solved by having cond_resched()
> > not schedule away from TIF_MEMDIE processes, which would be much
> > preferable to oom-killed checks in random places.
> > 
> 
> I don't know of any other "random places" other than when the oom killed 
> process is sitting in reclaim before it is selected as the victim.  Once 
> it returns to the page allocator, it will immediately allocate and then be 
> able to handle its pending SIGKILL.  The one spot identified where it is 
> absolutely pointless to spin is in reclaim since it is virtually 
> guaranteed to fail.  This patch fixes that issue.

No, this applies to every other operation that does not immediately
lead to the task exiting or which creates more system load.  Readahead
would be another example.  They're all pointless and you could do
without all of them at this point, but I'm not okay with putting these
checks in random places that happen to bother you right now.  It's not
a proper solution to the problem.

Is it a good idea to let ~700 processes simultaneously go into direct
global reclaim?

The victim aborting reclaim still leaves you with ~699 processes
spinning in reclaim that should instead just retry the allocation as
well.  What about them?

The situation your setups seem to get in frequently is bananas, don't
micro optimize this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-14  0:00       ` Johannes Weiner
@ 2013-11-14  0:48         ` David Rientjes
  -1 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-14  0:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> > because reclaim has completely failed.
> 
> ...for somebody else.
> 

That process is in the same allocating context as current, otherwise 
current would not have been selected as a victim.  The oom killer tries to 
only kill processes that will lead to future memory freeing where reclaim 
has failed.

> > I don't know of any other "random places" other than when the oom killed 
> > process is sitting in reclaim before it is selected as the victim.  Once 
> > it returns to the page allocator, it will immediately allocate and then be 
> > able to handle its pending SIGKILL.  The one spot identified where it is 
> > absolutely pointless to spin is in reclaim since it is virtually 
> > guaranteed to fail.  This patch fixes that issue.
> 
> No, this applies to every other operation that does not immediately
> lead to the task exiting or which creates more system load.  Readahead
> would be another example.  They're all pointless and you could do
> without all of them at this point, but I'm not okay with putting these
> checks in random places that happen to bother you right now.  It's not
> a proper solution to the problem.
> 

If you have an alternative solution, please feel free to propose it and 
I'll try it out.

This isn't only about the cond_resched() in shrink_slab(), the reclaim is 
going to fail.  There should be no instances where an oom killed process 
can go and start magically reclaiming memory that would have prevented it 
from becoming oom in the first place.  I have seen the oom killer trigger 
and the victim stall for several seconds before actually allocating memory 
and that stall is pointless, especially when we're not touching a hotpath 
here, we're in direct reclaim already.

> Is it a good idea to let ~700 processes simultaneously go into direct
> global reclaim?
> 
> The victim aborting reclaim still leaves you with ~699 processes
> spinning in reclaim that should instead just retry the allocation as
> well.  What about them?
> 

Um, no, those processes are going through a repeated loop of direct 
reclaim, calling the oom killer, iterating the tasklist, finding an 
existing oom killed process that has yet to exit, and looping.  They 
wouldn't loop for too long if we can reduce the amount of time that it 
takes for that oom killed process to exit.

> The situation your setups seem to get in frequently is bananas, don't
> micro optimize this.
> 

Unless you propose an alternative solution, this is the patch that fixes 
the problem when an oom killed process gets killed and then stalls for 
seconds before it actually retries allocating memory.

Thanks for your thoughts.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-14  0:48         ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-14  0:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 13 Nov 2013, Johannes Weiner wrote:

> > The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> > because reclaim has completely failed.
> 
> ...for somebody else.
> 

That process is in the same allocating context as current, otherwise 
current would not have been selected as a victim.  The oom killer tries to 
only kill processes that will lead to future memory freeing where reclaim 
has failed.

> > I don't know of any other "random places" other than when the oom killed 
> > process is sitting in reclaim before it is selected as the victim.  Once 
> > it returns to the page allocator, it will immediately allocate and then be 
> > able to handle its pending SIGKILL.  The one spot identified where it is 
> > absolutely pointless to spin is in reclaim since it is virtually 
> > guaranteed to fail.  This patch fixes that issue.
> 
> No, this applies to every other operation that does not immediately
> lead to the task exiting or which creates more system load.  Readahead
> would be another example.  They're all pointless and you could do
> without all of them at this point, but I'm not okay with putting these
> checks in random places that happen to bother you right now.  It's not
> a proper solution to the problem.
> 

If you have an alternative solution, please feel free to propose it and 
I'll try it out.

This isn't only about the cond_resched() in shrink_slab(), the reclaim is 
going to fail.  There should be no instances where an oom killed process 
can go and start magically reclaiming memory that would have prevented it 
from becoming oom in the first place.  I have seen the oom killer trigger 
and the victim stall for several seconds before actually allocating memory 
and that stall is pointless, especially when we're not touching a hotpath 
here, we're in direct reclaim already.

> Is it a good idea to let ~700 processes simultaneously go into direct
> global reclaim?
> 
> The victim aborting reclaim still leaves you with ~699 processes
> spinning in reclaim that should instead just retry the allocation as
> well.  What about them?
> 

Um, no, those processes are going through a repeated loop of direct 
reclaim, calling the oom killer, iterating the tasklist, finding an 
existing oom killed process that has yet to exit, and looping.  They 
wouldn't loop for too long if we can reduce the amount of time that it 
takes for that oom killed process to exit.

> The situation your setups seem to get in frequently is bananas, don't
> micro optimize this.
> 

Unless you propose an alternative solution, this is the patch that fixes 
the problem when an oom killed process gets killed and then stalls for 
seconds before it actually retries allocating memory.

Thanks for your thoughts.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-14  0:48         ` David Rientjes
@ 2013-11-18 16:41           ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-18 16:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 13, 2013 at 04:48:32PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> > > because reclaim has completely failed.
> > 
> > ...for somebody else.
> > 
> 
> That process is in the same allocating context as current, otherwise 
> current would not have been selected as a victim.  The oom killer tries to 
> only kill processes that will lead to future memory freeing where reclaim 
> has failed.
> 
> > > I don't know of any other "random places" other than when the oom killed 
> > > process is sitting in reclaim before it is selected as the victim.  Once 
> > > it returns to the page allocator, it will immediately allocate and then be 
> > > able to handle its pending SIGKILL.  The one spot identified where it is 
> > > absolutely pointless to spin is in reclaim since it is virtually 
> > > guaranteed to fail.  This patch fixes that issue.
> > 
> > No, this applies to every other operation that does not immediately
> > lead to the task exiting or which creates more system load.  Readahead
> > would be another example.  They're all pointless and you could do
> > without all of them at this point, but I'm not okay with putting these
> > checks in random places that happen to bother you right now.  It's not
> > a proper solution to the problem.
> > 
> 
> If you have an alternative solution, please feel free to propose it and 
> I'll try it out.
> 
> This isn't only about the cond_resched() in shrink_slab(), the reclaim is 
> going to fail.  There should be no instances where an oom killed process 
> can go and start magically reclaiming memory that would have prevented it 
> from becoming oom in the first place.  I have seen the oom killer trigger 
> and the victim stall for several seconds before actually allocating memory 
> and that stall is pointless, especially when we're not touching a hotpath 
> here, we're in direct reclaim already.
> 
> > Is it a good idea to let ~700 processes simultaneously go into direct
> > global reclaim?
> > 
> > The victim aborting reclaim still leaves you with ~699 processes
> > spinning in reclaim that should instead just retry the allocation as
> > well.  What about them?
> > 
> 
> Um, no, those processes are going through a repeated loop of direct 
> reclaim, calling the oom killer, iterating the tasklist, finding an 
> existing oom killed process that has yet to exit, and looping.  They 
> wouldn't loop for too long if we can reduce the amount of time that it 
> takes for that oom killed process to exit.

I'm not talking about the big loop in the page allocator.  The victim
is going through the same loop.  This patch is about the victim being
in a pointless direct reclaim cycle when it could be exiting, all I'm
saying is that the other tasks doing direct reclaim at that moment
should also be quitting and retrying the allocation.

Because the victim exiting will put memory on the allocator free
lists, not the LRU lists, it will not allow the other direct
reclaimers to make progress any faster.

> > The situation your setups seem to get in frequently is bananas, don't
> > micro optimize this.
> > 
> 
> Unless you propose an alternative solution, this is the patch that fixes 
> the problem when an oom killed process gets killed and then stalls for 
> seconds before it actually retries allocating memory.

If we have multi-second stalls in direct reclaim then it should be
fixed for all direct reclaimers.  The problem is not only OOM kill
victims getting stuck, it's every direct reclaimer being stuck trying
to do way too much work before retrying the allocation.

Kswapd checks the system state after every priority cycle.  Direct
reclaim should probably do the same and retry the allocation after
every priority cycle or every X pages scanned, where X is something
reasonable and not "up to every LRU page in the system".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-18 16:41           ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-18 16:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 13, 2013 at 04:48:32PM -0800, David Rientjes wrote:
> On Wed, 13 Nov 2013, Johannes Weiner wrote:
> 
> > > The reclaim will fail, the only reason current has TIF_MEMDIE set is 
> > > because reclaim has completely failed.
> > 
> > ...for somebody else.
> > 
> 
> That process is in the same allocating context as current, otherwise 
> current would not have been selected as a victim.  The oom killer tries to 
> only kill processes that will lead to future memory freeing where reclaim 
> has failed.
> 
> > > I don't know of any other "random places" other than when the oom killed 
> > > process is sitting in reclaim before it is selected as the victim.  Once 
> > > it returns to the page allocator, it will immediately allocate and then be 
> > > able to handle its pending SIGKILL.  The one spot identified where it is 
> > > absolutely pointless to spin is in reclaim since it is virtually 
> > > guaranteed to fail.  This patch fixes that issue.
> > 
> > No, this applies to every other operation that does not immediately
> > lead to the task exiting or which creates more system load.  Readahead
> > would be another example.  They're all pointless and you could do
> > without all of them at this point, but I'm not okay with putting these
> > checks in random places that happen to bother you right now.  It's not
> > a proper solution to the problem.
> > 
> 
> If you have an alternative solution, please feel free to propose it and 
> I'll try it out.
> 
> This isn't only about the cond_resched() in shrink_slab(), the reclaim is 
> going to fail.  There should be no instances where an oom killed process 
> can go and start magically reclaiming memory that would have prevented it 
> from becoming oom in the first place.  I have seen the oom killer trigger 
> and the victim stall for several seconds before actually allocating memory 
> and that stall is pointless, especially when we're not touching a hotpath 
> here, we're in direct reclaim already.
> 
> > Is it a good idea to let ~700 processes simultaneously go into direct
> > global reclaim?
> > 
> > The victim aborting reclaim still leaves you with ~699 processes
> > spinning in reclaim that should instead just retry the allocation as
> > well.  What about them?
> > 
> 
> Um, no, those processes are going through a repeated loop of direct 
> reclaim, calling the oom killer, iterating the tasklist, finding an 
> existing oom killed process that has yet to exit, and looping.  They 
> wouldn't loop for too long if we can reduce the amount of time that it 
> takes for that oom killed process to exit.

I'm not talking about the big loop in the page allocator.  The victim
is going through the same loop.  This patch is about the victim being
in a pointless direct reclaim cycle when it could be exiting, all I'm
saying is that the other tasks doing direct reclaim at that moment
should also be quitting and retrying the allocation.

Because the victim exiting will put memory on the allocator free
lists, not the LRU lists, it will not allow the other direct
reclaimers to make progress any faster.

> > The situation your setups seem to get in frequently is bananas, don't
> > micro optimize this.
> > 
> 
> Unless you propose an alternative solution, this is the patch that fixes 
> the problem when an oom killed process gets killed and then stalls for 
> seconds before it actually retries allocating memory.

If we have multi-second stalls in direct reclaim then it should be
fixed for all direct reclaimers.  The problem is not only OOM kill
victims getting stuck, it's every direct reclaimer being stuck trying
to do way too much work before retrying the allocation.

Kswapd checks the system state after every priority cycle.  Direct
reclaim should probably do the same and retry the allocation after
every priority cycle or every X pages scanned, where X is something
reasonable and not "up to every LRU page in the system".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-18 16:41           ` Johannes Weiner
@ 2013-11-19  1:17             ` David Rientjes
  -1 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-19  1:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Mon, 18 Nov 2013, Johannes Weiner wrote:

> > Um, no, those processes are going through a repeated loop of direct 
> > reclaim, calling the oom killer, iterating the tasklist, finding an 
> > existing oom killed process that has yet to exit, and looping.  They 
> > wouldn't loop for too long if we can reduce the amount of time that it 
> > takes for that oom killed process to exit.
> 
> I'm not talking about the big loop in the page allocator.  The victim
> is going through the same loop.  This patch is about the victim being
> in a pointless direct reclaim cycle when it could be exiting, all I'm
> saying is that the other tasks doing direct reclaim at that moment
> should also be quitting and retrying the allocation.
> 

"All other tasks" would be defined as though sharing the same mempolicy 
context as the oom kill victim or the same set of cpuset mems, I'm not 
sure what type of method for determining reclaim eligiblity you're 
proposing to avoid pointlessly spinning without making progress.  Until an 
alternative exists, my patch avoids the needless spinning and expedites 
the exit, so I'll ask that it be merged.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-19  1:17             ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-19  1:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Mon, 18 Nov 2013, Johannes Weiner wrote:

> > Um, no, those processes are going through a repeated loop of direct 
> > reclaim, calling the oom killer, iterating the tasklist, finding an 
> > existing oom killed process that has yet to exit, and looping.  They 
> > wouldn't loop for too long if we can reduce the amount of time that it 
> > takes for that oom killed process to exit.
> 
> I'm not talking about the big loop in the page allocator.  The victim
> is going through the same loop.  This patch is about the victim being
> in a pointless direct reclaim cycle when it could be exiting, all I'm
> saying is that the other tasks doing direct reclaim at that moment
> should also be quitting and retrying the allocation.
> 

"All other tasks" would be defined as though sharing the same mempolicy 
context as the oom kill victim or the same set of cpuset mems, I'm not 
sure what type of method for determining reclaim eligiblity you're 
proposing to avoid pointlessly spinning without making progress.  Until an 
alternative exists, my patch avoids the needless spinning and expedites 
the exit, so I'll ask that it be merged.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-19  1:17             ` David Rientjes
@ 2013-11-20 16:07               ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-20 16:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Mon, Nov 18, 2013 at 05:17:31PM -0800, David Rientjes wrote:
> On Mon, 18 Nov 2013, Johannes Weiner wrote:
> 
> > > Um, no, those processes are going through a repeated loop of direct 
> > > reclaim, calling the oom killer, iterating the tasklist, finding an 
> > > existing oom killed process that has yet to exit, and looping.  They 
> > > wouldn't loop for too long if we can reduce the amount of time that it 
> > > takes for that oom killed process to exit.
> > 
> > I'm not talking about the big loop in the page allocator.  The victim
> > is going through the same loop.  This patch is about the victim being
> > in a pointless direct reclaim cycle when it could be exiting, all I'm
> > saying is that the other tasks doing direct reclaim at that moment
> > should also be quitting and retrying the allocation.
> > 
> 
> "All other tasks" would be defined as though sharing the same mempolicy 
> context as the oom kill victim or the same set of cpuset mems, I'm not 
> sure what type of method for determining reclaim eligiblity you're 
> proposing to avoid pointlessly spinning without making progress.  Until an 
> alternative exists, my patch avoids the needless spinning and expedites 
> the exit, so I'll ask that it be merged.

I laid this out in the second half of my email, which you apparently
did not read:

"If we have multi-second stalls in direct reclaim then it should be
 fixed for all direct reclaimers.  The problem is not only OOM kill
 victims getting stuck, it's every direct reclaimer being stuck trying
 to do way too much work before retrying the allocation.

 Kswapd checks the system state after every priority cycle.  Direct
 reclaim should probably do the same and retry the allocation after
 every priority cycle or every X pages scanned, where X is something
 reasonable and not "up to every LRU page in the system"."

NAK to this incomplete drive-by fix.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-20 16:07               ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-20 16:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Mon, Nov 18, 2013 at 05:17:31PM -0800, David Rientjes wrote:
> On Mon, 18 Nov 2013, Johannes Weiner wrote:
> 
> > > Um, no, those processes are going through a repeated loop of direct 
> > > reclaim, calling the oom killer, iterating the tasklist, finding an 
> > > existing oom killed process that has yet to exit, and looping.  They 
> > > wouldn't loop for too long if we can reduce the amount of time that it 
> > > takes for that oom killed process to exit.
> > 
> > I'm not talking about the big loop in the page allocator.  The victim
> > is going through the same loop.  This patch is about the victim being
> > in a pointless direct reclaim cycle when it could be exiting, all I'm
> > saying is that the other tasks doing direct reclaim at that moment
> > should also be quitting and retrying the allocation.
> > 
> 
> "All other tasks" would be defined as though sharing the same mempolicy 
> context as the oom kill victim or the same set of cpuset mems, I'm not 
> sure what type of method for determining reclaim eligiblity you're 
> proposing to avoid pointlessly spinning without making progress.  Until an 
> alternative exists, my patch avoids the needless spinning and expedites 
> the exit, so I'll ask that it be merged.

I laid this out in the second half of my email, which you apparently
did not read:

"If we have multi-second stalls in direct reclaim then it should be
 fixed for all direct reclaimers.  The problem is not only OOM kill
 victims getting stuck, it's every direct reclaimer being stuck trying
 to do way too much work before retrying the allocation.

 Kswapd checks the system state after every priority cycle.  Direct
 reclaim should probably do the same and retry the allocation after
 every priority cycle or every X pages scanned, where X is something
 reasonable and not "up to every LRU page in the system"."

NAK to this incomplete drive-by fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-20 16:07               ` Johannes Weiner
@ 2013-11-21  3:08                 ` David Rientjes
  -1 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-21  3:08 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 20 Nov 2013, Johannes Weiner wrote:

> > "All other tasks" would be defined as though sharing the same mempolicy 
> > context as the oom kill victim or the same set of cpuset mems, I'm not 
> > sure what type of method for determining reclaim eligiblity you're 
> > proposing to avoid pointlessly spinning without making progress.  Until an 
> > alternative exists, my patch avoids the needless spinning and expedites 
> > the exit, so I'll ask that it be merged.
> 
> I laid this out in the second half of my email, which you apparently
> did not read:
> 

I read it, but your proposal is incomplete, see below.

> "If we have multi-second stalls in direct reclaim then it should be
>  fixed for all direct reclaimers.  The problem is not only OOM kill
>  victims getting stuck, it's every direct reclaimer being stuck trying
>  to do way too much work before retrying the allocation.
> 

I'm addressing oom conditions here, including system, mempolicy, and 
cpuset ooms.

I'm not addressing a large number of processes that are doing direct 
reclaim in parallel over the same set of zones.  That would be a more 
invasive change and could potentially cause regressions because reclaim 
would be stopped prematurely before reclaiming the given threshold.  I do 
not have a bug report in front of me that suggests this is an issue 
outside of oom conditions and the current behavior is actually intuitive: 
if there are a large number of processes attempting reclaim, the demand 
for memory from those zones is higher and it is intuitive to have reclaim 
done in parallel up to a threshold.

>  Kswapd checks the system state after every priority cycle.  Direct
>  reclaim should probably do the same and retry the allocation after
>  every priority cycle or every X pages scanned, where X is something
>  reasonable and not "up to every LRU page in the system"."
> 
> NAK to this incomplete drive-by fix.
> 

This is a fix for a real-world situation that current exists in reclaim: 
specifically, preventing unnecessary stalls in reclaim in oom conditions 
that is known to be futile.  There is no possibility that reclaim itself 
will be successful because of the oom condition, and that condition is the 
only condition where reclaim is guaranteed to not be successful.  I'm sure 
we both agree that there is no use in an oom killed process continually 
looping in reclaim and yielding the cpu back to another process which just 
prolongs the duration before the oom killed process can free its memory.

You're advocating that the allocation is retried after every priority 
cycle as an alternative and that seems potentially racy and incomplete: if 
32 processes enter reclaim all doing order-0 allocations and one process 
reclaims a page, they would all terminate reclaim and retry the 
allocation.  31 processes would then loop the page allocator again, and 
reenter reclaim again at the starting priority.  Much better, in my 
opinion, would be to reclaim up to a threshold for each and then return to 
the page allocator since all 32 processes have demand for memory; that 
threshold is debatable, but SWAP_CLUSTER_MAX is reasonable.

So I would be nervous to carry the classzone_idx into direct reclaim, do 
an alloc_flags |= ALLOC_NO_WATERMARKS iff TIF_MEMDIE, iterate the 
zonelist, and do a __zone_watermark_ok_safe() for some watermark that's 
greater than the low watermark to avoid finding ourselves oom again upon 
returning to the page allocator without causing regressions in reclaim.

The page allocator already tries to allocate memory between direct reclaim 
and calling the oom killer specifically for cases where reclaim was 
unsuccessful for a single process because memory was freed externally.

The situation I'm addressing occurs when reclaim will never be successful 
and nothing external to it will reclaim anything that the oom kill victim 
can use.  The non-victim processes will continue to loop through the oom 
killer and get put to sleep since they aren't victims themselves, but in 
the case described there are 700 processes competing for cpu all doing 
memory allocations so that doesn't help as it normally would.  Older 
kernels used to increase the timeslice that oom kill victims have so they 
exit as quickly as possible, but that was removed since 341aea2bc48b 
("oom-kill: remove boost_dying_task_prio()").

My patch is not in a fastpath, it has extremely minimal overhead, and it 
allows an oom killed victim to exit much quicker instead of incurring 
O(seconds) stalls because of 700 other allocators grabbing the cpu in a 
futile effort to reclaim memory themselves.

Andrew, this fixes a real-world issue that exists and I'm asking that it 
be merged so that oom killed processes can quickly allocate and exit to 
free its memory.  If a more invasive future patch causes it to no longer 
be necessary, that's what we call kernel development.  Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-21  3:08                 ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-21  3:08 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, 20 Nov 2013, Johannes Weiner wrote:

> > "All other tasks" would be defined as though sharing the same mempolicy 
> > context as the oom kill victim or the same set of cpuset mems, I'm not 
> > sure what type of method for determining reclaim eligiblity you're 
> > proposing to avoid pointlessly spinning without making progress.  Until an 
> > alternative exists, my patch avoids the needless spinning and expedites 
> > the exit, so I'll ask that it be merged.
> 
> I laid this out in the second half of my email, which you apparently
> did not read:
> 

I read it, but your proposal is incomplete, see below.

> "If we have multi-second stalls in direct reclaim then it should be
>  fixed for all direct reclaimers.  The problem is not only OOM kill
>  victims getting stuck, it's every direct reclaimer being stuck trying
>  to do way too much work before retrying the allocation.
> 

I'm addressing oom conditions here, including system, mempolicy, and 
cpuset ooms.

I'm not addressing a large number of processes that are doing direct 
reclaim in parallel over the same set of zones.  That would be a more 
invasive change and could potentially cause regressions because reclaim 
would be stopped prematurely before reclaiming the given threshold.  I do 
not have a bug report in front of me that suggests this is an issue 
outside of oom conditions and the current behavior is actually intuitive: 
if there are a large number of processes attempting reclaim, the demand 
for memory from those zones is higher and it is intuitive to have reclaim 
done in parallel up to a threshold.

>  Kswapd checks the system state after every priority cycle.  Direct
>  reclaim should probably do the same and retry the allocation after
>  every priority cycle or every X pages scanned, where X is something
>  reasonable and not "up to every LRU page in the system"."
> 
> NAK to this incomplete drive-by fix.
> 

This is a fix for a real-world situation that current exists in reclaim: 
specifically, preventing unnecessary stalls in reclaim in oom conditions 
that is known to be futile.  There is no possibility that reclaim itself 
will be successful because of the oom condition, and that condition is the 
only condition where reclaim is guaranteed to not be successful.  I'm sure 
we both agree that there is no use in an oom killed process continually 
looping in reclaim and yielding the cpu back to another process which just 
prolongs the duration before the oom killed process can free its memory.

You're advocating that the allocation is retried after every priority 
cycle as an alternative and that seems potentially racy and incomplete: if 
32 processes enter reclaim all doing order-0 allocations and one process 
reclaims a page, they would all terminate reclaim and retry the 
allocation.  31 processes would then loop the page allocator again, and 
reenter reclaim again at the starting priority.  Much better, in my 
opinion, would be to reclaim up to a threshold for each and then return to 
the page allocator since all 32 processes have demand for memory; that 
threshold is debatable, but SWAP_CLUSTER_MAX is reasonable.

So I would be nervous to carry the classzone_idx into direct reclaim, do 
an alloc_flags |= ALLOC_NO_WATERMARKS iff TIF_MEMDIE, iterate the 
zonelist, and do a __zone_watermark_ok_safe() for some watermark that's 
greater than the low watermark to avoid finding ourselves oom again upon 
returning to the page allocator without causing regressions in reclaim.

The page allocator already tries to allocate memory between direct reclaim 
and calling the oom killer specifically for cases where reclaim was 
unsuccessful for a single process because memory was freed externally.

The situation I'm addressing occurs when reclaim will never be successful 
and nothing external to it will reclaim anything that the oom kill victim 
can use.  The non-victim processes will continue to loop through the oom 
killer and get put to sleep since they aren't victims themselves, but in 
the case described there are 700 processes competing for cpu all doing 
memory allocations so that doesn't help as it normally would.  Older 
kernels used to increase the timeslice that oom kill victims have so they 
exit as quickly as possible, but that was removed since 341aea2bc48b 
("oom-kill: remove boost_dying_task_prio()").

My patch is not in a fastpath, it has extremely minimal overhead, and it 
allows an oom killed victim to exit much quicker instead of incurring 
O(seconds) stalls because of 700 other allocators grabbing the cpu in a 
futile effort to reclaim memory themselves.

Andrew, this fixes a real-world issue that exists and I'm asking that it 
be merged so that oom killed processes can quickly allocate and exit to 
free its memory.  If a more invasive future patch causes it to no longer 
be necessary, that's what we call kernel development.  Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-21  3:08                 ` David Rientjes
@ 2013-11-21 14:51                   ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-21 14:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 20, 2013 at 07:08:50PM -0800, David Rientjes wrote:
> On Wed, 20 Nov 2013, Johannes Weiner wrote:
> 
> > > "All other tasks" would be defined as though sharing the same mempolicy 
> > > context as the oom kill victim or the same set of cpuset mems, I'm not 
> > > sure what type of method for determining reclaim eligiblity you're 
> > > proposing to avoid pointlessly spinning without making progress.  Until an 
> > > alternative exists, my patch avoids the needless spinning and expedites 
> > > the exit, so I'll ask that it be merged.
> > 
> > I laid this out in the second half of my email, which you apparently
> > did not read:
> > 
> 
> I read it, but your proposal is incomplete, see below.
> 
> > "If we have multi-second stalls in direct reclaim then it should be
> >  fixed for all direct reclaimers.  The problem is not only OOM kill
> >  victims getting stuck, it's every direct reclaimer being stuck trying
> >  to do way too much work before retrying the allocation.
> > 
> 
> I'm addressing oom conditions here, including system, mempolicy, and 
> cpuset ooms.
> 
> I'm not addressing a large number of processes that are doing direct 
> reclaim in parallel over the same set of zones.  That would be a more 
> invasive change and could potentially cause regressions because reclaim 
> would be stopped prematurely before reclaiming the given threshold.  I do 
> not have a bug report in front of me that suggests this is an issue 
> outside of oom conditions and the current behavior is actually intuitive: 
> if there are a large number of processes attempting reclaim, the demand 
> for memory from those zones is higher and it is intuitive to have reclaim 
> done in parallel up to a threshold.
> 
> >  Kswapd checks the system state after every priority cycle.  Direct
> >  reclaim should probably do the same and retry the allocation after
> >  every priority cycle or every X pages scanned, where X is something
> >  reasonable and not "up to every LRU page in the system"."
> > 
> > NAK to this incomplete drive-by fix.
> > 
> 
> This is a fix for a real-world situation that current exists in reclaim: 
> specifically, preventing unnecessary stalls in reclaim in oom conditions 
> that is known to be futile.  There is no possibility that reclaim itself 
> will be successful because of the oom condition, and that condition is the 
> only condition where reclaim is guaranteed to not be successful.  I'm sure 
> we both agree that there is no use in an oom killed process continually 
> looping in reclaim and yielding the cpu back to another process which just 
> prolongs the duration before the oom killed process can free its memory.

Yes, but you are actively avoiding the only question I have asked from
the beginning:

If there is an OOM kill, direct reclaim for a given memory context has
failed and every continuation of direct reclaim in this context is
just pointless burning of cpu cycles.  You said you have 700 processes
in direct reclaim.  Your patch allows ONE of them to exit prematurely.
What about the other 699?  There is still nothing to reclaim for them.

My proposal was: if they would all check back with the allocator more
frequently, this would properly resolve this problem.  You added an
OOM victim check after every priority cycle to solve this problem for
one task, but if we checked the watermarks as well each priority
cycle, all 700 tasks could quit useless burning of CPU cycles once
direct reclaim has failed.

> You're advocating that the allocation is retried after every priority 
> cycle as an alternative and that seems potentially racy and incomplete: if 
> 32 processes enter reclaim all doing order-0 allocations and one process 
> reclaims a page, they would all terminate reclaim and retry the 
> allocation.  31 processes would then loop the page allocator again, and 
> reenter reclaim again at the starting priority.  Much better, in my 
> opinion, would be to reclaim up to a threshold for each and then return to 
> the page allocator since all 32 processes have demand for memory; that 
> threshold is debatable, but SWAP_CLUSTER_MAX is reasonable.

They would re-enter at the next priority of course, not the same one
again.  All I suggested is retrying the allocation in between reclaim
priority cycles, I'm not sure where you are getting the rest from.

> So I would be nervous to carry the classzone_idx into direct reclaim, do 
> an alloc_flags |= ALLOC_NO_WATERMARKS iff TIF_MEMDIE, iterate the 
> zonelist, and do a __zone_watermark_ok_safe() for some watermark that's 
> greater than the low watermark to avoid finding ourselves oom again upon 
> returning to the page allocator without causing regressions in reclaim.
> 
> The page allocator already tries to allocate memory between direct reclaim 
> and calling the oom killer specifically for cases where reclaim was 
> unsuccessful for a single process because memory was freed externally.

Lol, but you complain that the direct reclaim part of this cycle is
too long!  I propose shortening it.  Why are we still having this
argument?!

> The situation I'm addressing occurs when reclaim will never be successful 
> and nothing external to it will reclaim anything that the oom kill victim 
> can use.  The non-victim processes will continue to loop through the oom 
> killer and get put to sleep since they aren't victims themselves, but in 
> the case described there are 700 processes competing for cpu all doing 
> memory allocations so that doesn't help as it normally would.  Older 
> kernels used to increase the timeslice that oom kill victims have so they 
> exit as quickly as possible, but that was removed since 341aea2bc48b 
> ("oom-kill: remove boost_dying_task_prio()").

Yes, the OOM victim should exit direct reclaim.  But so should ALL
OTHER 699 TASKS, given that there is nothing to reclaim from this
context!

I just don't get why you are so focussed on this one victim task.  The
OOM situation is not property of this one task, it's a context-wide
state.  If a context goes OOM, direct reclaim has failed and should be
stopped as soon as possible.  It does not make sense for the victim to
continue, it does not make sense for anybody else to continue.

12 priority cycles is just too long of a stretch before checking back
with the overall state of affairs, that's the real problem in my view.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-21 14:51                   ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-21 14:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 20, 2013 at 07:08:50PM -0800, David Rientjes wrote:
> On Wed, 20 Nov 2013, Johannes Weiner wrote:
> 
> > > "All other tasks" would be defined as though sharing the same mempolicy 
> > > context as the oom kill victim or the same set of cpuset mems, I'm not 
> > > sure what type of method for determining reclaim eligiblity you're 
> > > proposing to avoid pointlessly spinning without making progress.  Until an 
> > > alternative exists, my patch avoids the needless spinning and expedites 
> > > the exit, so I'll ask that it be merged.
> > 
> > I laid this out in the second half of my email, which you apparently
> > did not read:
> > 
> 
> I read it, but your proposal is incomplete, see below.
> 
> > "If we have multi-second stalls in direct reclaim then it should be
> >  fixed for all direct reclaimers.  The problem is not only OOM kill
> >  victims getting stuck, it's every direct reclaimer being stuck trying
> >  to do way too much work before retrying the allocation.
> > 
> 
> I'm addressing oom conditions here, including system, mempolicy, and 
> cpuset ooms.
> 
> I'm not addressing a large number of processes that are doing direct 
> reclaim in parallel over the same set of zones.  That would be a more 
> invasive change and could potentially cause regressions because reclaim 
> would be stopped prematurely before reclaiming the given threshold.  I do 
> not have a bug report in front of me that suggests this is an issue 
> outside of oom conditions and the current behavior is actually intuitive: 
> if there are a large number of processes attempting reclaim, the demand 
> for memory from those zones is higher and it is intuitive to have reclaim 
> done in parallel up to a threshold.
> 
> >  Kswapd checks the system state after every priority cycle.  Direct
> >  reclaim should probably do the same and retry the allocation after
> >  every priority cycle or every X pages scanned, where X is something
> >  reasonable and not "up to every LRU page in the system"."
> > 
> > NAK to this incomplete drive-by fix.
> > 
> 
> This is a fix for a real-world situation that current exists in reclaim: 
> specifically, preventing unnecessary stalls in reclaim in oom conditions 
> that is known to be futile.  There is no possibility that reclaim itself 
> will be successful because of the oom condition, and that condition is the 
> only condition where reclaim is guaranteed to not be successful.  I'm sure 
> we both agree that there is no use in an oom killed process continually 
> looping in reclaim and yielding the cpu back to another process which just 
> prolongs the duration before the oom killed process can free its memory.

Yes, but you are actively avoiding the only question I have asked from
the beginning:

If there is an OOM kill, direct reclaim for a given memory context has
failed and every continuation of direct reclaim in this context is
just pointless burning of cpu cycles.  You said you have 700 processes
in direct reclaim.  Your patch allows ONE of them to exit prematurely.
What about the other 699?  There is still nothing to reclaim for them.

My proposal was: if they would all check back with the allocator more
frequently, this would properly resolve this problem.  You added an
OOM victim check after every priority cycle to solve this problem for
one task, but if we checked the watermarks as well each priority
cycle, all 700 tasks could quit useless burning of CPU cycles once
direct reclaim has failed.

> You're advocating that the allocation is retried after every priority 
> cycle as an alternative and that seems potentially racy and incomplete: if 
> 32 processes enter reclaim all doing order-0 allocations and one process 
> reclaims a page, they would all terminate reclaim and retry the 
> allocation.  31 processes would then loop the page allocator again, and 
> reenter reclaim again at the starting priority.  Much better, in my 
> opinion, would be to reclaim up to a threshold for each and then return to 
> the page allocator since all 32 processes have demand for memory; that 
> threshold is debatable, but SWAP_CLUSTER_MAX is reasonable.

They would re-enter at the next priority of course, not the same one
again.  All I suggested is retrying the allocation in between reclaim
priority cycles, I'm not sure where you are getting the rest from.

> So I would be nervous to carry the classzone_idx into direct reclaim, do 
> an alloc_flags |= ALLOC_NO_WATERMARKS iff TIF_MEMDIE, iterate the 
> zonelist, and do a __zone_watermark_ok_safe() for some watermark that's 
> greater than the low watermark to avoid finding ourselves oom again upon 
> returning to the page allocator without causing regressions in reclaim.
> 
> The page allocator already tries to allocate memory between direct reclaim 
> and calling the oom killer specifically for cases where reclaim was 
> unsuccessful for a single process because memory was freed externally.

Lol, but you complain that the direct reclaim part of this cycle is
too long!  I propose shortening it.  Why are we still having this
argument?!

> The situation I'm addressing occurs when reclaim will never be successful 
> and nothing external to it will reclaim anything that the oom kill victim 
> can use.  The non-victim processes will continue to loop through the oom 
> killer and get put to sleep since they aren't victims themselves, but in 
> the case described there are 700 processes competing for cpu all doing 
> memory allocations so that doesn't help as it normally would.  Older 
> kernels used to increase the timeslice that oom kill victims have so they 
> exit as quickly as possible, but that was removed since 341aea2bc48b 
> ("oom-kill: remove boost_dying_task_prio()").

Yes, the OOM victim should exit direct reclaim.  But so should ALL
OTHER 699 TASKS, given that there is nothing to reclaim from this
context!

I just don't get why you are so focussed on this one victim task.  The
OOM situation is not property of this one task, it's a context-wide
state.  If a context goes OOM, direct reclaim has failed and should be
stopped as soon as possible.  It does not make sense for the victim to
continue, it does not make sense for anybody else to continue.

12 priority cycles is just too long of a stretch before checking back
with the overall state of affairs, that's the real problem in my view.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-21  3:08                 ` David Rientjes
@ 2013-11-21 16:40                   ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-21 16:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 20, 2013 at 07:08:50PM -0800, David Rientjes wrote:
> My patch is not in a fastpath, it has extremely minimal overhead, and it 
> allows an oom killed victim to exit much quicker instead of incurring 
> O(seconds) stalls because of 700 other allocators grabbing the cpu in a 
> futile effort to reclaim memory themselves.
> 
> Andrew, this fixes a real-world issue that exists and I'm asking that it 
> be merged so that oom killed processes can quickly allocate and exit to 
> free its memory.  If a more invasive future patch causes it to no longer 
> be necessary, that's what we call kernel development.  Thanks.

All I'm trying to do is find the broader root cause for the problem
you are experiencing and find a solution that will leave us with
maintainable code.  It does not matter how few instructions your fix
adds, it changes the outcome of the algorithm and makes every
developer trying to grasp the complexity of page reclaim think about
yet another special condition.

The more specific the code is, the harder it will be to understand in
the future.  Yes, it's a one-liner, but we've had death by a thousand
cuts before, many times.  A few cycles ago, kswapd was blowing up left
and right simply because it was trying to meet too many specific
objectives from facilitating order-0 allocators, maintaining zone
health, enabling compaction for higher order allocation, writing back
dirty pages.  Ultimately, it just got stuck in endless loops because
of conflicting conditionals.  We've had similar problems in the scan
count calculation etc where all the checks and special cases left us
with code that was impossible to reason about.  There really is a
history of "low overhead one-liner fixes" eating us alive in the VM.

The solution was always to take a step back and integrate all
requirements properly.  Not only did this fix the problems, the code
ended up being much more robust and easier to understand and modify as
well.

If shortening the direct reclaim cycle is an adequate solution to your
problem, it would be much preferable.  Because

  "checking at a reasonable interval if the work I'm doing is still
   necessary"

is a much more approachable, generic, and intuitive concept than

  "the OOM killer has gone off, direct reclaim is futile, I should
   exit quickly to release memory so that not more tasks get caught
   doing direct reclaim".

and the fix would benefit a much wider audience.

Lastly, as far as I know, you are the only reporter that noticed an
issue with this loooooong-standing behavior, and you don't even run
upstream kernels.  There really is no excuse to put up with a quick &
dirty fix.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-21 16:40                   ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-21 16:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Wed, Nov 20, 2013 at 07:08:50PM -0800, David Rientjes wrote:
> My patch is not in a fastpath, it has extremely minimal overhead, and it 
> allows an oom killed victim to exit much quicker instead of incurring 
> O(seconds) stalls because of 700 other allocators grabbing the cpu in a 
> futile effort to reclaim memory themselves.
> 
> Andrew, this fixes a real-world issue that exists and I'm asking that it 
> be merged so that oom killed processes can quickly allocate and exit to 
> free its memory.  If a more invasive future patch causes it to no longer 
> be necessary, that's what we call kernel development.  Thanks.

All I'm trying to do is find the broader root cause for the problem
you are experiencing and find a solution that will leave us with
maintainable code.  It does not matter how few instructions your fix
adds, it changes the outcome of the algorithm and makes every
developer trying to grasp the complexity of page reclaim think about
yet another special condition.

The more specific the code is, the harder it will be to understand in
the future.  Yes, it's a one-liner, but we've had death by a thousand
cuts before, many times.  A few cycles ago, kswapd was blowing up left
and right simply because it was trying to meet too many specific
objectives from facilitating order-0 allocators, maintaining zone
health, enabling compaction for higher order allocation, writing back
dirty pages.  Ultimately, it just got stuck in endless loops because
of conflicting conditionals.  We've had similar problems in the scan
count calculation etc where all the checks and special cases left us
with code that was impossible to reason about.  There really is a
history of "low overhead one-liner fixes" eating us alive in the VM.

The solution was always to take a step back and integrate all
requirements properly.  Not only did this fix the problems, the code
ended up being much more robust and easier to understand and modify as
well.

If shortening the direct reclaim cycle is an adequate solution to your
problem, it would be much preferable.  Because

  "checking at a reasonable interval if the work I'm doing is still
   necessary"

is a much more approachable, generic, and intuitive concept than

  "the OOM killer has gone off, direct reclaim is futile, I should
   exit quickly to release memory so that not more tasks get caught
   doing direct reclaim".

and the fix would benefit a much wider audience.

Lastly, as far as I know, you are the only reporter that noticed an
issue with this loooooong-standing behavior, and you don't even run
upstream kernels.  There really is no excuse to put up with a quick &
dirty fix.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-21 16:40                   ` Johannes Weiner
@ 2013-11-27  0:47                     ` David Rientjes
  -1 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-27  0:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Thu, 21 Nov 2013, Johannes Weiner wrote:

> All I'm trying to do is find the broader root cause for the problem
> you are experiencing and find a solution that will leave us with
> maintainable code.  It does not matter how few instructions your fix
> adds, it changes the outcome of the algorithm and makes every
> developer trying to grasp the complexity of page reclaim think about
> yet another special condition.
> 
> The more specific the code is, the harder it will be to understand in
> the future.  Yes, it's a one-liner, but we've had death by a thousand
> cuts before, many times.  A few cycles ago, kswapd was blowing up left
> and right simply because it was trying to meet too many specific
> objectives from facilitating order-0 allocators, maintaining zone
> health, enabling compaction for higher order allocation, writing back
> dirty pages.  Ultimately, it just got stuck in endless loops because
> of conflicting conditionals.  We've had similar problems in the scan
> count calculation etc where all the checks and special cases left us
> with code that was impossible to reason about.  There really is a
> history of "low overhead one-liner fixes" eating us alive in the VM.
> 

Your objection is that the added code is obscure and will require kernel 
hackers to think about why it's there?  I could certainly add a comment:

	/*
	 * The oom killer only kills processes when reclaim has already
	 * failed for its allocation context, continuously trying won't
	 * help.
	 */

to the patch? 

> The solution was always to take a step back and integrate all
> requirements properly.  Not only did this fix the problems, the code
> ended up being much more robust and easier to understand and modify as
> well.
> 
> If shortening the direct reclaim cycle is an adequate solution to your
> problem, it would be much preferable.  Because
> 
>   "checking at a reasonable interval if the work I'm doing is still
>    necessary"
> 
> is a much more approachable, generic, and intuitive concept than
> 
>   "the OOM killer has gone off, direct reclaim is futile, I should
>    exit quickly to release memory so that not more tasks get caught
>    doing direct reclaim".
> 
> and the fix would benefit a much wider audience.
> 

I agree with your point that obscure random fixes does obfuscate the VM in 
many different ways and I'd like to support you in anyway that I can to 
make sure that my fix doesn't do that for anybody in the future, the 
comment being added may be one way of doing that.

I disagree with changing the "reasonable interval" to determine if reclaim 
is still necessary because many parallel reclaimers will indicate a higher 
demand for free memory and it prevents short-circuiting direct reclaim, 
returning to the page allocator, and finding that the memory you've 
reclaimed has been stolen by another allocator.

That race to allocate the reclaimed memory will always exist if checking 
zone watermarks from direct reclaim to determine whether we should 
terminate or not as part of your suggested, the alternative would be to 
actually do get_page_from_freelist() and actually allocate on every 
iteration.  For the vast majority of reclaimers, I think this would 
terminate prematurely when we haven't hit the SWAP_CLUSTER_MAX threshold 
and since the removal of lumpy reclaim and the reliance on synchronous 
memory compaction following one of these oom events to defragment memory, 
I would be tenative to implement such a solution.

> Lastly, as far as I know, you are the only reporter that noticed an
> issue with this loooooong-standing behavior, and you don't even run
> upstream kernels.  There really is no excuse to put up with a quick &
> dirty fix.
> 

I certainly do run upstream kernels and I would agree that I may be one of 
the only reporters of such an issue because Google probably hits more out 
of memory conditions to enforce memory isolation than anybody else and 
stalls like this are more noticible.

I would certainly add the comment if you feel it's necessary, though, just 
let me know.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-27  0:47                     ` David Rientjes
  0 siblings, 0 replies; 26+ messages in thread
From: David Rientjes @ 2013-11-27  0:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Thu, 21 Nov 2013, Johannes Weiner wrote:

> All I'm trying to do is find the broader root cause for the problem
> you are experiencing and find a solution that will leave us with
> maintainable code.  It does not matter how few instructions your fix
> adds, it changes the outcome of the algorithm and makes every
> developer trying to grasp the complexity of page reclaim think about
> yet another special condition.
> 
> The more specific the code is, the harder it will be to understand in
> the future.  Yes, it's a one-liner, but we've had death by a thousand
> cuts before, many times.  A few cycles ago, kswapd was blowing up left
> and right simply because it was trying to meet too many specific
> objectives from facilitating order-0 allocators, maintaining zone
> health, enabling compaction for higher order allocation, writing back
> dirty pages.  Ultimately, it just got stuck in endless loops because
> of conflicting conditionals.  We've had similar problems in the scan
> count calculation etc where all the checks and special cases left us
> with code that was impossible to reason about.  There really is a
> history of "low overhead one-liner fixes" eating us alive in the VM.
> 

Your objection is that the added code is obscure and will require kernel 
hackers to think about why it's there?  I could certainly add a comment:

	/*
	 * The oom killer only kills processes when reclaim has already
	 * failed for its allocation context, continuously trying won't
	 * help.
	 */

to the patch? 

> The solution was always to take a step back and integrate all
> requirements properly.  Not only did this fix the problems, the code
> ended up being much more robust and easier to understand and modify as
> well.
> 
> If shortening the direct reclaim cycle is an adequate solution to your
> problem, it would be much preferable.  Because
> 
>   "checking at a reasonable interval if the work I'm doing is still
>    necessary"
> 
> is a much more approachable, generic, and intuitive concept than
> 
>   "the OOM killer has gone off, direct reclaim is futile, I should
>    exit quickly to release memory so that not more tasks get caught
>    doing direct reclaim".
> 
> and the fix would benefit a much wider audience.
> 

I agree with your point that obscure random fixes does obfuscate the VM in 
many different ways and I'd like to support you in anyway that I can to 
make sure that my fix doesn't do that for anybody in the future, the 
comment being added may be one way of doing that.

I disagree with changing the "reasonable interval" to determine if reclaim 
is still necessary because many parallel reclaimers will indicate a higher 
demand for free memory and it prevents short-circuiting direct reclaim, 
returning to the page allocator, and finding that the memory you've 
reclaimed has been stolen by another allocator.

That race to allocate the reclaimed memory will always exist if checking 
zone watermarks from direct reclaim to determine whether we should 
terminate or not as part of your suggested, the alternative would be to 
actually do get_page_from_freelist() and actually allocate on every 
iteration.  For the vast majority of reclaimers, I think this would 
terminate prematurely when we haven't hit the SWAP_CLUSTER_MAX threshold 
and since the removal of lumpy reclaim and the reliance on synchronous 
memory compaction following one of these oom events to defragment memory, 
I would be tenative to implement such a solution.

> Lastly, as far as I know, you are the only reporter that noticed an
> issue with this loooooong-standing behavior, and you don't even run
> upstream kernels.  There really is no excuse to put up with a quick &
> dirty fix.
> 

I certainly do run upstream kernels and I would agree that I may be one of 
the only reporters of such an issue because Google probably hits more out 
of memory conditions to enforce memory isolation than anybody else and 
stalls like this are more noticible.

I would certainly add the comment if you feel it's necessary, though, just 
let me know.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
  2013-11-27  0:47                     ` David Rientjes
@ 2013-11-27 16:09                       ` Johannes Weiner
  -1 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-27 16:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Tue, Nov 26, 2013 at 04:47:56PM -0800, David Rientjes wrote:
> On Thu, 21 Nov 2013, Johannes Weiner wrote:
> 
> > All I'm trying to do is find the broader root cause for the problem
> > you are experiencing and find a solution that will leave us with
> > maintainable code.  It does not matter how few instructions your fix
> > adds, it changes the outcome of the algorithm and makes every
> > developer trying to grasp the complexity of page reclaim think about
> > yet another special condition.
> > 
> > The more specific the code is, the harder it will be to understand in
> > the future.  Yes, it's a one-liner, but we've had death by a thousand
> > cuts before, many times.  A few cycles ago, kswapd was blowing up left
> > and right simply because it was trying to meet too many specific
> > objectives from facilitating order-0 allocators, maintaining zone
> > health, enabling compaction for higher order allocation, writing back
> > dirty pages.  Ultimately, it just got stuck in endless loops because
> > of conflicting conditionals.  We've had similar problems in the scan
> > count calculation etc where all the checks and special cases left us
> > with code that was impossible to reason about.  There really is a
> > history of "low overhead one-liner fixes" eating us alive in the VM.
> > 
> 
> Your objection is that the added code is obscure and will require kernel 
> hackers to think about why it's there?  I could certainly add a comment:
> 
> 	/*
> 	 * The oom killer only kills processes when reclaim has already
> 	 * failed for its allocation context, continuously trying won't
> 	 * help.
> 	 */
> 
> to the patch? 
> 
> > The solution was always to take a step back and integrate all
> > requirements properly.  Not only did this fix the problems, the code
> > ended up being much more robust and easier to understand and modify as
> > well.
> > 
> > If shortening the direct reclaim cycle is an adequate solution to your
> > problem, it would be much preferable.  Because
> > 
> >   "checking at a reasonable interval if the work I'm doing is still
> >    necessary"
> > 
> > is a much more approachable, generic, and intuitive concept than
> > 
> >   "the OOM killer has gone off, direct reclaim is futile, I should
> >    exit quickly to release memory so that not more tasks get caught
> >    doing direct reclaim".
> > 
> > and the fix would benefit a much wider audience.
> > 
> 
> I agree with your point that obscure random fixes does obfuscate the VM in 
> many different ways and I'd like to support you in anyway that I can to 
> make sure that my fix doesn't do that for anybody in the future, the 
> comment being added may be one way of doing that.
> 
> I disagree with changing the "reasonable interval" to determine if reclaim 
> is still necessary because many parallel reclaimers will indicate a higher 
> demand for free memory and it prevents short-circuiting direct reclaim, 
> returning to the page allocator, and finding that the memory you've 
> reclaimed has been stolen by another allocator.

I can't reach the same conclusion as you.  There will always be
concurrent allocators while some tasks are in direct reclaim.  But the
longer you reclaim without checking free pages, the higher the
opportunity for another task to steal your work.  Shortening direct
reclaim cycles would actually mean shrinking that window where other
tasks can steal the pages you reclaimed.

> That race to allocate the reclaimed memory will always exist if checking 
> zone watermarks from direct reclaim to determine whether we should 
> terminate or not as part of your suggested, the alternative would be to 
> actually do get_page_from_freelist() and actually allocate on every 
> iteration.  For the vast majority of reclaimers, I think this would 
> terminate prematurely when we haven't hit the SWAP_CLUSTER_MAX threshold 
> and since the removal of lumpy reclaim and the reliance on synchronous 
> memory compaction following one of these oom events to defragment memory, 
> I would be tenative to implement such a solution.

Yes, I really meant get_page_from_freelist() after every priority
cycle.  What does "prematurely" mean?  If task A is currently doing
reclaim and task B steals the pages, A will have to go back to direct
reclaim.  If A reclaims one cycle and then allocates, there is a
reduced chance of B stealing the pages, and so B will have to do
direct reclaim and pull its own weight.  If A did enough for both,
there wouldn't have been any point in continuing reclaim.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [patch] mm, vmscan: abort futile reclaim if we've been oom killed
@ 2013-11-27 16:09                       ` Johannes Weiner
  0 siblings, 0 replies; 26+ messages in thread
From: Johannes Weiner @ 2013-11-27 16:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Rik van Riel, linux-kernel, linux-mm

On Tue, Nov 26, 2013 at 04:47:56PM -0800, David Rientjes wrote:
> On Thu, 21 Nov 2013, Johannes Weiner wrote:
> 
> > All I'm trying to do is find the broader root cause for the problem
> > you are experiencing and find a solution that will leave us with
> > maintainable code.  It does not matter how few instructions your fix
> > adds, it changes the outcome of the algorithm and makes every
> > developer trying to grasp the complexity of page reclaim think about
> > yet another special condition.
> > 
> > The more specific the code is, the harder it will be to understand in
> > the future.  Yes, it's a one-liner, but we've had death by a thousand
> > cuts before, many times.  A few cycles ago, kswapd was blowing up left
> > and right simply because it was trying to meet too many specific
> > objectives from facilitating order-0 allocators, maintaining zone
> > health, enabling compaction for higher order allocation, writing back
> > dirty pages.  Ultimately, it just got stuck in endless loops because
> > of conflicting conditionals.  We've had similar problems in the scan
> > count calculation etc where all the checks and special cases left us
> > with code that was impossible to reason about.  There really is a
> > history of "low overhead one-liner fixes" eating us alive in the VM.
> > 
> 
> Your objection is that the added code is obscure and will require kernel 
> hackers to think about why it's there?  I could certainly add a comment:
> 
> 	/*
> 	 * The oom killer only kills processes when reclaim has already
> 	 * failed for its allocation context, continuously trying won't
> 	 * help.
> 	 */
> 
> to the patch? 
> 
> > The solution was always to take a step back and integrate all
> > requirements properly.  Not only did this fix the problems, the code
> > ended up being much more robust and easier to understand and modify as
> > well.
> > 
> > If shortening the direct reclaim cycle is an adequate solution to your
> > problem, it would be much preferable.  Because
> > 
> >   "checking at a reasonable interval if the work I'm doing is still
> >    necessary"
> > 
> > is a much more approachable, generic, and intuitive concept than
> > 
> >   "the OOM killer has gone off, direct reclaim is futile, I should
> >    exit quickly to release memory so that not more tasks get caught
> >    doing direct reclaim".
> > 
> > and the fix would benefit a much wider audience.
> > 
> 
> I agree with your point that obscure random fixes does obfuscate the VM in 
> many different ways and I'd like to support you in anyway that I can to 
> make sure that my fix doesn't do that for anybody in the future, the 
> comment being added may be one way of doing that.
> 
> I disagree with changing the "reasonable interval" to determine if reclaim 
> is still necessary because many parallel reclaimers will indicate a higher 
> demand for free memory and it prevents short-circuiting direct reclaim, 
> returning to the page allocator, and finding that the memory you've 
> reclaimed has been stolen by another allocator.

I can't reach the same conclusion as you.  There will always be
concurrent allocators while some tasks are in direct reclaim.  But the
longer you reclaim without checking free pages, the higher the
opportunity for another task to steal your work.  Shortening direct
reclaim cycles would actually mean shrinking that window where other
tasks can steal the pages you reclaimed.

> That race to allocate the reclaimed memory will always exist if checking 
> zone watermarks from direct reclaim to determine whether we should 
> terminate or not as part of your suggested, the alternative would be to 
> actually do get_page_from_freelist() and actually allocate on every 
> iteration.  For the vast majority of reclaimers, I think this would 
> terminate prematurely when we haven't hit the SWAP_CLUSTER_MAX threshold 
> and since the removal of lumpy reclaim and the reliance on synchronous 
> memory compaction following one of these oom events to defragment memory, 
> I would be tenative to implement such a solution.

Yes, I really meant get_page_from_freelist() after every priority
cycle.  What does "prematurely" mean?  If task A is currently doing
reclaim and task B steals the pages, A will have to go back to direct
reclaim.  If A reclaims one cycle and then allocates, there is a
reduced chance of B stealing the pages, and so B will have to do
direct reclaim and pull its own weight.  If A did enough for both,
there wouldn't have been any point in continuing reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2013-11-27 16:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-13  2:02 [patch] mm, vmscan: abort futile reclaim if we've been oom killed David Rientjes
2013-11-13  2:02 ` David Rientjes
2013-11-13 15:24 ` Johannes Weiner
2013-11-13 15:24   ` Johannes Weiner
2013-11-13 22:16   ` David Rientjes
2013-11-13 22:16     ` David Rientjes
2013-11-14  0:00     ` Johannes Weiner
2013-11-14  0:00       ` Johannes Weiner
2013-11-14  0:48       ` David Rientjes
2013-11-14  0:48         ` David Rientjes
2013-11-18 16:41         ` Johannes Weiner
2013-11-18 16:41           ` Johannes Weiner
2013-11-19  1:17           ` David Rientjes
2013-11-19  1:17             ` David Rientjes
2013-11-20 16:07             ` Johannes Weiner
2013-11-20 16:07               ` Johannes Weiner
2013-11-21  3:08               ` David Rientjes
2013-11-21  3:08                 ` David Rientjes
2013-11-21 14:51                 ` Johannes Weiner
2013-11-21 14:51                   ` Johannes Weiner
2013-11-21 16:40                 ` Johannes Weiner
2013-11-21 16:40                   ` Johannes Weiner
2013-11-27  0:47                   ` David Rientjes
2013-11-27  0:47                     ` David Rientjes
2013-11-27 16:09                     ` Johannes Weiner
2013-11-27 16:09                       ` Johannes Weiner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.