[PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
@ 2011-05-26 19:52 Vaibhav Nagarnaik
  2011-05-26 20:04 ` Steven Rostedt
  2011-05-27 17:58 ` Vaibhav Nagarnaik
  0 siblings, 2 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-26 19:52 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker
  Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik

The tracing ring buffer is allocated from kernel memory. While
allocating the memory, if OOM happens, the allocating process might not
be the one that gets killed, since the ring-buffer memory is not
allocated as process memory. Thus random processes might get killed
during the allocation.

This patch makes sure that the allocating process is considered the most
likely oom-kill-able process while the allocating is going on. Thus if
oom-killer is invoked because of ring-buffer allocation, it is easier
for the ring buffer memory to be freed and save important system
processes from being killed.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 kernel/trace/trace.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b926578..15a667a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -37,6 +37,7 @@
 #include <linux/init.h>
 #include <linux/poll.h>
 #include <linux/fs.h>
+#include <linux/oom.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 	unsigned long val;
 	char buf[64];
 	int ret;
+	int oom_score_adj;
 
 	if (cnt >= sizeof(buf))
 		return -EINVAL;
@@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 	/* value is in KB */
 	val <<= 10;
 
+	/*
+	 * make sure this process is picked over others to be killed in OOM
+	 * condition
+	 */
+	oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 	ret = tracing_resize_ring_buffer(val);
+	/* restore the original oom_score_adj value */
+	test_set_oom_score_adj(oom_score_adj);
 	if (ret < 0)
 		return ret;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik
@ 2011-05-26 20:04 ` Steven Rostedt
  2011-05-26 20:22   ` David Rientjes
  2011-05-26 20:23   ` Vaibhav Nagarnaik
  2011-05-27 17:58 ` Vaibhav Nagarnaik
  1 sibling, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-05-26 20:04 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp,
	linux-kernel

On Thu, 2011-05-26 at 12:52 -0700, Vaibhav Nagarnaik wrote:
> The tracing ring buffer is allocated from kernel memory. While
> allocating the memory, if OOM happens, the allocating process might not
> be the one that gets killed, since the ring-buffer memory is not
> allocated as process memory. Thus random processes might get killed
> during the allocation.
> 
> This patch makes sure that the allocating process is considered the most
> likely oom-kill-able process while the allocating is going on. Thus if
> oom-killer is invoked because of ring-buffer allocation, it is easier
> for the ring buffer memory to be freed and save important system
> processes from being killed.

Hmm, have you tried this in practice? Yes we may kill the "echo" command
but it doesn't stop the ring buffer from being allocated, and thus
killing the echo command may not be enough, and those critical processes
that you are trying to protect will be killed next.

Perhaps we should change the allocation of the ring buffer or detect OOM
triggering. Maybe make all the allocations ATOMIC, thus it will be
either available or not, and fail instead of trying to swap out other
memory for the allocation.

-- Steve

> 
> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
> ---
>  kernel/trace/trace.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index b926578..15a667a 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -37,6 +37,7 @@
>  #include <linux/init.h>
>  #include <linux/poll.h>
>  #include <linux/fs.h>
> +#include <linux/oom.h>
>  
>  #include "trace.h"
>  #include "trace_output.h"
> @@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
>  	unsigned long val;
>  	char buf[64];
>  	int ret;
> +	int oom_score_adj;
>  
>  	if (cnt >= sizeof(buf))
>  		return -EINVAL;
> @@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
>  	/* value is in KB */
>  	val <<= 10;
>  
> +	/*
> +	 * make sure this process is picked over others to be killed in OOM
> +	 * condition
> +	 */
> +	oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
>  	ret = tracing_resize_ring_buffer(val);
> +	/* restore the original oom_score_adj value */
> +	test_set_oom_score_adj(oom_score_adj);
>  	if (ret < 0)
>  		return ret;
>  



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 20:04 ` Steven Rostedt
@ 2011-05-26 20:22   ` David Rientjes
  2011-05-26 20:23   ` Vaibhav Nagarnaik
  1 sibling, 0 replies; 30+ messages in thread
From: David Rientjes @ 2011-05-26 20:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker,
	Michael Rubin, David Sharp, linux-kernel

On Thu, 26 May 2011, Steven Rostedt wrote:

> Hmm, have you tried this in practice? Yes we may kill the "echo" command
> but it doesn't stop the ring buffer from being allocated, and thus
> killing the echo command may not be enough, and those critical processes
> that you are trying to protect will be killed next.
> 
> Perhaps we should change the allocation of the ring buffer or detect OOM
> triggering. Maybe make all the allocations ATOMIC, thus it will be
> either available or not, and fail instead of trying to swap out other
> memory for the allocation.
> 

My impression of this was that it was attempting to avoid killing a 
different process by means of the oom killer rather than avoiding swap.  I 
don't think there's anything wrong with using GFP_KERNEL, but I'd suggest 
also using __GFP_NORETRY so the allocation calls into direct reclaim but 
will avoid oom killing anything and simply failing instead.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 20:04 ` Steven Rostedt
  2011-05-26 20:22   ` David Rientjes
@ 2011-05-26 20:23   ` Vaibhav Nagarnaik
  2011-05-26 20:33     ` David Rientjes
  1 sibling, 1 reply; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-26 20:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp,
	linux-kernel

On Thu, May 26, 2011 at 1:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 2011-05-26 at 12:52 -0700, Vaibhav Nagarnaik wrote:
>> The tracing ring buffer is allocated from kernel memory. While
>> allocating the memory, if OOM happens, the allocating process might not
>> be the one that gets killed, since the ring-buffer memory is not
>> allocated as process memory. Thus random processes might get killed
>> during the allocation.
>>
>> This patch makes sure that the allocating process is considered the most
>> likely oom-kill-able process while the allocating is going on. Thus if
>> oom-killer is invoked because of ring-buffer allocation, it is easier
>> for the ring buffer memory to be freed and save important system
>> processes from being killed.
>
> Hmm, have you tried this in practice? Yes we may kill the "echo" command
> but it doesn't stop the ring buffer from being allocated, and thus
> killing the echo command may not be enough, and those critical processes
> that you are trying to protect will be killed next.
>

Yes I did try this and found that it works as we intend it to. When
oom-killer is invoked, it picks the process which has lowest
oom_score_adj and kills it or one of its children. When the process is
getting killed, any memory allocation from it would be returned -ENOMEM,
which gets handled in our allocation process and we free up previously
allocated memory.

This API is now being used in other parts of kernel too, where it knows
that the allocation could cause OOM.

> Perhaps we should change the allocation of the ring buffer or detect OOM
> triggering. Maybe make all the allocations ATOMIC, thus it will be
> either available or not, and fail instead of trying to swap out other
> memory for the allocation.
>
> -- Steve
>

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 20:23   ` Vaibhav Nagarnaik
@ 2011-05-26 20:33     ` David Rientjes
  2011-05-26 21:00       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2011-05-26 20:33 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Thu, 26 May 2011, Vaibhav Nagarnaik wrote:

> > Hmm, have you tried this in practice? Yes we may kill the "echo" command
> > but it doesn't stop the ring buffer from being allocated, and thus
> > killing the echo command may not be enough, and those critical processes
> > that you are trying to protect will be killed next.
> >
> 
> Yes I did try this and found that it works as we intend it to. When
> oom-killer is invoked, it picks the process which has lowest
> oom_score_adj and kills it or one of its children.

s/lowest/highest/

> When the process is
> getting killed, any memory allocation from it would be returned -ENOMEM,
> which gets handled in our allocation process and we free up previously
> allocated memory.
> 

Not sure that's true, this is allocating with kzalloc_node(GFP_KERNEL), 
correct?  If current is oom killed, it will have access to all memory 
reserves which will increase the liklihood that the allocation will 
succeed before handling the SIGKILL.

> This API is now being used in other parts of kernel too, where it knows
> that the allocation could cause OOM.
> 

What's wrong with using __GFP_NORETRY to avoid oom killing entirely and 
then failing the ring buffer memory allocation?  Seems like a better 
solution than relying on the oom killer, since there may be other threads 
with a max oom_score_adj as well that would appear in the tasklist first 
and get killed unnecessarily.  Is there some ring buffer code that can't 
handle failing allocations appropriately?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 20:33     ` David Rientjes
@ 2011-05-26 21:00       ` Steven Rostedt
  2011-05-26 22:28         ` Vaibhav Nagarnaik
  2011-05-26 23:23         ` Vaibhav Nagarnaik
  0 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-05-26 21:00 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker,
	Michael Rubin, David Sharp, linux-kernel

On Thu, 2011-05-26 at 13:33 -0700, David Rientjes wrote:
> On Thu, 26 May 2011, Vaibhav Nagarnaik wrote:
> 

> 
> Not sure that's true, this is allocating with kzalloc_node(GFP_KERNEL), 
> correct?  If current is oom killed, it will have access to all memory 
> reserves which will increase the liklihood that the allocation will 
> succeed before handling the SIGKILL.

Actually it uses get_free_page()

> 
> > This API is now being used in other parts of kernel too, where it knows
> > that the allocation could cause OOM.
> > 
> 
> What's wrong with using __GFP_NORETRY to avoid oom killing entirely and 

I have no problem with NORETRY.

> then failing the ring buffer memory allocation?  Seems like a better 
> solution than relying on the oom killer, since there may be other threads 
> with a max oom_score_adj as well that would appear in the tasklist first 
> and get killed unnecessarily.  Is there some ring buffer code that can't 
> handle failing allocations appropriately?

The ring buffer code can handle failed allocations just fine, and will
free up the pages it allocated before a full success. It allocates the
pages one at a time and adds it to a list. After all pages it wants to
allocate has successfully been allocated, it then applies them to the
ring buffer. If it fails an allocation, all pages are freed that were
not added to the ring buffer yet.

But the issue is, if the process increasing the size of the ring buffer
causes the oom, it will not handle the SIGKILL until after the ring
buffer has finished allocating. Now, if it failed to allocate, then we
are fine, but if it does not fail, but now we start killing processes,
then we may be in trouble.

I like the NORETRY better. But then, would this mean that if we have a
lot of cached filesystems, we wont be able to extend the ring buffer?

I'm thinking the oom killer used here got lucky. As it killed this task,
we were still out of memory, and the ring buffer failed to get the
memory it needed and freed up everything that it previously allocated,
and returned. Then the process calling this function would be killed by
the OOM. Ideally, the process shouldn't be killed and the ring buffer
just returned -ENOMEM to the user.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 21:00       ` Steven Rostedt
@ 2011-05-26 22:28         ` Vaibhav Nagarnaik
  2011-05-26 23:38           ` Steven Rostedt
  2011-05-26 23:23         ` Vaibhav Nagarnaik
  1 sibling, 1 reply; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-26 22:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> But the issue is, if the process increasing the size of the ring buffer
> causes the oom, it will not handle the SIGKILL until after the ring
> buffer has finished allocating. Now, if it failed to allocate, then we
> are fine, but if it does not fail, but now we start killing processes,
> then we may be in trouble.
>

If I understand correctly, if a fatal signal is pending on a process
while allocation is called, the allocation fails. Then we handle the
freeing up memory correctly, though the echo gets killed once we return
from the allocation process.

> I like the NORETRY better. But then, would this mean that if we have a
> lot of cached filesystems, we wont be able to extend the ring buffer?

It doesn't seem so. I talked with the mm- team and I understand that
even if NORETRY is set, cached pages will be flushed out and allocation
will succeed. But it still does not address the situation when the ring
buffer allocation is going on and another process invokes OOM. If the
oom_score_adj is not set to maximum, then random processes will still be
killed before ring buffer allocation fails.

>
> I'm thinking the oom killer used here got lucky. As it killed this task,
> we were still out of memory, and the ring buffer failed to get the
> memory it needed and freed up everything that it previously allocated,
> and returned. Then the process calling this function would be killed by
> the OOM. Ideally, the process shouldn't be killed and the ring buffer
> just returned -ENOMEM to the user.

What do you think of this?

test_set_oom_score_adj(MAXIMUM);
allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY);
test_set_oom_score_adj(original);

This makes sure that the allocation fails much sooner and more
gracefully. If oom-killer is invoked in any circumstance, then the ring
buffer allocation process gives up memory and is killed.

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 21:00       ` Steven Rostedt
  2011-05-26 22:28         ` Vaibhav Nagarnaik
@ 2011-05-26 23:23         ` Vaibhav Nagarnaik
  1 sibling, 0 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-26 23:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> I'm thinking the oom killer used here got lucky. As it killed this task,
> we were still out of memory, and the ring buffer failed to get the
> memory it needed and freed up everything that it previously allocated,
> and returned. Then the process calling this function would be killed by

I have tested this multiple times and I don't see any other process
being killed, regardless of who invoked oom-killer. The oom-killer
always selects the "echo" process to kill because of its maximum
oom_score_adj and sets TIF_MEMDIE. Since the ring buffer is still
getting allocated, any further allocation from this process fails
because of this flag and the fatal signal pending. This causes the ring
buffer memory to be freed.

I don't see the oom-killer getting lucky in my test cases.

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 22:28         ` Vaibhav Nagarnaik
@ 2011-05-26 23:38           ` Steven Rostedt
  2011-05-27  9:43             ` David Rientjes
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2011-05-26 23:38 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel, Peter Zijlstra, Mel Gorman,
	Rik Van Riel, David Rientjes, Andrew Morton

[ I added to the Cc people that understand MM more than I do ]

On Thu, 2011-05-26 at 15:28 -0700, Vaibhav Nagarnaik wrote:
> On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > But the issue is, if the process increasing the size of the ring buffer
> > causes the oom, it will not handle the SIGKILL until after the ring
> > buffer has finished allocating. Now, if it failed to allocate, then we
> > are fine, but if it does not fail, but now we start killing processes,
> > then we may be in trouble.
> >
> 
> If I understand correctly, if a fatal signal is pending on a process
> while allocation is called, the allocation fails. Then we handle the
> freeing up memory correctly, though the echo gets killed once we return
> from the allocation process.
> 
> > I like the NORETRY better. But then, would this mean that if we have a
> > lot of cached filesystems, we wont be able to extend the ring buffer?
> 
> It doesn't seem so. I talked with the mm- team and I understand that
> even if NORETRY is set, cached pages will be flushed out and allocation
> will succeed. But it still does not address the situation when the ring
> buffer allocation is going on and another process invokes OOM. If the
> oom_score_adj is not set to maximum, then random processes will still be
> killed before ring buffer allocation fails.
> 
> >
> > I'm thinking the oom killer used here got lucky. As it killed this task,
> > we were still out of memory, and the ring buffer failed to get the
> > memory it needed and freed up everything that it previously allocated,
> > and returned. Then the process calling this function would be killed by
> > the OOM. Ideally, the process shouldn't be killed and the ring buffer
> > just returned -ENOMEM to the user.
> 
> What do you think of this?
> 
> test_set_oom_score_adj(MAXIMUM);
> allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY);
> test_set_oom_score_adj(original);
> 
> This makes sure that the allocation fails much sooner and more
> gracefully. If oom-killer is invoked in any circumstance, then the ring
> buffer allocation process gives up memory and is killed.

I don't know. But as I never seen this function before, I went and took
a look. This test_set_oom_score_adj() is new, and coincidentally written
by another google developer ;)

As there's not really a precedence to this, if those that I added to the
Cc, give their acks, I'm happy to apply this for the next merge window.

-- Steve
 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 23:38           ` Steven Rostedt
@ 2011-05-27  9:43             ` David Rientjes
  2011-05-27 12:48               ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: David Rientjes @ 2011-05-27  9:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker,
	Michael Rubin, David Sharp, linux-kernel, Peter Zijlstra,
	Mel Gorman, Rik Van Riel, Andrew Morton

On Thu, 26 May 2011, Steven Rostedt wrote:

> > What do you think of this?
> > 
> > test_set_oom_score_adj(MAXIMUM);
> > allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY);
> > test_set_oom_score_adj(original);
> > 
> > This makes sure that the allocation fails much sooner and more
> > gracefully. If oom-killer is invoked in any circumstance, then the ring
> > buffer allocation process gives up memory and is killed.
> 
> I don't know. But as I never seen this function before, I went and took
> a look. This test_set_oom_score_adj() is new, and coincidentally written
> by another google developer ;)
> 

Ignore the history of function, it simply duplicates the old PF_OOM_ORIGIN 
flag that is now removed.

> As there's not really a precedence to this, if those that I added to the
> Cc, give their acks, I'm happy to apply this for the next merge window.
> 

This problem isn't new, it's always been possible that an allocation that 
is higher order, using GFP_ATOMIC or GFP_NOWAIT, or utilizing 
__GFP_NORETRY as I suggested here, would deplete memory at the same time 
that a GFP_FS allocation on another cpu would invoke the oom killer.

If that happens between the time when tracing_resize_ring_buffer() goes 
oom and its nicely written error handling starts freeing memory, then it's 
possible that another task will be unfairly oom killed.  Note that the 
suggested solution of test_set_oom_score_adj(OOM_SCORE_ADJ_MAX) doesn't 
prevent that in all cases: it's possible that another thread on the system 
also has an oom_score_adj of OOM_SCORE_ADJ_MAX and it would be killed in 
its place just because it appeared in the tasklist first (which is 
guaranteed if this is simply an echo command).

Relying on the oom killer to kill this task for parallel blockable 
allocations doesn't seem like the best solution for the sole reason that 
the program that wrote to buffer_size_kb may count on its return value.  
It may be able to handle an -ENOMEM return value and, perhaps, try to 
write a smaller value?

I think what this patch really wants to do is utilize __GFP_NORETRY as 
previously suggested and, if we're really concerned about parallel 
allocations in this instance even though the same situation exists all 
over the kernel, also create an oom notifier with register_oom_notifier() 
that may be called in oom conditions that would free memory when 
buffer->record_disabled is non-zero and prevent the oom.  That would 
increase the size of the ring buffer as large as possible up until oom 
even though it may not be to what the user requested.

Otherwise, you'll just want to use oom_killer_disable() to preven the oom 
killer altogether.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-27  9:43             ` David Rientjes
@ 2011-05-27 12:48               ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-05-27 12:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker,
	Michael Rubin, David Sharp, linux-kernel, Peter Zijlstra,
	Mel Gorman, Rik Van Riel, Andrew Morton

On Fri, 2011-05-27 at 02:43 -0700, David Rientjes wrote:

> This problem isn't new, it's always been possible that an allocation that 
> is higher order, using GFP_ATOMIC or GFP_NOWAIT, or utilizing 
> __GFP_NORETRY as I suggested here, would deplete memory at the same time 
> that a GFP_FS allocation on another cpu would invoke the oom killer.
> 
> If that happens between the time when tracing_resize_ring_buffer() goes 
> oom and its nicely written error handling starts freeing memory, then it's 
> possible that another task will be unfairly oom killed.  Note that the 
> suggested solution of test_set_oom_score_adj(OOM_SCORE_ADJ_MAX) doesn't 
> prevent that in all cases: it's possible that another thread on the system 
> also has an oom_score_adj of OOM_SCORE_ADJ_MAX and it would be killed in 
> its place just because it appeared in the tasklist first (which is 
> guaranteed if this is simply an echo command).
> 
> Relying on the oom killer to kill this task for parallel blockable 
> allocations doesn't seem like the best solution for the sole reason that 
> the program that wrote to buffer_size_kb may count on its return value.  
> It may be able to handle an -ENOMEM return value and, perhaps, try to 
> write a smaller value?
> 
> I think what this patch really wants to do is utilize __GFP_NORETRY as 
> previously suggested and, if we're really concerned about parallel 
> allocations in this instance even though the same situation exists all 
> over the kernel, also create an oom notifier with register_oom_notifier() 
> that may be called in oom conditions that would free memory when 
> buffer->record_disabled is non-zero and prevent the oom.  That would 
> increase the size of the ring buffer as large as possible up until oom 
> even though it may not be to what the user requested.
> 
> Otherwise, you'll just want to use oom_killer_disable() to preven the oom 
> killer altogether.

Thanks for the detailed explanation. OK, I'm convinced. The proper
solution looks to be both the use of __GFP_NORETRY and the use of
test_set_oom_score_adj(). I don't think it is necessary to worry about
multiple users of this score adj, as when we are in an OOM situation,
things are just bad to begin with. But Google has to deal with bad
situations more than others, so if it becomes an issue for you, then we
can discuss those changes later.

Vaibhav, can you send an updated patch?

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik
  2011-05-26 20:04 ` Steven Rostedt
@ 2011-05-27 17:58 ` Vaibhav Nagarnaik
  2011-05-27 23:22   ` David Rientjes
  2011-06-07 23:41   ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik
  1 sibling, 2 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-27 17:58 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker
  Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik

The tracing ring buffer is allocated from kernel memory. While
allocating the memory, if OOM happens, the allocating process might not
be the one that gets killed, since the ring-buffer memory is not
allocated as process memory. Thus random processes might get killed
during the allocation.

This patch makes sure that the allocating process is considered the most
likely oom-kill-able process while the allocating is going on. Thus if
oom-killer is invoked because of ring-buffer allocation, it is easier
for the ring buffer memory to be freed and save important system
processes from being killed.

This patch also adds __GFP_NORETRY flag to the ring buffer allocation
calls to make it fail more gracefully if the system will not be able to
complete the allocation request.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 kernel/trace/ring_buffer.c |   15 ++++++++++-----
 kernel/trace/trace.c       |    9 +++++++++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 02b7896..0339f95 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
 		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+				    GFP_KERNEL | __GFP_NORETRY,
+				    cpu_to_node(cpu_buffer->cpu));
 		if (!bpage)
 			goto free_pages;
 
@@ -1014,7 +1015,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		list_add(&bpage->list, &pages);
 
 		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-					GFP_KERNEL, 0);
+					GFP_KERNEL | __GFP_NORETRY, 0);
 		if (!page)
 			goto free_pages;
 		bpage->page = page_address(page);
@@ -1378,11 +1379,13 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
 			struct page *page;
 			bpage = kzalloc_node(ALIGN(sizeof(*bpage),
 						  cache_line_size()),
-					    GFP_KERNEL, cpu_to_node(cpu));
+					    GFP_KERNEL | __GFP_NORETRY,
+					    cpu_to_node(cpu));
 			if (!bpage)
 				goto free_pages;
 			list_add(&bpage->list, &pages);
-			page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL,
+			page = alloc_pages_node(cpu_to_node(cpu),
+						GFP_KERNEL | __GFP_NORETRY,
 						0);
 			if (!page)
 				goto free_pages;
@@ -3737,7 +3740,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
 	struct buffer_data_page *bpage;
 	struct page *page;
 
-	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+	page = alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_NORETRY,
+				0);
 	if (!page)
 		return NULL;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b926578..15a667a 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -37,6 +37,7 @@
 #include <linux/init.h>
 #include <linux/poll.h>
 #include <linux/fs.h>
+#include <linux/oom.h>
 
 #include "trace.h"
 #include "trace_output.h"
@@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 	unsigned long val;
 	char buf[64];
 	int ret;
+	int oom_score_adj;
 
 	if (cnt >= sizeof(buf))
 		return -EINVAL;
@@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 	/* value is in KB */
 	val <<= 10;
 
+	/*
+	 * make sure this process is picked over others to be killed in OOM
+	 * condition
+	 */
+	oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
 	ret = tracing_resize_ring_buffer(val);
+	/* restore the original oom_score_adj value */
+	test_set_oom_score_adj(oom_score_adj);
 	if (ret < 0)
 		return ret;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-27 17:58 ` Vaibhav Nagarnaik
@ 2011-05-27 23:22   ` David Rientjes
  2011-05-28  0:44     ` Vaibhav Nagarnaik
  2011-06-07 23:41   ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik
  1 sibling, 1 reply; 30+ messages in thread
From: David Rientjes @ 2011-05-27 23:22 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Fri, 27 May 2011, Vaibhav Nagarnaik wrote:

> The tracing ring buffer is allocated from kernel memory. While
> allocating the memory, if OOM happens, the allocating process might not
> be the one that gets killed, since the ring-buffer memory is not
> allocated as process memory. Thus random processes might get killed
> during the allocation.
> 
> This patch makes sure that the allocating process is considered the most
> likely oom-kill-able process while the allocating is going on. Thus if
> oom-killer is invoked because of ring-buffer allocation, it is easier
> for the ring buffer memory to be freed and save important system
> processes from being killed.
> 
> This patch also adds __GFP_NORETRY flag to the ring buffer allocation
> calls to make it fail more gracefully if the system will not be able to
> complete the allocation request.
> 
> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>

Still not sure this is what we want, I'm afraid.

I like the addition of __GFP_NORETRY, but I don't understand the use of 
test_set_oom_score_adj() here.  Why can't we use oom_killer_disable(), 
allocate with __GFP_NORETRY, and then do oom_killer_enable()?

This prevents other tasks from getting oom killed themselves if they have 
oom_score_adj of OOM_SCORE_ADJ_MAX and allows the write to fail with 
-ENOMEM rather then being oom killed out from under us.

So why is test_set_oom_score_adj() better?

The alternative would be to setup an oom notifier for the ring buffer and 
stop allocating prior to killing a task and return a size that was smaller 
than what the user requested.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-27 23:22   ` David Rientjes
@ 2011-05-28  0:44     ` Vaibhav Nagarnaik
  2011-05-28  1:50       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-28  0:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Fri, May 27, 2011 at 4:22 PM, David Rientjes <rientjes@google.com> wrote:
> On Fri, 27 May 2011, Vaibhav Nagarnaik wrote:
>
>> The tracing ring buffer is allocated from kernel memory. While
>> allocating the memory, if OOM happens, the allocating process might not
>> be the one that gets killed, since the ring-buffer memory is not
>> allocated as process memory. Thus random processes might get killed
>> during the allocation.
>>
>> This patch makes sure that the allocating process is considered the most
>> likely oom-kill-able process while the allocating is going on. Thus if
>> oom-killer is invoked because of ring-buffer allocation, it is easier
>> for the ring buffer memory to be freed and save important system
>> processes from being killed.
>>
>> This patch also adds __GFP_NORETRY flag to the ring buffer allocation
>> calls to make it fail more gracefully if the system will not be able to
>> complete the allocation request.
>>
>> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
>
> Still not sure this is what we want, I'm afraid.
>
> I like the addition of __GFP_NORETRY, but I don't understand the use of
> test_set_oom_score_adj() here.  Why can't we use oom_killer_disable(),
> allocate with __GFP_NORETRY, and then do oom_killer_enable()?
>
> This prevents other tasks from getting oom killed themselves if they have
> oom_score_adj of OOM_SCORE_ADJ_MAX and allows the write to fail with
> -ENOMEM rather then being oom killed out from under us.
>
> So why is test_set_oom_score_adj() better?
>

When I tested this change, I saw that oom-killer can be invoked by any
process trying to allocate. If the oom_score_adj is not set to MAXIMUM,
oom-killer will not pick the "echo" process allocating ring buffer
memory since the memory is not counted against its RSS. This was the
main reason to use the test_set_oom_score_adj() API.

I saw this API being used in the swapoff() to allocate RAM for the pages
swapped out and as implemented, it seemed like a similar use case.

This change is not for the case where memory is already constrained. It
is for the case where the user has specified an invalid amount of memory
to be allocated while running in a constrained container. We don't want
to kill random processes while allocation of ring buffer is going on.

By using oom_killer_{enable|disable}, I think you mean setting the
global variable oom_killer_disabled. Using it will upset the system's
expectation of memory behavior. When GFP_NORETRY is set for our ring
buffer allocating process, the oom_killer_disabled is never checked. But
it would cause NULL's to be returned to other allocating processes which
might destabilize them and the system. The oom-killer has much better
success of reclaiming memory by killing the ring buffer allocating
process at that point.

For the other tasks with OOM_SCORE_ADJ_MAX set, one can argue that they
had expectation to be killed even if they themselves were not the direct
cause of the OOM situation.

> The alternative would be to setup an oom notifier for the ring buffer and
> stop allocating prior to killing a task and return a size that was smaller
> than what the user requested.
>

I think this gets really complex for the ring buffer allocation
scenario. I see that using __GFP_NORETRY is great for a graceful
failure. In the rare case where the allocation is going on and some
other process invokes oom-killer, stopping the ring buffer allocation
should be a priority to make sure that the system doesn't get
de-stabilized. As far as I can see, setting the oom_score_adj to maximum
is the best way to solve it.

That said, I am open to changing it if Steven and you think using
oom_killer_disabled is a better solution.

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-28  0:44     ` Vaibhav Nagarnaik
@ 2011-05-28  1:50       ` Steven Rostedt
  2011-05-30  6:23         ` KOSAKI Motohiro
  2011-05-30 23:46         ` Vaibhav Nagarnaik
  0 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-05-28  1:50 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote:

> That said, I am open to changing it if Steven and you think using
> oom_killer_disabled is a better solution.

My biggest concern is that we are setting policy in the kernel. If you
are concerned about this, why not just have the process that is going to
increase the size of the ring buffer adjust its own oom policy
with /proc/<pid>/oom_score_adj ? Only a privilege process can increase
the size of the ring buffer so it's not like we are worried about any
normal user task upping the ring buffer to kill other processes.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-28  1:50       ` Steven Rostedt
@ 2011-05-30  6:23         ` KOSAKI Motohiro
  2011-05-30 23:54           ` Vaibhav Nagarnaik
  2011-05-30 23:46         ` Vaibhav Nagarnaik
  1 sibling, 1 reply; 30+ messages in thread
From: KOSAKI Motohiro @ 2011-05-30  6:23 UTC (permalink / raw)
  To: rostedt
  Cc: vnagarnaik, rientjes, mingo, fweisbec, mrubin, dhsharp, linux-kernel

(2011/05/28 10:50), Steven Rostedt wrote:
> On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote:
> 
>> That said, I am open to changing it if Steven and you think using
>> oom_killer_disabled is a better solution.
> 
> My biggest concern is that we are setting policy in the kernel. If you
> are concerned about this, why not just have the process that is going to
> increase the size of the ring buffer adjust its own oom policy
> with /proc/<pid>/oom_score_adj ? Only a privilege process can increase
> the size of the ring buffer so it's not like we are worried about any
> normal user task upping the ring buffer to kill other processes.

I like Steven's approach.

Because even if we apply Vaibhav's patch, we still have a oom issue.
because when oom-killer killed echo commands, it doesn't shrink ring
buffer. it only just die. So, the kernel is still under extreme memory
shortage. Any admins operation may invoke next oom-killer.

And -personally- I think any tracing user should know system ram size
and proper ring buffer size. ;)


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-28  1:50       ` Steven Rostedt
  2011-05-30  6:23         ` KOSAKI Motohiro
@ 2011-05-30 23:46         ` Vaibhav Nagarnaik
  2011-06-07 23:07           ` Steven Rostedt
  1 sibling, 1 reply; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-30 23:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Fri, May 27, 2011 at 6:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote:
>
>> That said, I am open to changing it if Steven and you think using
>> oom_killer_disabled is a better solution.
>
> My biggest concern is that we are setting policy in the kernel. If you
> are concerned about this, why not just have the process that is going to
> increase the size of the ring buffer adjust its own oom policy
> with /proc/<pid>/oom_score_adj ? Only a privilege process can increase
> the size of the ring buffer so it's not like we are worried about any
> normal user task upping the ring buffer to kill other processes.
>
> -- Steve

You are right. Changing the value of oom_killer_disabled is setting
kernel policy and I don't like the solution either.

It would be great if the user always sets oom_score_adj while
allocating ring buffer memory, but there are different use cases I see
where it might not be feasible. One is the number of different
applications which use ftrace and set the ring buffer memory size
(trace-cmd, perf, powertop, etc). The other case is where a developer
is setting the size for his own tracing. In all these cases,
oom_score_adj *should* be set to maximum to be safe, but it a
developer might forget this step and can de-stabilize the system. This
is why I think that setting oom_score_adj to maximum in the kernel
while allocating buffer as well as using __GFP_NORETRY should suffice
for the end goal of being a good memory citizen.

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-30  6:23         ` KOSAKI Motohiro
@ 2011-05-30 23:54           ` Vaibhav Nagarnaik
  0 siblings, 0 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-05-30 23:54 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Steven Rostedt, David Rientjes, Ingo Molnar, Frederic Weisbecker,
	Michael Rubin, David Sharp, linux-kernel

On Sun, May 29, 2011 at 11:23 PM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> (2011/05/28 10:50), Steven Rostedt wrote:
>> On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote:
>>
>>> That said, I am open to changing it if Steven and you think using
>>> oom_killer_disabled is a better solution.
>>
>> My biggest concern is that we are setting policy in the kernel. If you
>> are concerned about this, why not just have the process that is going to
>> increase the size of the ring buffer adjust its own oom policy
>> with /proc/<pid>/oom_score_adj ? Only a privilege process can increase
>> the size of the ring buffer so it's not like we are worried about any
>> normal user task upping the ring buffer to kill other processes.
>
> I like Steven's approach.
>
> Because even if we apply Vaibhav's patch, we still have a oom issue.
> because when oom-killer killed echo commands, it doesn't shrink ring
> buffer. it only just die. So, the kernel is still under extreme memory
> shortage. Any admins operation may invoke next oom-killer.

The ring buffer allocation is done synchronously with the echo
command. When a process is picked by the OOM killer, it has a fatal
signal pending. Any further allocations by the process in that
scenario will return with a failure. The ring buffer allocation code
can handle this scenario very well and give up all the previously
allocated memory.



Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-05-30 23:46         ` Vaibhav Nagarnaik
@ 2011-06-07 23:07           ` Steven Rostedt
  2011-06-07 23:30             ` Vaibhav Nagarnaik
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2011-06-07 23:07 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Mon, 2011-05-30 at 16:46 -0700, Vaibhav Nagarnaik wrote:
> On Fri, May 27, 2011 at 6:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote:
> >
> >> That said, I am open to changing it if Steven and you think using
> >> oom_killer_disabled is a better solution.
> >
> > My biggest concern is that we are setting policy in the kernel. If you
> > are concerned about this, why not just have the process that is going to
> > increase the size of the ring buffer adjust its own oom policy
> > with /proc/<pid>/oom_score_adj ? Only a privilege process can increase
> > the size of the ring buffer so it's not like we are worried about any
> > normal user task upping the ring buffer to kill other processes.
> >
> > -- Steve
> 
> You are right. Changing the value of oom_killer_disabled is setting
> kernel policy and I don't like the solution either.
> 
> It would be great if the user always sets oom_score_adj while
> allocating ring buffer memory, but there are different use cases I see
> where it might not be feasible. One is the number of different
> applications which use ftrace and set the ring buffer memory size
> (trace-cmd, perf, powertop, etc). The other case is where a developer
> is setting the size for his own tracing. In all these cases,
> oom_score_adj *should* be set to maximum to be safe, but it a
> developer might forget this step and can de-stabilize the system. This
> is why I think that setting oom_score_adj to maximum in the kernel
> while allocating buffer as well as using __GFP_NORETRY should suffice
> for the end goal of being a good memory citizen.


Hi Vaibhav,

I'm going though patches for v3.1 now. Where are we on this issue. I
still don't really like the ideal of having the kernel set the oom
policy by default. But I'm totally fine with changing the allocations
for NORETRY.

If you switch it to NORETRY, do you still have the issues you were
seeing? You could also add userspace helpers that would set the oom
policy of things accessing the ring buffer.

Either have a script that updates the ring buffer size and sets the oom
policy, or have a library with a helper routine.

Would those be fine for you?

-- Steve




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process
  2011-06-07 23:07           ` Steven Rostedt
@ 2011-06-07 23:30             ` Vaibhav Nagarnaik
  0 siblings, 0 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-06-07 23:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Tue, Jun 7, 2011 at 4:07 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Hi Vaibhav,
>
> I'm going though patches for v3.1 now. Where are we on this issue. I
> still don't really like the ideal of having the kernel set the oom
> policy by default. But I'm totally fine with changing the allocations
> for NORETRY.
>
> If you switch it to NORETRY, do you still have the issues you were
> seeing? You could also add userspace helpers that would set the oom
> policy of things accessing the ring buffer.
>
> Either have a script that updates the ring buffer size and sets the oom
> policy, or have a library with a helper routine.
>
> Would those be fine for you?
>
> -- Steve
>
I understand your hesitation to change the kernel oom policy for
allocating the ring buffer. I am fine with just using __GFP_NORETRY for
allocation calls to fail more gracefully. It works for most of the use
cases I encounter.

We will make changes to our internal tools to make sure that
oom_score_adj is changed to pick the allocating process for oom-killing.

I will send the updated patch.


Thanks

Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-05-27 17:58 ` Vaibhav Nagarnaik
  2011-05-27 23:22   ` David Rientjes
@ 2011-06-07 23:41   ` Vaibhav Nagarnaik
  2011-06-07 23:47     ` Frederic Weisbecker
  2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
  1 sibling, 2 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-06-07 23:41 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker
  Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik

The tracing ring buffer is allocated from kernel memory. While
allocating a large chunk of memory, OOM might happen which destabilizes
the system. Thus random processes might get killed during the
allocation.

This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
to make it fail more gracefully if the system will not be able to
complete the allocation request.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
 kernel/trace/ring_buffer.c |   15 ++++++++++-----
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2780e60..ec817d5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
 		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+				    GFP_KERNEL | __GFP_NORETRY,
+				    cpu_to_node(cpu_buffer->cpu));
 		if (!bpage)
 			goto free_pages;
 
@@ -1014,7 +1015,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		list_add(&bpage->list, &pages);
 
 		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-					GFP_KERNEL, 0);
+					GFP_KERNEL | __GFP_NORETRY, 0);
 		if (!page)
 			goto free_pages;
 		bpage->page = page_address(page);
@@ -1378,11 +1379,13 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
 			struct page *page;
 			bpage = kzalloc_node(ALIGN(sizeof(*bpage),
 						  cache_line_size()),
-					    GFP_KERNEL, cpu_to_node(cpu));
+					    GFP_KERNEL | __GFP_NORETRY,
+					    cpu_to_node(cpu));
 			if (!bpage)
 				goto free_pages;
 			list_add(&bpage->list, &pages);
-			page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL,
+			page = alloc_pages_node(cpu_to_node(cpu),
+						GFP_KERNEL | __GFP_NORETRY,
 						0);
 			if (!page)
 				goto free_pages;
@@ -3737,7 +3740,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
 	struct buffer_data_page *bpage;
 	struct page *page;
 
-	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+	page = alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_NORETRY,
+				0);
 	if (!page)
 		return NULL;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-07 23:41   ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik
@ 2011-06-07 23:47     ` Frederic Weisbecker
  2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
  1 sibling, 0 replies; 30+ messages in thread
From: Frederic Weisbecker @ 2011-06-07 23:47 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Ingo Molnar, Michael Rubin, David Sharp, linux-kernel

On Tue, Jun 07, 2011 at 04:41:28PM -0700, Vaibhav Nagarnaik wrote:
> The tracing ring buffer is allocated from kernel memory. While
> allocating a large chunk of memory, OOM might happen which destabilizes
> the system. Thus random processes might get killed during the
> allocation.
> 
> This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
> to make it fail more gracefully if the system will not be able to
> complete the allocation request.
> 
> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
> ---
>  kernel/trace/ring_buffer.c |   15 ++++++++++-----
>  1 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 2780e60..ec817d5 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
>  	for (i = 0; i < nr_pages; i++) {
>  		struct page *page;
>  		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> -				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
> +				    GFP_KERNEL | __GFP_NORETRY,

Please put a comment in the code to explain the reason of this __GFP_NORETRY.
It's pretty hard without the changelog.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-07 23:41   ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik
  2011-06-07 23:47     ` Frederic Weisbecker
@ 2011-06-08  0:01     ` Vaibhav Nagarnaik
  2011-06-08  2:30       ` David Rientjes
                         ` (2 more replies)
  1 sibling, 3 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-06-08  0:01 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker
  Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik

The tracing ring buffer is allocated from kernel memory. While
allocating a large chunk of memory, OOM might happen which destabilizes
the system. Thus random processes might get killed during the
allocation.

This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
to make it fail more gracefully if the system will not be able to
complete the allocation request.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
---
Changelog:
v2-v1: Added comment to explaing use of __GFP_NORETRY

 kernel/trace/ring_buffer.c |   25 ++++++++++++++++++++-----
 1 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2780e60..bd588b6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1004,8 +1004,14 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
+		/*
+		 * __GFP_NORETRY flag makes sure that the allocation fails
+		 * gracefully without invoking oom-killer and the system is
+		 * not destabilized.
+		 */
 		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+				    GFP_KERNEL | __GFP_NORETRY,
+				    cpu_to_node(cpu_buffer->cpu));
 		if (!bpage)
 			goto free_pages;
 
@@ -1014,7 +1020,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		list_add(&bpage->list, &pages);
 
 		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-					GFP_KERNEL, 0);
+					GFP_KERNEL | __GFP_NORETRY, 0);
 		if (!page)
 			goto free_pages;
 		bpage->page = page_address(page);
@@ -1376,13 +1382,20 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
 	for_each_buffer_cpu(buffer, cpu) {
 		for (i = 0; i < new_pages; i++) {
 			struct page *page;
+			/*
+			 * __GFP_NORETRY flag makes sure that the allocation
+			 * fails gracefully without invoking oom-killer and
+			 * the system is not destabilized.
+			 */
 			bpage = kzalloc_node(ALIGN(sizeof(*bpage),
 						  cache_line_size()),
-					    GFP_KERNEL, cpu_to_node(cpu));
+					    GFP_KERNEL | __GFP_NORETRY,
+					    cpu_to_node(cpu));
 			if (!bpage)
 				goto free_pages;
 			list_add(&bpage->list, &pages);
-			page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL,
+			page = alloc_pages_node(cpu_to_node(cpu),
+						GFP_KERNEL | __GFP_NORETRY,
 						0);
 			if (!page)
 				goto free_pages;
@@ -3737,7 +3750,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
 	struct buffer_data_page *bpage;
 	struct page *page;
 
-	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+	page = alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_NORETRY,
+				0);
 	if (!page)
 		return NULL;
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
@ 2011-06-08  2:30       ` David Rientjes
  2011-06-09 11:37       ` KOSAKI Motohiro
  2011-07-05 12:54       ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik
  2 siblings, 0 replies; 30+ messages in thread
From: David Rientjes @ 2011-06-08  2:30 UTC (permalink / raw)
  To: Vaibhav Nagarnaik
  Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Tue, 7 Jun 2011, Vaibhav Nagarnaik wrote:

> The tracing ring buffer is allocated from kernel memory. While
> allocating a large chunk of memory, OOM might happen which destabilizes
> the system. Thus random processes might get killed during the
> allocation.
> 
> This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
> to make it fail more gracefully if the system will not be able to
> complete the allocation request.
> 
> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
  2011-06-08  2:30       ` David Rientjes
@ 2011-06-09 11:37       ` KOSAKI Motohiro
  2011-06-09 12:14         ` Steven Rostedt
  2011-07-05 12:54       ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik
  2 siblings, 1 reply; 30+ messages in thread
From: KOSAKI Motohiro @ 2011-06-09 11:37 UTC (permalink / raw)
  To: vnagarnaik
  Cc: kosaki.motohiro, rostedt, mingo, fweisbec, mrubin, dhsharp, linux-kernel

(2011/06/08 9:01), Vaibhav Nagarnaik wrote:
> The tracing ring buffer is allocated from kernel memory. While
> allocating a large chunk of memory, OOM might happen which destabilizes
> the system. Thus random processes might get killed during the
> allocation.
> 
> This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
> to make it fail more gracefully if the system will not be able to
> complete the allocation request.
> 
> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
> ---
> Changelog:
> v2-v1: Added comment to explaing use of __GFP_NORETRY
> 
>  kernel/trace/ring_buffer.c |   25 ++++++++++++++++++++-----
>  1 files changed, 20 insertions(+), 5 deletions(-)

Unfortunately, __GFP_NORETRY is racy and don't work as expected.
If free memory is not enough, the thread may start to reclaim and
another thread can steal the reclaimed memory. And thread0 don't retry.

Then, thread0's alloc page may fail even though system have enough reclaimable
memory.

        thread0                                        thread1
        ---------------------------------------------------------------
        alloc_pages()
           get_page_from_freelist() -> fail
           try_to_free_pages()
                                                    alloc_pages()
                                                       get_page_from_freelist() -> success
           get_page_from_freelist() -> fail again

I think this is mm issue, and afaik, Minchan and some developers are
working on fixing it. but _now_ your patch doesn't work.

Thanks.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-09 11:37       ` KOSAKI Motohiro
@ 2011-06-09 12:14         ` Steven Rostedt
  2011-06-09 18:41           ` Vaibhav Nagarnaik
  2011-06-09 19:42           ` David Rientjes
  0 siblings, 2 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-06-09 12:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: vnagarnaik, mingo, fweisbec, mrubin, dhsharp, linux-kernel

On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:

> Unfortunately, __GFP_NORETRY is racy and don't work as expected.
> If free memory is not enough, the thread may start to reclaim and
> another thread can steal the reclaimed memory. And thread0 don't retry.
> 
> Then, thread0's alloc page may fail even though system have enough reclaimable
> memory.
> 
>         thread0                                        thread1
>         ---------------------------------------------------------------
>         alloc_pages()
>            get_page_from_freelist() -> fail
>            try_to_free_pages()
>                                                     alloc_pages()
>                                                        get_page_from_freelist() -> success
>            get_page_from_freelist() -> fail again
> 
> I think this is mm issue, and afaik, Minchan and some developers are
> working on fixing it. but _now_ your patch doesn't work.

Have you seen this fail in practice?

I'm not too concern if it only triggers when memory is tight. But if it
is triggering on normal cases, then that worries me.

Thanks,

-- Steve



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-09 12:14         ` Steven Rostedt
@ 2011-06-09 18:41           ` Vaibhav Nagarnaik
  2011-06-09 19:42           ` David Rientjes
  1 sibling, 0 replies; 30+ messages in thread
From: Vaibhav Nagarnaik @ 2011-06-09 18:41 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: KOSAKI Motohiro, Ingo Molnar, Frederic Weisbecker, Michael Rubin,
	David Sharp, linux-kernel

On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote:
> Unfortunately, __GFP_NORETRY is racy and don't work as expected.
> If free memory is not enough, the thread may start to reclaim and
> another thread can steal the reclaimed memory. And thread0 don't retry.
>
> Then, thread0's alloc page may fail even though system have enough reclaimable
> memory.
>
>         thread0                                        thread1
>         ---------------------------------------------------------------
>         alloc_pages()
>            get_page_from_freelist() -> fail
>            try_to_free_pages()
>                                                     alloc_pages()
>                                                        get_page_from_freelist() -> success
>            get_page_from_freelist() -> fail again
>
> I think this is mm issue, and afaik, Minchan and some developers are
> working on fixing it. but _now_ your patch doesn't work.

Thanks. I was not aware of this condition. I discussed it with the mm-
team and it seems this is a well known problem. Though it it is not that
bad for order 0 allocations, which is what ring buffer uses.

Also, it happens in tight memory conditions, at which point allocating
ring buffer might make it worse. I think providing the user a
notification of this is better than just worsening the situation by
allocating ring buffer.

On Thu, Jun 9, 2011 at 5:14 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Have you seen this fail in practice?

Not in my normal test cases. I haven't stressed it though, but that
doesn't make sense for this patch. The idea is that ring buffer should
be able to allocate memory as long as it doesn't impact the system too
much.


Vaibhav Nagarnaik

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-09 12:14         ` Steven Rostedt
  2011-06-09 18:41           ` Vaibhav Nagarnaik
@ 2011-06-09 19:42           ` David Rientjes
  2011-06-09 19:52             ` Steven Rostedt
  1 sibling, 1 reply; 30+ messages in thread
From: David Rientjes @ 2011-06-09 19:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: KOSAKI Motohiro, vnagarnaik, mingo, fweisbec, mrubin, dhsharp,
	linux-kernel

On Thu, 9 Jun 2011, Steven Rostedt wrote:

> > Unfortunately, __GFP_NORETRY is racy and don't work as expected.
> > If free memory is not enough, the thread may start to reclaim and
> > another thread can steal the reclaimed memory. And thread0 don't retry.
> > 
> > Then, thread0's alloc page may fail even though system have enough reclaimable
> > memory.
> > 
> >         thread0                                        thread1
> >         ---------------------------------------------------------------
> >         alloc_pages()
> >            get_page_from_freelist() -> fail
> >            try_to_free_pages()
> >                                                     alloc_pages()
> >                                                        get_page_from_freelist() -> success
> >            get_page_from_freelist() -> fail again
> > 
> > I think this is mm issue, and afaik, Minchan and some developers are
> > working on fixing it. but _now_ your patch doesn't work.
> 
> Have you seen this fail in practice?
> 
> I'm not too concern if it only triggers when memory is tight. But if it
> is triggering on normal cases, then that worries me.
> 

It would only happen if there was an antagonist that stole the reclaimed 
pages before your __GFP_NORETRY allocation could allocate them, resulting 
in the system being oom again as it was before reclaim occurred.  Without 
__GFP_NORETRY, we'd automatically retry these allocations in a loop until 
we found the memory since they are order-0, so the only side effect would 
be an increased latency in the allocation.  I think if we still end up oom 
after reclaiming memory that was allocated by another thread that we 
probably don't want to be expanding the ring buffer and, thus, I see no 
problem with just failing.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-09 19:42           ` David Rientjes
@ 2011-06-09 19:52             ` Steven Rostedt
  0 siblings, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2011-06-09 19:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, vnagarnaik, mingo, fweisbec, mrubin, dhsharp,
	linux-kernel

On Thu, 2011-06-09 at 12:42 -0700, David Rientjes wrote:

> It would only happen if there was an antagonist that stole the reclaimed 
> pages before your __GFP_NORETRY allocation could allocate them, resulting 
> in the system being oom again as it was before reclaim occurred.  Without 
> __GFP_NORETRY, we'd automatically retry these allocations in a loop until 
> we found the memory since they are order-0, so the only side effect would 
> be an increased latency in the allocation.  I think if we still end up oom 
> after reclaiming memory that was allocated by another thread that we 
> probably don't want to be expanding the ring buffer and, thus, I see no 
> problem with just failing.

Agreed, which is why I already pushed the patch.

-- Steve



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [tip:perf/core] ring-buffer: Set __GFP_NORETRY flag for ring buffer allocating process
  2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
  2011-06-08  2:30       ` David Rientjes
  2011-06-09 11:37       ` KOSAKI Motohiro
@ 2011-07-05 12:54       ` tip-bot for Vaibhav Nagarnaik
  2 siblings, 0 replies; 30+ messages in thread
From: tip-bot for Vaibhav Nagarnaik @ 2011-07-05 12:54 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, vnagarnaik, hpa, mingo, dhsharp, fweisbec, rostedt,
	mrubin, tglx, rientjes

Commit-ID:  d7ec4bfed6c97405c6417970ba06c439e08ab8e3
Gitweb:     http://git.kernel.org/tip/d7ec4bfed6c97405c6417970ba06c439e08ab8e3
Author:     Vaibhav Nagarnaik <vnagarnaik@google.com>
AuthorDate: Tue, 7 Jun 2011 17:01:42 -0700
Committer:  Steven Rostedt <rostedt@goodmis.org>
CommitDate: Tue, 14 Jun 2011 22:48:51 -0400

ring-buffer: Set __GFP_NORETRY flag for ring buffer allocating process

The tracing ring buffer is allocated from kernel memory. While
allocating a large chunk of memory, OOM might happen which destabilizes
the system. Thus random processes might get killed during the
allocation.

This patch adds __GFP_NORETRY flag to the ring buffer allocation calls
to make it fail more gracefully if the system will not be able to
complete the allocation request.

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c |   25 +++++++++++++++++++------
 1 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f00ede3..731201b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1004,9 +1004,14 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
-
+		/*
+		 * __GFP_NORETRY flag makes sure that the allocation fails
+		 * gracefully without invoking oom-killer and the system is
+		 * not destabilized.
+		 */
 		bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
-				    GFP_KERNEL, cpu_to_node(cpu_buffer->cpu));
+				    GFP_KERNEL | __GFP_NORETRY,
+				    cpu_to_node(cpu_buffer->cpu));
 		if (!bpage)
 			goto free_pages;
 
@@ -1015,7 +1020,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
 		list_add(&bpage->list, &pages);
 
 		page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-					GFP_KERNEL, 0);
+					GFP_KERNEL | __GFP_NORETRY, 0);
 		if (!page)
 			goto free_pages;
 		bpage->page = page_address(page);
@@ -1377,13 +1382,20 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size)
 	for_each_buffer_cpu(buffer, cpu) {
 		for (i = 0; i < new_pages; i++) {
 			struct page *page;
+			/*
+			 * __GFP_NORETRY flag makes sure that the allocation
+			 * fails gracefully without invoking oom-killer and
+			 * the system is not destabilized.
+			 */
 			bpage = kzalloc_node(ALIGN(sizeof(*bpage),
 						  cache_line_size()),
-					    GFP_KERNEL, cpu_to_node(cpu));
+					    GFP_KERNEL | __GFP_NORETRY,
+					    cpu_to_node(cpu));
 			if (!bpage)
 				goto free_pages;
 			list_add(&bpage->list, &pages);
-			page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+			page = alloc_pages_node(cpu_to_node(cpu),
+						GFP_KERNEL | __GFP_NORETRY, 0);
 			if (!page)
 				goto free_pages;
 			bpage->page = page_address(page);
@@ -3737,7 +3749,8 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu)
 	struct buffer_data_page *bpage;
 	struct page *page;
 
-	page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+	page = alloc_pages_node(cpu_to_node(cpu),
+				GFP_KERNEL | __GFP_NORETRY, 0);
 	if (!page)
 		return NULL;
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-07-05 12:54 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik
2011-05-26 20:04 ` Steven Rostedt
2011-05-26 20:22   ` David Rientjes
2011-05-26 20:23   ` Vaibhav Nagarnaik
2011-05-26 20:33     ` David Rientjes
2011-05-26 21:00       ` Steven Rostedt
2011-05-26 22:28         ` Vaibhav Nagarnaik
2011-05-26 23:38           ` Steven Rostedt
2011-05-27  9:43             ` David Rientjes
2011-05-27 12:48               ` Steven Rostedt
2011-05-26 23:23         ` Vaibhav Nagarnaik
2011-05-27 17:58 ` Vaibhav Nagarnaik
2011-05-27 23:22   ` David Rientjes
2011-05-28  0:44     ` Vaibhav Nagarnaik
2011-05-28  1:50       ` Steven Rostedt
2011-05-30  6:23         ` KOSAKI Motohiro
2011-05-30 23:54           ` Vaibhav Nagarnaik
2011-05-30 23:46         ` Vaibhav Nagarnaik
2011-06-07 23:07           ` Steven Rostedt
2011-06-07 23:30             ` Vaibhav Nagarnaik
2011-06-07 23:41   ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik
2011-06-07 23:47     ` Frederic Weisbecker
2011-06-08  0:01     ` [PATCH v2] " Vaibhav Nagarnaik
2011-06-08  2:30       ` David Rientjes
2011-06-09 11:37       ` KOSAKI Motohiro
2011-06-09 12:14         ` Steven Rostedt
2011-06-09 18:41           ` Vaibhav Nagarnaik
2011-06-09 19:42           ` David Rientjes
2011-06-09 19:52             ` Steven Rostedt
2011-07-05 12:54       ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.