* [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process @ 2011-05-26 19:52 Vaibhav Nagarnaik 2011-05-26 20:04 ` Steven Rostedt 2011-05-27 17:58 ` Vaibhav Nagarnaik 0 siblings, 2 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-26 19:52 UTC (permalink / raw) To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik The tracing ring buffer is allocated from kernel memory. While allocating the memory, if OOM happens, the allocating process might not be the one that gets killed, since the ring-buffer memory is not allocated as process memory. Thus random processes might get killed during the allocation. This patch makes sure that the allocating process is considered the most likely oom-kill-able process while the allocating is going on. Thus if oom-killer is invoked because of ring-buffer allocation, it is easier for the ring buffer memory to be freed and save important system processes from being killed. Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> --- kernel/trace/trace.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b926578..15a667a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -37,6 +37,7 @@ #include <linux/init.h> #include <linux/poll.h> #include <linux/fs.h> +#include <linux/oom.h> #include "trace.h" #include "trace_output.h" @@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, unsigned long val; char buf[64]; int ret; + int oom_score_adj; if (cnt >= sizeof(buf)) return -EINVAL; @@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, /* value is in KB */ val <<= 10; + /* + * make sure this process is picked over others to be killed in OOM + * condition + */ + oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ret = tracing_resize_ring_buffer(val); + /* restore the original oom_score_adj value */ + test_set_oom_score_adj(oom_score_adj); if (ret < 0) return ret; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik @ 2011-05-26 20:04 ` Steven Rostedt 2011-05-26 20:22 ` David Rientjes 2011-05-26 20:23 ` Vaibhav Nagarnaik 2011-05-27 17:58 ` Vaibhav Nagarnaik 1 sibling, 2 replies; 30+ messages in thread From: Steven Rostedt @ 2011-05-26 20:04 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, 2011-05-26 at 12:52 -0700, Vaibhav Nagarnaik wrote: > The tracing ring buffer is allocated from kernel memory. While > allocating the memory, if OOM happens, the allocating process might not > be the one that gets killed, since the ring-buffer memory is not > allocated as process memory. Thus random processes might get killed > during the allocation. > > This patch makes sure that the allocating process is considered the most > likely oom-kill-able process while the allocating is going on. Thus if > oom-killer is invoked because of ring-buffer allocation, it is easier > for the ring buffer memory to be freed and save important system > processes from being killed. Hmm, have you tried this in practice? Yes we may kill the "echo" command but it doesn't stop the ring buffer from being allocated, and thus killing the echo command may not be enough, and those critical processes that you are trying to protect will be killed next. Perhaps we should change the allocation of the ring buffer or detect OOM triggering. Maybe make all the allocations ATOMIC, thus it will be either available or not, and fail instead of trying to swap out other memory for the allocation. -- Steve > > Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> > --- > kernel/trace/trace.c | 9 +++++++++ > 1 files changed, 9 insertions(+), 0 deletions(-) > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index b926578..15a667a 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -37,6 +37,7 @@ > #include <linux/init.h> > #include <linux/poll.h> > #include <linux/fs.h> > +#include <linux/oom.h> > > #include "trace.h" > #include "trace_output.h" > @@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, > unsigned long val; > char buf[64]; > int ret; > + int oom_score_adj; > > if (cnt >= sizeof(buf)) > return -EINVAL; > @@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, > /* value is in KB */ > val <<= 10; > > + /* > + * make sure this process is picked over others to be killed in OOM > + * condition > + */ > + oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); > ret = tracing_resize_ring_buffer(val); > + /* restore the original oom_score_adj value */ > + test_set_oom_score_adj(oom_score_adj); > if (ret < 0) > return ret; > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 20:04 ` Steven Rostedt @ 2011-05-26 20:22 ` David Rientjes 2011-05-26 20:23 ` Vaibhav Nagarnaik 1 sibling, 0 replies; 30+ messages in thread From: David Rientjes @ 2011-05-26 20:22 UTC (permalink / raw) To: Steven Rostedt Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, 26 May 2011, Steven Rostedt wrote: > Hmm, have you tried this in practice? Yes we may kill the "echo" command > but it doesn't stop the ring buffer from being allocated, and thus > killing the echo command may not be enough, and those critical processes > that you are trying to protect will be killed next. > > Perhaps we should change the allocation of the ring buffer or detect OOM > triggering. Maybe make all the allocations ATOMIC, thus it will be > either available or not, and fail instead of trying to swap out other > memory for the allocation. > My impression of this was that it was attempting to avoid killing a different process by means of the oom killer rather than avoiding swap. I don't think there's anything wrong with using GFP_KERNEL, but I'd suggest also using __GFP_NORETRY so the allocation calls into direct reclaim but will avoid oom killing anything and simply failing instead. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 20:04 ` Steven Rostedt 2011-05-26 20:22 ` David Rientjes @ 2011-05-26 20:23 ` Vaibhav Nagarnaik 2011-05-26 20:33 ` David Rientjes 1 sibling, 1 reply; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-26 20:23 UTC (permalink / raw) To: Steven Rostedt Cc: Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, May 26, 2011 at 1:04 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 2011-05-26 at 12:52 -0700, Vaibhav Nagarnaik wrote: >> The tracing ring buffer is allocated from kernel memory. While >> allocating the memory, if OOM happens, the allocating process might not >> be the one that gets killed, since the ring-buffer memory is not >> allocated as process memory. Thus random processes might get killed >> during the allocation. >> >> This patch makes sure that the allocating process is considered the most >> likely oom-kill-able process while the allocating is going on. Thus if >> oom-killer is invoked because of ring-buffer allocation, it is easier >> for the ring buffer memory to be freed and save important system >> processes from being killed. > > Hmm, have you tried this in practice? Yes we may kill the "echo" command > but it doesn't stop the ring buffer from being allocated, and thus > killing the echo command may not be enough, and those critical processes > that you are trying to protect will be killed next. > Yes I did try this and found that it works as we intend it to. When oom-killer is invoked, it picks the process which has lowest oom_score_adj and kills it or one of its children. When the process is getting killed, any memory allocation from it would be returned -ENOMEM, which gets handled in our allocation process and we free up previously allocated memory. This API is now being used in other parts of kernel too, where it knows that the allocation could cause OOM. > Perhaps we should change the allocation of the ring buffer or detect OOM > triggering. Maybe make all the allocations ATOMIC, thus it will be > either available or not, and fail instead of trying to swap out other > memory for the allocation. > > -- Steve > Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 20:23 ` Vaibhav Nagarnaik @ 2011-05-26 20:33 ` David Rientjes 2011-05-26 21:00 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2011-05-26 20:33 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, 26 May 2011, Vaibhav Nagarnaik wrote: > > Hmm, have you tried this in practice? Yes we may kill the "echo" command > > but it doesn't stop the ring buffer from being allocated, and thus > > killing the echo command may not be enough, and those critical processes > > that you are trying to protect will be killed next. > > > > Yes I did try this and found that it works as we intend it to. When > oom-killer is invoked, it picks the process which has lowest > oom_score_adj and kills it or one of its children. s/lowest/highest/ > When the process is > getting killed, any memory allocation from it would be returned -ENOMEM, > which gets handled in our allocation process and we free up previously > allocated memory. > Not sure that's true, this is allocating with kzalloc_node(GFP_KERNEL), correct? If current is oom killed, it will have access to all memory reserves which will increase the liklihood that the allocation will succeed before handling the SIGKILL. > This API is now being used in other parts of kernel too, where it knows > that the allocation could cause OOM. > What's wrong with using __GFP_NORETRY to avoid oom killing entirely and then failing the ring buffer memory allocation? Seems like a better solution than relying on the oom killer, since there may be other threads with a max oom_score_adj as well that would appear in the tasklist first and get killed unnecessarily. Is there some ring buffer code that can't handle failing allocations appropriately? ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 20:33 ` David Rientjes @ 2011-05-26 21:00 ` Steven Rostedt 2011-05-26 22:28 ` Vaibhav Nagarnaik 2011-05-26 23:23 ` Vaibhav Nagarnaik 0 siblings, 2 replies; 30+ messages in thread From: Steven Rostedt @ 2011-05-26 21:00 UTC (permalink / raw) To: David Rientjes Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, 2011-05-26 at 13:33 -0700, David Rientjes wrote: > On Thu, 26 May 2011, Vaibhav Nagarnaik wrote: > > > Not sure that's true, this is allocating with kzalloc_node(GFP_KERNEL), > correct? If current is oom killed, it will have access to all memory > reserves which will increase the liklihood that the allocation will > succeed before handling the SIGKILL. Actually it uses get_free_page() > > > This API is now being used in other parts of kernel too, where it knows > > that the allocation could cause OOM. > > > > What's wrong with using __GFP_NORETRY to avoid oom killing entirely and I have no problem with NORETRY. > then failing the ring buffer memory allocation? Seems like a better > solution than relying on the oom killer, since there may be other threads > with a max oom_score_adj as well that would appear in the tasklist first > and get killed unnecessarily. Is there some ring buffer code that can't > handle failing allocations appropriately? The ring buffer code can handle failed allocations just fine, and will free up the pages it allocated before a full success. It allocates the pages one at a time and adds it to a list. After all pages it wants to allocate has successfully been allocated, it then applies them to the ring buffer. If it fails an allocation, all pages are freed that were not added to the ring buffer yet. But the issue is, if the process increasing the size of the ring buffer causes the oom, it will not handle the SIGKILL until after the ring buffer has finished allocating. Now, if it failed to allocate, then we are fine, but if it does not fail, but now we start killing processes, then we may be in trouble. I like the NORETRY better. But then, would this mean that if we have a lot of cached filesystems, we wont be able to extend the ring buffer? I'm thinking the oom killer used here got lucky. As it killed this task, we were still out of memory, and the ring buffer failed to get the memory it needed and freed up everything that it previously allocated, and returned. Then the process calling this function would be killed by the OOM. Ideally, the process shouldn't be killed and the ring buffer just returned -ENOMEM to the user. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 21:00 ` Steven Rostedt @ 2011-05-26 22:28 ` Vaibhav Nagarnaik 2011-05-26 23:38 ` Steven Rostedt 2011-05-26 23:23 ` Vaibhav Nagarnaik 1 sibling, 1 reply; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-26 22:28 UTC (permalink / raw) To: Steven Rostedt Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > But the issue is, if the process increasing the size of the ring buffer > causes the oom, it will not handle the SIGKILL until after the ring > buffer has finished allocating. Now, if it failed to allocate, then we > are fine, but if it does not fail, but now we start killing processes, > then we may be in trouble. > If I understand correctly, if a fatal signal is pending on a process while allocation is called, the allocation fails. Then we handle the freeing up memory correctly, though the echo gets killed once we return from the allocation process. > I like the NORETRY better. But then, would this mean that if we have a > lot of cached filesystems, we wont be able to extend the ring buffer? It doesn't seem so. I talked with the mm- team and I understand that even if NORETRY is set, cached pages will be flushed out and allocation will succeed. But it still does not address the situation when the ring buffer allocation is going on and another process invokes OOM. If the oom_score_adj is not set to maximum, then random processes will still be killed before ring buffer allocation fails. > > I'm thinking the oom killer used here got lucky. As it killed this task, > we were still out of memory, and the ring buffer failed to get the > memory it needed and freed up everything that it previously allocated, > and returned. Then the process calling this function would be killed by > the OOM. Ideally, the process shouldn't be killed and the ring buffer > just returned -ENOMEM to the user. What do you think of this? test_set_oom_score_adj(MAXIMUM); allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY); test_set_oom_score_adj(original); This makes sure that the allocation fails much sooner and more gracefully. If oom-killer is invoked in any circumstance, then the ring buffer allocation process gives up memory and is killed. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 22:28 ` Vaibhav Nagarnaik @ 2011-05-26 23:38 ` Steven Rostedt 2011-05-27 9:43 ` David Rientjes 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2011-05-26 23:38 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel, Peter Zijlstra, Mel Gorman, Rik Van Riel, David Rientjes, Andrew Morton [ I added to the Cc people that understand MM more than I do ] On Thu, 2011-05-26 at 15:28 -0700, Vaibhav Nagarnaik wrote: > On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > But the issue is, if the process increasing the size of the ring buffer > > causes the oom, it will not handle the SIGKILL until after the ring > > buffer has finished allocating. Now, if it failed to allocate, then we > > are fine, but if it does not fail, but now we start killing processes, > > then we may be in trouble. > > > > If I understand correctly, if a fatal signal is pending on a process > while allocation is called, the allocation fails. Then we handle the > freeing up memory correctly, though the echo gets killed once we return > from the allocation process. > > > I like the NORETRY better. But then, would this mean that if we have a > > lot of cached filesystems, we wont be able to extend the ring buffer? > > It doesn't seem so. I talked with the mm- team and I understand that > even if NORETRY is set, cached pages will be flushed out and allocation > will succeed. But it still does not address the situation when the ring > buffer allocation is going on and another process invokes OOM. If the > oom_score_adj is not set to maximum, then random processes will still be > killed before ring buffer allocation fails. > > > > > I'm thinking the oom killer used here got lucky. As it killed this task, > > we were still out of memory, and the ring buffer failed to get the > > memory it needed and freed up everything that it previously allocated, > > and returned. Then the process calling this function would be killed by > > the OOM. Ideally, the process shouldn't be killed and the ring buffer > > just returned -ENOMEM to the user. > > What do you think of this? > > test_set_oom_score_adj(MAXIMUM); > allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY); > test_set_oom_score_adj(original); > > This makes sure that the allocation fails much sooner and more > gracefully. If oom-killer is invoked in any circumstance, then the ring > buffer allocation process gives up memory and is killed. I don't know. But as I never seen this function before, I went and took a look. This test_set_oom_score_adj() is new, and coincidentally written by another google developer ;) As there's not really a precedence to this, if those that I added to the Cc, give their acks, I'm happy to apply this for the next merge window. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 23:38 ` Steven Rostedt @ 2011-05-27 9:43 ` David Rientjes 2011-05-27 12:48 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: David Rientjes @ 2011-05-27 9:43 UTC (permalink / raw) To: Steven Rostedt Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel, Peter Zijlstra, Mel Gorman, Rik Van Riel, Andrew Morton On Thu, 26 May 2011, Steven Rostedt wrote: > > What do you think of this? > > > > test_set_oom_score_adj(MAXIMUM); > > allocate_ring_buffer(GFP_KERNEL | __GFP_NORETRY); > > test_set_oom_score_adj(original); > > > > This makes sure that the allocation fails much sooner and more > > gracefully. If oom-killer is invoked in any circumstance, then the ring > > buffer allocation process gives up memory and is killed. > > I don't know. But as I never seen this function before, I went and took > a look. This test_set_oom_score_adj() is new, and coincidentally written > by another google developer ;) > Ignore the history of function, it simply duplicates the old PF_OOM_ORIGIN flag that is now removed. > As there's not really a precedence to this, if those that I added to the > Cc, give their acks, I'm happy to apply this for the next merge window. > This problem isn't new, it's always been possible that an allocation that is higher order, using GFP_ATOMIC or GFP_NOWAIT, or utilizing __GFP_NORETRY as I suggested here, would deplete memory at the same time that a GFP_FS allocation on another cpu would invoke the oom killer. If that happens between the time when tracing_resize_ring_buffer() goes oom and its nicely written error handling starts freeing memory, then it's possible that another task will be unfairly oom killed. Note that the suggested solution of test_set_oom_score_adj(OOM_SCORE_ADJ_MAX) doesn't prevent that in all cases: it's possible that another thread on the system also has an oom_score_adj of OOM_SCORE_ADJ_MAX and it would be killed in its place just because it appeared in the tasklist first (which is guaranteed if this is simply an echo command). Relying on the oom killer to kill this task for parallel blockable allocations doesn't seem like the best solution for the sole reason that the program that wrote to buffer_size_kb may count on its return value. It may be able to handle an -ENOMEM return value and, perhaps, try to write a smaller value? I think what this patch really wants to do is utilize __GFP_NORETRY as previously suggested and, if we're really concerned about parallel allocations in this instance even though the same situation exists all over the kernel, also create an oom notifier with register_oom_notifier() that may be called in oom conditions that would free memory when buffer->record_disabled is non-zero and prevent the oom. That would increase the size of the ring buffer as large as possible up until oom even though it may not be to what the user requested. Otherwise, you'll just want to use oom_killer_disable() to preven the oom killer altogether. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-27 9:43 ` David Rientjes @ 2011-05-27 12:48 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2011-05-27 12:48 UTC (permalink / raw) To: David Rientjes Cc: Vaibhav Nagarnaik, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel, Peter Zijlstra, Mel Gorman, Rik Van Riel, Andrew Morton On Fri, 2011-05-27 at 02:43 -0700, David Rientjes wrote: > This problem isn't new, it's always been possible that an allocation that > is higher order, using GFP_ATOMIC or GFP_NOWAIT, or utilizing > __GFP_NORETRY as I suggested here, would deplete memory at the same time > that a GFP_FS allocation on another cpu would invoke the oom killer. > > If that happens between the time when tracing_resize_ring_buffer() goes > oom and its nicely written error handling starts freeing memory, then it's > possible that another task will be unfairly oom killed. Note that the > suggested solution of test_set_oom_score_adj(OOM_SCORE_ADJ_MAX) doesn't > prevent that in all cases: it's possible that another thread on the system > also has an oom_score_adj of OOM_SCORE_ADJ_MAX and it would be killed in > its place just because it appeared in the tasklist first (which is > guaranteed if this is simply an echo command). > > Relying on the oom killer to kill this task for parallel blockable > allocations doesn't seem like the best solution for the sole reason that > the program that wrote to buffer_size_kb may count on its return value. > It may be able to handle an -ENOMEM return value and, perhaps, try to > write a smaller value? > > I think what this patch really wants to do is utilize __GFP_NORETRY as > previously suggested and, if we're really concerned about parallel > allocations in this instance even though the same situation exists all > over the kernel, also create an oom notifier with register_oom_notifier() > that may be called in oom conditions that would free memory when > buffer->record_disabled is non-zero and prevent the oom. That would > increase the size of the ring buffer as large as possible up until oom > even though it may not be to what the user requested. > > Otherwise, you'll just want to use oom_killer_disable() to preven the oom > killer altogether. Thanks for the detailed explanation. OK, I'm convinced. The proper solution looks to be both the use of __GFP_NORETRY and the use of test_set_oom_score_adj(). I don't think it is necessary to worry about multiple users of this score adj, as when we are in an OOM situation, things are just bad to begin with. But Google has to deal with bad situations more than others, so if it becomes an issue for you, then we can discuss those changes later. Vaibhav, can you send an updated patch? Thanks, -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 21:00 ` Steven Rostedt 2011-05-26 22:28 ` Vaibhav Nagarnaik @ 2011-05-26 23:23 ` Vaibhav Nagarnaik 1 sibling, 0 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-26 23:23 UTC (permalink / raw) To: Steven Rostedt Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, May 26, 2011 at 2:00 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > I'm thinking the oom killer used here got lucky. As it killed this task, > we were still out of memory, and the ring buffer failed to get the > memory it needed and freed up everything that it previously allocated, > and returned. Then the process calling this function would be killed by I have tested this multiple times and I don't see any other process being killed, regardless of who invoked oom-killer. The oom-killer always selects the "echo" process to kill because of its maximum oom_score_adj and sets TIF_MEMDIE. Since the ring buffer is still getting allocated, any further allocation from this process fails because of this flag and the fatal signal pending. This causes the ring buffer memory to be freed. I don't see the oom-killer getting lucky in my test cases. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik 2011-05-26 20:04 ` Steven Rostedt @ 2011-05-27 17:58 ` Vaibhav Nagarnaik 2011-05-27 23:22 ` David Rientjes 2011-06-07 23:41 ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik 1 sibling, 2 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-27 17:58 UTC (permalink / raw) To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik The tracing ring buffer is allocated from kernel memory. While allocating the memory, if OOM happens, the allocating process might not be the one that gets killed, since the ring-buffer memory is not allocated as process memory. Thus random processes might get killed during the allocation. This patch makes sure that the allocating process is considered the most likely oom-kill-able process while the allocating is going on. Thus if oom-killer is invoked because of ring-buffer allocation, it is easier for the ring buffer memory to be freed and save important system processes from being killed. This patch also adds __GFP_NORETRY flag to the ring buffer allocation calls to make it fail more gracefully if the system will not be able to complete the allocation request. Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> --- kernel/trace/ring_buffer.c | 15 ++++++++++----- kernel/trace/trace.c | 9 +++++++++ 2 files changed, 19 insertions(+), 5 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 02b7896..0339f95 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, for (i = 0; i < nr_pages; i++) { struct page *page; bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu_buffer->cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu_buffer->cpu)); if (!bpage) goto free_pages; @@ -1014,7 +1015,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(&bpage->list, &pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - GFP_KERNEL, 0); + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; bpage->page = page_address(page); @@ -1378,11 +1379,13 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) struct page *page; bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu)); if (!bpage) goto free_pages; list_add(&bpage->list, &pages); - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; @@ -3737,7 +3740,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu) struct buffer_data_page *bpage; struct page *page; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0); + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, + 0); if (!page) return NULL; diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b926578..15a667a 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -37,6 +37,7 @@ #include <linux/init.h> #include <linux/poll.h> #include <linux/fs.h> +#include <linux/oom.h> #include "trace.h" #include "trace_output.h" @@ -3498,6 +3499,7 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, unsigned long val; char buf[64]; int ret; + int oom_score_adj; if (cnt >= sizeof(buf)) return -EINVAL; @@ -3518,7 +3520,14 @@ tracing_entries_write(struct file *filp, const char __user *ubuf, /* value is in KB */ val <<= 10; + /* + * make sure this process is picked over others to be killed in OOM + * condition + */ + oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ret = tracing_resize_ring_buffer(val); + /* restore the original oom_score_adj value */ + test_set_oom_score_adj(oom_score_adj); if (ret < 0) return ret; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-27 17:58 ` Vaibhav Nagarnaik @ 2011-05-27 23:22 ` David Rientjes 2011-05-28 0:44 ` Vaibhav Nagarnaik 2011-06-07 23:41 ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik 1 sibling, 1 reply; 30+ messages in thread From: David Rientjes @ 2011-05-27 23:22 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Fri, 27 May 2011, Vaibhav Nagarnaik wrote: > The tracing ring buffer is allocated from kernel memory. While > allocating the memory, if OOM happens, the allocating process might not > be the one that gets killed, since the ring-buffer memory is not > allocated as process memory. Thus random processes might get killed > during the allocation. > > This patch makes sure that the allocating process is considered the most > likely oom-kill-able process while the allocating is going on. Thus if > oom-killer is invoked because of ring-buffer allocation, it is easier > for the ring buffer memory to be freed and save important system > processes from being killed. > > This patch also adds __GFP_NORETRY flag to the ring buffer allocation > calls to make it fail more gracefully if the system will not be able to > complete the allocation request. > > Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Still not sure this is what we want, I'm afraid. I like the addition of __GFP_NORETRY, but I don't understand the use of test_set_oom_score_adj() here. Why can't we use oom_killer_disable(), allocate with __GFP_NORETRY, and then do oom_killer_enable()? This prevents other tasks from getting oom killed themselves if they have oom_score_adj of OOM_SCORE_ADJ_MAX and allows the write to fail with -ENOMEM rather then being oom killed out from under us. So why is test_set_oom_score_adj() better? The alternative would be to setup an oom notifier for the ring buffer and stop allocating prior to killing a task and return a size that was smaller than what the user requested. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-27 23:22 ` David Rientjes @ 2011-05-28 0:44 ` Vaibhav Nagarnaik 2011-05-28 1:50 ` Steven Rostedt 0 siblings, 1 reply; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-28 0:44 UTC (permalink / raw) To: David Rientjes Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Fri, May 27, 2011 at 4:22 PM, David Rientjes <rientjes@google.com> wrote: > On Fri, 27 May 2011, Vaibhav Nagarnaik wrote: > >> The tracing ring buffer is allocated from kernel memory. While >> allocating the memory, if OOM happens, the allocating process might not >> be the one that gets killed, since the ring-buffer memory is not >> allocated as process memory. Thus random processes might get killed >> during the allocation. >> >> This patch makes sure that the allocating process is considered the most >> likely oom-kill-able process while the allocating is going on. Thus if >> oom-killer is invoked because of ring-buffer allocation, it is easier >> for the ring buffer memory to be freed and save important system >> processes from being killed. >> >> This patch also adds __GFP_NORETRY flag to the ring buffer allocation >> calls to make it fail more gracefully if the system will not be able to >> complete the allocation request. >> >> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> > > Still not sure this is what we want, I'm afraid. > > I like the addition of __GFP_NORETRY, but I don't understand the use of > test_set_oom_score_adj() here. Why can't we use oom_killer_disable(), > allocate with __GFP_NORETRY, and then do oom_killer_enable()? > > This prevents other tasks from getting oom killed themselves if they have > oom_score_adj of OOM_SCORE_ADJ_MAX and allows the write to fail with > -ENOMEM rather then being oom killed out from under us. > > So why is test_set_oom_score_adj() better? > When I tested this change, I saw that oom-killer can be invoked by any process trying to allocate. If the oom_score_adj is not set to MAXIMUM, oom-killer will not pick the "echo" process allocating ring buffer memory since the memory is not counted against its RSS. This was the main reason to use the test_set_oom_score_adj() API. I saw this API being used in the swapoff() to allocate RAM for the pages swapped out and as implemented, it seemed like a similar use case. This change is not for the case where memory is already constrained. It is for the case where the user has specified an invalid amount of memory to be allocated while running in a constrained container. We don't want to kill random processes while allocation of ring buffer is going on. By using oom_killer_{enable|disable}, I think you mean setting the global variable oom_killer_disabled. Using it will upset the system's expectation of memory behavior. When GFP_NORETRY is set for our ring buffer allocating process, the oom_killer_disabled is never checked. But it would cause NULL's to be returned to other allocating processes which might destabilize them and the system. The oom-killer has much better success of reclaiming memory by killing the ring buffer allocating process at that point. For the other tasks with OOM_SCORE_ADJ_MAX set, one can argue that they had expectation to be killed even if they themselves were not the direct cause of the OOM situation. > The alternative would be to setup an oom notifier for the ring buffer and > stop allocating prior to killing a task and return a size that was smaller > than what the user requested. > I think this gets really complex for the ring buffer allocation scenario. I see that using __GFP_NORETRY is great for a graceful failure. In the rare case where the allocation is going on and some other process invokes oom-killer, stopping the ring buffer allocation should be a priority to make sure that the system doesn't get de-stabilized. As far as I can see, setting the oom_score_adj to maximum is the best way to solve it. That said, I am open to changing it if Steven and you think using oom_killer_disabled is a better solution. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-28 0:44 ` Vaibhav Nagarnaik @ 2011-05-28 1:50 ` Steven Rostedt 2011-05-30 6:23 ` KOSAKI Motohiro 2011-05-30 23:46 ` Vaibhav Nagarnaik 0 siblings, 2 replies; 30+ messages in thread From: Steven Rostedt @ 2011-05-28 1:50 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote: > That said, I am open to changing it if Steven and you think using > oom_killer_disabled is a better solution. My biggest concern is that we are setting policy in the kernel. If you are concerned about this, why not just have the process that is going to increase the size of the ring buffer adjust its own oom policy with /proc/<pid>/oom_score_adj ? Only a privilege process can increase the size of the ring buffer so it's not like we are worried about any normal user task upping the ring buffer to kill other processes. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-28 1:50 ` Steven Rostedt @ 2011-05-30 6:23 ` KOSAKI Motohiro 2011-05-30 23:54 ` Vaibhav Nagarnaik 2011-05-30 23:46 ` Vaibhav Nagarnaik 1 sibling, 1 reply; 30+ messages in thread From: KOSAKI Motohiro @ 2011-05-30 6:23 UTC (permalink / raw) To: rostedt Cc: vnagarnaik, rientjes, mingo, fweisbec, mrubin, dhsharp, linux-kernel (2011/05/28 10:50), Steven Rostedt wrote: > On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote: > >> That said, I am open to changing it if Steven and you think using >> oom_killer_disabled is a better solution. > > My biggest concern is that we are setting policy in the kernel. If you > are concerned about this, why not just have the process that is going to > increase the size of the ring buffer adjust its own oom policy > with /proc/<pid>/oom_score_adj ? Only a privilege process can increase > the size of the ring buffer so it's not like we are worried about any > normal user task upping the ring buffer to kill other processes. I like Steven's approach. Because even if we apply Vaibhav's patch, we still have a oom issue. because when oom-killer killed echo commands, it doesn't shrink ring buffer. it only just die. So, the kernel is still under extreme memory shortage. Any admins operation may invoke next oom-killer. And -personally- I think any tracing user should know system ram size and proper ring buffer size. ;) ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-30 6:23 ` KOSAKI Motohiro @ 2011-05-30 23:54 ` Vaibhav Nagarnaik 0 siblings, 0 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-30 23:54 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Steven Rostedt, David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Sun, May 29, 2011 at 11:23 PM, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote: > (2011/05/28 10:50), Steven Rostedt wrote: >> On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote: >> >>> That said, I am open to changing it if Steven and you think using >>> oom_killer_disabled is a better solution. >> >> My biggest concern is that we are setting policy in the kernel. If you >> are concerned about this, why not just have the process that is going to >> increase the size of the ring buffer adjust its own oom policy >> with /proc/<pid>/oom_score_adj ? Only a privilege process can increase >> the size of the ring buffer so it's not like we are worried about any >> normal user task upping the ring buffer to kill other processes. > > I like Steven's approach. > > Because even if we apply Vaibhav's patch, we still have a oom issue. > because when oom-killer killed echo commands, it doesn't shrink ring > buffer. it only just die. So, the kernel is still under extreme memory > shortage. Any admins operation may invoke next oom-killer. The ring buffer allocation is done synchronously with the echo command. When a process is picked by the OOM killer, it has a fatal signal pending. Any further allocations by the process in that scenario will return with a failure. The ring buffer allocation code can handle this scenario very well and give up all the previously allocated memory. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-28 1:50 ` Steven Rostedt 2011-05-30 6:23 ` KOSAKI Motohiro @ 2011-05-30 23:46 ` Vaibhav Nagarnaik 2011-06-07 23:07 ` Steven Rostedt 1 sibling, 1 reply; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-05-30 23:46 UTC (permalink / raw) To: Steven Rostedt Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Fri, May 27, 2011 at 6:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote: > >> That said, I am open to changing it if Steven and you think using >> oom_killer_disabled is a better solution. > > My biggest concern is that we are setting policy in the kernel. If you > are concerned about this, why not just have the process that is going to > increase the size of the ring buffer adjust its own oom policy > with /proc/<pid>/oom_score_adj ? Only a privilege process can increase > the size of the ring buffer so it's not like we are worried about any > normal user task upping the ring buffer to kill other processes. > > -- Steve You are right. Changing the value of oom_killer_disabled is setting kernel policy and I don't like the solution either. It would be great if the user always sets oom_score_adj while allocating ring buffer memory, but there are different use cases I see where it might not be feasible. One is the number of different applications which use ftrace and set the ring buffer memory size (trace-cmd, perf, powertop, etc). The other case is where a developer is setting the size for his own tracing. In all these cases, oom_score_adj *should* be set to maximum to be safe, but it a developer might forget this step and can de-stabilize the system. This is why I think that setting oom_score_adj to maximum in the kernel while allocating buffer as well as using __GFP_NORETRY should suffice for the end goal of being a good memory citizen. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-05-30 23:46 ` Vaibhav Nagarnaik @ 2011-06-07 23:07 ` Steven Rostedt 2011-06-07 23:30 ` Vaibhav Nagarnaik 0 siblings, 1 reply; 30+ messages in thread From: Steven Rostedt @ 2011-06-07 23:07 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Mon, 2011-05-30 at 16:46 -0700, Vaibhav Nagarnaik wrote: > On Fri, May 27, 2011 at 6:50 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 2011-05-27 at 17:44 -0700, Vaibhav Nagarnaik wrote: > > > >> That said, I am open to changing it if Steven and you think using > >> oom_killer_disabled is a better solution. > > > > My biggest concern is that we are setting policy in the kernel. If you > > are concerned about this, why not just have the process that is going to > > increase the size of the ring buffer adjust its own oom policy > > with /proc/<pid>/oom_score_adj ? Only a privilege process can increase > > the size of the ring buffer so it's not like we are worried about any > > normal user task upping the ring buffer to kill other processes. > > > > -- Steve > > You are right. Changing the value of oom_killer_disabled is setting > kernel policy and I don't like the solution either. > > It would be great if the user always sets oom_score_adj while > allocating ring buffer memory, but there are different use cases I see > where it might not be feasible. One is the number of different > applications which use ftrace and set the ring buffer memory size > (trace-cmd, perf, powertop, etc). The other case is where a developer > is setting the size for his own tracing. In all these cases, > oom_score_adj *should* be set to maximum to be safe, but it a > developer might forget this step and can de-stabilize the system. This > is why I think that setting oom_score_adj to maximum in the kernel > while allocating buffer as well as using __GFP_NORETRY should suffice > for the end goal of being a good memory citizen. Hi Vaibhav, I'm going though patches for v3.1 now. Where are we on this issue. I still don't really like the ideal of having the kernel set the oom policy by default. But I'm totally fine with changing the allocations for NORETRY. If you switch it to NORETRY, do you still have the issues you were seeing? You could also add userspace helpers that would set the oom policy of things accessing the ring buffer. Either have a script that updates the ring buffer size and sets the oom policy, or have a library with a helper routine. Would those be fine for you? -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process 2011-06-07 23:07 ` Steven Rostedt @ 2011-06-07 23:30 ` Vaibhav Nagarnaik 0 siblings, 0 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-06-07 23:30 UTC (permalink / raw) To: Steven Rostedt Cc: David Rientjes, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Tue, Jun 7, 2011 at 4:07 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > Hi Vaibhav, > > I'm going though patches for v3.1 now. Where are we on this issue. I > still don't really like the ideal of having the kernel set the oom > policy by default. But I'm totally fine with changing the allocations > for NORETRY. > > If you switch it to NORETRY, do you still have the issues you were > seeing? You could also add userspace helpers that would set the oom > policy of things accessing the ring buffer. > > Either have a script that updates the ring buffer size and sets the oom > policy, or have a library with a helper routine. > > Would those be fine for you? > > -- Steve > I understand your hesitation to change the kernel oom policy for allocating the ring buffer. I am fine with just using __GFP_NORETRY for allocation calls to fail more gracefully. It works for most of the use cases I encounter. We will make changes to our internal tools to make sure that oom_score_adj is changed to pick the allocating process for oom-killing. I will send the updated patch. Thanks Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-05-27 17:58 ` Vaibhav Nagarnaik 2011-05-27 23:22 ` David Rientjes @ 2011-06-07 23:41 ` Vaibhav Nagarnaik 2011-06-07 23:47 ` Frederic Weisbecker 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik 1 sibling, 2 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-06-07 23:41 UTC (permalink / raw) To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik The tracing ring buffer is allocated from kernel memory. While allocating a large chunk of memory, OOM might happen which destabilizes the system. Thus random processes might get killed during the allocation. This patch adds __GFP_NORETRY flag to the ring buffer allocation calls to make it fail more gracefully if the system will not be able to complete the allocation request. Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> --- kernel/trace/ring_buffer.c | 15 ++++++++++----- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 2780e60..ec817d5 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, for (i = 0; i < nr_pages; i++) { struct page *page; bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu_buffer->cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu_buffer->cpu)); if (!bpage) goto free_pages; @@ -1014,7 +1015,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(&bpage->list, &pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - GFP_KERNEL, 0); + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; bpage->page = page_address(page); @@ -1378,11 +1379,13 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) struct page *page; bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu)); if (!bpage) goto free_pages; list_add(&bpage->list, &pages); - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; @@ -3737,7 +3740,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu) struct buffer_data_page *bpage; struct page *page; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0); + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, + 0); if (!page) return NULL; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-07 23:41 ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik @ 2011-06-07 23:47 ` Frederic Weisbecker 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik 1 sibling, 0 replies; 30+ messages in thread From: Frederic Weisbecker @ 2011-06-07 23:47 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: Steven Rostedt, Ingo Molnar, Michael Rubin, David Sharp, linux-kernel On Tue, Jun 07, 2011 at 04:41:28PM -0700, Vaibhav Nagarnaik wrote: > The tracing ring buffer is allocated from kernel memory. While > allocating a large chunk of memory, OOM might happen which destabilizes > the system. Thus random processes might get killed during the > allocation. > > This patch adds __GFP_NORETRY flag to the ring buffer allocation calls > to make it fail more gracefully if the system will not be able to > complete the allocation request. > > Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> > --- > kernel/trace/ring_buffer.c | 15 ++++++++++----- > 1 files changed, 10 insertions(+), 5 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 2780e60..ec817d5 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1005,7 +1005,8 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, > for (i = 0; i < nr_pages; i++) { > struct page *page; > bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), > - GFP_KERNEL, cpu_to_node(cpu_buffer->cpu)); > + GFP_KERNEL | __GFP_NORETRY, Please put a comment in the code to explain the reason of this __GFP_NORETRY. It's pretty hard without the changelog. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-07 23:41 ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik 2011-06-07 23:47 ` Frederic Weisbecker @ 2011-06-08 0:01 ` Vaibhav Nagarnaik 2011-06-08 2:30 ` David Rientjes ` (2 more replies) 1 sibling, 3 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-06-08 0:01 UTC (permalink / raw) To: Steven Rostedt, Ingo Molnar, Frederic Weisbecker Cc: Michael Rubin, David Sharp, linux-kernel, Vaibhav Nagarnaik The tracing ring buffer is allocated from kernel memory. While allocating a large chunk of memory, OOM might happen which destabilizes the system. Thus random processes might get killed during the allocation. This patch adds __GFP_NORETRY flag to the ring buffer allocation calls to make it fail more gracefully if the system will not be able to complete the allocation request. Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> --- Changelog: v2-v1: Added comment to explaing use of __GFP_NORETRY kernel/trace/ring_buffer.c | 25 ++++++++++++++++++++----- 1 files changed, 20 insertions(+), 5 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 2780e60..bd588b6 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1004,8 +1004,14 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, for (i = 0; i < nr_pages; i++) { struct page *page; + /* + * __GFP_NORETRY flag makes sure that the allocation fails + * gracefully without invoking oom-killer and the system is + * not destabilized. + */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu_buffer->cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu_buffer->cpu)); if (!bpage) goto free_pages; @@ -1014,7 +1020,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(&bpage->list, &pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - GFP_KERNEL, 0); + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; bpage->page = page_address(page); @@ -1376,13 +1382,20 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) for_each_buffer_cpu(buffer, cpu) { for (i = 0; i < new_pages; i++) { struct page *page; + /* + * __GFP_NORETRY flag makes sure that the allocation + * fails gracefully without invoking oom-killer and + * the system is not destabilized. + */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu)); if (!bpage) goto free_pages; list_add(&bpage->list, &pages); - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; @@ -3737,7 +3750,9 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu) struct buffer_data_page *bpage; struct page *page; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0); + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, + 0); if (!page) return NULL; -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik @ 2011-06-08 2:30 ` David Rientjes 2011-06-09 11:37 ` KOSAKI Motohiro 2011-07-05 12:54 ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik 2 siblings, 0 replies; 30+ messages in thread From: David Rientjes @ 2011-06-08 2:30 UTC (permalink / raw) To: Vaibhav Nagarnaik Cc: Steven Rostedt, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Tue, 7 Jun 2011, Vaibhav Nagarnaik wrote: > The tracing ring buffer is allocated from kernel memory. While > allocating a large chunk of memory, OOM might happen which destabilizes > the system. Thus random processes might get killed during the > allocation. > > This patch adds __GFP_NORETRY flag to the ring buffer allocation calls > to make it fail more gracefully if the system will not be able to > complete the allocation request. > > Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Acked-by: David Rientjes <rientjes@google.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik 2011-06-08 2:30 ` David Rientjes @ 2011-06-09 11:37 ` KOSAKI Motohiro 2011-06-09 12:14 ` Steven Rostedt 2011-07-05 12:54 ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik 2 siblings, 1 reply; 30+ messages in thread From: KOSAKI Motohiro @ 2011-06-09 11:37 UTC (permalink / raw) To: vnagarnaik Cc: kosaki.motohiro, rostedt, mingo, fweisbec, mrubin, dhsharp, linux-kernel (2011/06/08 9:01), Vaibhav Nagarnaik wrote: > The tracing ring buffer is allocated from kernel memory. While > allocating a large chunk of memory, OOM might happen which destabilizes > the system. Thus random processes might get killed during the > allocation. > > This patch adds __GFP_NORETRY flag to the ring buffer allocation calls > to make it fail more gracefully if the system will not be able to > complete the allocation request. > > Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> > --- > Changelog: > v2-v1: Added comment to explaing use of __GFP_NORETRY > > kernel/trace/ring_buffer.c | 25 ++++++++++++++++++++----- > 1 files changed, 20 insertions(+), 5 deletions(-) Unfortunately, __GFP_NORETRY is racy and don't work as expected. If free memory is not enough, the thread may start to reclaim and another thread can steal the reclaimed memory. And thread0 don't retry. Then, thread0's alloc page may fail even though system have enough reclaimable memory. thread0 thread1 --------------------------------------------------------------- alloc_pages() get_page_from_freelist() -> fail try_to_free_pages() alloc_pages() get_page_from_freelist() -> success get_page_from_freelist() -> fail again I think this is mm issue, and afaik, Minchan and some developers are working on fixing it. but _now_ your patch doesn't work. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-09 11:37 ` KOSAKI Motohiro @ 2011-06-09 12:14 ` Steven Rostedt 2011-06-09 18:41 ` Vaibhav Nagarnaik 2011-06-09 19:42 ` David Rientjes 0 siblings, 2 replies; 30+ messages in thread From: Steven Rostedt @ 2011-06-09 12:14 UTC (permalink / raw) To: KOSAKI Motohiro Cc: vnagarnaik, mingo, fweisbec, mrubin, dhsharp, linux-kernel On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote: > Unfortunately, __GFP_NORETRY is racy and don't work as expected. > If free memory is not enough, the thread may start to reclaim and > another thread can steal the reclaimed memory. And thread0 don't retry. > > Then, thread0's alloc page may fail even though system have enough reclaimable > memory. > > thread0 thread1 > --------------------------------------------------------------- > alloc_pages() > get_page_from_freelist() -> fail > try_to_free_pages() > alloc_pages() > get_page_from_freelist() -> success > get_page_from_freelist() -> fail again > > I think this is mm issue, and afaik, Minchan and some developers are > working on fixing it. but _now_ your patch doesn't work. Have you seen this fail in practice? I'm not too concern if it only triggers when memory is tight. But if it is triggering on normal cases, then that worries me. Thanks, -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-09 12:14 ` Steven Rostedt @ 2011-06-09 18:41 ` Vaibhav Nagarnaik 2011-06-09 19:42 ` David Rientjes 1 sibling, 0 replies; 30+ messages in thread From: Vaibhav Nagarnaik @ 2011-06-09 18:41 UTC (permalink / raw) To: Steven Rostedt Cc: KOSAKI Motohiro, Ingo Molnar, Frederic Weisbecker, Michael Rubin, David Sharp, linux-kernel On Thu, 2011-06-09 at 20:37 +0900, KOSAKI Motohiro wrote: > Unfortunately, __GFP_NORETRY is racy and don't work as expected. > If free memory is not enough, the thread may start to reclaim and > another thread can steal the reclaimed memory. And thread0 don't retry. > > Then, thread0's alloc page may fail even though system have enough reclaimable > memory. > > thread0 thread1 > --------------------------------------------------------------- > alloc_pages() > get_page_from_freelist() -> fail > try_to_free_pages() > alloc_pages() > get_page_from_freelist() -> success > get_page_from_freelist() -> fail again > > I think this is mm issue, and afaik, Minchan and some developers are > working on fixing it. but _now_ your patch doesn't work. Thanks. I was not aware of this condition. I discussed it with the mm- team and it seems this is a well known problem. Though it it is not that bad for order 0 allocations, which is what ring buffer uses. Also, it happens in tight memory conditions, at which point allocating ring buffer might make it worse. I think providing the user a notification of this is better than just worsening the situation by allocating ring buffer. On Thu, Jun 9, 2011 at 5:14 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > Have you seen this fail in practice? Not in my normal test cases. I haven't stressed it though, but that doesn't make sense for this patch. The idea is that ring buffer should be able to allocate memory as long as it doesn't impact the system too much. Vaibhav Nagarnaik ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-09 12:14 ` Steven Rostedt 2011-06-09 18:41 ` Vaibhav Nagarnaik @ 2011-06-09 19:42 ` David Rientjes 2011-06-09 19:52 ` Steven Rostedt 1 sibling, 1 reply; 30+ messages in thread From: David Rientjes @ 2011-06-09 19:42 UTC (permalink / raw) To: Steven Rostedt Cc: KOSAKI Motohiro, vnagarnaik, mingo, fweisbec, mrubin, dhsharp, linux-kernel On Thu, 9 Jun 2011, Steven Rostedt wrote: > > Unfortunately, __GFP_NORETRY is racy and don't work as expected. > > If free memory is not enough, the thread may start to reclaim and > > another thread can steal the reclaimed memory. And thread0 don't retry. > > > > Then, thread0's alloc page may fail even though system have enough reclaimable > > memory. > > > > thread0 thread1 > > --------------------------------------------------------------- > > alloc_pages() > > get_page_from_freelist() -> fail > > try_to_free_pages() > > alloc_pages() > > get_page_from_freelist() -> success > > get_page_from_freelist() -> fail again > > > > I think this is mm issue, and afaik, Minchan and some developers are > > working on fixing it. but _now_ your patch doesn't work. > > Have you seen this fail in practice? > > I'm not too concern if it only triggers when memory is tight. But if it > is triggering on normal cases, then that worries me. > It would only happen if there was an antagonist that stole the reclaimed pages before your __GFP_NORETRY allocation could allocate them, resulting in the system being oom again as it was before reclaim occurred. Without __GFP_NORETRY, we'd automatically retry these allocations in a loop until we found the memory since they are order-0, so the only side effect would be an increased latency in the allocation. I think if we still end up oom after reclaiming memory that was allocated by another thread that we probably don't want to be expanding the ring buffer and, thus, I see no problem with just failing. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v2] trace: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-09 19:42 ` David Rientjes @ 2011-06-09 19:52 ` Steven Rostedt 0 siblings, 0 replies; 30+ messages in thread From: Steven Rostedt @ 2011-06-09 19:52 UTC (permalink / raw) To: David Rientjes Cc: KOSAKI Motohiro, vnagarnaik, mingo, fweisbec, mrubin, dhsharp, linux-kernel On Thu, 2011-06-09 at 12:42 -0700, David Rientjes wrote: > It would only happen if there was an antagonist that stole the reclaimed > pages before your __GFP_NORETRY allocation could allocate them, resulting > in the system being oom again as it was before reclaim occurred. Without > __GFP_NORETRY, we'd automatically retry these allocations in a loop until > we found the memory since they are order-0, so the only side effect would > be an increased latency in the allocation. I think if we still end up oom > after reclaiming memory that was allocated by another thread that we > probably don't want to be expanding the ring buffer and, thus, I see no > problem with just failing. Agreed, which is why I already pushed the patch. -- Steve ^ permalink raw reply [flat|nested] 30+ messages in thread
* [tip:perf/core] ring-buffer: Set __GFP_NORETRY flag for ring buffer allocating process 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik 2011-06-08 2:30 ` David Rientjes 2011-06-09 11:37 ` KOSAKI Motohiro @ 2011-07-05 12:54 ` tip-bot for Vaibhav Nagarnaik 2 siblings, 0 replies; 30+ messages in thread From: tip-bot for Vaibhav Nagarnaik @ 2011-07-05 12:54 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, vnagarnaik, hpa, mingo, dhsharp, fweisbec, rostedt, mrubin, tglx, rientjes Commit-ID: d7ec4bfed6c97405c6417970ba06c439e08ab8e3 Gitweb: http://git.kernel.org/tip/d7ec4bfed6c97405c6417970ba06c439e08ab8e3 Author: Vaibhav Nagarnaik <vnagarnaik@google.com> AuthorDate: Tue, 7 Jun 2011 17:01:42 -0700 Committer: Steven Rostedt <rostedt@goodmis.org> CommitDate: Tue, 14 Jun 2011 22:48:51 -0400 ring-buffer: Set __GFP_NORETRY flag for ring buffer allocating process The tracing ring buffer is allocated from kernel memory. While allocating a large chunk of memory, OOM might happen which destabilizes the system. Thus random processes might get killed during the allocation. This patch adds __GFP_NORETRY flag to the ring buffer allocation calls to make it fail more gracefully if the system will not be able to complete the allocation request. Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Link: http://lkml.kernel.org/r/1307491302-9236-1-git-send-email-vnagarnaik@google.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org> --- kernel/trace/ring_buffer.c | 25 +++++++++++++++++++------ 1 files changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index f00ede3..731201b 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1004,9 +1004,14 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, for (i = 0; i < nr_pages; i++) { struct page *page; - + /* + * __GFP_NORETRY flag makes sure that the allocation fails + * gracefully without invoking oom-killer and the system is + * not destabilized. + */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu_buffer->cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu_buffer->cpu)); if (!bpage) goto free_pages; @@ -1015,7 +1020,7 @@ static int rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(&bpage->list, &pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - GFP_KERNEL, 0); + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; bpage->page = page_address(page); @@ -1377,13 +1382,20 @@ int ring_buffer_resize(struct ring_buffer *buffer, unsigned long size) for_each_buffer_cpu(buffer, cpu) { for (i = 0; i < new_pages; i++) { struct page *page; + /* + * __GFP_NORETRY flag makes sure that the allocation + * fails gracefully without invoking oom-killer and + * the system is not destabilized. + */ bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()), - GFP_KERNEL, cpu_to_node(cpu)); + GFP_KERNEL | __GFP_NORETRY, + cpu_to_node(cpu)); if (!bpage) goto free_pages; list_add(&bpage->list, &pages); - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0); + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) goto free_pages; bpage->page = page_address(page); @@ -3737,7 +3749,8 @@ void *ring_buffer_alloc_read_page(struct ring_buffer *buffer, int cpu) struct buffer_data_page *bpage; struct page *page; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0); + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY, 0); if (!page) return NULL; ^ permalink raw reply related [flat|nested] 30+ messages in thread
end of thread, other threads:[~2011-07-05 12:54 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-05-26 19:52 [PATCH] trace: Set oom_score_adj to maximum for ring buffer allocating process Vaibhav Nagarnaik 2011-05-26 20:04 ` Steven Rostedt 2011-05-26 20:22 ` David Rientjes 2011-05-26 20:23 ` Vaibhav Nagarnaik 2011-05-26 20:33 ` David Rientjes 2011-05-26 21:00 ` Steven Rostedt 2011-05-26 22:28 ` Vaibhav Nagarnaik 2011-05-26 23:38 ` Steven Rostedt 2011-05-27 9:43 ` David Rientjes 2011-05-27 12:48 ` Steven Rostedt 2011-05-26 23:23 ` Vaibhav Nagarnaik 2011-05-27 17:58 ` Vaibhav Nagarnaik 2011-05-27 23:22 ` David Rientjes 2011-05-28 0:44 ` Vaibhav Nagarnaik 2011-05-28 1:50 ` Steven Rostedt 2011-05-30 6:23 ` KOSAKI Motohiro 2011-05-30 23:54 ` Vaibhav Nagarnaik 2011-05-30 23:46 ` Vaibhav Nagarnaik 2011-06-07 23:07 ` Steven Rostedt 2011-06-07 23:30 ` Vaibhav Nagarnaik 2011-06-07 23:41 ` [PATCH] trace: Set __GFP_NORETRY flag " Vaibhav Nagarnaik 2011-06-07 23:47 ` Frederic Weisbecker 2011-06-08 0:01 ` [PATCH v2] " Vaibhav Nagarnaik 2011-06-08 2:30 ` David Rientjes 2011-06-09 11:37 ` KOSAKI Motohiro 2011-06-09 12:14 ` Steven Rostedt 2011-06-09 18:41 ` Vaibhav Nagarnaik 2011-06-09 19:42 ` David Rientjes 2011-06-09 19:52 ` Steven Rostedt 2011-07-05 12:54 ` [tip:perf/core] ring-buffer: " tip-bot for Vaibhav Nagarnaik
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.