* [PATCH] oom killer: break from infinite loop
@ 2010-03-24 16:25 ` Anfei Zhou
0 siblings, 0 replies; 197+ messages in thread
From: Anfei Zhou @ 2010-03-24 16:25 UTC (permalink / raw)
To: akpm, rientjes, kosaki.motohiro, nishimura, kamezawa.hiroyu
Cc: linux-mm, linux-kernel
In multi-threading environment, if the current task(A) have got
the mm->mmap_sem semaphore, and the thread(B) in the same process
is selected to be oom killed, because they shares the same semaphore,
thread B can not really be killed. So __alloc_pages_slowpath turns
to be a infinite loop. Here set all the threads in the group to
TIF_MEMDIE, it gets a chance to break and exit.
Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
---
mm/oom_kill.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9b223af..aab9892 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
*/
static void __oom_kill_task(struct task_struct *p, int verbose)
{
+ struct task_struct *t;
+
if (is_global_init(p)) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill init!\n");
@@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ for (t = next_thread(p); t != p; t = next_thread(t))
+ set_tsk_thread_flag(t, TIF_MEMDIE);
force_sig(SIGKILL, p);
}
--
1.6.4.rc1
^ permalink raw reply related [flat|nested] 197+ messages in thread
* [PATCH] oom killer: break from infinite loop
@ 2010-03-24 16:25 ` Anfei Zhou
0 siblings, 0 replies; 197+ messages in thread
From: Anfei Zhou @ 2010-03-24 16:25 UTC (permalink / raw)
To: akpm, rientjes, kosaki.motohiro, nishimura, kamezawa.hiroyu
Cc: linux-mm, linux-kernel
In multi-threading environment, if the current task(A) have got
the mm->mmap_sem semaphore, and the thread(B) in the same process
is selected to be oom killed, because they shares the same semaphore,
thread B can not really be killed. So __alloc_pages_slowpath turns
to be a infinite loop. Here set all the threads in the group to
TIF_MEMDIE, it gets a chance to break and exit.
Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
---
mm/oom_kill.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9b223af..aab9892 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
*/
static void __oom_kill_task(struct task_struct *p, int verbose)
{
+ struct task_struct *t;
+
if (is_global_init(p)) {
WARN_ON(1);
printk(KERN_WARNING "tried to kill init!\n");
@@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);
+ for (t = next_thread(p); t != p; t = next_thread(t))
+ set_tsk_thread_flag(t, TIF_MEMDIE);
force_sig(SIGKILL, p);
}
--
1.6.4.rc1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-24 16:25 ` Anfei Zhou
@ 2010-03-25 2:51 ` KOSAKI Motohiro
-1 siblings, 0 replies; 197+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25 2:51 UTC (permalink / raw)
To: Anfei Zhou
Cc: kosaki.motohiro, akpm, rientjes, nishimura, kamezawa.hiroyu,
linux-mm, linux-kernel
> In multi-threading environment, if the current task(A) have got
> the mm->mmap_sem semaphore, and the thread(B) in the same process
> is selected to be oom killed, because they shares the same semaphore,
> thread B can not really be killed. So __alloc_pages_slowpath turns
> to be a infinite loop. Here set all the threads in the group to
> TIF_MEMDIE, it gets a chance to break and exit.
>
> Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
I like this patch very much.
Thanks, Anfei!
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/oom_kill.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
> }
> --
> 1.6.4.rc1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-25 2:51 ` KOSAKI Motohiro
0 siblings, 0 replies; 197+ messages in thread
From: KOSAKI Motohiro @ 2010-03-25 2:51 UTC (permalink / raw)
To: Anfei Zhou
Cc: kosaki.motohiro, akpm, rientjes, nishimura, kamezawa.hiroyu,
linux-mm, linux-kernel
> In multi-threading environment, if the current task(A) have got
> the mm->mmap_sem semaphore, and the thread(B) in the same process
> is selected to be oom killed, because they shares the same semaphore,
> thread B can not really be killed. So __alloc_pages_slowpath turns
> to be a infinite loop. Here set all the threads in the group to
> TIF_MEMDIE, it gets a chance to break and exit.
>
> Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
I like this patch very much.
Thanks, Anfei!
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
> mm/oom_kill.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
> }
> --
> 1.6.4.rc1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-24 16:25 ` Anfei Zhou
@ 2010-03-26 22:08 ` Andrew Morton
-1 siblings, 0 replies; 197+ messages in thread
From: Andrew Morton @ 2010-03-26 22:08 UTC (permalink / raw)
To: Anfei Zhou
Cc: rientjes, kosaki.motohiro, nishimura, kamezawa.hiroyu, linux-mm,
linux-kernel, Oleg Nesterov
On Thu, 25 Mar 2010 00:25:05 +0800
Anfei Zhou <anfei.zhou@gmail.com> wrote:
> In multi-threading environment, if the current task(A) have got
> the mm->mmap_sem semaphore, and the thread(B) in the same process
> is selected to be oom killed, because they shares the same semaphore,
> thread B can not really be killed. So __alloc_pages_slowpath turns
> to be a infinite loop. Here set all the threads in the group to
> TIF_MEMDIE, it gets a chance to break and exit.
>
> Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
> ---
> mm/oom_kill.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
Don't we need some sort of locking while walking that ring?
Unintuitively it appears to be spin_lock_irq(&p->sighand->siglock).
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-26 22:08 ` Andrew Morton
0 siblings, 0 replies; 197+ messages in thread
From: Andrew Morton @ 2010-03-26 22:08 UTC (permalink / raw)
To: Anfei Zhou
Cc: rientjes, kosaki.motohiro, nishimura, kamezawa.hiroyu, linux-mm,
linux-kernel, Oleg Nesterov
On Thu, 25 Mar 2010 00:25:05 +0800
Anfei Zhou <anfei.zhou@gmail.com> wrote:
> In multi-threading environment, if the current task(A) have got
> the mm->mmap_sem semaphore, and the thread(B) in the same process
> is selected to be oom killed, because they shares the same semaphore,
> thread B can not really be killed. So __alloc_pages_slowpath turns
> to be a infinite loop. Here set all the threads in the group to
> TIF_MEMDIE, it gets a chance to break and exit.
>
> Signed-off-by: Anfei Zhou <anfei.zhou@gmail.com>
> ---
> mm/oom_kill.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
Don't we need some sort of locking while walking that ring?
Unintuitively it appears to be spin_lock_irq(&p->sighand->siglock).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-26 22:08 ` Andrew Morton
@ 2010-03-26 22:33 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-26 22:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Anfei Zhou, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/26, Andrew Morton wrote:
>
> On Thu, 25 Mar 2010 00:25:05 +0800
> Anfei Zhou <anfei.zhou@gmail.com> wrote:
>
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > */
> > static void __oom_kill_task(struct task_struct *p, int verbose)
> > {
> > + struct task_struct *t;
> > +
> > if (is_global_init(p)) {
> > WARN_ON(1);
> > printk(KERN_WARNING "tried to kill init!\n");
> > @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + for (t = next_thread(p); t != p; t = next_thread(t))
> > + set_tsk_thread_flag(t, TIF_MEMDIE);
> >
> > force_sig(SIGKILL, p);
>
> Don't we need some sort of locking while walking that ring?
This should be always called under tasklist_lock, I think.
At least this seems to be true in Linus's tree.
I'd suggest to do
- set_tsk_thread_flag(p, TIF_MEMDIE);
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
but this is matter of taste.
Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
need "force" semantics.
I'd wish I could understand the changelog ;)
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-26 22:33 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-26 22:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Anfei Zhou, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/26, Andrew Morton wrote:
>
> On Thu, 25 Mar 2010 00:25:05 +0800
> Anfei Zhou <anfei.zhou@gmail.com> wrote:
>
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > */
> > static void __oom_kill_task(struct task_struct *p, int verbose)
> > {
> > + struct task_struct *t;
> > +
> > if (is_global_init(p)) {
> > WARN_ON(1);
> > printk(KERN_WARNING "tried to kill init!\n");
> > @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + for (t = next_thread(p); t != p; t = next_thread(t))
> > + set_tsk_thread_flag(t, TIF_MEMDIE);
> >
> > force_sig(SIGKILL, p);
>
> Don't we need some sort of locking while walking that ring?
This should be always called under tasklist_lock, I think.
At least this seems to be true in Linus's tree.
I'd suggest to do
- set_tsk_thread_flag(p, TIF_MEMDIE);
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
but this is matter of taste.
Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
need "force" semantics.
I'd wish I could understand the changelog ;)
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-24 16:25 ` Anfei Zhou
@ 2010-03-28 2:46 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-28 2:46 UTC (permalink / raw)
To: Anfei Zhou
Cc: Andrew Morton, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki,
linux-mm, linux-kernel
On Thu, 25 Mar 2010, Anfei Zhou wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
> }
I like the concept, but I agree that it would probably be better to write
it as Oleg suggested. The oom killer has been rewritten in the -mm tree
and so this patch doesn't apply cleanly, would it be possible to rebase to
mmotm with the suggested coding sytle and post this again?
See http://userweb.kernel.org/~akpm/mmotm/mmotm-readme.txt
Thanks!
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-28 2:46 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-28 2:46 UTC (permalink / raw)
To: Anfei Zhou
Cc: Andrew Morton, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki,
linux-mm, linux-kernel
On Thu, 25 Mar 2010, Anfei Zhou wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9b223af..aab9892 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> static void __oom_kill_task(struct task_struct *p, int verbose)
> {
> + struct task_struct *t;
> +
> if (is_global_init(p)) {
> WARN_ON(1);
> printk(KERN_WARNING "tried to kill init!\n");
> @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + for (t = next_thread(p); t != p; t = next_thread(t))
> + set_tsk_thread_flag(t, TIF_MEMDIE);
>
> force_sig(SIGKILL, p);
> }
I like the concept, but I agree that it would probably be better to write
it as Oleg suggested. The oom killer has been rewritten in the -mm tree
and so this patch doesn't apply cleanly, would it be possible to rebase to
mmotm with the suggested coding sytle and post this again?
See http://userweb.kernel.org/~akpm/mmotm/mmotm-readme.txt
Thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-26 22:33 ` Oleg Nesterov
@ 2010-03-28 14:55 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-28 14:55 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
> On 03/26, Andrew Morton wrote:
> >
> > On Thu, 25 Mar 2010 00:25:05 +0800
> > Anfei Zhou <anfei.zhou@gmail.com> wrote:
> >
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > > */
> > > static void __oom_kill_task(struct task_struct *p, int verbose)
> > > {
> > > + struct task_struct *t;
> > > +
> > > if (is_global_init(p)) {
> > > WARN_ON(1);
> > > printk(KERN_WARNING "tried to kill init!\n");
> > > @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > > */
> > > p->rt.time_slice = HZ;
> > > set_tsk_thread_flag(p, TIF_MEMDIE);
> > > + for (t = next_thread(p); t != p; t = next_thread(t))
> > > + set_tsk_thread_flag(t, TIF_MEMDIE);
> > >
> > > force_sig(SIGKILL, p);
> >
> > Don't we need some sort of locking while walking that ring?
>
> This should be always called under tasklist_lock, I think.
> At least this seems to be true in Linus's tree.
>
Yes, this function is always called with read_lock(&tasklist_lock), so
it should be okay.
> I'd suggest to do
>
> - set_tsk_thread_flag(p, TIF_MEMDIE);
> + t = p;
> + do {
> + set_tsk_thread_flag(t, TIF_MEMDIE);
> + } while_each_thread(p, t);
>
> but this is matter of taste.
>
Yes, this is better.
> Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> need "force" semantics.
>
This may need a dedicated patch, there are some other places to
force_sig(SIGKILL, ...) too.
> I'd wish I could understand the changelog ;)
>
Assume thread A and B are in the same group. If A runs into the oom,
and selects B as the victim, B won't exit because at least in exit_mm(),
it can not get the mm->mmap_sem semaphore which A has already got. So
no memory is freed, and no other task will be selected to kill.
I formatted the patch for -mm tree as David suggested.
---
mm/oom_kill.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -418,8 +418,15 @@ static void dump_header(struct task_stru
*/
static void __oom_kill_task(struct task_struct *p)
{
+ struct task_struct *t;
+
p->rt.time_slice = HZ;
- set_tsk_thread_flag(p, TIF_MEMDIE);
+
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
+
force_sig(SIGKILL, p);
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-28 14:55 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-28 14:55 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
> On 03/26, Andrew Morton wrote:
> >
> > On Thu, 25 Mar 2010 00:25:05 +0800
> > Anfei Zhou <anfei.zhou@gmail.com> wrote:
> >
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -381,6 +381,8 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > > */
> > > static void __oom_kill_task(struct task_struct *p, int verbose)
> > > {
> > > + struct task_struct *t;
> > > +
> > > if (is_global_init(p)) {
> > > WARN_ON(1);
> > > printk(KERN_WARNING "tried to kill init!\n");
> > > @@ -412,6 +414,8 @@ static void __oom_kill_task(struct task_struct *p, int verbose)
> > > */
> > > p->rt.time_slice = HZ;
> > > set_tsk_thread_flag(p, TIF_MEMDIE);
> > > + for (t = next_thread(p); t != p; t = next_thread(t))
> > > + set_tsk_thread_flag(t, TIF_MEMDIE);
> > >
> > > force_sig(SIGKILL, p);
> >
> > Don't we need some sort of locking while walking that ring?
>
> This should be always called under tasklist_lock, I think.
> At least this seems to be true in Linus's tree.
>
Yes, this function is always called with read_lock(&tasklist_lock), so
it should be okay.
> I'd suggest to do
>
> - set_tsk_thread_flag(p, TIF_MEMDIE);
> + t = p;
> + do {
> + set_tsk_thread_flag(t, TIF_MEMDIE);
> + } while_each_thread(p, t);
>
> but this is matter of taste.
>
Yes, this is better.
> Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> need "force" semantics.
>
This may need a dedicated patch, there are some other places to
force_sig(SIGKILL, ...) too.
> I'd wish I could understand the changelog ;)
>
Assume thread A and B are in the same group. If A runs into the oom,
and selects B as the victim, B won't exit because at least in exit_mm(),
it can not get the mm->mmap_sem semaphore which A has already got. So
no memory is freed, and no other task will be selected to kill.
I formatted the patch for -mm tree as David suggested.
---
mm/oom_kill.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -418,8 +418,15 @@ static void dump_header(struct task_stru
*/
static void __oom_kill_task(struct task_struct *p)
{
+ struct task_struct *t;
+
p->rt.time_slice = HZ;
- set_tsk_thread_flag(p, TIF_MEMDIE);
+
+ t = p;
+ do {
+ set_tsk_thread_flag(t, TIF_MEMDIE);
+ } while_each_thread(p, t);
+
force_sig(SIGKILL, p);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 14:55 ` anfei
@ 2010-03-28 16:28 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-28 16:28 UTC (permalink / raw)
To: anfei
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/28, anfei wrote:
>
> On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
>
> > Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> > need "force" semantics.
> >
> This may need a dedicated patch, there are some other places to
> force_sig(SIGKILL, ...) too.
Yes, yes, sure.
> > I'd wish I could understand the changelog ;)
> >
> Assume thread A and B are in the same group. If A runs into the oom,
> and selects B as the victim, B won't exit because at least in exit_mm(),
> it can not get the mm->mmap_sem semaphore which A has already got.
I see. But still I can't understand. To me, the problem is not that
B can't exit, the problem is that A doesn't know it should exit. All
threads should exit and free ->mm. Even if B could exit, this is not
enough. And, to some extent, it doesn't matter if it holds mmap_sem
or not.
Don't get me wrong. Even if I don't understand oom_kill.c the patch
looks obviously good to me, even from "common sense" pov. I am just
curious.
So, my understanding is: we are going to kill the whole thread group
but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
so that any thread can notice this flag and (say, __alloc_pages_slowpath)
fail asap.
Is my understanding correct?
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-28 16:28 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-28 16:28 UTC (permalink / raw)
To: anfei
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/28, anfei wrote:
>
> On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
>
> > Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> > need "force" semantics.
> >
> This may need a dedicated patch, there are some other places to
> force_sig(SIGKILL, ...) too.
Yes, yes, sure.
> > I'd wish I could understand the changelog ;)
> >
> Assume thread A and B are in the same group. If A runs into the oom,
> and selects B as the victim, B won't exit because at least in exit_mm(),
> it can not get the mm->mmap_sem semaphore which A has already got.
I see. But still I can't understand. To me, the problem is not that
B can't exit, the problem is that A doesn't know it should exit. All
threads should exit and free ->mm. Even if B could exit, this is not
enough. And, to some extent, it doesn't matter if it holds mmap_sem
or not.
Don't get me wrong. Even if I don't understand oom_kill.c the patch
looks obviously good to me, even from "common sense" pov. I am just
curious.
So, my understanding is: we are going to kill the whole thread group
but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
so that any thread can notice this flag and (say, __alloc_pages_slowpath)
fail asap.
Is my understanding correct?
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 16:28 ` Oleg Nesterov
@ 2010-03-28 21:21 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-28 21:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Sun, 28 Mar 2010, Oleg Nesterov wrote:
> I see. But still I can't understand. To me, the problem is not that
> B can't exit, the problem is that A doesn't know it should exit. All
> threads should exit and free ->mm. Even if B could exit, this is not
> enough. And, to some extent, it doesn't matter if it holds mmap_sem
> or not.
>
> Don't get me wrong. Even if I don't understand oom_kill.c the patch
> looks obviously good to me, even from "common sense" pov. I am just
> curious.
>
> So, my understanding is: we are going to kill the whole thread group
> but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> fail asap.
>
> Is my understanding correct?
>
[Adding Mel Gorman <mel@csn.ul.ie> to the cc]
The problem with this approach is that we could easily deplete all memory
reserves if the oom killed task has an extremely large number of threads,
there has always been only a single thread with TIF_MEMDIE set per cpuset
or memcg; for systems that don't run with cpusets or memory controller,
this has been limited to one thread with TIF_MEMDIE for the entire system.
There's risk involved with suddenly allowing 1000 threads to have
TIF_MEMDIE set and the chances of fully depleting all allowed zones is
much higher if they allocate memory prior to exit, for example.
An alternative is to fail allocations if they are failable and the
allocating task has a pending SIGKILL. It's better to preempt the oom
killer since current is going to be exiting anyway and this avoids a
needless kill.
That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
endlessly looping while making no progress.
Comments?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1610,13 +1610,21 @@ try_next_zone:
}
static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed)
{
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
return 0;
+ /* Loop if specifically requested */
+ if (gfp_mask & __GFP_NOFAIL)
+ return 1;
+
+ /* Task is killed, fail the allocation if possible */
+ if (fatal_signal_pending(p))
+ return 0;
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
@@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
return 1;
- /*
- * Don't let big-order allocations loop unless the caller
- * explicitly requests that.
- */
- if (gfp_mask & __GFP_NOFAIL)
- return 1;
-
return 0;
}
@@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (!in_interrupt() &&
((p->flags & PF_MEMALLOC) ||
+ (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
@@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int migratetype)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
+ const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
struct page *page = NULL;
int alloc_flags;
unsigned long pages_reclaimed = 0;
@@ -1876,7 +1879,7 @@ rebalance:
goto nopage;
/* Avoid allocations with no watermarks from looping endlessly */
- if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+ if (test_thread_flag(TIF_MEMDIE) && !nofail)
goto nopage;
/* Try direct reclaim and then allocating */
@@ -1888,6 +1891,10 @@ rebalance:
if (page)
goto got_pg;
+ /* Task is killed, fail the allocation if possible */
+ if (fatal_signal_pending(p) && !nofail)
+ goto nopage;
+
/*
* If we failed to make any progress reclaiming, then we are
* running out of options and have to consider going OOM
@@ -1909,8 +1916,7 @@ rebalance:
* made, there are no other options and retrying is
* unlikely to help.
*/
- if (order > PAGE_ALLOC_COSTLY_ORDER &&
- !(gfp_mask & __GFP_NOFAIL))
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
goto nopage;
goto restart;
@@ -1919,7 +1925,7 @@ rebalance:
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
- if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
congestion_wait(BLK_RW_ASYNC, HZ/50);
goto rebalance;
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-28 21:21 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-28 21:21 UTC (permalink / raw)
To: Oleg Nesterov
Cc: anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Sun, 28 Mar 2010, Oleg Nesterov wrote:
> I see. But still I can't understand. To me, the problem is not that
> B can't exit, the problem is that A doesn't know it should exit. All
> threads should exit and free ->mm. Even if B could exit, this is not
> enough. And, to some extent, it doesn't matter if it holds mmap_sem
> or not.
>
> Don't get me wrong. Even if I don't understand oom_kill.c the patch
> looks obviously good to me, even from "common sense" pov. I am just
> curious.
>
> So, my understanding is: we are going to kill the whole thread group
> but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> fail asap.
>
> Is my understanding correct?
>
[Adding Mel Gorman <mel@csn.ul.ie> to the cc]
The problem with this approach is that we could easily deplete all memory
reserves if the oom killed task has an extremely large number of threads,
there has always been only a single thread with TIF_MEMDIE set per cpuset
or memcg; for systems that don't run with cpusets or memory controller,
this has been limited to one thread with TIF_MEMDIE for the entire system.
There's risk involved with suddenly allowing 1000 threads to have
TIF_MEMDIE set and the chances of fully depleting all allowed zones is
much higher if they allocate memory prior to exit, for example.
An alternative is to fail allocations if they are failable and the
allocating task has a pending SIGKILL. It's better to preempt the oom
killer since current is going to be exiting anyway and this avoids a
needless kill.
That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
endlessly looping while making no progress.
Comments?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1610,13 +1610,21 @@ try_next_zone:
}
static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed)
{
/* Do not loop if specifically requested */
if (gfp_mask & __GFP_NORETRY)
return 0;
+ /* Loop if specifically requested */
+ if (gfp_mask & __GFP_NOFAIL)
+ return 1;
+
+ /* Task is killed, fail the allocation if possible */
+ if (fatal_signal_pending(p))
+ return 0;
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
@@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
return 1;
- /*
- * Don't let big-order allocations loop unless the caller
- * explicitly requests that.
- */
- if (gfp_mask & __GFP_NOFAIL)
- return 1;
-
return 0;
}
@@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (!in_interrupt() &&
((p->flags & PF_MEMALLOC) ||
+ (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}
@@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int migratetype)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
+ const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
struct page *page = NULL;
int alloc_flags;
unsigned long pages_reclaimed = 0;
@@ -1876,7 +1879,7 @@ rebalance:
goto nopage;
/* Avoid allocations with no watermarks from looping endlessly */
- if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
+ if (test_thread_flag(TIF_MEMDIE) && !nofail)
goto nopage;
/* Try direct reclaim and then allocating */
@@ -1888,6 +1891,10 @@ rebalance:
if (page)
goto got_pg;
+ /* Task is killed, fail the allocation if possible */
+ if (fatal_signal_pending(p) && !nofail)
+ goto nopage;
+
/*
* If we failed to make any progress reclaiming, then we are
* running out of options and have to consider going OOM
@@ -1909,8 +1916,7 @@ rebalance:
* made, there are no other options and retrying is
* unlikely to help.
*/
- if (order > PAGE_ALLOC_COSTLY_ORDER &&
- !(gfp_mask & __GFP_NOFAIL))
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
goto nopage;
goto restart;
@@ -1919,7 +1925,7 @@ rebalance:
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
- if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
congestion_wait(BLK_RW_ASYNC, HZ/50);
goto rebalance;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 21:21 ` David Rientjes
@ 2010-03-29 11:21 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-29 11:21 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/28, David Rientjes wrote:
>
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
Can't comment, I do not understand these subtleties.
But I'd like to note that fatal_signal_pending() can be true when the
process wasn't killed, but another thread does exit_group/exec.
I am not saying this is wrong, just I'd like to be sure this didn't
escape your attention.
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 11:21 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-29 11:21 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/28, David Rientjes wrote:
>
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
Can't comment, I do not understand these subtleties.
But I'd like to note that fatal_signal_pending() can be true when the
process wasn't killed, but another thread does exit_group/exec.
I am not saying this is wrong, just I'd like to be sure this didn't
escape your attention.
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 16:28 ` Oleg Nesterov
@ 2010-03-29 11:31 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 11:31 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> On 03/28, anfei wrote:
> >
> > On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
> >
> > > Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> > > need "force" semantics.
> > >
> > This may need a dedicated patch, there are some other places to
> > force_sig(SIGKILL, ...) too.
>
> Yes, yes, sure.
>
> > > I'd wish I could understand the changelog ;)
> > >
> > Assume thread A and B are in the same group. If A runs into the oom,
> > and selects B as the victim, B won't exit because at least in exit_mm(),
> > it can not get the mm->mmap_sem semaphore which A has already got.
>
> I see. But still I can't understand. To me, the problem is not that
> B can't exit, the problem is that A doesn't know it should exit. All
If B can exit, its memory will be freed, and A will be able to allocate
the memory, so A won't loop here.
Regards,
Anfei.
> threads should exit and free ->mm. Even if B could exit, this is not
> enough. And, to some extent, it doesn't matter if it holds mmap_sem
> or not.
>
> Don't get me wrong. Even if I don't understand oom_kill.c the patch
> looks obviously good to me, even from "common sense" pov. I am just
> curious.
>
> So, my understanding is: we are going to kill the whole thread group
> but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> fail asap.
>
> Is my understanding correct?
>
> Oleg.
>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 11:31 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 11:31 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> On 03/28, anfei wrote:
> >
> > On Fri, Mar 26, 2010 at 11:33:56PM +0100, Oleg Nesterov wrote:
> >
> > > Off-topic, but we shouldn't use force_sig(), SIGKILL doesn't
> > > need "force" semantics.
> > >
> > This may need a dedicated patch, there are some other places to
> > force_sig(SIGKILL, ...) too.
>
> Yes, yes, sure.
>
> > > I'd wish I could understand the changelog ;)
> > >
> > Assume thread A and B are in the same group. If A runs into the oom,
> > and selects B as the victim, B won't exit because at least in exit_mm(),
> > it can not get the mm->mmap_sem semaphore which A has already got.
>
> I see. But still I can't understand. To me, the problem is not that
> B can't exit, the problem is that A doesn't know it should exit. All
If B can exit, its memory will be freed, and A will be able to allocate
the memory, so A won't loop here.
Regards,
Anfei.
> threads should exit and free ->mm. Even if B could exit, this is not
> enough. And, to some extent, it doesn't matter if it holds mmap_sem
> or not.
>
> Don't get me wrong. Even if I don't understand oom_kill.c the patch
> looks obviously good to me, even from "common sense" pov. I am just
> curious.
>
> So, my understanding is: we are going to kill the whole thread group
> but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> fail asap.
>
> Is my understanding correct?
>
> Oleg.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-29 11:31 ` anfei
@ 2010-03-29 11:46 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-29 11:46 UTC (permalink / raw)
To: anfei
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/29, anfei wrote:
>
> On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> > On 03/28, anfei wrote:
> > >
> > > Assume thread A and B are in the same group. If A runs into the oom,
> > > and selects B as the victim, B won't exit because at least in exit_mm(),
> > > it can not get the mm->mmap_sem semaphore which A has already got.
> >
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
>
> If B can exit, its memory will be freed,
Which memory? I thought, we are talking about the memory used by ->mm ?
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 11:46 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-29 11:46 UTC (permalink / raw)
To: anfei
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On 03/29, anfei wrote:
>
> On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> > On 03/28, anfei wrote:
> > >
> > > Assume thread A and B are in the same group. If A runs into the oom,
> > > and selects B as the victim, B won't exit because at least in exit_mm(),
> > > it can not get the mm->mmap_sem semaphore which A has already got.
> >
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
>
> If B can exit, its memory will be freed,
Which memory? I thought, we are talking about the memory used by ->mm ?
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-29 11:46 ` Oleg Nesterov
@ 2010-03-29 12:09 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 12:09 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Mon, Mar 29, 2010 at 01:46:30PM +0200, Oleg Nesterov wrote:
> On 03/29, anfei wrote:
> >
> > On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> > > On 03/28, anfei wrote:
> > > >
> > > > Assume thread A and B are in the same group. If A runs into the oom,
> > > > and selects B as the victim, B won't exit because at least in exit_mm(),
> > > > it can not get the mm->mmap_sem semaphore which A has already got.
> > >
> > > I see. But still I can't understand. To me, the problem is not that
> > > B can't exit, the problem is that A doesn't know it should exit. All
> >
> > If B can exit, its memory will be freed,
>
> Which memory? I thought, we are talking about the memory used by ->mm ?
>
There is also a little kernel struct related to the task can be freed,
but I think you are correct, the memory used by ->mm takes more effect,
and it won't be freed even B exits. So I agree you on:
"
the problem is not that B can't exit, the problem is that A doesn't know
it should exit. All threads should exit and free ->mm. Even if B could
exit, this is not enough. And, to some extent, it doesn't matter if it
holds mmap_sem or not.
"
Thanks,
Anfei.
> Oleg.
>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 12:09 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 12:09 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, rientjes, kosaki.motohiro, nishimura,
kamezawa.hiroyu, linux-mm, linux-kernel
On Mon, Mar 29, 2010 at 01:46:30PM +0200, Oleg Nesterov wrote:
> On 03/29, anfei wrote:
> >
> > On Sun, Mar 28, 2010 at 06:28:21PM +0200, Oleg Nesterov wrote:
> > > On 03/28, anfei wrote:
> > > >
> > > > Assume thread A and B are in the same group. If A runs into the oom,
> > > > and selects B as the victim, B won't exit because at least in exit_mm(),
> > > > it can not get the mm->mmap_sem semaphore which A has already got.
> > >
> > > I see. But still I can't understand. To me, the problem is not that
> > > B can't exit, the problem is that A doesn't know it should exit. All
> >
> > If B can exit, its memory will be freed,
>
> Which memory? I thought, we are talking about the memory used by ->mm ?
>
There is also a little kernel struct related to the task can be freed,
but I think you are correct, the memory used by ->mm takes more effect,
and it won't be freed even B exits. So I agree you on:
"
the problem is not that B can't exit, the problem is that A doesn't know
it should exit. All threads should exit and free ->mm. Even if B could
exit, this is not enough. And, to some extent, it doesn't matter if it
holds mmap_sem or not.
"
Thanks,
Anfei.
> Oleg.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 21:21 ` David Rientjes
@ 2010-03-29 14:06 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 14:06 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel@csn.ul.ie> to the cc]
>
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
I think this method is okay, but it's easy to trigger another bug of
oom. See select_bad_process():
if (!p->mm)
continue;
!p->mm is not always an unaccepted condition. e.g. "p" is killed and
doing exit, setting tsk->mm to NULL is before releasing the memory.
And in multi threading environment, this happens much more.
In __out_of_memory(), it panics if select_bad_process returns NULL.
The simple way to fix it is as mem_cgroup_out_of_memory() does.
So I think both of these 2 patches are needed.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index afeab2a..9aae208 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -588,12 +588,8 @@ retry:
if (PTR_ERR(p) == -1UL)
return;
- /* Found nothing?!?! Either we hang forever, or we panic. */
- if (!p) {
- read_unlock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL);
- panic("Out of memory and no killable processes...\n");
- }
+ if (!p)
+ p = current;
if (oom_kill_process(p, gfp_mask, order, points, NULL,
"Out of memory"))
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
^ permalink raw reply related [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 14:06 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-29 14:06 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel@csn.ul.ie> to the cc]
>
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
I think this method is okay, but it's easy to trigger another bug of
oom. See select_bad_process():
if (!p->mm)
continue;
!p->mm is not always an unaccepted condition. e.g. "p" is killed and
doing exit, setting tsk->mm to NULL is before releasing the memory.
And in multi threading environment, this happens much more.
In __out_of_memory(), it panics if select_bad_process returns NULL.
The simple way to fix it is as mem_cgroup_out_of_memory() does.
So I think both of these 2 patches are needed.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index afeab2a..9aae208 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -588,12 +588,8 @@ retry:
if (PTR_ERR(p) == -1UL)
return;
- /* Found nothing?!?! Either we hang forever, or we panic. */
- if (!p) {
- read_unlock(&tasklist_lock);
- dump_header(NULL, gfp_mask, order, NULL);
- panic("Out of memory and no killable processes...\n");
- }
+ if (!p)
+ p = current;
if (oom_kill_process(p, gfp_mask, order, points, NULL,
"Out of memory"))
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-29 14:06 ` anfei
@ 2010-03-29 20:01 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-29 20:01 UTC (permalink / raw)
To: anfei
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Mon, 29 Mar 2010, anfei wrote:
> I think this method is okay, but it's easy to trigger another bug of
> oom. See select_bad_process():
> if (!p->mm)
> continue;
> !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> doing exit, setting tsk->mm to NULL is before releasing the memory.
> And in multi threading environment, this happens much more.
> In __out_of_memory(), it panics if select_bad_process returns NULL.
> The simple way to fix it is as mem_cgroup_out_of_memory() does.
>
This is fixed by
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
the -mm tree.
See
http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index afeab2a..9aae208 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -588,12 +588,8 @@ retry:
> if (PTR_ERR(p) == -1UL)
> return;
>
> - /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!p) {
> - read_unlock(&tasklist_lock);
> - dump_header(NULL, gfp_mask, order, NULL);
> - panic("Out of memory and no killable processes...\n");
> - }
> + if (!p)
> + p = current;
>
> if (oom_kill_process(p, gfp_mask, order, points, NULL,
> "Out of memory"))
The reason p wasn't selected is because it fails to meet the criteria for
candidacy in select_bad_process(), not necessarily because of a race with
the !p->mm check that the -mm patch cited above fixes. It's quite
possible that current has an oom_adj value of OOM_DISABLE, for example,
where this would be wrong.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-29 20:01 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-29 20:01 UTC (permalink / raw)
To: anfei
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Mon, 29 Mar 2010, anfei wrote:
> I think this method is okay, but it's easy to trigger another bug of
> oom. See select_bad_process():
> if (!p->mm)
> continue;
> !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> doing exit, setting tsk->mm to NULL is before releasing the memory.
> And in multi threading environment, this happens much more.
> In __out_of_memory(), it panics if select_bad_process returns NULL.
> The simple way to fix it is as mem_cgroup_out_of_memory() does.
>
This is fixed by
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
the -mm tree.
See
http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index afeab2a..9aae208 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -588,12 +588,8 @@ retry:
> if (PTR_ERR(p) == -1UL)
> return;
>
> - /* Found nothing?!?! Either we hang forever, or we panic. */
> - if (!p) {
> - read_unlock(&tasklist_lock);
> - dump_header(NULL, gfp_mask, order, NULL);
> - panic("Out of memory and no killable processes...\n");
> - }
> + if (!p)
> + p = current;
>
> if (oom_kill_process(p, gfp_mask, order, points, NULL,
> "Out of memory"))
The reason p wasn't selected is because it fails to meet the criteria for
candidacy in select_bad_process(), not necessarily because of a race with
the !p->mm check that the -mm patch cited above fixes. It's quite
possible that current has an oom_adj value of OOM_DISABLE, for example,
where this would be wrong.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch] oom: give current access to memory reserves if it has been killed
2010-03-29 11:21 ` Oleg Nesterov
@ 2010-03-29 20:49 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-29 20:49 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
On Mon, 29 Mar 2010, Oleg Nesterov wrote:
> Can't comment, I do not understand these subtleties.
>
> But I'd like to note that fatal_signal_pending() can be true when the
> process wasn't killed, but another thread does exit_group/exec.
>
I'm not sure there's a difference between whether a process was oom killed
and received a SIGKILL that way or whether exit_group(2) was used, so I
don't think we need to test for (p->signal->flags & SIGNAL_GROUP_EXIT)
here.
We do need to guarantee that exiting tasks always can get memory, which is
the responsibility of setting TIF_MEMDIE. The only thing this patch does
is defer calling the oom killer when a task has a pending SIGKILL and then
fail the allocation when it would otherwise repeat. Instead of the
considerable risk involved with no failing GFP_KERNEL allocations that are
under PAGE_ALLOC_COSTLY_ORDER that is typically never done, it may make
more sense to retry the allocation with TIF_MEMDIE on the second
iteration: in essence, automatically selecting current for oom kill
regardless of other oom killed tasks if it already has a pending SIGKILL.
oom: give current access to memory reserves if it has been killed
It's possible to livelock the page allocator if a thread has mm->mmap_sem and
fails to make forward progress because the oom killer selects another thread
sharing the same ->mm to kill that cannot exit until the semaphore is dropped.
The oom killer will not kill multiple tasks at the same time; each oom killed
task must exit before another task may be killed. Thus, if one thread is
holding mm->mmap_sem and cannot allocate memory, all threads sharing the same
->mm are blocked from exiting as well. In the oom kill case, that means the
thread holding mm->mmap_sem will never free additional memory since it cannot
get access to memory reserves and the thread that depends on it with access to
memory reserves cannot exit because it cannot acquire the semaphore. Thus,
the page allocators livelocks.
When the oom killer is called and current happens to have a pending SIGKILL,
this patch automatically selects it for kill so that it has access to memory
reserves and the better timeslice. Upon returning to the page allocator, its
allocation will hopefully succeed so it can quickly exit and free its memory.
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}
/*
+ * If current has a pending SIGKILL, then automatically select it. The
+ * goal is to allow it to allocate so that it may quickly exit and free
+ * its memory.
+ */
+ if (fatal_signal_pending(current)) {
+ __oom_kill_task(current);
+ return;
+ }
+
+ /*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-29 20:49 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-29 20:49 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
On Mon, 29 Mar 2010, Oleg Nesterov wrote:
> Can't comment, I do not understand these subtleties.
>
> But I'd like to note that fatal_signal_pending() can be true when the
> process wasn't killed, but another thread does exit_group/exec.
>
I'm not sure there's a difference between whether a process was oom killed
and received a SIGKILL that way or whether exit_group(2) was used, so I
don't think we need to test for (p->signal->flags & SIGNAL_GROUP_EXIT)
here.
We do need to guarantee that exiting tasks always can get memory, which is
the responsibility of setting TIF_MEMDIE. The only thing this patch does
is defer calling the oom killer when a task has a pending SIGKILL and then
fail the allocation when it would otherwise repeat. Instead of the
considerable risk involved with no failing GFP_KERNEL allocations that are
under PAGE_ALLOC_COSTLY_ORDER that is typically never done, it may make
more sense to retry the allocation with TIF_MEMDIE on the second
iteration: in essence, automatically selecting current for oom kill
regardless of other oom killed tasks if it already has a pending SIGKILL.
oom: give current access to memory reserves if it has been killed
It's possible to livelock the page allocator if a thread has mm->mmap_sem and
fails to make forward progress because the oom killer selects another thread
sharing the same ->mm to kill that cannot exit until the semaphore is dropped.
The oom killer will not kill multiple tasks at the same time; each oom killed
task must exit before another task may be killed. Thus, if one thread is
holding mm->mmap_sem and cannot allocate memory, all threads sharing the same
->mm are blocked from exiting as well. In the oom kill case, that means the
thread holding mm->mmap_sem will never free additional memory since it cannot
get access to memory reserves and the thread that depends on it with access to
memory reserves cannot exit because it cannot acquire the semaphore. Thus,
the page allocators livelocks.
When the oom killer is called and current happens to have a pending SIGKILL,
this patch automatically selects it for kill so that it has access to memory
reserves and the better timeslice. Upon returning to the page allocator, its
allocation will hopefully succeed so it can quickly exit and free its memory.
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}
/*
+ * If current has a pending SIGKILL, then automatically select it. The
+ * goal is to allow it to allocate so that it may quickly exit and free
+ * its memory.
+ */
+ if (fatal_signal_pending(current)) {
+ __oom_kill_task(current);
+ return;
+ }
+
+ /*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-29 20:01 ` David Rientjes
@ 2010-03-30 14:29 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-30 14:29 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Mon, Mar 29, 2010 at 01:01:58PM -0700, David Rientjes wrote:
> On Mon, 29 Mar 2010, anfei wrote:
>
> > I think this method is okay, but it's easy to trigger another bug of
> > oom. See select_bad_process():
> > if (!p->mm)
> > continue;
> > !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> > doing exit, setting tsk->mm to NULL is before releasing the memory.
> > And in multi threading environment, this happens much more.
> > In __out_of_memory(), it panics if select_bad_process returns NULL.
> > The simple way to fix it is as mem_cgroup_out_of_memory() does.
> >
>
> This is fixed by
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
> the -mm tree.
>
> See
> http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index afeab2a..9aae208 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -588,12 +588,8 @@ retry:
> > if (PTR_ERR(p) == -1UL)
> > return;
> >
> > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > - if (!p) {
> > - read_unlock(&tasklist_lock);
> > - dump_header(NULL, gfp_mask, order, NULL);
> > - panic("Out of memory and no killable processes...\n");
> > - }
> > + if (!p)
> > + p = current;
> >
> > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > "Out of memory"))
>
> The reason p wasn't selected is because it fails to meet the criteria for
> candidacy in select_bad_process(), not necessarily because of a race with
> the !p->mm check that the -mm patch cited above fixes. It's quite
> possible that current has an oom_adj value of OOM_DISABLE, for example,
> where this would be wrong.
I see. And what about changing mem_cgroup_out_of_memory() too?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cb1ca4..9e89a29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,8 +510,10 @@ retry:
if (PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ panic("Out of memory and no killable processes...\n");
+ }
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
^ permalink raw reply related [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-30 14:29 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-03-30 14:29 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Mon, Mar 29, 2010 at 01:01:58PM -0700, David Rientjes wrote:
> On Mon, 29 Mar 2010, anfei wrote:
>
> > I think this method is okay, but it's easy to trigger another bug of
> > oom. See select_bad_process():
> > if (!p->mm)
> > continue;
> > !p->mm is not always an unaccepted condition. e.g. "p" is killed and
> > doing exit, setting tsk->mm to NULL is before releasing the memory.
> > And in multi threading environment, this happens much more.
> > In __out_of_memory(), it panics if select_bad_process returns NULL.
> > The simple way to fix it is as mem_cgroup_out_of_memory() does.
> >
>
> This is fixed by
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch in
> the -mm tree.
>
> See
> http://userweb.kernel.org/~akpm/mmotm/broken-out/oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index afeab2a..9aae208 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -588,12 +588,8 @@ retry:
> > if (PTR_ERR(p) == -1UL)
> > return;
> >
> > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > - if (!p) {
> > - read_unlock(&tasklist_lock);
> > - dump_header(NULL, gfp_mask, order, NULL);
> > - panic("Out of memory and no killable processes...\n");
> > - }
> > + if (!p)
> > + p = current;
> >
> > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > "Out of memory"))
>
> The reason p wasn't selected is because it fails to meet the criteria for
> candidacy in select_bad_process(), not necessarily because of a race with
> the !p->mm check that the -mm patch cited above fixes. It's quite
> possible that current has an oom_adj value of OOM_DISABLE, for example,
> where this would be wrong.
I see. And what about changing mem_cgroup_out_of_memory() too?
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cb1ca4..9e89a29 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -510,8 +510,10 @@ retry:
if (PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
+ if (!p) {
+ read_unlock(&tasklist_lock);
+ panic("Out of memory and no killable processes...\n");
+ }
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-29 20:49 ` David Rientjes
@ 2010-03-30 15:46 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 15:46 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/29, David Rientjes wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> }
>
> /*
> + * If current has a pending SIGKILL, then automatically select it. The
> + * goal is to allow it to allocate so that it may quickly exit and free
> + * its memory.
> + */
> + if (fatal_signal_pending(current)) {
> + __oom_kill_task(current);
I am worried...
Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
->sighand != NULL. This is not true if out_of_memory() is called after
current has already passed exit_notify().
Hmm. looking at oom_kill.c... Afaics there are more problems with mt
apllications. select_bad_process() does for_each_process() which can
only see the group leaders. This is fine, but what if ->group_leader
has already exited? In this case its ->mm == NULL, and we ignore the
whole thread group.
IOW, unless I missed something, it is very easy to hide the process
from oom-kill:
int main()
{
pthread_create(memory_hog_func);
syscall(__NR_exit);
}
probably we need something like
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -246,21 +246,27 @@ static enum oom_constraint constrained_a
static struct task_struct *select_bad_process(unsigned long *ppoints,
struct mem_cgroup *mem)
{
- struct task_struct *p;
+ struct task_struct *g, *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
*ppoints = 0;
do_posix_clock_monotonic_gettime(&uptime);
- for_each_process(p) {
+ for_each_process(g) {
unsigned long points;
/*
* skip kernel threads and tasks which have already released
* their mm.
*/
+ p = g;
+ do {
+ if (p->mm)
+ break;
+ } while_each_thread(g, p);
if (!p->mm)
continue;
+
/* skip the init task */
if (is_global_init(p))
continue;
except is should be simplified and is_global_init() should check g.
No?
Oh... proc_oom_score() is racy. We can't trust ->group_leader even
under tasklist_lock. If we race with exit/exec it can point to
nowhere. I'll send the simple fix.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-30 15:46 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 15:46 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/29, David Rientjes wrote:
>
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> }
>
> /*
> + * If current has a pending SIGKILL, then automatically select it. The
> + * goal is to allow it to allocate so that it may quickly exit and free
> + * its memory.
> + */
> + if (fatal_signal_pending(current)) {
> + __oom_kill_task(current);
I am worried...
Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
->sighand != NULL. This is not true if out_of_memory() is called after
current has already passed exit_notify().
Hmm. looking at oom_kill.c... Afaics there are more problems with mt
apllications. select_bad_process() does for_each_process() which can
only see the group leaders. This is fine, but what if ->group_leader
has already exited? In this case its ->mm == NULL, and we ignore the
whole thread group.
IOW, unless I missed something, it is very easy to hide the process
from oom-kill:
int main()
{
pthread_create(memory_hog_func);
syscall(__NR_exit);
}
probably we need something like
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -246,21 +246,27 @@ static enum oom_constraint constrained_a
static struct task_struct *select_bad_process(unsigned long *ppoints,
struct mem_cgroup *mem)
{
- struct task_struct *p;
+ struct task_struct *g, *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
*ppoints = 0;
do_posix_clock_monotonic_gettime(&uptime);
- for_each_process(p) {
+ for_each_process(g) {
unsigned long points;
/*
* skip kernel threads and tasks which have already released
* their mm.
*/
+ p = g;
+ do {
+ if (p->mm)
+ break;
+ } while_each_thread(g, p);
if (!p->mm)
continue;
+
/* skip the init task */
if (is_global_init(p))
continue;
except is should be simplified and is_global_init() should check g.
No?
Oh... proc_oom_score() is racy. We can't trust ->group_leader even
under tasklist_lock. If we race with exit/exec it can point to
nowhere. I'll send the simple fix.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
2010-03-29 20:49 ` David Rientjes
@ 2010-03-30 16:39 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 16:39 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
Add the pid_alive() check to ensure __unhash_process() was not called.
Note: I think we shouldn't use ->group_leader, badness() should return
the same result for any sub-thread. However this is not true currently,
and I think that ->mm check and list_for_each_entry(p->children) in
badness are not right.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- 34-rc1/fs/proc/base.c~OOM_SCORE 2010-03-22 16:36:28.000000000 +0100
+++ 34-rc1/fs/proc/base.c 2010-03-30 18:23:50.000000000 +0200
@@ -430,12 +430,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task->group_leader, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
@ 2010-03-30 16:39 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 16:39 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
Add the pid_alive() check to ensure __unhash_process() was not called.
Note: I think we shouldn't use ->group_leader, badness() should return
the same result for any sub-thread. However this is not true currently,
and I think that ->mm check and list_for_each_entry(p->children) in
badness are not right.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- 34-rc1/fs/proc/base.c~OOM_SCORE 2010-03-22 16:36:28.000000000 +0100
+++ 34-rc1/fs/proc/base.c 2010-03-30 18:23:50.000000000 +0200
@@ -430,12 +430,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task->group_leader, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-30 16:39 ` Oleg Nesterov
@ 2010-03-30 17:43 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 17:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
->siglock is no longer needed to access task->signal, change
oom_adjust_read() and oom_adjust_write() to read/write oom_adj
lockless.
Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
can race and the second write can win, but I hope this is OK.
Also, cleanup the EACCES case a bit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 28 ++++++----------------------
1 file changed, 6 insertions(+), 22 deletions(-)
--- 34-rc1/fs/proc/base.c~PROC_5_OOM_ADJ 2010-03-30 18:23:50.000000000 +0200
+++ 34-rc1/fs/proc/base.c 2010-03-30 19:14:43.000000000 +0200
@@ -981,22 +981,16 @@ static ssize_t oom_adjust_read(struct fi
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
char buffer[PROC_NUMBUF];
+ int oom_adjust;
size_t len;
- int oom_adjust = OOM_DISABLE;
- unsigned long flags;
if (!task)
return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_adjust = task->signal->oom_adj;
- unlock_task_sighand(task, &flags);
- }
-
+ oom_adjust = task->signal->oom_adj;
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
-
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}
@@ -1006,7 +1000,6 @@ static ssize_t oom_adjust_write(struct f
struct task_struct *task;
char buffer[PROC_NUMBUF];
long oom_adjust;
- unsigned long flags;
int err;
memset(buffer, 0, sizeof(buffer));
@@ -1025,20 +1018,11 @@ static ssize_t oom_adjust_write(struct f
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
- if (!lock_task_sighand(task, &flags)) {
- put_task_struct(task);
- return -ESRCH;
- }
-
- if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- unlock_task_sighand(task, &flags);
- put_task_struct(task);
- return -EACCES;
- }
- task->signal->oom_adj = oom_adjust;
-
- unlock_task_sighand(task, &flags);
+ if (task->signal->oom_adj <= oom_adjust || capable(CAP_SYS_RESOURCE))
+ task->signal->oom_adj = oom_adjust;
+ else
+ count = -EACCES;
put_task_struct(task);
return count;
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-30 17:43 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-30 17:43 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
->siglock is no longer needed to access task->signal, change
oom_adjust_read() and oom_adjust_write() to read/write oom_adj
lockless.
Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
can race and the second write can win, but I hope this is OK.
Also, cleanup the EACCES case a bit.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 28 ++++++----------------------
1 file changed, 6 insertions(+), 22 deletions(-)
--- 34-rc1/fs/proc/base.c~PROC_5_OOM_ADJ 2010-03-30 18:23:50.000000000 +0200
+++ 34-rc1/fs/proc/base.c 2010-03-30 19:14:43.000000000 +0200
@@ -981,22 +981,16 @@ static ssize_t oom_adjust_read(struct fi
{
struct task_struct *task = get_proc_task(file->f_path.dentry->d_inode);
char buffer[PROC_NUMBUF];
+ int oom_adjust;
size_t len;
- int oom_adjust = OOM_DISABLE;
- unsigned long flags;
if (!task)
return -ESRCH;
- if (lock_task_sighand(task, &flags)) {
- oom_adjust = task->signal->oom_adj;
- unlock_task_sighand(task, &flags);
- }
-
+ oom_adjust = task->signal->oom_adj;
put_task_struct(task);
len = snprintf(buffer, sizeof(buffer), "%i\n", oom_adjust);
-
return simple_read_from_buffer(buf, count, ppos, buffer, len);
}
@@ -1006,7 +1000,6 @@ static ssize_t oom_adjust_write(struct f
struct task_struct *task;
char buffer[PROC_NUMBUF];
long oom_adjust;
- unsigned long flags;
int err;
memset(buffer, 0, sizeof(buffer));
@@ -1025,20 +1018,11 @@ static ssize_t oom_adjust_write(struct f
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
- if (!lock_task_sighand(task, &flags)) {
- put_task_struct(task);
- return -ESRCH;
- }
-
- if (oom_adjust < task->signal->oom_adj && !capable(CAP_SYS_RESOURCE)) {
- unlock_task_sighand(task, &flags);
- put_task_struct(task);
- return -EACCES;
- }
- task->signal->oom_adj = oom_adjust;
-
- unlock_task_sighand(task, &flags);
+ if (task->signal->oom_adj <= oom_adjust || capable(CAP_SYS_RESOURCE))
+ task->signal->oom_adj = oom_adjust;
+ else
+ count = -EACCES;
put_task_struct(task);
return count;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-30 15:46 ` Oleg Nesterov
@ 2010-03-30 20:26 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:26 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > }
> >
> > /*
> > + * If current has a pending SIGKILL, then automatically select it. The
> > + * goal is to allow it to allocate so that it may quickly exit and free
> > + * its memory.
> > + */
> > + if (fatal_signal_pending(current)) {
> > + __oom_kill_task(current);
>
> I am worried...
>
> Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> ->sighand != NULL. This is not true if out_of_memory() is called after
> current has already passed exit_notify().
>
We have an even bigger problem if current is in the oom killer at
exit_notify() since it has already detached its ->mm in exit_mm() :)
> Hmm. looking at oom_kill.c... Afaics there are more problems with mt
> apllications. select_bad_process() does for_each_process() which can
> only see the group leaders. This is fine, but what if ->group_leader
> has already exited? In this case its ->mm == NULL, and we ignore the
> whole thread group.
>
> IOW, unless I missed something, it is very easy to hide the process
> from oom-kill:
>
> int main()
> {
> pthread_create(memory_hog_func);
> syscall(__NR_exit);
> }
>
The check for !p->mm was moved in the -mm tree (and the oom killer was
entirely rewritten in that tree, so I encourage you to work off of it
instead) with
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to
even after the check for PF_EXITING. This is set in the exit path before
the ->mm is detached so if the oom killer finds an already exiting task,
it will become a no-op since it should eventually free memory and avoids a
needless oom kill.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-30 20:26 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:26 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -681,6 +681,16 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > }
> >
> > /*
> > + * If current has a pending SIGKILL, then automatically select it. The
> > + * goal is to allow it to allocate so that it may quickly exit and free
> > + * its memory.
> > + */
> > + if (fatal_signal_pending(current)) {
> > + __oom_kill_task(current);
>
> I am worried...
>
> Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> ->sighand != NULL. This is not true if out_of_memory() is called after
> current has already passed exit_notify().
>
We have an even bigger problem if current is in the oom killer at
exit_notify() since it has already detached its ->mm in exit_mm() :)
> Hmm. looking at oom_kill.c... Afaics there are more problems with mt
> apllications. select_bad_process() does for_each_process() which can
> only see the group leaders. This is fine, but what if ->group_leader
> has already exited? In this case its ->mm == NULL, and we ignore the
> whole thread group.
>
> IOW, unless I missed something, it is very easy to hide the process
> from oom-kill:
>
> int main()
> {
> pthread_create(memory_hog_func);
> syscall(__NR_exit);
> }
>
The check for !p->mm was moved in the -mm tree (and the oom killer was
entirely rewritten in that tree, so I encourage you to work off of it
instead) with
oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to
even after the check for PF_EXITING. This is set in the exit path before
the ->mm is detached so if the oom killer finds an already exiting task,
it will become a no-op since it should eventually free memory and avoids a
needless oom kill.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-30 14:29 ` anfei
@ 2010-03-30 20:29 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:29 UTC (permalink / raw)
To: anfei, KAMEZAWA Hiroyuki
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, anfei wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index afeab2a..9aae208 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -588,12 +588,8 @@ retry:
> > > if (PTR_ERR(p) == -1UL)
> > > return;
> > >
> > > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > > - if (!p) {
> > > - read_unlock(&tasklist_lock);
> > > - dump_header(NULL, gfp_mask, order, NULL);
> > > - panic("Out of memory and no killable processes...\n");
> > > - }
> > > + if (!p)
> > > + p = current;
> > >
> > > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > > "Out of memory"))
> >
> > The reason p wasn't selected is because it fails to meet the criteria for
> > candidacy in select_bad_process(), not necessarily because of a race with
> > the !p->mm check that the -mm patch cited above fixes. It's quite
> > possible that current has an oom_adj value of OOM_DISABLE, for example,
> > where this would be wrong.
>
> I see. And what about changing mem_cgroup_out_of_memory() too?
>
The memory controller is different because it must kill a task even if
another task is exiting since the imposed limit has been reached.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0cb1ca4..9e89a29 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -510,8 +510,10 @@ retry:
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> + if (!p) {
> + read_unlock(&tasklist_lock);
> + panic("Out of memory and no killable processes...\n");
> + }
>
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
>
This actually does appear to be necessary but for a different reason: if
current is unkillable because it has OOM_DISABLE, for example, then
oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
will infinitely loop.
Kame-san?
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-30 20:29 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:29 UTC (permalink / raw)
To: anfei, KAMEZAWA Hiroyuki
Cc: Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, anfei wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index afeab2a..9aae208 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -588,12 +588,8 @@ retry:
> > > if (PTR_ERR(p) == -1UL)
> > > return;
> > >
> > > - /* Found nothing?!?! Either we hang forever, or we panic. */
> > > - if (!p) {
> > > - read_unlock(&tasklist_lock);
> > > - dump_header(NULL, gfp_mask, order, NULL);
> > > - panic("Out of memory and no killable processes...\n");
> > > - }
> > > + if (!p)
> > > + p = current;
> > >
> > > if (oom_kill_process(p, gfp_mask, order, points, NULL,
> > > "Out of memory"))
> >
> > The reason p wasn't selected is because it fails to meet the criteria for
> > candidacy in select_bad_process(), not necessarily because of a race with
> > the !p->mm check that the -mm patch cited above fixes. It's quite
> > possible that current has an oom_adj value of OOM_DISABLE, for example,
> > where this would be wrong.
>
> I see. And what about changing mem_cgroup_out_of_memory() too?
>
The memory controller is different because it must kill a task even if
another task is exiting since the imposed limit has been reached.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 0cb1ca4..9e89a29 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -510,8 +510,10 @@ retry:
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> + if (!p) {
> + read_unlock(&tasklist_lock);
> + panic("Out of memory and no killable processes...\n");
> + }
>
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
>
This actually does appear to be necessary but for a different reason: if
current is unkillable because it has OOM_DISABLE, for example, then
oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
will infinitely loop.
Kame-san?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-30 17:43 ` Oleg Nesterov
@ 2010-03-30 20:30 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:30 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> ->siglock is no longer needed to access task->signal, change
> oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> lockless.
>
> Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> can race and the second write can win, but I hope this is OK.
>
Ok, but could you base this on -mm at
http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
been added (oom_score_adj), which does the same thing?
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-30 20:30 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:30 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> ->siglock is no longer needed to access task->signal, change
> oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> lockless.
>
> Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> can race and the second write can win, but I hope this is OK.
>
Ok, but could you base this on -mm at
http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
been added (oom_score_adj), which does the same thing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
2010-03-30 16:39 ` Oleg Nesterov
@ 2010-03-30 20:32 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> proc_oom_score(task) have a reference to task_struct, but that is all.
> If this task was already released before we take tasklist_lock
>
> - we can't use task->group_leader, it points to nowhere
>
> - it is not safe to call badness() even if this task is
> ->group_leader, has_intersects_mems_allowed() assumes
> it is safe to iterate over ->thread_group list.
>
> Add the pid_alive() check to ensure __unhash_process() was not called.
>
> Note: I think we shouldn't use ->group_leader, badness() should return
> the same result for any sub-thread. However this is not true currently,
> and I think that ->mm check and list_for_each_entry(p->children) in
> badness are not right.
>
I think it would be better to just use task and not task->group_leader.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
@ 2010-03-30 20:32 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-30 20:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> proc_oom_score(task) have a reference to task_struct, but that is all.
> If this task was already released before we take tasklist_lock
>
> - we can't use task->group_leader, it points to nowhere
>
> - it is not safe to call badness() even if this task is
> ->group_leader, has_intersects_mems_allowed() assumes
> it is safe to iterate over ->thread_group list.
>
> Add the pid_alive() check to ensure __unhash_process() was not called.
>
> Note: I think we shouldn't use ->group_leader, badness() should return
> the same result for any sub-thread. However this is not true currently,
> and I think that ->mm check and list_for_each_entry(p->children) in
> badness are not right.
>
I think it would be better to just use task and not task->group_leader.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-30 20:29 ` David Rientjes
@ 2010-03-31 0:57 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 0:57 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010 13:29:29 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0cb1ca4..9e89a29 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -510,8 +510,10 @@ retry:
> > if (PTR_ERR(p) == -1UL)
> > goto out;
> >
> > - if (!p)
> > - p = current;
> > + if (!p) {
> > + read_unlock(&tasklist_lock);
> > + panic("Out of memory and no killable processes...\n");
> > + }
> >
> > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > "Memory cgroup out of memory"))
> >
>
> This actually does appear to be necessary but for a different reason: if
> current is unkillable because it has OOM_DISABLE, for example, then
> oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> will infinitely loop.
>
> Kame-san?
>
When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
we can do nothing. (we can't panic because container's death != system death.)
Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
I think infinite-loop will not be critical probelm for the whole system.
And, now, memcg has oom-kill-disable + oom-kill-notifier features.
So, If a memcg goes into OOM and there is no killable process, but oom-kill is
not disabled by memcg.....it means system admin's mis-configuraton.
He can stop inifite loop by hand, anyway.
# echo 1 > ..../group_A/memory.oom_control
Thanks,
-Kame
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 0:57 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 0:57 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Tue, 30 Mar 2010 13:29:29 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 0cb1ca4..9e89a29 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -510,8 +510,10 @@ retry:
> > if (PTR_ERR(p) == -1UL)
> > goto out;
> >
> > - if (!p)
> > - p = current;
> > + if (!p) {
> > + read_unlock(&tasklist_lock);
> > + panic("Out of memory and no killable processes...\n");
> > + }
> >
> > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > "Memory cgroup out of memory"))
> >
>
> This actually does appear to be necessary but for a different reason: if
> current is unkillable because it has OOM_DISABLE, for example, then
> oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> will infinitely loop.
>
> Kame-san?
>
When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
we can do nothing. (we can't panic because container's death != system death.)
Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
I think infinite-loop will not be critical probelm for the whole system.
And, now, memcg has oom-kill-disable + oom-kill-notifier features.
So, If a memcg goes into OOM and there is no killable process, but oom-kill is
not disabled by memcg.....it means system admin's mis-configuraton.
He can stop inifite loop by hand, anyway.
# echo 1 > ..../group_A/memory.oom_control
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 0:57 ` KAMEZAWA Hiroyuki
@ 2010-03-31 6:07 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 6:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 0cb1ca4..9e89a29 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -510,8 +510,10 @@ retry:
> > > if (PTR_ERR(p) == -1UL)
> > > goto out;
> > >
> > > - if (!p)
> > > - p = current;
> > > + if (!p) {
> > > + read_unlock(&tasklist_lock);
> > > + panic("Out of memory and no killable processes...\n");
> > > + }
> > >
> > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > "Memory cgroup out of memory"))
> > >
> >
> > This actually does appear to be necessary but for a different reason: if
> > current is unkillable because it has OOM_DISABLE, for example, then
> > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > will infinitely loop.
> >
> > Kame-san?
> >
>
> When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> we can do nothing. (we can't panic because container's death != system death.)
>
> Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> I think infinite-loop will not be critical probelm for the whole system.
>
> And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> not disabled by memcg.....it means system admin's mis-configuraton.
>
> He can stop inifite loop by hand, anyway.
> # echo 1 > ..../group_A/memory.oom_control
>
Then we should be able to do this since current is by definition
unkillable since it was not found in select_bad_process(), right?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
- if (PTR_ERR(p) == -1UL)
+ if (!p || PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
-
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 6:07 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 6:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 0cb1ca4..9e89a29 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -510,8 +510,10 @@ retry:
> > > if (PTR_ERR(p) == -1UL)
> > > goto out;
> > >
> > > - if (!p)
> > > - p = current;
> > > + if (!p) {
> > > + read_unlock(&tasklist_lock);
> > > + panic("Out of memory and no killable processes...\n");
> > > + }
> > >
> > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > "Memory cgroup out of memory"))
> > >
> >
> > This actually does appear to be necessary but for a different reason: if
> > current is unkillable because it has OOM_DISABLE, for example, then
> > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > will infinitely loop.
> >
> > Kame-san?
> >
>
> When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> we can do nothing. (we can't panic because container's death != system death.)
>
> Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> I think infinite-loop will not be critical probelm for the whole system.
>
> And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> not disabled by memcg.....it means system admin's mis-configuraton.
>
> He can stop inifite loop by hand, anyway.
> # echo 1 > ..../group_A/memory.oom_control
>
Then we should be able to do this since current is by definition
unkillable since it was not found in select_bad_process(), right?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
- if (PTR_ERR(p) == -1UL)
+ if (!p || PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
-
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 6:07 ` David Rientjes
@ 2010-03-31 6:13 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 6:13 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel, balbir
On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
>
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 0cb1ca4..9e89a29 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -510,8 +510,10 @@ retry:
> > > > if (PTR_ERR(p) == -1UL)
> > > > goto out;
> > > >
> > > > - if (!p)
> > > > - p = current;
> > > > + if (!p) {
> > > > + read_unlock(&tasklist_lock);
> > > > + panic("Out of memory and no killable processes...\n");
> > > > + }
> > > >
> > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > "Memory cgroup out of memory"))
> > > >
> > >
> > > This actually does appear to be necessary but for a different reason: if
> > > current is unkillable because it has OOM_DISABLE, for example, then
> > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > will infinitely loop.
> > >
> > > Kame-san?
> > >
> >
> > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > we can do nothing. (we can't panic because container's death != system death.)
> >
> > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > I think infinite-loop will not be critical probelm for the whole system.
> >
> > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > not disabled by memcg.....it means system admin's mis-configuraton.
> >
> > He can stop inifite loop by hand, anyway.
> > # echo 1 > ..../group_A/memory.oom_control
> >
>
> Then we should be able to do this since current is by definition
> unkillable since it was not found in select_bad_process(), right?
To me, this patch is acceptable and seems reasnoable.
But I didn't joined to memcg development when this check was added
and don't know why kill current..
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
Addinc Balbir to CC. Maybe situation is changed now.
Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
callers, this change is acceptable.
Thanks,
-Kame
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> - if (PTR_ERR(p) == -1UL)
> + if (!p || PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> -
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
> goto retry;
>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 6:13 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 6:13 UTC (permalink / raw)
To: David Rientjes
Cc: anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro, nishimura,
Mel Gorman, linux-mm, linux-kernel, balbir
On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
>
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index 0cb1ca4..9e89a29 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -510,8 +510,10 @@ retry:
> > > > if (PTR_ERR(p) == -1UL)
> > > > goto out;
> > > >
> > > > - if (!p)
> > > > - p = current;
> > > > + if (!p) {
> > > > + read_unlock(&tasklist_lock);
> > > > + panic("Out of memory and no killable processes...\n");
> > > > + }
> > > >
> > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > "Memory cgroup out of memory"))
> > > >
> > >
> > > This actually does appear to be necessary but for a different reason: if
> > > current is unkillable because it has OOM_DISABLE, for example, then
> > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > will infinitely loop.
> > >
> > > Kame-san?
> > >
> >
> > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > we can do nothing. (we can't panic because container's death != system death.)
> >
> > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > I think infinite-loop will not be critical probelm for the whole system.
> >
> > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > not disabled by memcg.....it means system admin's mis-configuraton.
> >
> > He can stop inifite loop by hand, anyway.
> > # echo 1 > ..../group_A/memory.oom_control
> >
>
> Then we should be able to do this since current is by definition
> unkillable since it was not found in select_bad_process(), right?
To me, this patch is acceptable and seems reasnoable.
But I didn't joined to memcg development when this check was added
and don't know why kill current..
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
Addinc Balbir to CC. Maybe situation is changed now.
Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
callers, this change is acceptable.
Thanks,
-Kame
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> - if (PTR_ERR(p) == -1UL)
> + if (!p || PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> -
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
> goto retry;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 6:13 ` KAMEZAWA Hiroyuki
@ 2010-03-31 6:30 ` Balbir Singh
-1 siblings, 0 replies; 197+ messages in thread
From: Balbir Singh @ 2010-03-31 6:30 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: David Rientjes, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-31 15:13:56]:
> On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> >
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index 0cb1ca4..9e89a29 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -510,8 +510,10 @@ retry:
> > > > > if (PTR_ERR(p) == -1UL)
> > > > > goto out;
> > > > >
> > > > > - if (!p)
> > > > > - p = current;
> > > > > + if (!p) {
> > > > > + read_unlock(&tasklist_lock);
> > > > > + panic("Out of memory and no killable processes...\n");
> > > > > + }
> > > > >
> > > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > > "Memory cgroup out of memory"))
> > > > >
> > > >
> > > > This actually does appear to be necessary but for a different reason: if
> > > > current is unkillable because it has OOM_DISABLE, for example, then
> > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > > will infinitely loop.
> > > >
> > > > Kame-san?
> > > >
> > >
> > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > we can do nothing. (we can't panic because container's death != system death.)
> > >
> > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > > I think infinite-loop will not be critical probelm for the whole system.
> > >
> > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > not disabled by memcg.....it means system admin's mis-configuraton.
> > >
> > > He can stop inifite loop by hand, anyway.
> > > # echo 1 > ..../group_A/memory.oom_control
> > >
> >
> > Then we should be able to do this since current is by definition
> > unkillable since it was not found in select_bad_process(), right?
>
> To me, this patch is acceptable and seems reasnoable.
>
> But I didn't joined to memcg development when this check was added
> and don't know why kill current..
>
The reason for adding current was that we did not want to loop
forever, since it stops forward progress - no error/no forward
progress. It made sense to oom kill the current process, so that the
cgroup admin could look at what went wrong.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
>
> Addinc Balbir to CC. Maybe situation is changed now.
> Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> callers, this change is acceptable.
>
By hand is not always possible if we have a large number of cgroups
(I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
number of processes make the situation complex. I think using OOM
notifier is now another way of handling such a situation.
--
Three Cheers,
Balbir
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 6:30 ` Balbir Singh
0 siblings, 0 replies; 197+ messages in thread
From: Balbir Singh @ 2010-03-31 6:30 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: David Rientjes, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-31 15:13:56]:
> On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> >
> > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > index 0cb1ca4..9e89a29 100644
> > > > > --- a/mm/oom_kill.c
> > > > > +++ b/mm/oom_kill.c
> > > > > @@ -510,8 +510,10 @@ retry:
> > > > > if (PTR_ERR(p) == -1UL)
> > > > > goto out;
> > > > >
> > > > > - if (!p)
> > > > > - p = current;
> > > > > + if (!p) {
> > > > > + read_unlock(&tasklist_lock);
> > > > > + panic("Out of memory and no killable processes...\n");
> > > > > + }
> > > > >
> > > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > > "Memory cgroup out of memory"))
> > > > >
> > > >
> > > > This actually does appear to be necessary but for a different reason: if
> > > > current is unkillable because it has OOM_DISABLE, for example, then
> > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > > will infinitely loop.
> > > >
> > > > Kame-san?
> > > >
> > >
> > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > we can do nothing. (we can't panic because container's death != system death.)
> > >
> > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > > I think infinite-loop will not be critical probelm for the whole system.
> > >
> > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > not disabled by memcg.....it means system admin's mis-configuraton.
> > >
> > > He can stop inifite loop by hand, anyway.
> > > # echo 1 > ..../group_A/memory.oom_control
> > >
> >
> > Then we should be able to do this since current is by definition
> > unkillable since it was not found in select_bad_process(), right?
>
> To me, this patch is acceptable and seems reasnoable.
>
> But I didn't joined to memcg development when this check was added
> and don't know why kill current..
>
The reason for adding current was that we did not want to loop
forever, since it stops forward progress - no error/no forward
progress. It made sense to oom kill the current process, so that the
cgroup admin could look at what went wrong.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
>
> Addinc Balbir to CC. Maybe situation is changed now.
> Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> callers, this change is acceptable.
>
By hand is not always possible if we have a large number of cgroups
(I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
number of processes make the situation complex. I think using OOM
notifier is now another way of handling such a situation.
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 6:30 ` Balbir Singh
@ 2010-03-31 6:31 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 6:31 UTC (permalink / raw)
To: balbir
Cc: David Rientjes, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010 12:00:07 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-31 15:13:56]:
>
> > On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> >
> > > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> > >
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index 0cb1ca4..9e89a29 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -510,8 +510,10 @@ retry:
> > > > > > if (PTR_ERR(p) == -1UL)
> > > > > > goto out;
> > > > > >
> > > > > > - if (!p)
> > > > > > - p = current;
> > > > > > + if (!p) {
> > > > > > + read_unlock(&tasklist_lock);
> > > > > > + panic("Out of memory and no killable processes...\n");
> > > > > > + }
> > > > > >
> > > > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > > > "Memory cgroup out of memory"))
> > > > > >
> > > > >
> > > > > This actually does appear to be necessary but for a different reason: if
> > > > > current is unkillable because it has OOM_DISABLE, for example, then
> > > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > > > will infinitely loop.
> > > > >
> > > > > Kame-san?
> > > > >
> > > >
> > > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > > we can do nothing. (we can't panic because container's death != system death.)
> > > >
> > > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > > > I think infinite-loop will not be critical probelm for the whole system.
> > > >
> > > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > > not disabled by memcg.....it means system admin's mis-configuraton.
> > > >
> > > > He can stop inifite loop by hand, anyway.
> > > > # echo 1 > ..../group_A/memory.oom_control
> > > >
> > >
> > > Then we should be able to do this since current is by definition
> > > unkillable since it was not found in select_bad_process(), right?
> >
> > To me, this patch is acceptable and seems reasnoable.
> >
> > But I didn't joined to memcg development when this check was added
> > and don't know why kill current..
> >
>
> The reason for adding current was that we did not want to loop
> forever, since it stops forward progress - no error/no forward
> progress. It made sense to oom kill the current process, so that the
> cgroup admin could look at what went wrong.
>
Now, notifier is triggered.
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
> >
> > Addinc Balbir to CC. Maybe situation is changed now.
> > Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> > callers, this change is acceptable.
> >
>
> By hand is not always possible if we have a large number of cgroups
> (I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
> number of processes make the situation complex. I think using OOM
> notifier is now another way of handling such a situation.
>
"By hand" includes "automatically with daemon program", of course.
Hmm, in short, your opinion is "killing current is good for now" ?
I have no strong opinion, here. (Because I'll recommend all customers to
disable oom kill if they don't want any task to be killed automatically.)
Thanks,
-Kame
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 6:31 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 6:31 UTC (permalink / raw)
To: balbir
Cc: David Rientjes, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010 12:00:07 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-31 15:13:56]:
>
> > On Tue, 30 Mar 2010 23:07:08 -0700 (PDT)
> > David Rientjes <rientjes@google.com> wrote:
> >
> > > On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> > >
> > > > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > > > index 0cb1ca4..9e89a29 100644
> > > > > > --- a/mm/oom_kill.c
> > > > > > +++ b/mm/oom_kill.c
> > > > > > @@ -510,8 +510,10 @@ retry:
> > > > > > if (PTR_ERR(p) == -1UL)
> > > > > > goto out;
> > > > > >
> > > > > > - if (!p)
> > > > > > - p = current;
> > > > > > + if (!p) {
> > > > > > + read_unlock(&tasklist_lock);
> > > > > > + panic("Out of memory and no killable processes...\n");
> > > > > > + }
> > > > > >
> > > > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > > > "Memory cgroup out of memory"))
> > > > > >
> > > > >
> > > > > This actually does appear to be necessary but for a different reason: if
> > > > > current is unkillable because it has OOM_DISABLE, for example, then
> > > > > oom_kill_process() will repeatedly fail and mem_cgroup_out_of_memory()
> > > > > will infinitely loop.
> > > > >
> > > > > Kame-san?
> > > > >
> > > >
> > > > When a memcg goes into OOM and it only has unkillable processes (OOM_DISABLE),
> > > > we can do nothing. (we can't panic because container's death != system death.)
> > > >
> > > > Because memcg itself has mutex+waitqueue for mutual execusion of OOM killer,
> > > > I think infinite-loop will not be critical probelm for the whole system.
> > > >
> > > > And, now, memcg has oom-kill-disable + oom-kill-notifier features.
> > > > So, If a memcg goes into OOM and there is no killable process, but oom-kill is
> > > > not disabled by memcg.....it means system admin's mis-configuraton.
> > > >
> > > > He can stop inifite loop by hand, anyway.
> > > > # echo 1 > ..../group_A/memory.oom_control
> > > >
> > >
> > > Then we should be able to do this since current is by definition
> > > unkillable since it was not found in select_bad_process(), right?
> >
> > To me, this patch is acceptable and seems reasnoable.
> >
> > But I didn't joined to memcg development when this check was added
> > and don't know why kill current..
> >
>
> The reason for adding current was that we did not want to loop
> forever, since it stops forward progress - no error/no forward
> progress. It made sense to oom kill the current process, so that the
> cgroup admin could look at what went wrong.
>
Now, notifier is triggered.
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ba5c9e8176704bfac0729875fa62798037584d
> >
> > Addinc Balbir to CC. Maybe situation is changed now.
> > Because we can stop inifinite loop (by hand) and there is no rushing oom-kill
> > callers, this change is acceptable.
> >
>
> By hand is not always possible if we have a large number of cgroups
> (I've seen a setup with 2000 cgroups on libcgroup ML). 2000 cgroups *
> number of processes make the situation complex. I think using OOM
> notifier is now another way of handling such a situation.
>
"By hand" includes "automatically with daemon program", of course.
Hmm, in short, your opinion is "killing current is good for now" ?
I have no strong opinion, here. (Because I'll recommend all customers to
disable oom kill if they don't want any task to be killed automatically.)
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 6:30 ` Balbir Singh
@ 2010-03-31 6:32 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 6:32 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Balbir Singh wrote:
> > To me, this patch is acceptable and seems reasnoable.
> >
> > But I didn't joined to memcg development when this check was added
> > and don't know why kill current..
> >
>
> The reason for adding current was that we did not want to loop
> forever, since it stops forward progress - no error/no forward
> progress. It made sense to oom kill the current process, so that the
> cgroup admin could look at what went wrong.
>
oom_kill_process() will fail on current since it wasn't selected as an
eligible task to kill in select_bad_process() and we know it to be a
member of the memcg, so there's no point in trying to kill it.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 6:32 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 6:32 UTC (permalink / raw)
To: Balbir Singh
Cc: KAMEZAWA Hiroyuki, anfei, Oleg Nesterov, Andrew Morton,
KOSAKI Motohiro, nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Balbir Singh wrote:
> > To me, this patch is acceptable and seems reasnoable.
> >
> > But I didn't joined to memcg development when this check was added
> > and don't know why kill current..
> >
>
> The reason for adding current was that we did not want to loop
> forever, since it stops forward progress - no error/no forward
> progress. It made sense to oom kill the current process, so that the
> cgroup admin could look at what went wrong.
>
oom_kill_process() will fail on current since it wasn't selected as an
eligible task to kill in select_bad_process() and we know it to be a
member of the memcg, so there's no point in trying to kill it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-31 6:31 ` KAMEZAWA Hiroyuki
@ 2010-03-31 7:04 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 7:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro,
nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> "By hand" includes "automatically with daemon program", of course.
>
> Hmm, in short, your opinion is "killing current is good for now" ?
>
> I have no strong opinion, here. (Because I'll recommend all customers to
> disable oom kill if they don't want any task to be killed automatically.)
>
I think there're a couple of options: either define threshold notifiers
with memory.usage_in_bytes so userspace can proactively address low memory
situations prior to oom, or use the oom notifier after setting
echo 1 > /dev/cgroup/blah/memory.oom_control to address those issues
in userspace as they happen. If userspace wants to defer back to the
kernel oom killer because it can't raise max_usage_in_bytes, then
echo 0 > /dev/cgroup/blah/memory.oom_control should take care of it
instantly and I'd rather see a misconfigured memcg with tasks that are
OOM_DISABLE but not memcg->oom_kill_disable to be starved of memory than
panicking the entire system.
Those are good options for users having to deal with low memory
situations, thanks for continuing to work on it!
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-03-31 7:04 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 7:04 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: balbir, anfei, Oleg Nesterov, Andrew Morton, KOSAKI Motohiro,
nishimura, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, KAMEZAWA Hiroyuki wrote:
> "By hand" includes "automatically with daemon program", of course.
>
> Hmm, in short, your opinion is "killing current is good for now" ?
>
> I have no strong opinion, here. (Because I'll recommend all customers to
> disable oom kill if they don't want any task to be killed automatically.)
>
I think there're a couple of options: either define threshold notifiers
with memory.usage_in_bytes so userspace can proactively address low memory
situations prior to oom, or use the oom notifier after setting
echo 1 > /dev/cgroup/blah/memory.oom_control to address those issues
in userspace as they happen. If userspace wants to defer back to the
kernel oom killer because it can't raise max_usage_in_bytes, then
echo 0 > /dev/cgroup/blah/memory.oom_control should take care of it
instantly and I'd rather see a misconfigured memcg with tasks that are
OOM_DISABLE but not memcg->oom_kill_disable to be starved of memory than
panicking the entire system.
Those are good options for users having to deal with low memory
situations, thanks for continuing to work on it!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-03-31 7:08 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
@ 2010-03-31 7:08 ` KAMEZAWA Hiroyuki
2010-03-31 8:04 ` Balbir Singh
2010-04-04 23:28 ` David Rientjes
2 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-31 7:08 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura, Balbir Singh, linux-mm
On Wed, 31 Mar 2010 00:08:38 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
Ah, okay. If current is killable, current should be found by select_bad_process.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> mm/oom_kill.c | 5 +----
> 1 files changed, 1 insertions(+), 4 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> - if (PTR_ERR(p) == -1UL)
> + if (!p || PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> -
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
> goto retry;
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-03-31 6:32 ` David Rientjes
(?)
@ 2010-03-31 7:08 ` David Rientjes
2010-03-31 7:08 ` KAMEZAWA Hiroyuki
` (2 more replies)
-1 siblings, 3 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 7:08 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
It's pointless to try to kill current if select_bad_process() did not
find an eligible task to kill in mem_cgroup_out_of_memory() since it's
guaranteed that current is a member of the memcg that is oom and it is,
by definition, unkillable.
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
- if (PTR_ERR(p) == -1UL)
+ if (!p || PTR_ERR(p) == -1UL)
goto out;
- if (!p)
- p = current;
-
if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
"Memory cgroup out of memory"))
goto retry;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-03-31 7:08 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
2010-03-31 7:08 ` KAMEZAWA Hiroyuki
@ 2010-03-31 8:04 ` Balbir Singh
2010-03-31 10:38 ` David Rientjes
2010-04-04 23:28 ` David Rientjes
2 siblings, 1 reply; 197+ messages in thread
From: Balbir Singh @ 2010-03-31 8:04 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro,
nishimura, linux-mm
* David Rientjes <rientjes@google.com> [2010-03-31 00:08:38]:
> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> mm/oom_kill.c | 5 +----
> 1 files changed, 1 insertions(+), 4 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> - if (PTR_ERR(p) == -1UL)
> + if (!p || PTR_ERR(p) == -1UL)
> goto out;
Should we have a bit fat WAR_ON_ONCE() here?
--
Three Cheers,
Balbir
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
2010-03-30 20:32 ` David Rientjes
@ 2010-03-31 9:16 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 9:16 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > proc_oom_score(task) have a reference to task_struct, but that is all.
> > If this task was already released before we take tasklist_lock
> >
> > - we can't use task->group_leader, it points to nowhere
> >
> > - it is not safe to call badness() even if this task is
> > ->group_leader, has_intersects_mems_allowed() assumes
> > it is safe to iterate over ->thread_group list.
> >
> > Add the pid_alive() check to ensure __unhash_process() was not called.
> >
> > Note: I think we shouldn't use ->group_leader, badness() should return
> > the same result for any sub-thread. However this is not true currently,
> > and I think that ->mm check and list_for_each_entry(p->children) in
> > badness are not right.
> >
>
> I think it would be better to just use task and not task->group_leader.
Sure, agreed. I preserved ->group_leader just because I didn't understand
why the current code doesn't use task. But note that pid_alive() is still
needed.
I'll check the code in -mm and resend.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
@ 2010-03-31 9:16 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 9:16 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > proc_oom_score(task) have a reference to task_struct, but that is all.
> > If this task was already released before we take tasklist_lock
> >
> > - we can't use task->group_leader, it points to nowhere
> >
> > - it is not safe to call badness() even if this task is
> > ->group_leader, has_intersects_mems_allowed() assumes
> > it is safe to iterate over ->thread_group list.
> >
> > Add the pid_alive() check to ensure __unhash_process() was not called.
> >
> > Note: I think we shouldn't use ->group_leader, badness() should return
> > the same result for any sub-thread. However this is not true currently,
> > and I think that ->mm check and list_for_each_entry(p->children) in
> > badness are not right.
> >
>
> I think it would be better to just use task and not task->group_leader.
Sure, agreed. I preserved ->group_leader just because I didn't understand
why the current code doesn't use task. But note that pid_alive() is still
needed.
I'll check the code in -mm and resend.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-30 20:30 ` David Rientjes
@ 2010-03-31 9:17 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 9:17 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > ->siglock is no longer needed to access task->signal, change
> > oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> > lockless.
> >
> > Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> > can race and the second write can win, but I hope this is OK.
> >
>
> Ok, but could you base this on -mm at
> http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
> been added (oom_score_adj), which does the same thing?
Ah, OK, will do.
Thanks David.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-31 9:17 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 9:17 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > ->siglock is no longer needed to access task->signal, change
> > oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> > lockless.
> >
> > Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> > can race and the second write can win, but I hope this is OK.
> >
>
> Ok, but could you base this on -mm at
> http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
> been added (oom_score_adj), which does the same thing?
Ah, OK, will do.
Thanks David.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-03-31 8:04 ` Balbir Singh
@ 2010-03-31 10:38 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 10:38 UTC (permalink / raw)
To: Balbir Singh
Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro,
nishimura, linux-mm
On Wed, 31 Mar 2010, Balbir Singh wrote:
> > It's pointless to try to kill current if select_bad_process() did not
> > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > guaranteed that current is a member of the memcg that is oom and it is,
> > by definition, unkillable.
> >
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> > mm/oom_kill.c | 5 +----
> > 1 files changed, 1 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > read_lock(&tasklist_lock);
> > retry:
> > p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > - if (PTR_ERR(p) == -1UL)
> > + if (!p || PTR_ERR(p) == -1UL)
> > goto out;
>
> Should we have a bit fat WAR_ON_ONCE() here?
>
I'm not sure a WARN_ON_ONCE() is going to be too helpful to a sysadmin who
has misconfigured the memcg here since all it will do is emit the stack
trace and line number, it's not going to be immediately obvious that this
is because all tasks in the cgroup are unkillable so he or she should do
echo 1 > /dev/cgroup/blah/memory.oom_control as a remedy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-30 20:26 ` David Rientjes
@ 2010-03-31 17:58 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 17:58 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > ->sighand != NULL. This is not true if out_of_memory() is called after
> > current has already passed exit_notify().
>
> We have an even bigger problem if current is in the oom killer at
> exit_notify() since it has already detached its ->mm in exit_mm() :)
Can't understand... I thought that in theory even kmalloc(1) can trigger
oom.
Say, right after exit_mm() we are doing acct_process(), and f_op->write()
needs a page. So, you are saying that in this case __page_cache_alloc()
can never trigger out_of_memory() ?
> > IOW, unless I missed something, it is very easy to hide the process
> > from oom-kill:
> >
> > int main()
> > {
> > pthread_create(memory_hog_func);
> > syscall(__NR_exit);
> > }
> >
>
> The check for !p->mm was moved in the -mm tree (and the oom killer was
> entirely rewritten in that tree, so I encourage you to work off of it
> instead
OK, but I guess this !p->mm check is still wrong for the same reason.
In fact I do not understand why it is needed in select_bad_process()
right before oom_badness() which checks ->mm too (and this check is
equally wrong).
> with
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to
> even after the check for PF_EXITING. This is set in the exit path before
> the ->mm is detached
Yes. Then I do not understand "if (!p->mm)" completely.
> so if the oom killer finds an already exiting task,
> it will become a no-op since it should eventually free memory and avoids a
> needless oom kill.
No, afaics, And this reminds that I already complained about this
PF_EXITING check.
Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
is set) but it can have sub-threads. This means, unless I missed something,
any user can trivially disable select_bad_process() forever.
Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
would be better to fix these mt bugs first...
Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
Again, this is not right even if we forget about !child->mm check.
This list_for_each_entry() can only see the processes forked by the
main thread.
Likewise, oom_kill_process()->list_for_each_entry() is not right too.
Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 17:58 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 17:58 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > ->sighand != NULL. This is not true if out_of_memory() is called after
> > current has already passed exit_notify().
>
> We have an even bigger problem if current is in the oom killer at
> exit_notify() since it has already detached its ->mm in exit_mm() :)
Can't understand... I thought that in theory even kmalloc(1) can trigger
oom.
Say, right after exit_mm() we are doing acct_process(), and f_op->write()
needs a page. So, you are saying that in this case __page_cache_alloc()
can never trigger out_of_memory() ?
> > IOW, unless I missed something, it is very easy to hide the process
> > from oom-kill:
> >
> > int main()
> > {
> > pthread_create(memory_hog_func);
> > syscall(__NR_exit);
> > }
> >
>
> The check for !p->mm was moved in the -mm tree (and the oom killer was
> entirely rewritten in that tree, so I encourage you to work off of it
> instead
OK, but I guess this !p->mm check is still wrong for the same reason.
In fact I do not understand why it is needed in select_bad_process()
right before oom_badness() which checks ->mm too (and this check is
equally wrong).
> with
> oom-avoid-race-for-oom-killed-tasks-detaching-mm-prior-to-exit.patch to
> even after the check for PF_EXITING. This is set in the exit path before
> the ->mm is detached
Yes. Then I do not understand "if (!p->mm)" completely.
> so if the oom killer finds an already exiting task,
> it will become a no-op since it should eventually free memory and avoids a
> needless oom kill.
No, afaics, And this reminds that I already complained about this
PF_EXITING check.
Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
is set) but it can have sub-threads. This means, unless I missed something,
any user can trivially disable select_bad_process() forever.
Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
would be better to fix these mt bugs first...
Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
Again, this is not right even if we forget about !child->mm check.
This list_for_each_entry() can only see the processes forked by the
main thread.
Likewise, oom_kill_process()->list_for_each_entry() is not right too.
Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-30 20:30 ` David Rientjes
@ 2010-03-31 18:59 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 18:59 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > ->siglock is no longer needed to access task->signal, change
> > oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> > lockless.
> >
> > Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> > can race and the second write can win, but I hope this is OK.
>
> Ok, but could you base this on -mm at
> http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
> been added (oom_score_adj), which does the same thing?
David, I just can't understand why
oom-badness-heuristic-rewrite.patch
duplicates the related code in fs/proc/base.c and why it preserves
the deprecated signal->oom_adj.
OK. Please forget about lock_task_sighand/signal issues. Can't we kill
signal->oom_adj and create a single helper for both
/proc/pid/{oom_adj,oom_score_adj} ?
static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
size_t count, bool deprecated_mode)
{
struct task_struct *task;
char buffer[PROC_NUMBUF];
unsigned long flags;
long oom_score_adj;
int err;
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
if (err)
return -EINVAL;
if (depraceted_mode) {
if (oom_score_adj == OOM_ADJUST_MAX)
oom_score_adj = OOM_SCORE_ADJ_MAX;
else
oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
-OOM_DISABLE;
}
if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
oom_score_adj > OOM_SCORE_ADJ_MAX)
return -EINVAL;
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
if (!lock_task_sighand(task, &flags)) {
put_task_struct(task);
return -ESRCH;
}
if (oom_score_adj < task->signal->oom_score_adj &&
!capable(CAP_SYS_RESOURCE)) {
unlock_task_sighand(task, &flags);
put_task_struct(task);
return -EACCES;
}
task->signal->oom_score_adj = oom_score_adj;
unlock_task_sighand(task, &flags);
put_task_struct(task);
return count;
}
This is just the current oom_score_adj_read() + "if (depraceted_mode)"
which does oom_adj -> oom_score_adj conversion.
Now,
static ssize_t oom_adjust_write(...)
{
printk_once(KERN_WARNING "... deprecated ...\n");
return oom_any_adj_write(..., true);
}
static ssize_t oom_score_adj_write(...)
{
return oom_any_adj_write(..., false);
}
The same for oom_xxx_read().
What is the point to keep signal->oom_adj ?
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-31 18:59 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 18:59 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/30, David Rientjes wrote:
>
> On Tue, 30 Mar 2010, Oleg Nesterov wrote:
>
> > ->siglock is no longer needed to access task->signal, change
> > oom_adjust_read() and oom_adjust_write() to read/write oom_adj
> > lockless.
> >
> > Yes, this means that "echo 2 >oom_adj" and "echo 1 >oom_adj"
> > can race and the second write can win, but I hope this is OK.
>
> Ok, but could you base this on -mm at
> http://userweb.kernel.org/~akpm/mmotm/ since an additional tunable has
> been added (oom_score_adj), which does the same thing?
David, I just can't understand why
oom-badness-heuristic-rewrite.patch
duplicates the related code in fs/proc/base.c and why it preserves
the deprecated signal->oom_adj.
OK. Please forget about lock_task_sighand/signal issues. Can't we kill
signal->oom_adj and create a single helper for both
/proc/pid/{oom_adj,oom_score_adj} ?
static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
size_t count, bool deprecated_mode)
{
struct task_struct *task;
char buffer[PROC_NUMBUF];
unsigned long flags;
long oom_score_adj;
int err;
memset(buffer, 0, sizeof(buffer));
if (count > sizeof(buffer) - 1)
count = sizeof(buffer) - 1;
if (copy_from_user(buffer, buf, count))
return -EFAULT;
err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
if (err)
return -EINVAL;
if (depraceted_mode) {
if (oom_score_adj == OOM_ADJUST_MAX)
oom_score_adj = OOM_SCORE_ADJ_MAX;
else
oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
-OOM_DISABLE;
}
if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
oom_score_adj > OOM_SCORE_ADJ_MAX)
return -EINVAL;
task = get_proc_task(file->f_path.dentry->d_inode);
if (!task)
return -ESRCH;
if (!lock_task_sighand(task, &flags)) {
put_task_struct(task);
return -ESRCH;
}
if (oom_score_adj < task->signal->oom_score_adj &&
!capable(CAP_SYS_RESOURCE)) {
unlock_task_sighand(task, &flags);
put_task_struct(task);
return -EACCES;
}
task->signal->oom_score_adj = oom_score_adj;
unlock_task_sighand(task, &flags);
put_task_struct(task);
return count;
}
This is just the current oom_score_adj_read() + "if (depraceted_mode)"
which does oom_adj -> oom_score_adj conversion.
Now,
static ssize_t oom_adjust_write(...)
{
printk_once(KERN_WARNING "... deprecated ...\n");
return oom_any_adj_write(..., true);
}
static ssize_t oom_score_adj_write(...)
{
return oom_any_adj_write(..., false);
}
The same for oom_xxx_read().
What is the point to keep signal->oom_adj ?
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
2010-03-31 9:16 ` Oleg Nesterov
@ 2010-03-31 20:17 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 20:17 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, Oleg Nesterov wrote:
>
> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > proc_oom_score(task) have a reference to task_struct, but that is all.
> > > If this task was already released before we take tasklist_lock
> > >
> > > - we can't use task->group_leader, it points to nowhere
> > >
> > > - it is not safe to call badness() even if this task is
> > > ->group_leader, has_intersects_mems_allowed() assumes
> > > it is safe to iterate over ->thread_group list.
> > >
> > > Add the pid_alive() check to ensure __unhash_process() was not called.
> > >
> > > Note: I think we shouldn't use ->group_leader, badness() should return
> > > the same result for any sub-thread. However this is not true currently,
> > > and I think that ->mm check and list_for_each_entry(p->children) in
> > > badness are not right.
> > >
> >
> > I think it would be better to just use task and not task->group_leader.
>
> Sure, agreed. I preserved ->group_leader just because I didn't understand
> why the current code doesn't use task. But note that pid_alive() is still
> needed.
Oh. No, with the current code in -mm pid_alive() is not needed if
we use task instead of task->group_leader. But once we fix
oom_forkbomb_penalty() it will be needed again.
But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
to consult p->signal->oom_score_adj. Until recently this was wrong when it
is called from proc_oom_score().
This means oom-badness-heuristic-rewrite.patch depends on
signals-make-task_struct-signal-immutable-refcountable.patch, or we
need the pid_alive() check again.
oom_badness() gets the new argument, long totalpages, and the callers
were updated. However, long uptime is not used any longer, probably
it make sense to kill this arg and simplify the callers? Unless you
are going to take run-time into account later.
So, I think -mm needs the patch below, but I have no idea how to
write the changelog ;)
Oleg.
--- x/fs/proc/base.c
+++ x/fs/proc/base.c
@@ -430,12 +430,13 @@ static const struct file_operations proc
/* The badness from the OOM killer */
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = oom_badness(task->group_leader,
+ if (pid_alive(task))
+ points = oom_badness(task,
global_page_state(NR_INACTIVE_ANON) +
global_page_state(NR_ACTIVE_ANON) +
global_page_state(NR_INACTIVE_FILE) +
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
@ 2010-03-31 20:17 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 20:17 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, Oleg Nesterov wrote:
>
> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > proc_oom_score(task) have a reference to task_struct, but that is all.
> > > If this task was already released before we take tasklist_lock
> > >
> > > - we can't use task->group_leader, it points to nowhere
> > >
> > > - it is not safe to call badness() even if this task is
> > > ->group_leader, has_intersects_mems_allowed() assumes
> > > it is safe to iterate over ->thread_group list.
> > >
> > > Add the pid_alive() check to ensure __unhash_process() was not called.
> > >
> > > Note: I think we shouldn't use ->group_leader, badness() should return
> > > the same result for any sub-thread. However this is not true currently,
> > > and I think that ->mm check and list_for_each_entry(p->children) in
> > > badness are not right.
> > >
> >
> > I think it would be better to just use task and not task->group_leader.
>
> Sure, agreed. I preserved ->group_leader just because I didn't understand
> why the current code doesn't use task. But note that pid_alive() is still
> needed.
Oh. No, with the current code in -mm pid_alive() is not needed if
we use task instead of task->group_leader. But once we fix
oom_forkbomb_penalty() it will be needed again.
But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
to consult p->signal->oom_score_adj. Until recently this was wrong when it
is called from proc_oom_score().
This means oom-badness-heuristic-rewrite.patch depends on
signals-make-task_struct-signal-immutable-refcountable.patch, or we
need the pid_alive() check again.
oom_badness() gets the new argument, long totalpages, and the callers
were updated. However, long uptime is not used any longer, probably
it make sense to kill this arg and simplify the callers? Unless you
are going to take run-time into account later.
So, I think -mm needs the patch below, but I have no idea how to
write the changelog ;)
Oleg.
--- x/fs/proc/base.c
+++ x/fs/proc/base.c
@@ -430,12 +430,13 @@ static const struct file_operations proc
/* The badness from the OOM killer */
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = oom_badness(task->group_leader,
+ if (pid_alive(task))
+ points = oom_badness(task,
global_page_state(NR_INACTIVE_ANON) +
global_page_state(NR_ACTIVE_ANON) +
global_page_state(NR_INACTIVE_FILE) +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 17:58 ` Oleg Nesterov
@ 2010-03-31 20:47 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 20:47 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, Oleg Nesterov wrote:
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
Probably something like the patch below makes sense. Note that
"skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
Probably it is better to check it in select_bad_process() instead,
near is_global_init().
The new helper, find_lock_task_mm(), should be used by
oom_forkbomb_penalty() too.
dump_tasks() doesn't need it, it does do_each_thread(). Cough,
__out_of_memory() and out_of_memory() call it without tasklist.
We are going to panic() anyway, but still.
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
(child_rss / sysctl_oom_forkbomb_thres) : 0;
}
+static find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm && !(t->flags & PF_KTHREAD)))
+ return t;
+ task_unlock(t);
+ } while_each_thred(p, t);
+
+ return NULL;
+}
+
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
@@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
if (p->flags & PF_OOM_ORIGIN)
return 1000;
- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
+ p = find_lock_task_mm(p);
+ if (!p)
return 0;
- }
-
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
@@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 20:47 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 20:47 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, Oleg Nesterov wrote:
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
Probably something like the patch below makes sense. Note that
"skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
Probably it is better to check it in select_bad_process() instead,
near is_global_init().
The new helper, find_lock_task_mm(), should be used by
oom_forkbomb_penalty() too.
dump_tasks() doesn't need it, it does do_each_thread(). Cough,
__out_of_memory() and out_of_memory() call it without tasklist.
We are going to panic() anyway, but still.
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
(child_rss / sysctl_oom_forkbomb_thres) : 0;
}
+static find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm && !(t->flags & PF_KTHREAD)))
+ return t;
+ task_unlock(t);
+ } while_each_thred(p, t);
+
+ return NULL;
+}
+
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
@@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
if (p->flags & PF_OOM_ORIGIN)
return 1000;
- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
+ p = find_lock_task_mm(p);
+ if (!p)
return 0;
- }
-
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
@@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 17:58 ` Oleg Nesterov
@ 2010-03-31 21:07 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 21:07 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > current has already passed exit_notify().
> >
> > We have an even bigger problem if current is in the oom killer at
> > exit_notify() since it has already detached its ->mm in exit_mm() :)
>
> Can't understand... I thought that in theory even kmalloc(1) can trigger
> oom.
>
__oom_kill_task() cannot be called on a task without an ->mm.
> > > IOW, unless I missed something, it is very easy to hide the process
> > > from oom-kill:
> > >
> > > int main()
> > > {
> > > pthread_create(memory_hog_func);
> > > syscall(__NR_exit);
> > > }
> > >
> >
> > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > entirely rewritten in that tree, so I encourage you to work off of it
> > instead
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
>
It prevents kthreads from being killed. We already identify tasks that
are in the exit path with PF_EXITING in select_bad_process() and chosen to
make the oom killer a no-op when it's not current so it can exit and free
its memory. If it is current, then we're ooming in the exit path and we
need to oom kill it so that it gets access to memory reserves so its no
longer blocking.
> > so if the oom killer finds an already exiting task,
> > it will become a no-op since it should eventually free memory and avoids a
> > needless oom kill.
>
> No, afaics, And this reminds that I already complained about this
> PF_EXITING check.
>
> Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> is set) but it can have sub-threads. This means, unless I missed something,
> any user can trivially disable select_bad_process() forever.
>
The task is in the process of exiting and will do so if its not current,
otherwise it will get access to memory reserves since we're obviously oom
in the exit path. Thus, we'll be freeing that memory soon or recalling
the oom killer to kill additional tasks once those children have been
reparented (or one of its children was sacrificed).
>
> Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
> would be better to fix these mt bugs first...
>
> Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> Again, this is not right even if we forget about !child->mm check.
> This list_for_each_entry() can only see the processes forked by the
> main thread.
>
That's the intention.
> Likewise, oom_kill_process()->list_for_each_entry() is not right too.
>
Why?
> Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
>
Right, but we need to ensure that the check for !child->mm || child->mm ==
tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
detach its mm prior to the dereference. It would be possible to move the
thread_group_cputime() out of this critical section, but I felt it was
better to do filter all tasks with child->mm == tsk->mm first before
unnecessarily finding the cputime for them.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 21:07 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 21:07 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> On 03/30, David Rientjes wrote:
> >
> > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> >
> > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > current has already passed exit_notify().
> >
> > We have an even bigger problem if current is in the oom killer at
> > exit_notify() since it has already detached its ->mm in exit_mm() :)
>
> Can't understand... I thought that in theory even kmalloc(1) can trigger
> oom.
>
__oom_kill_task() cannot be called on a task without an ->mm.
> > > IOW, unless I missed something, it is very easy to hide the process
> > > from oom-kill:
> > >
> > > int main()
> > > {
> > > pthread_create(memory_hog_func);
> > > syscall(__NR_exit);
> > > }
> > >
> >
> > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > entirely rewritten in that tree, so I encourage you to work off of it
> > instead
>
> OK, but I guess this !p->mm check is still wrong for the same reason.
> In fact I do not understand why it is needed in select_bad_process()
> right before oom_badness() which checks ->mm too (and this check is
> equally wrong).
>
It prevents kthreads from being killed. We already identify tasks that
are in the exit path with PF_EXITING in select_bad_process() and chosen to
make the oom killer a no-op when it's not current so it can exit and free
its memory. If it is current, then we're ooming in the exit path and we
need to oom kill it so that it gets access to memory reserves so its no
longer blocking.
> > so if the oom killer finds an already exiting task,
> > it will become a no-op since it should eventually free memory and avoids a
> > needless oom kill.
>
> No, afaics, And this reminds that I already complained about this
> PF_EXITING check.
>
> Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> is set) but it can have sub-threads. This means, unless I missed something,
> any user can trivially disable select_bad_process() forever.
>
The task is in the process of exiting and will do so if its not current,
otherwise it will get access to memory reserves since we're obviously oom
in the exit path. Thus, we'll be freeing that memory soon or recalling
the oom killer to kill additional tasks once those children have been
reparented (or one of its children was sacrificed).
>
> Well. Looks like, -mm has a lot of changes in oom_kill.c. Perhaps it
> would be better to fix these mt bugs first...
>
> Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> Again, this is not right even if we forget about !child->mm check.
> This list_for_each_entry() can only see the processes forked by the
> main thread.
>
That's the intention.
> Likewise, oom_kill_process()->list_for_each_entry() is not right too.
>
Why?
> Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
>
Right, but we need to ensure that the check for !child->mm || child->mm ==
tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
detach its mm prior to the dereference. It would be possible to move the
thread_group_cputime() out of this critical section, but I felt it was
better to do filter all tasks with child->mm == tsk->mm first before
unnecessarily finding the cputime for them.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-31 18:59 ` Oleg Nesterov
@ 2010-03-31 21:14 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 21:14 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> David, I just can't understand why
> oom-badness-heuristic-rewrite.patch
> duplicates the related code in fs/proc/base.c and why it preserves
> the deprecated signal->oom_adj.
>
You could combine the two write functions together and then two read
functions together if you'd like.
> OK. Please forget about lock_task_sighand/signal issues. Can't we kill
> signal->oom_adj and create a single helper for both
> /proc/pid/{oom_adj,oom_score_adj} ?
>
> static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
> size_t count, bool deprecated_mode)
> {
> struct task_struct *task;
> char buffer[PROC_NUMBUF];
> unsigned long flags;
> long oom_score_adj;
> int err;
>
> memset(buffer, 0, sizeof(buffer));
> if (count > sizeof(buffer) - 1)
> count = sizeof(buffer) - 1;
> if (copy_from_user(buffer, buf, count))
> return -EFAULT;
>
> err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
> if (err)
> return -EINVAL;
>
> if (depraceted_mode) {
> if (oom_score_adj == OOM_ADJUST_MAX)
> oom_score_adj = OOM_SCORE_ADJ_MAX;
???
> else
> oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
> -OOM_DISABLE;
> }
>
> if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> oom_score_adj > OOM_SCORE_ADJ_MAX)
That doesn't work for depraceted_mode (sic), you'd need to test for
OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
> return -EINVAL;
>
> task = get_proc_task(file->f_path.dentry->d_inode);
> if (!task)
> return -ESRCH;
> if (!lock_task_sighand(task, &flags)) {
> put_task_struct(task);
> return -ESRCH;
> }
> if (oom_score_adj < task->signal->oom_score_adj &&
> !capable(CAP_SYS_RESOURCE)) {
> unlock_task_sighand(task, &flags);
> put_task_struct(task);
> return -EACCES;
> }
>
> task->signal->oom_score_adj = oom_score_adj;
>
> unlock_task_sighand(task, &flags);
> put_task_struct(task);
> return count;
> }
>
There have been efforts to reuse as much of this code as possible for
other sysctl handlers as well, you might be better off looking for other
users of the common read and write code and then merging them first
(comm_write, proc_coredump_filter_write, etc).
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-31 21:14 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 21:14 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> David, I just can't understand why
> oom-badness-heuristic-rewrite.patch
> duplicates the related code in fs/proc/base.c and why it preserves
> the deprecated signal->oom_adj.
>
You could combine the two write functions together and then two read
functions together if you'd like.
> OK. Please forget about lock_task_sighand/signal issues. Can't we kill
> signal->oom_adj and create a single helper for both
> /proc/pid/{oom_adj,oom_score_adj} ?
>
> static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
> size_t count, bool deprecated_mode)
> {
> struct task_struct *task;
> char buffer[PROC_NUMBUF];
> unsigned long flags;
> long oom_score_adj;
> int err;
>
> memset(buffer, 0, sizeof(buffer));
> if (count > sizeof(buffer) - 1)
> count = sizeof(buffer) - 1;
> if (copy_from_user(buffer, buf, count))
> return -EFAULT;
>
> err = strict_strtol(strstrip(buffer), 0, &oom_score_adj);
> if (err)
> return -EINVAL;
>
> if (depraceted_mode) {
> if (oom_score_adj == OOM_ADJUST_MAX)
> oom_score_adj = OOM_SCORE_ADJ_MAX;
???
> else
> oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
> -OOM_DISABLE;
> }
>
> if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> oom_score_adj > OOM_SCORE_ADJ_MAX)
That doesn't work for depraceted_mode (sic), you'd need to test for
OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
> return -EINVAL;
>
> task = get_proc_task(file->f_path.dentry->d_inode);
> if (!task)
> return -ESRCH;
> if (!lock_task_sighand(task, &flags)) {
> put_task_struct(task);
> return -ESRCH;
> }
> if (oom_score_adj < task->signal->oom_score_adj &&
> !capable(CAP_SYS_RESOURCE)) {
> unlock_task_sighand(task, &flags);
> put_task_struct(task);
> return -EACCES;
> }
>
> task->signal->oom_score_adj = oom_score_adj;
>
> unlock_task_sighand(task, &flags);
> put_task_struct(task);
> return count;
> }
>
There have been efforts to reuse as much of this code as possible for
other sysctl handlers as well, you might be better off looking for other
users of the common read and write code and then merging them first
(comm_write, proc_coredump_filter_write, etc).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 21:07 ` David Rientjes
@ 2010-03-31 22:50 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 22:50 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > On 03/30, David Rientjes wrote:
> > >
> > > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > >
> > > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > > current has already passed exit_notify().
> > >
> > > We have an even bigger problem if current is in the oom killer at
> > > exit_notify() since it has already detached its ->mm in exit_mm() :)
> >
> > Can't understand... I thought that in theory even kmalloc(1) can trigger
> > oom.
>
> __oom_kill_task() cannot be called on a task without an ->mm.
Why? You ignored this part:
Say, right after exit_mm() we are doing acct_process(), and f_op->write()
needs a page. So, you are saying that in this case __page_cache_alloc()
can never trigger out_of_memory() ?
why this is not possible?
David, I am not arguing, I am asking.
> > > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > > entirely rewritten in that tree, so I encourage you to work off of it
> > > instead
> >
> > OK, but I guess this !p->mm check is still wrong for the same reason.
> > In fact I do not understand why it is needed in select_bad_process()
> > right before oom_badness() which checks ->mm too (and this check is
> > equally wrong).
>
> It prevents kthreads from being killed.
No it doesn't, see use_mm(). See also another email I sent.
> > > so if the oom killer finds an already exiting task,
> > > it will become a no-op since it should eventually free memory and avoids a
> > > needless oom kill.
> >
> > No, afaics, And this reminds that I already complained about this
> > PF_EXITING check.
> >
> > Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> > is set) but it can have sub-threads. This means, unless I missed something,
> > any user can trivially disable select_bad_process() forever.
> >
>
> The task is in the process of exiting and will do so if its not current,
> otherwise it will get access to memory reserves since we're obviously oom
> in the exit path. Thus, we'll be freeing that memory soon or recalling
> the oom killer to kill additional tasks once those children have been
> reparented (or one of its children was sacrificed).
Just can't understand.
OK, a bad user does
int sleep_forever(void *)
{
pause();
}
int main(void)
{
pthread_create(sleep_forever);
syscall(__NR_exit);
}
Now, every time select_bad_process() is called it will find this process
and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
this process is not going to exit.
> > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > Again, this is not right even if we forget about !child->mm check.
> > This list_for_each_entry() can only see the processes forked by the
> > main thread.
> >
>
> That's the intention.
Why? shouldn't oom_badness() return the same result for any thread
in thread group? We should take all childs into account.
> > Likewise, oom_kill_process()->list_for_each_entry() is not right too.
> >
>
> Why?
>
> > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> >
>
> Right, but we need to ensure that the check for !child->mm || child->mm ==
> tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
> detach its mm prior to the dereference.
Oh, yes sure, I mentioned get_mm_rss() above.
> It would be possible to move the
> thread_group_cputime() out of this critical section,
Yes, this is what I meant.
> but I felt it was
> better to do filter all tasks with child->mm == tsk->mm first before
> unnecessarily finding the cputime for them.
Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
task_lock().
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 22:50 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 22:50 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > On 03/30, David Rientjes wrote:
> > >
> > > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > >
> > > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > > current has already passed exit_notify().
> > >
> > > We have an even bigger problem if current is in the oom killer at
> > > exit_notify() since it has already detached its ->mm in exit_mm() :)
> >
> > Can't understand... I thought that in theory even kmalloc(1) can trigger
> > oom.
>
> __oom_kill_task() cannot be called on a task without an ->mm.
Why? You ignored this part:
Say, right after exit_mm() we are doing acct_process(), and f_op->write()
needs a page. So, you are saying that in this case __page_cache_alloc()
can never trigger out_of_memory() ?
why this is not possible?
David, I am not arguing, I am asking.
> > > The check for !p->mm was moved in the -mm tree (and the oom killer was
> > > entirely rewritten in that tree, so I encourage you to work off of it
> > > instead
> >
> > OK, but I guess this !p->mm check is still wrong for the same reason.
> > In fact I do not understand why it is needed in select_bad_process()
> > right before oom_badness() which checks ->mm too (and this check is
> > equally wrong).
>
> It prevents kthreads from being killed.
No it doesn't, see use_mm(). See also another email I sent.
> > > so if the oom killer finds an already exiting task,
> > > it will become a no-op since it should eventually free memory and avoids a
> > > needless oom kill.
> >
> > No, afaics, And this reminds that I already complained about this
> > PF_EXITING check.
> >
> > Once again, p is the group leader. It can be dead (no ->mm, PF_EXITING
> > is set) but it can have sub-threads. This means, unless I missed something,
> > any user can trivially disable select_bad_process() forever.
> >
>
> The task is in the process of exiting and will do so if its not current,
> otherwise it will get access to memory reserves since we're obviously oom
> in the exit path. Thus, we'll be freeing that memory soon or recalling
> the oom killer to kill additional tasks once those children have been
> reparented (or one of its children was sacrificed).
Just can't understand.
OK, a bad user does
int sleep_forever(void *)
{
pause();
}
int main(void)
{
pthread_create(sleep_forever);
syscall(__NR_exit);
}
Now, every time select_bad_process() is called it will find this process
and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
this process is not going to exit.
> > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > Again, this is not right even if we forget about !child->mm check.
> > This list_for_each_entry() can only see the processes forked by the
> > main thread.
> >
>
> That's the intention.
Why? shouldn't oom_badness() return the same result for any thread
in thread group? We should take all childs into account.
> > Likewise, oom_kill_process()->list_for_each_entry() is not right too.
> >
>
> Why?
>
> > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> >
>
> Right, but we need to ensure that the check for !child->mm || child->mm ==
> tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
> detach its mm prior to the dereference.
Oh, yes sure, I mentioned get_mm_rss() above.
> It would be possible to move the
> thread_group_cputime() out of this critical section,
Yes, this is what I meant.
> but I felt it was
> better to do filter all tasks with child->mm == tsk->mm first before
> unnecessarily finding the cputime for them.
Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
task_lock().
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-31 21:14 ` David Rientjes
@ 2010-03-31 23:00 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 23:00 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > David, I just can't understand why
> > oom-badness-heuristic-rewrite.patch
> > duplicates the related code in fs/proc/base.c and why it preserves
> > the deprecated signal->oom_adj.
>
> You could combine the two write functions together and then two read
> functions together if you'd like.
Yes,
> > static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
> > size_t count, bool deprecated_mode)
> > {
> >
> > if (depraceted_mode) {
> > if (oom_score_adj == OOM_ADJUST_MAX)
> > oom_score_adj = OOM_SCORE_ADJ_MAX;
>
> ???
What?
> > else
> > oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
> > -OOM_DISABLE;
> > }
> >
> > if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> > oom_score_adj > OOM_SCORE_ADJ_MAX)
>
> That doesn't work for depraceted_mode (sic), you'd need to test for
> OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
to verify that MIN/MAX are correctly converted. I showed this code to explain
what I mean.
> There have been efforts to reuse as much of this code as possible for
> other sysctl handlers as well, you might be better off looking for
David, sorry ;) Right now I'd better try to stop the overloading of
->siglock. And, I'd like to shrink struct_signal if possible, but this
is minor.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-03-31 23:00 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 23:00 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > David, I just can't understand why
> > oom-badness-heuristic-rewrite.patch
> > duplicates the related code in fs/proc/base.c and why it preserves
> > the deprecated signal->oom_adj.
>
> You could combine the two write functions together and then two read
> functions together if you'd like.
Yes,
> > static ssize_t oom_any_adj_write(struct file *file, const char __user *buf,
> > size_t count, bool deprecated_mode)
> > {
> >
> > if (depraceted_mode) {
> > if (oom_score_adj == OOM_ADJUST_MAX)
> > oom_score_adj = OOM_SCORE_ADJ_MAX;
>
> ???
What?
> > else
> > oom_score_adj = (oom_score_adj * OOM_SCORE_ADJ_MAX) /
> > -OOM_DISABLE;
> > }
> >
> > if (oom_score_adj < OOM_SCORE_ADJ_MIN ||
> > oom_score_adj > OOM_SCORE_ADJ_MAX)
>
> That doesn't work for depraceted_mode (sic), you'd need to test for
> OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
to verify that MIN/MAX are correctly converted. I showed this code to explain
what I mean.
> There have been efforts to reuse as much of this code as possible for
> other sysctl handlers as well, you might be better off looking for
David, sorry ;) Right now I'd better try to stop the overloading of
->siglock. And, I'd like to shrink struct_signal if possible, but this
is minor.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 22:50 ` Oleg Nesterov
@ 2010-03-31 23:30 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 23:30 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, Oleg Nesterov wrote:
>
> On 03/31, David Rientjes wrote:
> >
> > On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> >
> > > On 03/30, David Rientjes wrote:
> > > >
> > > > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > > >
> > > > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > > > current has already passed exit_notify().
> > > >
> > > > We have an even bigger problem if current is in the oom killer at
> > > > exit_notify() since it has already detached its ->mm in exit_mm() :)
> > >
> > > Can't understand... I thought that in theory even kmalloc(1) can trigger
> > > oom.
> >
> > __oom_kill_task() cannot be called on a task without an ->mm.
>
> Why? You ignored this part:
>
> Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> needs a page. So, you are saying that in this case __page_cache_alloc()
> can never trigger out_of_memory() ?
>
> why this is not possible?
>
> David, I am not arguing, I am asking.
In case I wasn't clear...
Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
But your patch adds
if (fatal_signal_pending(current))
__oom_kill_task(current);
into out_of_memory().
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 23:30 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-03-31 23:30 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, Oleg Nesterov wrote:
>
> On 03/31, David Rientjes wrote:
> >
> > On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> >
> > > On 03/30, David Rientjes wrote:
> > > >
> > > > On Tue, 30 Mar 2010, Oleg Nesterov wrote:
> > > >
> > > > > Note that __oom_kill_task() does force_sig(SIGKILL) which assumes that
> > > > > ->sighand != NULL. This is not true if out_of_memory() is called after
> > > > > current has already passed exit_notify().
> > > >
> > > > We have an even bigger problem if current is in the oom killer at
> > > > exit_notify() since it has already detached its ->mm in exit_mm() :)
> > >
> > > Can't understand... I thought that in theory even kmalloc(1) can trigger
> > > oom.
> >
> > __oom_kill_task() cannot be called on a task without an ->mm.
>
> Why? You ignored this part:
>
> Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> needs a page. So, you are saying that in this case __page_cache_alloc()
> can never trigger out_of_memory() ?
>
> why this is not possible?
>
> David, I am not arguing, I am asking.
In case I wasn't clear...
Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
But your patch adds
if (fatal_signal_pending(current))
__oom_kill_task(current);
into out_of_memory().
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 23:30 ` Oleg Nesterov
@ 2010-03-31 23:48 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 23:48 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
> > David, I am not arguing, I am asking.
>
> In case I wasn't clear...
>
> Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
>
> But your patch adds
>
> if (fatal_signal_pending(current))
> __oom_kill_task(current);
>
> into out_of_memory().
>
Ok, and it's possible during the tasklist scan if current is PF_EXITING
and that gets passed to oom_kill_process(), so we need the following
patch. Can I have your acked-by and then I'll propose it to Andrew with a
follow-up that merges __oom_kill_task() into oom_kill_task() since it only
has one caller now anyway?
[ Both of these situations will be current since the oom killer is a
no-op whenever another task is found to be PF_EXITING and
oom_kill_process() wouldn't get called with any other thread unless
oom_kill_quick is enabled or its VM_FAULT_OOM in which cases we kill
current as well. ]
Thanks Oleg.
---
mm/oom_kill.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p);
+ set_tsk_thread_flag(p, TIF_MEMDIE);
return 0;
}
@@ -686,7 +686,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* its memory.
*/
if (fatal_signal_pending(current)) {
- __oom_kill_task(current);
+ set_tsk_thread_flag(current, TIF_MEMDIE);
return;
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-03-31 23:48 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-03-31 23:48 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
> > David, I am not arguing, I am asking.
>
> In case I wasn't clear...
>
> Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
>
> But your patch adds
>
> if (fatal_signal_pending(current))
> __oom_kill_task(current);
>
> into out_of_memory().
>
Ok, and it's possible during the tasklist scan if current is PF_EXITING
and that gets passed to oom_kill_process(), so we need the following
patch. Can I have your acked-by and then I'll propose it to Andrew with a
follow-up that merges __oom_kill_task() into oom_kill_task() since it only
has one caller now anyway?
[ Both of these situations will be current since the oom killer is a
no-op whenever another task is found to be PF_EXITING and
oom_kill_process() wouldn't get called with any other thread unless
oom_kill_quick is enabled or its VM_FAULT_OOM in which cases we kill
current as well. ]
Thanks Oleg.
---
mm/oom_kill.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* its children or threads, just set TIF_MEMDIE so it can die quickly
*/
if (p->flags & PF_EXITING) {
- __oom_kill_task(p);
+ set_tsk_thread_flag(p, TIF_MEMDIE);
return 0;
}
@@ -686,7 +686,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* its memory.
*/
if (fatal_signal_pending(current)) {
- __oom_kill_task(current);
+ set_tsk_thread_flag(current, TIF_MEMDIE);
return;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
2010-03-31 20:17 ` Oleg Nesterov
@ 2010-04-01 7:41 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 7:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
> to consult p->signal->oom_score_adj. Until recently this was wrong when it
> is called from proc_oom_score().
>
> This means oom-badness-heuristic-rewrite.patch depends on
> signals-make-task_struct-signal-immutable-refcountable.patch, or we
> need the pid_alive() check again.
>
oom-badness-heuristic-rewrite.patch didn't change anything, Linus' tree
currently dereferences p->signal->oom_adj which is no different from
dereferencing p->signal->oom_score_adj without a refcount on the
signal_struct in -mm. oom_adj was moved to struct signal_struct in
2.6.32, see 28b83c5.
> oom_badness() gets the new argument, long totalpages, and the callers
> were updated. However, long uptime is not used any longer, probably
> it make sense to kill this arg and simplify the callers? Unless you
> are going to take run-time into account later.
>
> So, I think -mm needs the patch below, but I have no idea how to
> write the changelog ;)
>
> Oleg.
>
> --- x/fs/proc/base.c
> +++ x/fs/proc/base.c
> @@ -430,12 +430,13 @@ static const struct file_operations proc
> /* The badness from the OOM killer */
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points;
> + unsigned long points = 0;
> struct timespec uptime;
>
> do_posix_clock_monotonic_gettime(&uptime);
> read_lock(&tasklist_lock);
> - points = oom_badness(task->group_leader,
> + if (pid_alive(task))
> + points = oom_badness(task,
> global_page_state(NR_INACTIVE_ANON) +
> global_page_state(NR_ACTIVE_ANON) +
> global_page_state(NR_INACTIVE_FILE) +
This should be protected by the get_proc_task() on the inode before
this function is called from proc_info_read().
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom: fix the unsafe proc_oom_score()->badness() call
@ 2010-04-01 7:41 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 7:41 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
> to consult p->signal->oom_score_adj. Until recently this was wrong when it
> is called from proc_oom_score().
>
> This means oom-badness-heuristic-rewrite.patch depends on
> signals-make-task_struct-signal-immutable-refcountable.patch, or we
> need the pid_alive() check again.
>
oom-badness-heuristic-rewrite.patch didn't change anything, Linus' tree
currently dereferences p->signal->oom_adj which is no different from
dereferencing p->signal->oom_score_adj without a refcount on the
signal_struct in -mm. oom_adj was moved to struct signal_struct in
2.6.32, see 28b83c5.
> oom_badness() gets the new argument, long totalpages, and the callers
> were updated. However, long uptime is not used any longer, probably
> it make sense to kill this arg and simplify the callers? Unless you
> are going to take run-time into account later.
>
> So, I think -mm needs the patch below, but I have no idea how to
> write the changelog ;)
>
> Oleg.
>
> --- x/fs/proc/base.c
> +++ x/fs/proc/base.c
> @@ -430,12 +430,13 @@ static const struct file_operations proc
> /* The badness from the OOM killer */
> static int proc_oom_score(struct task_struct *task, char *buffer)
> {
> - unsigned long points;
> + unsigned long points = 0;
> struct timespec uptime;
>
> do_posix_clock_monotonic_gettime(&uptime);
> read_lock(&tasklist_lock);
> - points = oom_badness(task->group_leader,
> + if (pid_alive(task))
> + points = oom_badness(task,
> global_page_state(NR_INACTIVE_ANON) +
> global_page_state(NR_ACTIVE_ANON) +
> global_page_state(NR_INACTIVE_FILE) +
This should be protected by the get_proc_task() on the inode before
this function is called from proc_info_read().
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 22:50 ` Oleg Nesterov
@ 2010-04-01 8:25 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:25 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> Why? You ignored this part:
>
> Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> needs a page. So, you are saying that in this case __page_cache_alloc()
> can never trigger out_of_memory() ?
>
> why this is not possible?
>
It can, but the check for p->mm is sufficient since exit_notify() takes
write_lock_irq(&tasklist_lock) that the oom killer holds for read, so the
rule is that whenever we have a valid p->mm, we have a valid p->sighand
and can do force_sig() while under tasklist_lock. The only time we call
oom_kill_process() without holding a readlock on tasklist_lock is for
current during pagefault ooms and we know it's not exiting because it's in
the oom killer.
> > > OK, but I guess this !p->mm check is still wrong for the same reason.
> > > In fact I do not understand why it is needed in select_bad_process()
> > > right before oom_badness() which checks ->mm too (and this check is
> > > equally wrong).
> >
> > It prevents kthreads from being killed.
>
> No it doesn't, see use_mm(). See also another email I sent.
>
We cannot rely on oom_badness() to filter this task because we still
select it as our chosen task even with a badness score of 0 if !chosen, so
we must filter these threads ahead of time:
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
Filtering on !p->mm prevents us from doing "if (points > *ppoints ||
(!chosen && p->mm))" because it's just cleaner and makes this rule
explicit.
Your point about p->mm being non-NULL for kthreads using use_mm() is
taken, we should probably just change the is_global_init() check in
select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
oom_kill_process() for them.
> > The task is in the process of exiting and will do so if its not current,
> > otherwise it will get access to memory reserves since we're obviously oom
> > in the exit path. Thus, we'll be freeing that memory soon or recalling
> > the oom killer to kill additional tasks once those children have been
> > reparented (or one of its children was sacrificed).
>
> Just can't understand.
>
> OK, a bad user does
>
> int sleep_forever(void *)
> {
> pause();
> }
>
> int main(void)
> {
> pthread_create(sleep_forever);
> syscall(__NR_exit);
> }
>
> Now, every time select_bad_process() is called it will find this process
> and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> this process is not going to exit.
>
Hmm, so it looks like we need to filter on !p->mm before checking for
PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
into a no-op.
> > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > Again, this is not right even if we forget about !child->mm check.
> > > This list_for_each_entry() can only see the processes forked by the
> > > main thread.
> > >
> >
> > That's the intention.
>
> Why? shouldn't oom_badness() return the same result for any thread
> in thread group? We should take all childs into account.
>
oom_forkbomb_penalty() only cares about first-descendant children that
do not share the same memory, so we purposely penalize the parent so that
it is more biased to select for oom kill and then it will sacrifice these
threads in oom_kill_process().
> > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > >
> >
> > Right, but we need to ensure that the check for !child->mm || child->mm ==
> > tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
> > detach its mm prior to the dereference.
>
> Oh, yes sure, I mentioned get_mm_rss() above.
>
> > It would be possible to move the
> > thread_group_cputime() out of this critical section,
>
> Yes, this is what I meant.
>
You could, but then you'd be calling thread_group_cputime() for all
threads even though they may not share the same ->mm as tsk.
> > but I felt it was
> > better to do filter all tasks with child->mm == tsk->mm first before
> > unnecessarily finding the cputime for them.
>
> Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
> task_lock().
>
We need task_lock() to ensure child->mm hasn't detached between the check
for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
you're trying to improve with this variation, it's a tradeoff between
calling thread_group_cputime() under task_lock() for a subset of a task's
threads when we already need to hold task_lock() anyway vs. calling it for
all threads unconditionally.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 8:25 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:25 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> Why? You ignored this part:
>
> Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> needs a page. So, you are saying that in this case __page_cache_alloc()
> can never trigger out_of_memory() ?
>
> why this is not possible?
>
It can, but the check for p->mm is sufficient since exit_notify() takes
write_lock_irq(&tasklist_lock) that the oom killer holds for read, so the
rule is that whenever we have a valid p->mm, we have a valid p->sighand
and can do force_sig() while under tasklist_lock. The only time we call
oom_kill_process() without holding a readlock on tasklist_lock is for
current during pagefault ooms and we know it's not exiting because it's in
the oom killer.
> > > OK, but I guess this !p->mm check is still wrong for the same reason.
> > > In fact I do not understand why it is needed in select_bad_process()
> > > right before oom_badness() which checks ->mm too (and this check is
> > > equally wrong).
> >
> > It prevents kthreads from being killed.
>
> No it doesn't, see use_mm(). See also another email I sent.
>
We cannot rely on oom_badness() to filter this task because we still
select it as our chosen task even with a badness score of 0 if !chosen, so
we must filter these threads ahead of time:
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
}
Filtering on !p->mm prevents us from doing "if (points > *ppoints ||
(!chosen && p->mm))" because it's just cleaner and makes this rule
explicit.
Your point about p->mm being non-NULL for kthreads using use_mm() is
taken, we should probably just change the is_global_init() check in
select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
oom_kill_process() for them.
> > The task is in the process of exiting and will do so if its not current,
> > otherwise it will get access to memory reserves since we're obviously oom
> > in the exit path. Thus, we'll be freeing that memory soon or recalling
> > the oom killer to kill additional tasks once those children have been
> > reparented (or one of its children was sacrificed).
>
> Just can't understand.
>
> OK, a bad user does
>
> int sleep_forever(void *)
> {
> pause();
> }
>
> int main(void)
> {
> pthread_create(sleep_forever);
> syscall(__NR_exit);
> }
>
> Now, every time select_bad_process() is called it will find this process
> and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> this process is not going to exit.
>
Hmm, so it looks like we need to filter on !p->mm before checking for
PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
into a no-op.
> > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > Again, this is not right even if we forget about !child->mm check.
> > > This list_for_each_entry() can only see the processes forked by the
> > > main thread.
> > >
> >
> > That's the intention.
>
> Why? shouldn't oom_badness() return the same result for any thread
> in thread group? We should take all childs into account.
>
oom_forkbomb_penalty() only cares about first-descendant children that
do not share the same memory, so we purposely penalize the parent so that
it is more biased to select for oom kill and then it will sacrifice these
threads in oom_kill_process().
> > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > >
> >
> > Right, but we need to ensure that the check for !child->mm || child->mm ==
> > tsk->mm fails before adding in get_mm_rss(child->mm). It can race and
> > detach its mm prior to the dereference.
>
> Oh, yes sure, I mentioned get_mm_rss() above.
>
> > It would be possible to move the
> > thread_group_cputime() out of this critical section,
>
> Yes, this is what I meant.
>
You could, but then you'd be calling thread_group_cputime() for all
threads even though they may not share the same ->mm as tsk.
> > but I felt it was
> > better to do filter all tasks with child->mm == tsk->mm first before
> > unnecessarily finding the cputime for them.
>
> Yes, but we can check child->mm == tsk->mm, call get_mm_counter() and drop
> task_lock().
>
We need task_lock() to ensure child->mm hasn't detached between the check
for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
you're trying to improve with this variation, it's a tradeoff between
calling thread_group_cputime() under task_lock() for a subset of a task's
threads when we already need to hold task_lock() anyway vs. calling it for
all threads unconditionally.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-03-31 23:00 ` Oleg Nesterov
@ 2010-04-01 8:32 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > That doesn't work for depraceted_mode (sic), you'd need to test for
> > OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
>
> Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
> to verify that MIN/MAX are correctly converted. I showed this code to explain
> what I mean.
>
Ok, please cc me on the patch, it will be good to get rid of the duplicate
code and remove oom_adj from struct signal_struct.
> > There have been efforts to reuse as much of this code as possible for
> > other sysctl handlers as well, you might be better off looking for
>
> David, sorry ;) Right now I'd better try to stop the overloading of
> ->siglock. And, I'd like to shrink struct_signal if possible, but this
> is minor.
>
Do we need ->siglock? Why can't we just do
struct sighand_struct *sighand;
struct signal_struct *sig;
rcu_read_lock();
sighand = rcu_dereference(task->sighand);
if (!sighand) {
rcu_read_unlock();
return;
}
sig = task->signal;
... load/store to sig ...
rcu_read_unlock();
instead?
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-04-01 8:32 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:32 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > That doesn't work for depraceted_mode (sic), you'd need to test for
> > OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
>
> Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
> to verify that MIN/MAX are correctly converted. I showed this code to explain
> what I mean.
>
Ok, please cc me on the patch, it will be good to get rid of the duplicate
code and remove oom_adj from struct signal_struct.
> > There have been efforts to reuse as much of this code as possible for
> > other sysctl handlers as well, you might be better off looking for
>
> David, sorry ;) Right now I'd better try to stop the overloading of
> ->siglock. And, I'd like to shrink struct_signal if possible, but this
> is minor.
>
Do we need ->siglock? Why can't we just do
struct sighand_struct *sighand;
struct signal_struct *sig;
rcu_read_lock();
sighand = rcu_dereference(task->sighand);
if (!sighand) {
rcu_read_unlock();
return;
}
sig = task->signal;
... load/store to sig ...
rcu_read_unlock();
instead?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 20:47 ` Oleg Nesterov
@ 2010-04-01 8:35 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> Probably something like the patch below makes sense. Note that
> "skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
> Probably it is better to check it in select_bad_process() instead,
> near is_global_init().
>
is_global_init() will be true for p->flags & PF_KTHREAD.
> The new helper, find_lock_task_mm(), should be used by
> oom_forkbomb_penalty() too.
>
> dump_tasks() doesn't need it, it does do_each_thread(). Cough,
> __out_of_memory() and out_of_memory() call it without tasklist.
> We are going to panic() anyway, but still.
>
Indeed, good observation.
> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
> (child_rss / sysctl_oom_forkbomb_thres) : 0;
> }
>
> +static find_lock_task_mm(struct task_struct *p)
> +{
> + struct task_struct *t = p;
> + do {
> + task_lock(t);
> + if (likely(t->mm && !(t->flags & PF_KTHREAD)))
> + return t;
> + task_unlock(t);
> + } while_each_thred(p, t);
> +
> + return NULL;
> +}
> +
> /**
> * oom_badness - heuristic function to determine which candidate task to kill
> * @p: task struct of which task we should calculate
> @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> if (p->flags & PF_OOM_ORIGIN)
> return 1000;
>
> - task_lock(p);
> - mm = p->mm;
> - if (!mm) {
> - task_unlock(p);
> + p = find_lock_task_mm(p);
> + if (!p)
> return 0;
> - }
> -
> /*
> * The baseline for the badness score is the proportion of RAM that each
> * task's rss and swap space use.
> @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> *ppoints = 1000;
> }
>
> - /*
> - * skip kernel threads and tasks which have already released
> - * their mm.
> - */
> - if (!p->mm)
> - continue;
> if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> continue;
You can't do this for the reason I cited in another email, oom_badness()
returning 0 does not exclude a task from being chosen by
selcet_bad_process(), it will use that task if nothing else has been found
yet. We must explicitly filter it from consideration by checking for
!p->mm.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 8:35 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Wed, 31 Mar 2010, Oleg Nesterov wrote:
> Probably something like the patch below makes sense. Note that
> "skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
> Probably it is better to check it in select_bad_process() instead,
> near is_global_init().
>
is_global_init() will be true for p->flags & PF_KTHREAD.
> The new helper, find_lock_task_mm(), should be used by
> oom_forkbomb_penalty() too.
>
> dump_tasks() doesn't need it, it does do_each_thread(). Cough,
> __out_of_memory() and out_of_memory() call it without tasklist.
> We are going to panic() anyway, but still.
>
Indeed, good observation.
> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -129,6 +129,19 @@ static unsigned long oom_forkbomb_penalt
> (child_rss / sysctl_oom_forkbomb_thres) : 0;
> }
>
> +static find_lock_task_mm(struct task_struct *p)
> +{
> + struct task_struct *t = p;
> + do {
> + task_lock(t);
> + if (likely(t->mm && !(t->flags & PF_KTHREAD)))
> + return t;
> + task_unlock(t);
> + } while_each_thred(p, t);
> +
> + return NULL;
> +}
> +
> /**
> * oom_badness - heuristic function to determine which candidate task to kill
> * @p: task struct of which task we should calculate
> @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> if (p->flags & PF_OOM_ORIGIN)
> return 1000;
>
> - task_lock(p);
> - mm = p->mm;
> - if (!mm) {
> - task_unlock(p);
> + p = find_lock_task_mm(p);
> + if (!p)
> return 0;
> - }
> -
> /*
> * The baseline for the badness score is the proportion of RAM that each
> * task's rss and swap space use.
> @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> *ppoints = 1000;
> }
>
> - /*
> - * skip kernel threads and tasks which have already released
> - * their mm.
> - */
> - if (!p->mm)
> - continue;
> if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> continue;
You can't do this for the reason I cited in another email, oom_badness()
returning 0 does not exclude a task from being chosen by
selcet_bad_process(), it will use that task if nothing else has been found
yet. We must explicitly filter it from consideration by checking for
!p->mm.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm] oom: hold tasklist_lock when dumping tasks
2010-04-01 8:35 ` David Rientjes
(?)
@ 2010-04-01 8:57 ` David Rientjes
2010-04-01 14:27 ` Oleg Nesterov
-1 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-01 8:57 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, linux-mm
dump_header() always requires tasklist_lock to be held because it calls
dump_tasks() which iterates through the tasklist. There are a few places
where this isn't maintained, so make sure tasklist_lock is always held
whenever calling dump_header().
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 23 ++++++++++-------------
1 files changed, 10 insertions(+), 13 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -395,6 +395,9 @@ static void dump_tasks(const struct mem_cgroup *mem)
} while_each_thread(g, p);
}
+/*
+ * Call with tasklist_lock read-locked.
+ */
static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *mem)
{
@@ -641,8 +644,8 @@ retry:
/* Found nothing?!?! Either we hang forever, or we panic. */
if (!p) {
- read_unlock(&tasklist_lock);
dump_header(NULL, gfp_mask, order, NULL);
+ read_unlock(&tasklist_lock);
panic("Out of memory and no killable processes...\n");
}
@@ -675,11 +678,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
/* Got some memory back in the last second. */
return;
- if (sysctl_panic_on_oom == 2) {
- dump_header(NULL, gfp_mask, order, NULL);
- panic("out of memory. Compulsory panic_on_oom is selected.\n");
- }
-
/*
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
@@ -688,15 +686,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
&totalpages);
read_lock(&tasklist_lock);
if (unlikely(sysctl_panic_on_oom)) {
- /*
- * panic_on_oom only affects CONSTRAINT_NONE, the kernel
- * should not panic for cpuset or mempolicy induced memory
- * failures.
- */
- if (constraint == CONSTRAINT_NONE) {
+ if (sysctl_panic_on_oom == 2 || constraint == CONSTRAINT_NONE) {
dump_header(NULL, gfp_mask, order, NULL);
read_unlock(&tasklist_lock);
- panic("Out of memory: panic_on_oom is enabled\n");
+ panic("Out of memory: %s panic_on_oom is enabled\n",
+ sysctl_panic_on_oom == 2 ? "compulsory" :
+ "system-wide");
}
}
__out_of_memory(gfp_mask, order, totalpages, constraint, nodemask);
@@ -724,8 +719,10 @@ void pagefault_out_of_memory(void)
if (try_set_system_oom()) {
constrained_alloc(NULL, 0, NULL, &totalpages);
+ read_lock(&tasklist_lock);
err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
"Out of memory (pagefault)");
+ read_unlock(&tasklist_lock);
if (err)
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH 0/1] oom: fix the unsafe usage of badness() in proc_oom_score()
2010-04-01 7:41 ` David Rientjes
@ 2010-04-01 13:13 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 13:13 UTC (permalink / raw)
To: David Rientjes, Andrew Morton, Linus Torvalds
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel, stable
On 04/01, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
> > to consult p->signal->oom_score_adj. Until recently this was wrong when it
> > is called from proc_oom_score().
> >
> > This means oom-badness-heuristic-rewrite.patch depends on
> > signals-make-task_struct-signal-immutable-refcountable.patch, or we
> > need the pid_alive() check again.
> >
>
> oom-badness-heuristic-rewrite.patch didn't change anything, Linus' tree
> currently dereferences p->signal->oom_adj
Yes, I wrongly blaimed oom-badness-heuristic-rewrite.patch, vanilla does
the same.
Now this is really bad, and I am resending my patch.
David, Andrew, I understand it (textually) conflicts with
oom-badness-heuristic-rewrite.patch, but this bug should be fixed imho
before other changes. I hope it will be easy to fixup this chunk
@@ -447,7 +447,13 @@ static int proc_oom_score(struct task_st
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = oom_badness(task->group_leader,
in that patch.
> > do_posix_clock_monotonic_gettime(&uptime);
> > read_lock(&tasklist_lock);
> > - points = oom_badness(task->group_leader,
> > + if (pid_alive(task))
> > + points = oom_badness(task,
> > global_page_state(NR_INACTIVE_ANON) +
> > global_page_state(NR_ACTIVE_ANON) +
> > global_page_state(NR_INACTIVE_FILE) +
>
> This should be protected by the get_proc_task() on the inode before
> this function is called from proc_info_read().
No, get_proc_task() shouldn't (and can't) do this. To clarify,
get_proc_task() does check the task wasn't unhashed, but nothing can
prevent from release_task() after that. Once again, only task_struct
itself is protected by get_task_struct(), nothing more.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH 0/1] oom: fix the unsafe usage of badness() in proc_oom_score()
@ 2010-04-01 13:13 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 13:13 UTC (permalink / raw)
To: David Rientjes, Andrew Morton, Linus Torvalds
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel, stable
On 04/01, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > But. Oh well. David, oom-badness-heuristic-rewrite.patch changed badness()
> > to consult p->signal->oom_score_adj. Until recently this was wrong when it
> > is called from proc_oom_score().
> >
> > This means oom-badness-heuristic-rewrite.patch depends on
> > signals-make-task_struct-signal-immutable-refcountable.patch, or we
> > need the pid_alive() check again.
> >
>
> oom-badness-heuristic-rewrite.patch didn't change anything, Linus' tree
> currently dereferences p->signal->oom_adj
Yes, I wrongly blaimed oom-badness-heuristic-rewrite.patch, vanilla does
the same.
Now this is really bad, and I am resending my patch.
David, Andrew, I understand it (textually) conflicts with
oom-badness-heuristic-rewrite.patch, but this bug should be fixed imho
before other changes. I hope it will be easy to fixup this chunk
@@ -447,7 +447,13 @@ static int proc_oom_score(struct task_st
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = oom_badness(task->group_leader,
in that patch.
> > do_posix_clock_monotonic_gettime(&uptime);
> > read_lock(&tasklist_lock);
> > - points = oom_badness(task->group_leader,
> > + if (pid_alive(task))
> > + points = oom_badness(task,
> > global_page_state(NR_INACTIVE_ANON) +
> > global_page_state(NR_ACTIVE_ANON) +
> > global_page_state(NR_INACTIVE_FILE) +
>
> This should be protected by the get_proc_task() on the inode before
> this function is called from proc_info_read().
No, get_proc_task() shouldn't (and can't) do this. To clarify,
get_proc_task() does check the task wasn't unhashed, but nothing can
prevent from release_task() after that. Once again, only task_struct
itself is protected by get_task_struct(), nothing more.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH 1/1] oom: fix the unsafe usage of badness() in proc_oom_score()
2010-04-01 13:13 ` Oleg Nesterov
@ 2010-04-01 13:13 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 13:13 UTC (permalink / raw)
To: David Rientjes, Andrew Morton, Linus Torvalds
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel, stable
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
- even worse, badness() can hit ->signal == NULL
Add the pid_alive() check to ensure __unhash_process() was not called.
Also, use "task" instead of task->group_leader. badness() should return
the same result for any sub-thread. Currently this is not true, but
this should be changed anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- TTT/fs/proc/base.c~PROC_OOM_SCORE 2010-03-11 13:11:50.000000000 +0100
+++ TTT/fs/proc/base.c 2010-04-01 14:41:17.000000000 +0200
@@ -442,12 +442,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH 1/1] oom: fix the unsafe usage of badness() in proc_oom_score()
@ 2010-04-01 13:13 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 13:13 UTC (permalink / raw)
To: David Rientjes, Andrew Morton, Linus Torvalds
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel, stable
proc_oom_score(task) have a reference to task_struct, but that is all.
If this task was already released before we take tasklist_lock
- we can't use task->group_leader, it points to nowhere
- it is not safe to call badness() even if this task is
->group_leader, has_intersects_mems_allowed() assumes
it is safe to iterate over ->thread_group list.
- even worse, badness() can hit ->signal == NULL
Add the pid_alive() check to ensure __unhash_process() was not called.
Also, use "task" instead of task->group_leader. badness() should return
the same result for any sub-thread. Currently this is not true, but
this should be changed anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/base.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
--- TTT/fs/proc/base.c~PROC_OOM_SCORE 2010-03-11 13:11:50.000000000 +0100
+++ TTT/fs/proc/base.c 2010-04-01 14:41:17.000000000 +0200
@@ -442,12 +442,13 @@ static const struct file_operations proc
unsigned long badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points;
+ unsigned long points = 0;
struct timespec uptime;
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ if (pid_alive(task))
+ points = badness(task, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 8:35 ` David Rientjes
@ 2010-04-01 14:00 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 13:59 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > Probably something like the patch below makes sense. Note that
> > "skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
> > Probably it is better to check it in select_bad_process() instead,
> > near is_global_init().
>
> is_global_init() will be true for p->flags & PF_KTHREAD.
No, is_global_init() && PF_KTHREAD have nothing to do with each other.
> > @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> > if (p->flags & PF_OOM_ORIGIN)
> > return 1000;
> >
> > - task_lock(p);
> > - mm = p->mm;
> > - if (!mm) {
> > - task_unlock(p);
> > + p = find_lock_task_mm(p);
> > + if (!p)
> > return 0;
> > - }
> > -
> > /*
> > * The baseline for the badness score is the proportion of RAM that each
> > * task's rss and swap space use.
> > @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> > *ppoints = 1000;
> > }
> >
> > - /*
> > - * skip kernel threads and tasks which have already released
> > - * their mm.
> > - */
> > - if (!p->mm)
> > - continue;
> > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > continue;
>
> You can't do this for the reason I cited in another email, oom_badness()
> returning 0 does not exclude a task from being chosen by
> selcet_bad_process(), it will use that task if nothing else has been found
> yet. We must explicitly filter it from consideration by checking for
> !p->mm.
Yes, you are right. OK, oom_badness() can never return points < 0,
we can make it int and oom_badness() can return -1 if !mm. IOW,
- unsigned int points;
+ int points;
...
points = oom_badness(...);
if (points >= 0 && (points > *ppoints || !chosen))
chosen = p;
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 14:00 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 14:00 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Wed, 31 Mar 2010, Oleg Nesterov wrote:
>
> > Probably something like the patch below makes sense. Note that
> > "skip kernel threads" logic is wrong too, we should check PF_KTHREAD.
> > Probably it is better to check it in select_bad_process() instead,
> > near is_global_init().
>
> is_global_init() will be true for p->flags & PF_KTHREAD.
No, is_global_init() && PF_KTHREAD have nothing to do with each other.
> > @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> > if (p->flags & PF_OOM_ORIGIN)
> > return 1000;
> >
> > - task_lock(p);
> > - mm = p->mm;
> > - if (!mm) {
> > - task_unlock(p);
> > + p = find_lock_task_mm(p);
> > + if (!p)
> > return 0;
> > - }
> > -
> > /*
> > * The baseline for the badness score is the proportion of RAM that each
> > * task's rss and swap space use.
> > @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> > *ppoints = 1000;
> > }
> >
> > - /*
> > - * skip kernel threads and tasks which have already released
> > - * their mm.
> > - */
> > - if (!p->mm)
> > - continue;
> > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > continue;
>
> You can't do this for the reason I cited in another email, oom_badness()
> returning 0 does not exclude a task from being chosen by
> selcet_bad_process(), it will use that task if nothing else has been found
> yet. We must explicitly filter it from consideration by checking for
> !p->mm.
Yes, you are right. OK, oom_badness() can never return points < 0,
we can make it int and oom_badness() can return -1 if !mm. IOW,
- unsigned int points;
+ int points;
...
points = oom_badness(...);
if (points >= 0 && (points > *ppoints || !chosen))
chosen = p;
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] oom: hold tasklist_lock when dumping tasks
2010-04-01 8:57 ` [patch -mm] oom: hold tasklist_lock when dumping tasks David Rientjes
@ 2010-04-01 14:27 ` Oleg Nesterov
2010-04-01 19:16 ` David Rientjes
0 siblings, 1 reply; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 14:27 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, linux-mm
On 04/01, David Rientjes wrote:
>
> dump_header() always requires tasklist_lock to be held because it calls
> dump_tasks() which iterates through the tasklist. There are a few places
> where this isn't maintained, so make sure tasklist_lock is always held
> whenever calling dump_header().
Looks correct, but I'd suggest you to update the changelog.
Not only dump_tasks() needs tasklist, oom_kill_process() needs it too
for list_for_each_entry(children).
You fixed this:
> @@ -724,8 +719,10 @@ void pagefault_out_of_memory(void)
>
> if (try_set_system_oom()) {
> constrained_alloc(NULL, 0, NULL, &totalpages);
> + read_lock(&tasklist_lock);
> err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
> "Out of memory (pagefault)");
> + read_unlock(&tasklist_lock);
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-03-31 23:48 ` David Rientjes
@ 2010-04-01 14:39 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 14:39 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > Why? You ignored this part:
> > >
> > > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > > needs a page. So, you are saying that in this case __page_cache_alloc()
> > > can never trigger out_of_memory() ?
> > >
> > > why this is not possible?
> > >
> > > David, I am not arguing, I am asking.
> >
> > In case I wasn't clear...
> >
> > Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
> >
> > But your patch adds
> >
> > if (fatal_signal_pending(current))
> > __oom_kill_task(current);
> >
> > into out_of_memory().
> >
>
> Ok, and it's possible during the tasklist scan if current is PF_EXITING
> and that gets passed to oom_kill_process(),
Yes, but this is harmless, afaics. The task is either current or it was
found by select_bad_process() under tasklist. This means it is safe to
use force_sig (but as I said, we should not use force_sig() anyway).
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> if (p->flags & PF_EXITING) {
> - __oom_kill_task(p);
> + set_tsk_thread_flag(p, TIF_MEMDIE);
So, probably this makes sense anyway but not strictly necessary, up to you.
> if (fatal_signal_pending(current)) {
> - __oom_kill_task(current);
> + set_tsk_thread_flag(current, TIF_MEMDIE);
Yes, I think this fix is needed.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 14:39 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 14:39 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 03/31, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > Why? You ignored this part:
> > >
> > > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > > needs a page. So, you are saying that in this case __page_cache_alloc()
> > > can never trigger out_of_memory() ?
> > >
> > > why this is not possible?
> > >
> > > David, I am not arguing, I am asking.
> >
> > In case I wasn't clear...
> >
> > Yes, currently __oom_kill_task(p) is not possible if p->mm == NULL.
> >
> > But your patch adds
> >
> > if (fatal_signal_pending(current))
> > __oom_kill_task(current);
> >
> > into out_of_memory().
> >
>
> Ok, and it's possible during the tasklist scan if current is PF_EXITING
> and that gets passed to oom_kill_process(),
Yes, but this is harmless, afaics. The task is either current or it was
found by select_bad_process() under tasklist. This means it is safe to
use force_sig (but as I said, we should not use force_sig() anyway).
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> * its children or threads, just set TIF_MEMDIE so it can die quickly
> */
> if (p->flags & PF_EXITING) {
> - __oom_kill_task(p);
> + set_tsk_thread_flag(p, TIF_MEMDIE);
So, probably this makes sense anyway but not strictly necessary, up to you.
> if (fatal_signal_pending(current)) {
> - __oom_kill_task(current);
> + set_tsk_thread_flag(current, TIF_MEMDIE);
Yes, I think this fix is needed.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 8:25 ` David Rientjes
@ 2010-04-01 15:26 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 15:26 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
>
> It can, but the check for p->mm is sufficient since exit_notify()
Yes, but I meant out_of_memory()->__oom_kill_task(current). OK, we
already discussed this in the previous emails.
> We cannot rely on oom_badness() to filter this task because we still
> select it as our chosen task even with a badness score of 0 if !chosen
Yes, see another email from me.
> Your point about p->mm being non-NULL for kthreads using use_mm() is
> taken, we should probably just change the is_global_init() check in
> select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
> oom_kill_process() for them.
Yes, but we have to check both is_global_init() and PF_KTHREAD.
The "patch" I sent checks PF_KTHREAD in find_lock_task_mm(), but as I
said select_bad_process() is the better place.
> > OK, a bad user does
> >
> > int sleep_forever(void *)
> > {
> > pause();
> > }
> >
> > int main(void)
> > {
> > pthread_create(sleep_forever);
> > syscall(__NR_exit);
> > }
> >
> > Now, every time select_bad_process() is called it will find this process
> > and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> > this process is not going to exit.
> >
>
> Hmm, so it looks like we need to filter on !p->mm before checking for
> PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
> into a no-op.
As it was already discussed, it is not easy to check !p->mm. Once
again, we must not filter out the task just because its ->mm == NULL.
Probably the best change for now is
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
This is not perfect too, but much better.
> > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > Again, this is not right even if we forget about !child->mm check.
> > > > This list_for_each_entry() can only see the processes forked by the
> > > > main thread.
> > > >
> > >
> > > That's the intention.
> >
> > Why? shouldn't oom_badness() return the same result for any thread
> > in thread group? We should take all childs into account.
> >
>
> oom_forkbomb_penalty() only cares about first-descendant children that
> do not share the same memory,
I see, but the code doesn't really do this. I mean, it doesn't really
see the first-descendant children, only those which were forked by the
main thread.
Look. We have a main thread M and the sub-thread T. T forks a lot of
processes which use a lot of memory. These processes _are_ the first
descendant children of the M+T thread group, they should be accounted.
But M->children list is empty.
oom_forkbomb_penalty() and oom_kill_process() should do
t = tsk;
do {
list_for_each_entry(child, &t->children, sibling) {
... take child into account ...
}
} while_each_thread(tsk, t);
> > > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > > >
> [...snip...]
> We need task_lock() to ensure child->mm hasn't detached between the check
> for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
> you're trying to improve with this variation, it's a tradeoff between
> calling thread_group_cputime() under task_lock() for a subset of a task's
> threads when we already need to hold task_lock() anyway vs. calling it for
> all threads unconditionally.
See the patch below. Yes, this is minor, but it is always good to avoid
the unnecessary locks, and thread_group_cputime() is O(N).
Not only for performance reasons. This allows to change the locking in
thread_group_cputime() if needed without fear to deadlock with task_lock().
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, this_rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ this_rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += this_rss;
forkcount++;
}
- task_unlock(child);
}
/*
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 15:26 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 15:26 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > Why? You ignored this part:
> >
> > Say, right after exit_mm() we are doing acct_process(), and f_op->write()
> > needs a page. So, you are saying that in this case __page_cache_alloc()
> > can never trigger out_of_memory() ?
> >
> > why this is not possible?
> >
>
> It can, but the check for p->mm is sufficient since exit_notify()
Yes, but I meant out_of_memory()->__oom_kill_task(current). OK, we
already discussed this in the previous emails.
> We cannot rely on oom_badness() to filter this task because we still
> select it as our chosen task even with a badness score of 0 if !chosen
Yes, see another email from me.
> Your point about p->mm being non-NULL for kthreads using use_mm() is
> taken, we should probably just change the is_global_init() check in
> select_bad_process() to p->flags & PF_KTHREAD and ensure we reject
> oom_kill_process() for them.
Yes, but we have to check both is_global_init() and PF_KTHREAD.
The "patch" I sent checks PF_KTHREAD in find_lock_task_mm(), but as I
said select_bad_process() is the better place.
> > OK, a bad user does
> >
> > int sleep_forever(void *)
> > {
> > pause();
> > }
> >
> > int main(void)
> > {
> > pthread_create(sleep_forever);
> > syscall(__NR_exit);
> > }
> >
> > Now, every time select_bad_process() is called it will find this process
> > and PF_EXITING is true, so it just returns ERR_PTR(-1UL). And note that
> > this process is not going to exit.
> >
>
> Hmm, so it looks like we need to filter on !p->mm before checking for
> PF_EXITING so that tasks that are EXIT_ZOMBIE won't make the oom killer
> into a no-op.
As it was already discussed, it is not easy to check !p->mm. Once
again, we must not filter out the task just because its ->mm == NULL.
Probably the best change for now is
- if (p->flags & PF_EXITING) {
+ if (p->flags & PF_EXITING && p->mm) {
This is not perfect too, but much better.
> > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > Again, this is not right even if we forget about !child->mm check.
> > > > This list_for_each_entry() can only see the processes forked by the
> > > > main thread.
> > > >
> > >
> > > That's the intention.
> >
> > Why? shouldn't oom_badness() return the same result for any thread
> > in thread group? We should take all childs into account.
> >
>
> oom_forkbomb_penalty() only cares about first-descendant children that
> do not share the same memory,
I see, but the code doesn't really do this. I mean, it doesn't really
see the first-descendant children, only those which were forked by the
main thread.
Look. We have a main thread M and the sub-thread T. T forks a lot of
processes which use a lot of memory. These processes _are_ the first
descendant children of the M+T thread group, they should be accounted.
But M->children list is empty.
oom_forkbomb_penalty() and oom_kill_process() should do
t = tsk;
do {
list_for_each_entry(child, &t->children, sibling) {
... take child into account ...
}
} while_each_thread(tsk, t);
> > > > Hmm. Why oom_forkbomb_penalty() does thread_group_cputime() under
> > > > task_lock() ? It seems, ->alloc_lock() is only needed for get_mm_rss().
> > > >
> [...snip...]
> We need task_lock() to ensure child->mm hasn't detached between the check
> for child->mm == tsk->mm and get_mm_rss(child->mm). So I'm not sure what
> you're trying to improve with this variation, it's a tradeoff between
> calling thread_group_cputime() under task_lock() for a subset of a task's
> threads when we already need to hold task_lock() anyway vs. calling it for
> all threads unconditionally.
See the patch below. Yes, this is minor, but it is always good to avoid
the unnecessary locks, and thread_group_cputime() is O(N).
Not only for performance reasons. This allows to change the locking in
thread_group_cputime() if needed without fear to deadlock with task_lock().
Oleg.
--- x/mm/oom_kill.c
+++ x/mm/oom_kill.c
@@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, this_rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ this_rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += this_rss;
forkcount++;
}
- task_unlock(child);
}
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-04-01 8:32 ` David Rientjes
@ 2010-04-01 15:37 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 15:37 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > That doesn't work for depraceted_mode (sic), you'd need to test for
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
> >
> > Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
> > to verify that MIN/MAX are correctly converted. I showed this code to explain
> > what I mean.
> >
>
> Ok, please cc me on the patch, it will be good to get rid of the duplicate
> code and remove oom_adj from struct signal_struct.
OK, great, will do tomorrow.
> Do we need ->siglock? Why can't we just do
>
> struct sighand_struct *sighand;
> struct signal_struct *sig;
>
> rcu_read_lock();
> sighand = rcu_dereference(task->sighand);
> if (!sighand) {
> rcu_read_unlock();
> return;
> }
> sig = task->signal;
>
> ... load/store to sig ...
>
> rcu_read_unlock();
No.
Before signals-make-task_struct-signal-immutable-refcountable.patch (actually,
series of patches), this can't work. ->signal is not protected by rcu, and
->sighand != NULL doesn't mean ->signal != NULL.
(yes, thread_group_cputime() is wrong too, but currently it is never called
lockless).
After signals-make-task_struct-signal-immutable-refcountable.patch, we do not
need any checks at all, it is always safe to use ->signal.
But. Unless we kill signal->oom_adj, we have another reason for ->siglock,
we can't update both oom_adj and oom_score_adj atomically, and if we race
with another thread they can be inconsistent wrt each other. Yes, oom_adj
is not actually used, except we report it back to user-space, but still.
So, I am going to send 2 patches. The first one factors out the code
in base.c and kills signal->oom_adj, the next one removes ->siglock.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-04-01 15:37 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-01 15:37 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > That doesn't work for depraceted_mode (sic), you'd need to test for
> > > OOM_ADJUST_MIN and OOM_ADJUST_MAX in that case.
> >
> > Yes, probably "if (depraceted_mode)" should do more checks, I didn't try
> > to verify that MIN/MAX are correctly converted. I showed this code to explain
> > what I mean.
> >
>
> Ok, please cc me on the patch, it will be good to get rid of the duplicate
> code and remove oom_adj from struct signal_struct.
OK, great, will do tomorrow.
> Do we need ->siglock? Why can't we just do
>
> struct sighand_struct *sighand;
> struct signal_struct *sig;
>
> rcu_read_lock();
> sighand = rcu_dereference(task->sighand);
> if (!sighand) {
> rcu_read_unlock();
> return;
> }
> sig = task->signal;
>
> ... load/store to sig ...
>
> rcu_read_unlock();
No.
Before signals-make-task_struct-signal-immutable-refcountable.patch (actually,
series of patches), this can't work. ->signal is not protected by rcu, and
->sighand != NULL doesn't mean ->signal != NULL.
(yes, thread_group_cputime() is wrong too, but currently it is never called
lockless).
After signals-make-task_struct-signal-immutable-refcountable.patch, we do not
need any checks at all, it is always safe to use ->signal.
But. Unless we kill signal->oom_adj, we have another reason for ->siglock,
we can't update both oom_adj and oom_score_adj atomically, and if we race
with another thread they can be inconsistent wrt each other. Yes, oom_adj
is not actually used, except we report it back to user-space, but still.
So, I am going to send 2 patches. The first one factors out the code
in base.c and kills signal->oom_adj, the next one removes ->siglock.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 14:39 ` Oleg Nesterov
@ 2010-04-01 18:58 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 18:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > if (p->flags & PF_EXITING) {
> > - __oom_kill_task(p);
> > + set_tsk_thread_flag(p, TIF_MEMDIE);
>
> So, probably this makes sense anyway but not strictly necessary, up to you.
>
It matches the already-existing comment that only says we need to set
TIF_MEMDIE so it can quickly exit rather than call __oom_kill_task(), so
it seems worthwhile.
> > if (fatal_signal_pending(current)) {
> > - __oom_kill_task(current);
> > + set_tsk_thread_flag(current, TIF_MEMDIE);
>
> Yes, I think this fix is needed.
>
Ok, I'll add your acked-by and send this to Andrew with a follow-up that
consolidates __oom_kill_task() into oom_kill_task(), thanks.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 18:58 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 18:58 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -459,7 +459,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > if (p->flags & PF_EXITING) {
> > - __oom_kill_task(p);
> > + set_tsk_thread_flag(p, TIF_MEMDIE);
>
> So, probably this makes sense anyway but not strictly necessary, up to you.
>
It matches the already-existing comment that only says we need to set
TIF_MEMDIE so it can quickly exit rather than call __oom_kill_task(), so
it seems worthwhile.
> > if (fatal_signal_pending(current)) {
> > - __oom_kill_task(current);
> > + set_tsk_thread_flag(current, TIF_MEMDIE);
>
> Yes, I think this fix is needed.
>
Ok, I'll add your acked-by and send this to Andrew with a follow-up that
consolidates __oom_kill_task() into oom_kill_task(), thanks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH 1/1] oom: fix the unsafe usage of badness() in proc_oom_score()
2010-04-01 13:13 ` Oleg Nesterov
@ 2010-04-01 19:03 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:03 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Linus Torvalds, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel, stable
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> proc_oom_score(task) have a reference to task_struct, but that is all.
> If this task was already released before we take tasklist_lock
>
> - we can't use task->group_leader, it points to nowhere
>
> - it is not safe to call badness() even if this task is
> ->group_leader, has_intersects_mems_allowed() assumes
> it is safe to iterate over ->thread_group list.
>
> - even worse, badness() can hit ->signal == NULL
>
> Add the pid_alive() check to ensure __unhash_process() was not called.
>
> Also, use "task" instead of task->group_leader. badness() should return
> the same result for any sub-thread. Currently this is not true, but
> this should be changed anyway.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Andrew, this is 2.6.34 material and should be backported to stable. It's
not introduced by the recent oom killer rewrite pending in -mm, but it
will require a trivial merge resolution on that work.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH 1/1] oom: fix the unsafe usage of badness() in proc_oom_score()
@ 2010-04-01 19:03 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:03 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Linus Torvalds, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel, stable
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> proc_oom_score(task) have a reference to task_struct, but that is all.
> If this task was already released before we take tasklist_lock
>
> - we can't use task->group_leader, it points to nowhere
>
> - it is not safe to call badness() even if this task is
> ->group_leader, has_intersects_mems_allowed() assumes
> it is safe to iterate over ->thread_group list.
>
> - even worse, badness() can hit ->signal == NULL
>
> Add the pid_alive() check to ensure __unhash_process() was not called.
>
> Also, use "task" instead of task->group_leader. badness() should return
> the same result for any sub-thread. Currently this is not true, but
> this should be changed anyway.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Andrew, this is 2.6.34 material and should be backported to stable. It's
not introduced by the recent oom killer rewrite pending in -mm, but it
will require a trivial merge resolution on that work.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
2010-04-01 15:37 ` Oleg Nesterov
@ 2010-04-01 19:04 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:04 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> But. Unless we kill signal->oom_adj, we have another reason for ->siglock,
> we can't update both oom_adj and oom_score_adj atomically, and if we race
> with another thread they can be inconsistent wrt each other. Yes, oom_adj
> is not actually used, except we report it back to user-space, but still.
>
> So, I am going to send 2 patches. The first one factors out the code
> in base.c and kills signal->oom_adj, the next one removes ->siglock.
>
Great, thanks!
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj
@ 2010-04-01 19:04 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:04 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> But. Unless we kill signal->oom_adj, we have another reason for ->siglock,
> we can't update both oom_adj and oom_score_adj atomically, and if we race
> with another thread they can be inconsistent wrt each other. Yes, oom_adj
> is not actually used, except we report it back to user-space, but still.
>
> So, I am going to send 2 patches. The first one factors out the code
> in base.c and kills signal->oom_adj, the next one removes ->siglock.
>
Great, thanks!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 14:00 ` Oleg Nesterov
@ 2010-04-01 19:12 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > > @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> > > if (p->flags & PF_OOM_ORIGIN)
> > > return 1000;
> > >
> > > - task_lock(p);
> > > - mm = p->mm;
> > > - if (!mm) {
> > > - task_unlock(p);
> > > + p = find_lock_task_mm(p);
> > > + if (!p)
> > > return 0;
> > > - }
> > > -
> > > /*
> > > * The baseline for the badness score is the proportion of RAM that each
> > > * task's rss and swap space use.
> > > @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> > > *ppoints = 1000;
> > > }
> > >
> > > - /*
> > > - * skip kernel threads and tasks which have already released
> > > - * their mm.
> > > - */
> > > - if (!p->mm)
> > > - continue;
> > > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > > continue;
> >
> > You can't do this for the reason I cited in another email, oom_badness()
> > returning 0 does not exclude a task from being chosen by
> > selcet_bad_process(), it will use that task if nothing else has been found
> > yet. We must explicitly filter it from consideration by checking for
> > !p->mm.
>
> Yes, you are right. OK, oom_badness() can never return points < 0,
> we can make it int and oom_badness() can return -1 if !mm. IOW,
>
> - unsigned int points;
> + int points;
> ...
>
> points = oom_badness(...);
> if (points >= 0 && (points > *ppoints || !chosen))
> chosen = p;
>
oom_badness() and its predecessor badness() in mainline never return
negative scores, so I don't see the value in doing this; just filter the
task in select_bad_process() with !p->mm as it has always been done.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-01 19:12 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:12 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > > @@ -159,13 +172,9 @@ unsigned int oom_badness(struct task_str
> > > if (p->flags & PF_OOM_ORIGIN)
> > > return 1000;
> > >
> > > - task_lock(p);
> > > - mm = p->mm;
> > > - if (!mm) {
> > > - task_unlock(p);
> > > + p = find_lock_task_mm(p);
> > > + if (!p)
> > > return 0;
> > > - }
> > > -
> > > /*
> > > * The baseline for the badness score is the proportion of RAM that each
> > > * task's rss and swap space use.
> > > @@ -330,12 +339,6 @@ static struct task_struct *select_bad_pr
> > > *ppoints = 1000;
> > > }
> > >
> > > - /*
> > > - * skip kernel threads and tasks which have already released
> > > - * their mm.
> > > - */
> > > - if (!p->mm)
> > > - continue;
> > > if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > > continue;
> >
> > You can't do this for the reason I cited in another email, oom_badness()
> > returning 0 does not exclude a task from being chosen by
> > selcet_bad_process(), it will use that task if nothing else has been found
> > yet. We must explicitly filter it from consideration by checking for
> > !p->mm.
>
> Yes, you are right. OK, oom_badness() can never return points < 0,
> we can make it int and oom_badness() can return -1 if !mm. IOW,
>
> - unsigned int points;
> + int points;
> ...
>
> points = oom_badness(...);
> if (points >= 0 && (points > *ppoints || !chosen))
> chosen = p;
>
oom_badness() and its predecessor badness() in mainline never return
negative scores, so I don't see the value in doing this; just filter the
task in select_bad_process() with !p->mm as it has always been done.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] oom: hold tasklist_lock when dumping tasks
2010-04-01 14:27 ` Oleg Nesterov
@ 2010-04-01 19:16 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-01 19:16 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, linux-mm
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > dump_header() always requires tasklist_lock to be held because it calls
> > dump_tasks() which iterates through the tasklist. There are a few places
> > where this isn't maintained, so make sure tasklist_lock is always held
> > whenever calling dump_header().
>
> Looks correct, but I'd suggest you to update the changelog.
>
> Not only dump_tasks() needs tasklist, oom_kill_process() needs it too
> for list_for_each_entry(children).
>
> You fixed this:
>
> > @@ -724,8 +719,10 @@ void pagefault_out_of_memory(void)
> >
> > if (try_set_system_oom()) {
> > constrained_alloc(NULL, 0, NULL, &totalpages);
> > + read_lock(&tasklist_lock);
> > err = oom_kill_process(current, 0, 0, 0, totalpages, NULL,
> > "Out of memory (pagefault)");
> > + read_unlock(&tasklist_lock);
>
It's required for both that and because oom_kill_process() can call
dump_header() which is mentioned in the changelog, so I don't think any
update is needed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-03-28 21:21 ` David Rientjes
@ 2010-04-02 10:17 ` Mel Gorman
-1 siblings, 0 replies; 197+ messages in thread
From: Mel Gorman @ 2010-04-02 10:17 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel@csn.ul.ie> to the cc]
>
Sorry for the delay.
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
Meh, you could have preserved the comment but no biggie.
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
Seems reasonable. This will be checked on every major loop in the
allocator slow patch.
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
that all threads with a fatal signal pending can ignore watermarks. This
is dangerous because if 1000 threads get killed, there is a possibility
of deadlocking the system.
Why not obey the watermarks and just not retry the loop later and fail
the allocation?
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
Again, I would expect this to be caught by should_alloc_retry().
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
>
I'm ok with the should_alloc_retry() change but am a lot less ok with ignoring
watermarks just because a fatal signal is pending and I think the nofail
changes to __alloc_pages_slowpath() are unnecessary as should_alloc_retry()
should end up failing the allocations.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-04-02 10:17 ` Mel Gorman
0 siblings, 0 replies; 197+ messages in thread
From: Mel Gorman @ 2010-04-02 10:17 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Sun, Mar 28, 2010 at 02:21:01PM -0700, David Rientjes wrote:
> On Sun, 28 Mar 2010, Oleg Nesterov wrote:
>
> > I see. But still I can't understand. To me, the problem is not that
> > B can't exit, the problem is that A doesn't know it should exit. All
> > threads should exit and free ->mm. Even if B could exit, this is not
> > enough. And, to some extent, it doesn't matter if it holds mmap_sem
> > or not.
> >
> > Don't get me wrong. Even if I don't understand oom_kill.c the patch
> > looks obviously good to me, even from "common sense" pov. I am just
> > curious.
> >
> > So, my understanding is: we are going to kill the whole thread group
> > but TIF_MEMDIE is per-thread. Mark the whole thread group as TIF_MEMDIE
> > so that any thread can notice this flag and (say, __alloc_pages_slowpath)
> > fail asap.
> >
> > Is my understanding correct?
> >
>
> [Adding Mel Gorman <mel@csn.ul.ie> to the cc]
>
Sorry for the delay.
> The problem with this approach is that we could easily deplete all memory
> reserves if the oom killed task has an extremely large number of threads,
> there has always been only a single thread with TIF_MEMDIE set per cpuset
> or memcg; for systems that don't run with cpusets or memory controller,
> this has been limited to one thread with TIF_MEMDIE for the entire system.
>
> There's risk involved with suddenly allowing 1000 threads to have
> TIF_MEMDIE set and the chances of fully depleting all allowed zones is
> much higher if they allocate memory prior to exit, for example.
>
> An alternative is to fail allocations if they are failable and the
> allocating task has a pending SIGKILL. It's better to preempt the oom
> killer since current is going to be exiting anyway and this avoids a
> needless kill.
>
> That's possible if it's guaranteed that __GFP_NOFAIL allocations with a
> pending SIGKILL are granted ALLOC_NO_WATERMARKS to prevent them from
> endlessly looping while making no progress.
>
> Comments?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1610,13 +1610,21 @@ try_next_zone:
> }
>
> static inline int
> -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> {
> /* Do not loop if specifically requested */
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Loop if specifically requested */
> + if (gfp_mask & __GFP_NOFAIL)
> + return 1;
> +
Meh, you could have preserved the comment but no biggie.
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p))
> + return 0;
> +
Seems reasonable. This will be checked on every major loop in the
allocator slow patch.
> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
> @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> return 1;
>
> - /*
> - * Don't let big-order allocations loop unless the caller
> - * explicitly requests that.
> - */
> - if (gfp_mask & __GFP_NOFAIL)
> - return 1;
> -
> return 0;
> }
>
> @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> if (!in_interrupt() &&
> ((p->flags & PF_MEMALLOC) ||
> + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
that all threads with a fatal signal pending can ignore watermarks. This
is dangerous because if 1000 threads get killed, there is a possibility
of deadlocking the system.
Why not obey the watermarks and just not retry the loop later and fail
the allocation?
> unlikely(test_thread_flag(TIF_MEMDIE))))
> alloc_flags |= ALLOC_NO_WATERMARKS;
> }
> @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int migratetype)
> {
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> struct page *page = NULL;
> int alloc_flags;
> unsigned long pages_reclaimed = 0;
> @@ -1876,7 +1879,7 @@ rebalance:
> goto nopage;
>
> /* Avoid allocations with no watermarks from looping endlessly */
> - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> goto nopage;
>
> /* Try direct reclaim and then allocating */
> @@ -1888,6 +1891,10 @@ rebalance:
> if (page)
> goto got_pg;
>
> + /* Task is killed, fail the allocation if possible */
> + if (fatal_signal_pending(p) && !nofail)
> + goto nopage;
> +
Again, I would expect this to be caught by should_alloc_retry().
> /*
> * If we failed to make any progress reclaiming, then we are
> * running out of options and have to consider going OOM
> @@ -1909,8 +1916,7 @@ rebalance:
> * made, there are no other options and retrying is
> * unlikely to help.
> */
> - if (order > PAGE_ALLOC_COSTLY_ORDER &&
> - !(gfp_mask & __GFP_NOFAIL))
> + if (order > PAGE_ALLOC_COSTLY_ORDER && !nofail)
> goto nopage;
>
> goto restart;
> @@ -1919,7 +1925,7 @@ rebalance:
>
> /* Check if we should retry the allocation */
> pages_reclaimed += did_some_progress;
> - if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> + if (should_alloc_retry(p, gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
>
I'm ok with the should_alloc_retry() change but am a lot less ok with ignoring
watermarks just because a fatal signal is pending and I think the nofail
changes to __alloc_pages_slowpath() are unnecessary as should_alloc_retry()
should end up failing the allocations.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 19:12 ` David Rientjes
@ 2010-04-02 11:14 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 11:14 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > You can't do this for the reason I cited in another email, oom_badness()
> > > returning 0 does not exclude a task from being chosen by
> > > selcet_bad_process(), it will use that task if nothing else has been found
> > > yet. We must explicitly filter it from consideration by checking for
> > > !p->mm.
> >
> > Yes, you are right. OK, oom_badness() can never return points < 0,
> > we can make it int and oom_badness() can return -1 if !mm. IOW,
> >
> > - unsigned int points;
> > + int points;
> > ...
> >
> > points = oom_badness(...);
> > if (points >= 0 && (points > *ppoints || !chosen))
> > chosen = p;
> >
>
> oom_badness() and its predecessor badness() in mainline never return
> negative scores, so I don't see the value in doing this; just filter the
> task in select_bad_process() with !p->mm as it has always been done.
David, you continue to ignore my arguments ;) select_bad_process()
must not filter out the tasks with ->mm == NULL.
Once again:
void *memory_hog_thread(void *arg)
{
for (;;)
malloc(A_LOT);
}
int main(void)
{
pthread_create(memory_hog_thread, ...);
syscall(__NR_exit, 0);
}
Now, even if we fix PF_EXITING check, select_bad_process() will always
ignore this process. The group leader has ->mm == NULL.
See?
That is why I think we need something like find_lock_task_mm() in the
pseudo-patch I sent.
Or I missed something?
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-02 11:14 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 11:14 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/01, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > > You can't do this for the reason I cited in another email, oom_badness()
> > > returning 0 does not exclude a task from being chosen by
> > > selcet_bad_process(), it will use that task if nothing else has been found
> > > yet. We must explicitly filter it from consideration by checking for
> > > !p->mm.
> >
> > Yes, you are right. OK, oom_badness() can never return points < 0,
> > we can make it int and oom_badness() can return -1 if !mm. IOW,
> >
> > - unsigned int points;
> > + int points;
> > ...
> >
> > points = oom_badness(...);
> > if (points >= 0 && (points > *ppoints || !chosen))
> > chosen = p;
> >
>
> oom_badness() and its predecessor badness() in mainline never return
> negative scores, so I don't see the value in doing this; just filter the
> task in select_bad_process() with !p->mm as it has always been done.
David, you continue to ignore my arguments ;) select_bad_process()
must not filter out the tasks with ->mm == NULL.
Once again:
void *memory_hog_thread(void *arg)
{
for (;;)
malloc(A_LOT);
}
int main(void)
{
pthread_create(memory_hog_thread, ...);
syscall(__NR_exit, 0);
}
Now, even if we fix PF_EXITING check, select_bad_process() will always
ignore this process. The group leader has ->mm == NULL.
See?
That is why I think we need something like find_lock_task_mm() in the
pseudo-patch I sent.
Or I missed something?
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 0/4] oom: linux has threads
2010-04-02 11:14 ` Oleg Nesterov
@ 2010-04-02 18:30 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:30 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
On 04/02, Oleg Nesterov wrote:
>
> Once again:
>
> void *memory_hog_thread(void *arg)
> {
> for (;;)
> malloc(A_LOT);
> }
>
> int main(void)
> {
> pthread_create(memory_hog_thread, ...);
> syscall(__NR_exit, 0);
> }
>
> Now, even if we fix PF_EXITING check, select_bad_process() will always
> ignore this process. The group leader has ->mm == NULL.
So. Please see the COMPLETELY UNTESTED patches I am sending. They need
your review, or feel free to redo these fixes. 4/4 is a bit off-topic.
Also, please note the "This patch is not enough" comment in 3/4.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 0/4] oom: linux has threads
@ 2010-04-02 18:30 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:30 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
On 04/02, Oleg Nesterov wrote:
>
> Once again:
>
> void *memory_hog_thread(void *arg)
> {
> for (;;)
> malloc(A_LOT);
> }
>
> int main(void)
> {
> pthread_create(memory_hog_thread, ...);
> syscall(__NR_exit, 0);
> }
>
> Now, even if we fix PF_EXITING check, select_bad_process() will always
> ignore this process. The group leader has ->mm == NULL.
So. Please see the COMPLETELY UNTESTED patches I am sending. They need
your review, or feel free to redo these fixes. 4/4 is a bit off-topic.
Also, please note the "This patch is not enough" comment in 3/4.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 1/4] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
2010-04-02 18:30 ` Oleg Nesterov
@ 2010-04-02 18:31 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:31 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
select_bad_process() thinks a kernel thread can't have ->mm != NULL,
this is not true due to use_mm().
Change the code to check PF_KTHREAD.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- MM/mm/oom_kill.c~1_FLITER_OUT_KTHREADS 2010-03-31 17:47:14.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:51:05.000000000 +0200
@@ -290,8 +290,8 @@ static struct task_struct *select_bad_pr
for_each_process(p) {
unsigned int points;
- /* skip the init task */
- if (is_global_init(p))
+ /* skip the init task and kthreads */
+ if (is_global_init(p) || (p->flags & PF_KTHREAD))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
@@ -331,8 +331,7 @@ static struct task_struct *select_bad_pr
}
/*
- * skip kernel threads and tasks which have already released
- * their mm.
+ * skip the tasks which have already released their mm.
*/
if (!p->mm)
continue;
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 1/4] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-04-02 18:31 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:31 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
select_bad_process() thinks a kernel thread can't have ->mm != NULL,
this is not true due to use_mm().
Change the code to check PF_KTHREAD.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
--- MM/mm/oom_kill.c~1_FLITER_OUT_KTHREADS 2010-03-31 17:47:14.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:51:05.000000000 +0200
@@ -290,8 +290,8 @@ static struct task_struct *select_bad_pr
for_each_process(p) {
unsigned int points;
- /* skip the init task */
- if (is_global_init(p))
+ /* skip the init task and kthreads */
+ if (is_global_init(p) || (p->flags & PF_KTHREAD))
continue;
if (mem && !task_in_mem_cgroup(p, mem))
continue;
@@ -331,8 +331,7 @@ static struct task_struct *select_bad_pr
}
/*
- * skip kernel threads and tasks which have already released
- * their mm.
+ * skip the tasks which have already released their mm.
*/
if (!p->mm)
continue;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
2010-04-02 18:30 ` Oleg Nesterov
@ 2010-04-02 18:32 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:32 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
select_bad_process() checks PF_EXITING to detect the task which
is going to release its memory, but the logic is very wrong.
- a single process P with the dead group leader disables
select_bad_process() completely, it will always return
ERR_PTR() while P can live forever
- if the PF_EXITING task has already released its ->mm
it doesn't make sense to expect it is goiing to free
more memory (except task_struct/etc)
Change the code to ignore the PF_EXITING tasks without ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
@@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (p->flags & PF_EXITING) {
+ if ((p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-04-02 18:32 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:32 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
select_bad_process() checks PF_EXITING to detect the task which
is going to release its memory, but the logic is very wrong.
- a single process P with the dead group leader disables
select_bad_process() completely, it will always return
ERR_PTR() while P can live forever
- if the PF_EXITING task has already released its ->mm
it doesn't make sense to expect it is goiing to free
more memory (except task_struct/etc)
Change the code to ignore the PF_EXITING tasks without ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
@@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if (p->flags & PF_EXITING) {
+ if ((p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 3/4] oom: introduce find_lock_task_mm() to fix !mm false positives
2010-04-02 18:30 ` Oleg Nesterov
@ 2010-04-02 18:32 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:32 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
Almost all ->mm == NUL checks in oom_kill.c are wrong.
The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.
- Remove the "if (!p->mm)" check in select_bad_process(), it is
just wrong.
- Add the new helper, find_lock_task_mm(), which finds the live
thread which uses the memory and takes task_lock() to pin ->mm
- change oom_badness() to use this helper instead of just checking
->mm != NULL.
- As David pointed out, select_bad_process() must never choose the
task without ->mm, but no matter what oom_badness() returns the
task can be chosen if nothing else has been found yet.
Change oom_badness() to return int, change it to return -1 if
find_lock_task_mm() fails, and change select_bad_process() to
check points >= 0.
Note! This patch is not enough, we need more changes.
- oom_badness() was fixed, but oom_kill_task() still ignores
the task without ->mm
- oom_forkbomb_penalty() should use find_lock_task_mm() too,
and it also needs other changes to actually find the first
first-descendant children
This will be addressed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 2 +-
mm/oom_kill.c | 39 +++++++++++++++++++++------------------
2 files changed, 22 insertions(+), 19 deletions(-)
--- MM/include/linux/oom.h~3_FIX_MM_CHECKS 2010-03-31 17:47:14.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-02 19:14:05.000000000 +0200
@@ -40,7 +40,7 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern unsigned int oom_badness(struct task_struct *p,
+extern int oom_badness(struct task_struct *p,
unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- MM/mm/oom_kill.c~3_FIX_MM_CHECKS 2010-04-02 18:58:37.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 19:55:46.000000000 +0200
@@ -69,6 +69,19 @@ static bool has_intersects_mems_allowed(
return false;
}
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm))
+ return t;
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return NULL;
+}
+
/*
* Tasks that fork a very large number of children with seperate address spaces
* may be the result of a bug, user error, malicious applications, or even those
@@ -139,10 +152,9 @@ static unsigned long oom_forkbomb_penalt
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom conditions.
*/
-unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+int oom_badness(struct task_struct *p, unsigned long totalpages,
unsigned long uptime)
{
- struct mm_struct *mm;
int points;
/*
@@ -159,19 +171,15 @@ unsigned int oom_badness(struct task_str
if (p->flags & PF_OOM_ORIGIN)
return 1000;
- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
- return 0;
- }
-
+ p = find_lock_task_mm(p);
+ if (!p)
+ return -1;
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
*/
- points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
- totalpages;
+ points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) *
+ 1000 / totalpages;
task_unlock(p);
points += oom_forkbomb_penalty(p);
@@ -288,7 +296,7 @@ static struct task_struct *select_bad_pr
do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned int points;
+ int points;
/* skip the init task and kthreads */
if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -330,16 +338,11 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- /*
- * skip the tasks which have already released their mm.
- */
- if (!p->mm)
- continue;
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
points = oom_badness(p, totalpages, uptime.tv_sec);
- if (points > *ppoints || !chosen) {
+ if (points >= 0 && (points > *ppoints || !chosen)) {
chosen = p;
*ppoints = points;
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 3/4] oom: introduce find_lock_task_mm() to fix !mm false positives
@ 2010-04-02 18:32 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:32 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
Almost all ->mm == NUL checks in oom_kill.c are wrong.
The current code assumes that the task without ->mm has already
released its memory and ignores the process. However this is not
necessarily true when this process is multithreaded, other live
sub-threads can use this ->mm.
- Remove the "if (!p->mm)" check in select_bad_process(), it is
just wrong.
- Add the new helper, find_lock_task_mm(), which finds the live
thread which uses the memory and takes task_lock() to pin ->mm
- change oom_badness() to use this helper instead of just checking
->mm != NULL.
- As David pointed out, select_bad_process() must never choose the
task without ->mm, but no matter what oom_badness() returns the
task can be chosen if nothing else has been found yet.
Change oom_badness() to return int, change it to return -1 if
find_lock_task_mm() fails, and change select_bad_process() to
check points >= 0.
Note! This patch is not enough, we need more changes.
- oom_badness() was fixed, but oom_kill_task() still ignores
the task without ->mm
- oom_forkbomb_penalty() should use find_lock_task_mm() too,
and it also needs other changes to actually find the first
first-descendant children
This will be addressed later.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 2 +-
mm/oom_kill.c | 39 +++++++++++++++++++++------------------
2 files changed, 22 insertions(+), 19 deletions(-)
--- MM/include/linux/oom.h~3_FIX_MM_CHECKS 2010-03-31 17:47:14.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-02 19:14:05.000000000 +0200
@@ -40,7 +40,7 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern unsigned int oom_badness(struct task_struct *p,
+extern int oom_badness(struct task_struct *p,
unsigned long totalpages, unsigned long uptime);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- MM/mm/oom_kill.c~3_FIX_MM_CHECKS 2010-04-02 18:58:37.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 19:55:46.000000000 +0200
@@ -69,6 +69,19 @@ static bool has_intersects_mems_allowed(
return false;
}
+static struct task_struct *find_lock_task_mm(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ do {
+ task_lock(t);
+ if (likely(t->mm))
+ return t;
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return NULL;
+}
+
/*
* Tasks that fork a very large number of children with seperate address spaces
* may be the result of a bug, user error, malicious applications, or even those
@@ -139,10 +152,9 @@ static unsigned long oom_forkbomb_penalt
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom conditions.
*/
-unsigned int oom_badness(struct task_struct *p, unsigned long totalpages,
+int oom_badness(struct task_struct *p, unsigned long totalpages,
unsigned long uptime)
{
- struct mm_struct *mm;
int points;
/*
@@ -159,19 +171,15 @@ unsigned int oom_badness(struct task_str
if (p->flags & PF_OOM_ORIGIN)
return 1000;
- task_lock(p);
- mm = p->mm;
- if (!mm) {
- task_unlock(p);
- return 0;
- }
-
+ p = find_lock_task_mm(p);
+ if (!p)
+ return -1;
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
*/
- points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
- totalpages;
+ points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) *
+ 1000 / totalpages;
task_unlock(p);
points += oom_forkbomb_penalty(p);
@@ -288,7 +296,7 @@ static struct task_struct *select_bad_pr
do_posix_clock_monotonic_gettime(&uptime);
for_each_process(p) {
- unsigned int points;
+ int points;
/* skip the init task and kthreads */
if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -330,16 +338,11 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- /*
- * skip the tasks which have already released their mm.
- */
- if (!p->mm)
- continue;
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
continue;
points = oom_badness(p, totalpages, uptime.tv_sec);
- if (points > *ppoints || !chosen) {
+ if (points >= 0 && (points > *ppoints || !chosen)) {
chosen = p;
*ppoints = points;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 4/4] oom: oom_forkbomb_penalty: move thread_group_cputime() out of task_lock()
2010-04-02 18:30 ` Oleg Nesterov
@ 2010-04-02 18:33 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:33 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
It doesn't make sense to call thread_group_cputime() under task_lock(),
we can drop this lock right after we read get_mm_rss() and save the
value in the local variable.
Note: probably it makes more sense to use sum_exec_runtime instead
of utime + stime, it is much more precise. A task can eat a lot of
CPU time, but its Xtime can be zero.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- MM/mm/oom_kill.c~4_FORKBOMB_DROP_TASK_LOCK_EARLIER 2010-04-02 19:55:46.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 20:16:13.000000000 +0200
@@ -110,13 +110,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -126,10 +129,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += rss;
forkcount++;
}
- task_unlock(child);
}
/*
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm 4/4] oom: oom_forkbomb_penalty: move thread_group_cputime() out of task_lock()
@ 2010-04-02 18:33 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 18:33 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
It doesn't make sense to call thread_group_cputime() under task_lock(),
we can drop this lock right after we read get_mm_rss() and save the
value in the local variable.
Note: probably it makes more sense to use sum_exec_runtime instead
of utime + stime, it is much more precise. A task can eat a lot of
CPU time, but its Xtime can be zero.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
mm/oom_kill.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
--- MM/mm/oom_kill.c~4_FORKBOMB_DROP_TASK_LOCK_EARLIER 2010-04-02 19:55:46.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-02 20:16:13.000000000 +0200
@@ -110,13 +110,16 @@ static unsigned long oom_forkbomb_penalt
return 0;
list_for_each_entry(child, &tsk->children, sibling) {
struct task_cputime task_time;
- unsigned long runtime;
+ unsigned long runtime, rss;
task_lock(child);
if (!child->mm || child->mm == tsk->mm) {
task_unlock(child);
continue;
}
+ rss = get_mm_rss(child->mm);
+ task_unlock(child);
+
thread_group_cputime(child, &task_time);
runtime = cputime_to_jiffies(task_time.utime) +
cputime_to_jiffies(task_time.stime);
@@ -126,10 +129,9 @@ static unsigned long oom_forkbomb_penalt
* get to execute at all in such cases anyway.
*/
if (runtime < HZ) {
- child_rss += get_mm_rss(child->mm);
+ child_rss += rss;
forkcount++;
}
- task_unlock(child);
}
/*
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-02 11:14 ` Oleg Nesterov
@ 2010-04-02 19:02 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:02 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> > > Yes, you are right. OK, oom_badness() can never return points < 0,
> > > we can make it int and oom_badness() can return -1 if !mm. IOW,
> > >
> > > - unsigned int points;
> > > + int points;
> > > ...
> > >
> > > points = oom_badness(...);
> > > if (points >= 0 && (points > *ppoints || !chosen))
> > > chosen = p;
> > >
> >
> > oom_badness() and its predecessor badness() in mainline never return
> > negative scores, so I don't see the value in doing this; just filter the
> > task in select_bad_process() with !p->mm as it has always been done.
>
> David, you continue to ignore my arguments ;) select_bad_process()
> must not filter out the tasks with ->mm == NULL.
>
> Once again:
>
> void *memory_hog_thread(void *arg)
> {
> for (;;)
> malloc(A_LOT);
> }
>
> int main(void)
> {
> pthread_create(memory_hog_thread, ...);
> syscall(__NR_exit, 0);
> }
>
> Now, even if we fix PF_EXITING check, select_bad_process() will always
> ignore this process. The group leader has ->mm == NULL.
>
> See?
>
> That is why I think we need something like find_lock_task_mm() in the
> pseudo-patch I sent.
>
I'm not ignoring your arguments, I think you're ignoring what I'm
responding to. I prefer to keep oom_badness() to be a positive range as
it always has been (and /proc/pid/oom_score has always used an unsigned
qualifier), so I disagree that we need to change oom_badness() to return
anything other than 0 for such tasks. We need to filter them explicitly
in select_bad_process() instead, so please do this there.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-02 19:02 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:02 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> > > Yes, you are right. OK, oom_badness() can never return points < 0,
> > > we can make it int and oom_badness() can return -1 if !mm. IOW,
> > >
> > > - unsigned int points;
> > > + int points;
> > > ...
> > >
> > > points = oom_badness(...);
> > > if (points >= 0 && (points > *ppoints || !chosen))
> > > chosen = p;
> > >
> >
> > oom_badness() and its predecessor badness() in mainline never return
> > negative scores, so I don't see the value in doing this; just filter the
> > task in select_bad_process() with !p->mm as it has always been done.
>
> David, you continue to ignore my arguments ;) select_bad_process()
> must not filter out the tasks with ->mm == NULL.
>
> Once again:
>
> void *memory_hog_thread(void *arg)
> {
> for (;;)
> malloc(A_LOT);
> }
>
> int main(void)
> {
> pthread_create(memory_hog_thread, ...);
> syscall(__NR_exit, 0);
> }
>
> Now, even if we fix PF_EXITING check, select_bad_process() will always
> ignore this process. The group leader has ->mm == NULL.
>
> See?
>
> That is why I think we need something like find_lock_task_mm() in the
> pseudo-patch I sent.
>
I'm not ignoring your arguments, I think you're ignoring what I'm
responding to. I prefer to keep oom_badness() to be a positive range as
it always has been (and /proc/pid/oom_score has always used an unsigned
qualifier), so I disagree that we need to change oom_badness() to return
anything other than 0 for such tasks. We need to filter them explicitly
in select_bad_process() instead, so please do this there.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 4/4] oom: oom_forkbomb_penalty: move thread_group_cputime() out of task_lock()
2010-04-02 18:33 ` Oleg Nesterov
@ 2010-04-02 19:04 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:04 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> It doesn't make sense to call thread_group_cputime() under task_lock(),
> we can drop this lock right after we read get_mm_rss() and save the
> value in the local variable.
>
> Note: probably it makes more sense to use sum_exec_runtime instead
> of utime + stime, it is much more precise. A task can eat a lot of
> CPU time, but its Xtime can be zero.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 4/4] oom: oom_forkbomb_penalty: move thread_group_cputime() out of task_lock()
@ 2010-04-02 19:04 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:04 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> It doesn't make sense to call thread_group_cputime() under task_lock(),
> we can drop this lock right after we read get_mm_rss() and save the
> value in the local variable.
>
> Note: probably it makes more sense to use sum_exec_runtime instead
> of utime + stime, it is much more precise. A task can eat a lot of
> CPU time, but its Xtime can be zero.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 1/4] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
2010-04-02 18:31 ` Oleg Nesterov
@ 2010-04-02 19:05 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:05 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> select_bad_process() thinks a kernel thread can't have ->mm != NULL,
> this is not true due to use_mm().
>
> Change the code to check PF_KTHREAD.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 1/4] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads
@ 2010-04-02 19:05 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:05 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> select_bad_process() thinks a kernel thread can't have ->mm != NULL,
> this is not true due to use_mm().
>
> Change the code to check PF_KTHREAD.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-02 19:02 ` David Rientjes
@ 2010-04-02 19:14 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 19:14 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> On Fri, 2 Apr 2010, Oleg Nesterov wrote:
>
> > David, you continue to ignore my arguments ;) select_bad_process()
> > must not filter out the tasks with ->mm == NULL.
> >
> I'm not ignoring your arguments, I think you're ignoring what I'm
> responding to.
Ah, sorry, I misunderstood your replies.
> I prefer to keep oom_badness() to be a positive range as
> it always has been (and /proc/pid/oom_score has always used an unsigned
> qualifier),
Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
Or something.
> so I disagree that we need to change oom_badness() to return
> anything other than 0 for such tasks. We need to filter them explicitly
> in select_bad_process() instead, so please do this there.
The problem is, we need task_lock() to pin ->mm. Or, we can change
find_lock_task_mm() to do get_task_mm() and return mm_struct *.
But then oom_badness() (and proc_oom_score!) needs much more changes,
it needs the new "struct mm_struct *mm" argument which is not necessarily
equal to p->mm.
So, I can't agree.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-02 19:14 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 19:14 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> On Fri, 2 Apr 2010, Oleg Nesterov wrote:
>
> > David, you continue to ignore my arguments ;) select_bad_process()
> > must not filter out the tasks with ->mm == NULL.
> >
> I'm not ignoring your arguments, I think you're ignoring what I'm
> responding to.
Ah, sorry, I misunderstood your replies.
> I prefer to keep oom_badness() to be a positive range as
> it always has been (and /proc/pid/oom_score has always used an unsigned
> qualifier),
Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
Or something.
> so I disagree that we need to change oom_badness() to return
> anything other than 0 for such tasks. We need to filter them explicitly
> in select_bad_process() instead, so please do this there.
The problem is, we need task_lock() to pin ->mm. Or, we can change
find_lock_task_mm() to do get_task_mm() and return mm_struct *.
But then oom_badness() (and proc_oom_score!) needs much more changes,
it needs the new "struct mm_struct *mm" argument which is not necessarily
equal to p->mm.
So, I can't agree.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-02 19:14 ` Oleg Nesterov
@ 2010-04-02 19:46 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:46 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> > > David, you continue to ignore my arguments ;) select_bad_process()
> > > must not filter out the tasks with ->mm == NULL.
> > >
> > I'm not ignoring your arguments, I think you're ignoring what I'm
> > responding to.
>
> Ah, sorry, I misunderstood your replies.
>
> > I prefer to keep oom_badness() to be a positive range as
> > it always has been (and /proc/pid/oom_score has always used an unsigned
> > qualifier),
>
> Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
> We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
> Or something.
>
Just have it return 0, meaning never kill, and then ensure "chosen" is
never set for an oom_badness() of 0, even if we don't have another task to
kill. That's how Documentation/filesystems/proc.txt describes it anyway.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-02 19:46 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:46 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> > > David, you continue to ignore my arguments ;) select_bad_process()
> > > must not filter out the tasks with ->mm == NULL.
> > >
> > I'm not ignoring your arguments, I think you're ignoring what I'm
> > responding to.
>
> Ah, sorry, I misunderstood your replies.
>
> > I prefer to keep oom_badness() to be a positive range as
> > it always has been (and /proc/pid/oom_score has always used an unsigned
> > qualifier),
>
> Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
> We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
> Or something.
>
Just have it return 0, meaning never kill, and then ensure "chosen" is
never set for an oom_badness() of 0, even if we don't have another task to
kill. That's how Documentation/filesystems/proc.txt describes it anyway.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm] oom: exclude tasks with badness score of 0 from being selected
2010-04-02 19:46 ` David Rientjes
@ 2010-04-02 19:54 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:54 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so explicitly exclude it from being
selected for kill. These tasks have either detached their p->mm or are
set to OOM_DISABLE.
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,6 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
continue;
points = oom_badness(p, totalpages);
+ if (!points)
+ continue;
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm] oom: exclude tasks with badness score of 0 from being selected
@ 2010-04-02 19:54 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 19:54 UTC (permalink / raw)
To: Oleg Nesterov, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so explicitly exclude it from being
selected for kill. These tasks have either detached their p->mm or are
set to OOM_DISABLE.
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -336,6 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
continue;
points = oom_badness(p, totalpages);
+ if (!points)
+ continue;
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-02 19:46 ` David Rientjes
@ 2010-04-02 20:55 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 20:55 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> >
> > > I prefer to keep oom_badness() to be a positive range as
> > > it always has been (and /proc/pid/oom_score has always used an unsigned
> > > qualifier),
> >
> > Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
> > We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
> > Or something.
>
> Just have it return 0, meaning never kill, and then ensure "chosen" is
> never set for an oom_badness() of 0, even if we don't have another task to
> kill. That's how Documentation/filesystems/proc.txt describes it anyway.
OK, agreed, this makes more sense and more clean. I misunderstood you even
more before.
Thanks, I'll redo/resend 3/4.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-02 20:55 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 20:55 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> On Fri, 2 Apr 2010, Oleg Nesterov wrote:
> >
> > > I prefer to keep oom_badness() to be a positive range as
> > > it always has been (and /proc/pid/oom_score has always used an unsigned
> > > qualifier),
> >
> > Yes, I thought about /proc/pid/oom_score, but imho this is minor issue.
> > We can s/%lu/%ld/ though, or just report 0 if oom_badness() returns -1.
> > Or something.
>
> Just have it return 0, meaning never kill, and then ensure "chosen" is
> never set for an oom_badness() of 0, even if we don't have another task to
> kill. That's how Documentation/filesystems/proc.txt describes it anyway.
OK, agreed, this makes more sense and more clean. I misunderstood you even
more before.
Thanks, I'll redo/resend 3/4.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] oom: exclude tasks with badness score of 0 from being selected
2010-04-02 19:54 ` David Rientjes
@ 2010-04-02 21:04 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 21:04 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> An oom_badness() score of 0 means "never kill" according to
> Documentation/filesystems/proc.txt, so explicitly exclude it from being
> selected for kill. These tasks have either detached their p->mm or are
> set to OOM_DISABLE.
Agreed, but
> @@ -336,6 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> continue;
>
> points = oom_badness(p, totalpages);
> + if (!points)
> + continue;
> if (points > *ppoints || !chosen) {
then "|| !chosen" can be killed.
with this patch !chosen <=> !*ppoints, and since points > 0
if (points > *ppoints) {
is enough.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] oom: exclude tasks with badness score of 0 from being selected
@ 2010-04-02 21:04 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-02 21:04 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/02, David Rientjes wrote:
>
> An oom_badness() score of 0 means "never kill" according to
> Documentation/filesystems/proc.txt, so explicitly exclude it from being
> selected for kill. These tasks have either detached their p->mm or are
> set to OOM_DISABLE.
Agreed, but
> @@ -336,6 +336,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> continue;
>
> points = oom_badness(p, totalpages);
> + if (!points)
> + continue;
> if (points > *ppoints || !chosen) {
then "|| !chosen" can be killed.
with this patch !chosen <=> !*ppoints, and since points > 0
if (points > *ppoints) {
is enough.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm v2] oom: exclude tasks with badness score of 0 from being selected
2010-04-02 21:04 ` Oleg Nesterov
@ 2010-04-02 21:22 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 21:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so exclude it from being selected for
kill. These tasks have either detached their p->mm or are set to
OOM_DISABLE.
Also removes an unnecessary initialization of points to 0 in
mem_cgroup_out_of_memory(), select_bad_process() does this already.
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 13 ++-----------
1 files changed, 2 insertions(+), 11 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -326,17 +326,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- continue;
-
points = oom_badness(p, totalpages);
- if (points > *ppoints || !chosen) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
@@ -478,7 +469,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned int points;
struct task_struct *p;
if (sysctl_panic_on_oom == 2)
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm v2] oom: exclude tasks with badness score of 0 from being selected
@ 2010-04-02 21:22 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-02 21:22 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
An oom_badness() score of 0 means "never kill" according to
Documentation/filesystems/proc.txt, so exclude it from being selected for
kill. These tasks have either detached their p->mm or are set to
OOM_DISABLE.
Also removes an unnecessary initialization of points to 0 in
mem_cgroup_out_of_memory(), select_bad_process() does this already.
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
mm/oom_kill.c | 13 ++-----------
1 files changed, 2 insertions(+), 11 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -326,17 +326,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 1000;
}
- /*
- * skip kernel threads and tasks which have already released
- * their mm.
- */
- if (!p->mm)
- continue;
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- continue;
-
points = oom_badness(p, totalpages);
- if (points > *ppoints || !chosen) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
@@ -478,7 +469,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned int points;
struct task_struct *p;
if (sysctl_panic_on_oom == 2)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-04-02 10:17 ` Mel Gorman
@ 2010-04-04 23:26 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-04 23:26 UTC (permalink / raw)
To: Mel Gorman
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Mel Gorman wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1610,13 +1610,21 @@ try_next_zone:
> > }
> >
> > static inline int
> > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > unsigned long pages_reclaimed)
> > {
> > /* Do not loop if specifically requested */
> > if (gfp_mask & __GFP_NORETRY)
> > return 0;
> >
> > + /* Loop if specifically requested */
> > + if (gfp_mask & __GFP_NOFAIL)
> > + return 1;
> > +
>
> Meh, you could have preserved the comment but no biggie.
>
I'll remember to preserve it when it's proposed.
> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p))
> > + return 0;
> > +
>
> Seems reasonable. This will be checked on every major loop in the
> allocator slow patch.
>
> > /*
> > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > * means __GFP_NOFAIL, but that may not be true in other
> > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > return 1;
> >
> > - /*
> > - * Don't let big-order allocations loop unless the caller
> > - * explicitly requests that.
> > - */
> > - if (gfp_mask & __GFP_NOFAIL)
> > - return 1;
> > -
> > return 0;
> > }
> >
> > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > if (!in_interrupt() &&
> > ((p->flags & PF_MEMALLOC) ||
> > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
>
> This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> that all threads with a fatal signal pending can ignore watermarks. This
> is dangerous because if 1000 threads get killed, there is a possibility
> of deadlocking the system.
>
I don't quite understand the comment, this is only for __GFP_NOFAIL
allocations, which you say are rare, so a large number of threads won't be
doing this simultaneously.
> Why not obey the watermarks and just not retry the loop later and fail
> the allocation?
>
The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
essentially oom kills p without invoking the oom killer before direct
reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
so we allow it to allocate beyond the min watermark to avoid costly
reclaim or needlessly killing another task.
> > unlikely(test_thread_flag(TIF_MEMDIE))))
> > alloc_flags |= ALLOC_NO_WATERMARKS;
> > }
> > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > int migratetype)
> > {
> > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > struct page *page = NULL;
> > int alloc_flags;
> > unsigned long pages_reclaimed = 0;
> > @@ -1876,7 +1879,7 @@ rebalance:
> > goto nopage;
> >
> > /* Avoid allocations with no watermarks from looping endlessly */
> > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > goto nopage;
> >
> > /* Try direct reclaim and then allocating */
> > @@ -1888,6 +1891,10 @@ rebalance:
> > if (page)
> > goto got_pg;
> >
> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p) && !nofail)
> > + goto nopage;
> > +
>
> Again, I would expect this to be caught by should_alloc_retry().
>
It is, but only after the oom killer is called. We don't want to
needlessly kill another task here when p has already been killed but may
not be PF_EXITING yet.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-04-04 23:26 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-04 23:26 UTC (permalink / raw)
To: Mel Gorman
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Fri, 2 Apr 2010, Mel Gorman wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1610,13 +1610,21 @@ try_next_zone:
> > }
> >
> > static inline int
> > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > unsigned long pages_reclaimed)
> > {
> > /* Do not loop if specifically requested */
> > if (gfp_mask & __GFP_NORETRY)
> > return 0;
> >
> > + /* Loop if specifically requested */
> > + if (gfp_mask & __GFP_NOFAIL)
> > + return 1;
> > +
>
> Meh, you could have preserved the comment but no biggie.
>
I'll remember to preserve it when it's proposed.
> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p))
> > + return 0;
> > +
>
> Seems reasonable. This will be checked on every major loop in the
> allocator slow patch.
>
> > /*
> > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > * means __GFP_NOFAIL, but that may not be true in other
> > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > return 1;
> >
> > - /*
> > - * Don't let big-order allocations loop unless the caller
> > - * explicitly requests that.
> > - */
> > - if (gfp_mask & __GFP_NOFAIL)
> > - return 1;
> > -
> > return 0;
> > }
> >
> > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > if (!in_interrupt() &&
> > ((p->flags & PF_MEMALLOC) ||
> > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
>
> This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> that all threads with a fatal signal pending can ignore watermarks. This
> is dangerous because if 1000 threads get killed, there is a possibility
> of deadlocking the system.
>
I don't quite understand the comment, this is only for __GFP_NOFAIL
allocations, which you say are rare, so a large number of threads won't be
doing this simultaneously.
> Why not obey the watermarks and just not retry the loop later and fail
> the allocation?
>
The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
essentially oom kills p without invoking the oom killer before direct
reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
so we allow it to allocate beyond the min watermark to avoid costly
reclaim or needlessly killing another task.
> > unlikely(test_thread_flag(TIF_MEMDIE))))
> > alloc_flags |= ALLOC_NO_WATERMARKS;
> > }
> > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > int migratetype)
> > {
> > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > struct page *page = NULL;
> > int alloc_flags;
> > unsigned long pages_reclaimed = 0;
> > @@ -1876,7 +1879,7 @@ rebalance:
> > goto nopage;
> >
> > /* Avoid allocations with no watermarks from looping endlessly */
> > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > goto nopage;
> >
> > /* Try direct reclaim and then allocating */
> > @@ -1888,6 +1891,10 @@ rebalance:
> > if (page)
> > goto got_pg;
> >
> > + /* Task is killed, fail the allocation if possible */
> > + if (fatal_signal_pending(p) && !nofail)
> > + goto nopage;
> > +
>
> Again, I would expect this to be caught by should_alloc_retry().
>
It is, but only after the oom killer is called. We don't want to
needlessly kill another task here when p has already been killed but may
not be PF_EXITING yet.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-03-31 7:08 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
2010-03-31 7:08 ` KAMEZAWA Hiroyuki
2010-03-31 8:04 ` Balbir Singh
@ 2010-04-04 23:28 ` David Rientjes
2010-04-05 21:30 ` Andrew Morton
2 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-04 23:28 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
On Wed, 31 Mar 2010, David Rientjes wrote:
> It's pointless to try to kill current if select_bad_process() did not
> find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> guaranteed that current is a member of the memcg that is oom and it is,
> by definition, unkillable.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
> mm/oom_kill.c | 5 +----
> 1 files changed, 1 insertions(+), 4 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> - if (PTR_ERR(p) == -1UL)
> + if (!p || PTR_ERR(p) == -1UL)
> goto out;
>
> - if (!p)
> - p = current;
> -
> if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> "Memory cgroup out of memory"))
> goto retry;
>
Are there any objections to merging this? It's pretty straight-forward
given the fact that oom_kill_process() would fail if select_bad_process()
returns NULL even if p is set to current since it was not found to be
eligible during the tasklist scan.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-04-04 23:26 ` David Rientjes
@ 2010-04-05 10:47 ` Mel Gorman
-1 siblings, 0 replies; 197+ messages in thread
From: Mel Gorman @ 2010-04-05 10:47 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Sun, Apr 04, 2010 at 04:26:38PM -0700, David Rientjes wrote:
> On Fri, 2 Apr 2010, Mel Gorman wrote:
>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1610,13 +1610,21 @@ try_next_zone:
> > > }
> > >
> > > static inline int
> > > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > > unsigned long pages_reclaimed)
> > > {
> > > /* Do not loop if specifically requested */
> > > if (gfp_mask & __GFP_NORETRY)
> > > return 0;
> > >
> > > + /* Loop if specifically requested */
> > > + if (gfp_mask & __GFP_NOFAIL)
> > > + return 1;
> > > +
> >
> > Meh, you could have preserved the comment but no biggie.
> >
>
> I'll remember to preserve it when it's proposed.
>
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p))
> > > + return 0;
> > > +
> >
> > Seems reasonable. This will be checked on every major loop in the
> > allocator slow patch.
> >
> > > /*
> > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > * means __GFP_NOFAIL, but that may not be true in other
> > > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > > return 1;
> > >
> > > - /*
> > > - * Don't let big-order allocations loop unless the caller
> > > - * explicitly requests that.
> > > - */
> > > - if (gfp_mask & __GFP_NOFAIL)
> > > - return 1;
> > > -
> > > return 0;
> > > }
> > >
> > > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > if (!in_interrupt() &&
> > > ((p->flags & PF_MEMALLOC) ||
> > > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> >
> > This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> > that all threads with a fatal signal pending can ignore watermarks. This
> > is dangerous because if 1000 threads get killed, there is a possibility
> > of deadlocking the system.
> >
>
> I don't quite understand the comment, this is only for __GFP_NOFAIL
> allocations, which you say are rare, so a large number of threads won't be
> doing this simultaneously.
>
> > Why not obey the watermarks and just not retry the loop later and fail
> > the allocation?
> >
>
> The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
> essentially oom kills p without invoking the oom killer before direct
> reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
> so we allow it to allocate beyond the min watermark to avoid costly
> reclaim or needlessly killing another task.
>
Sorry, I typod.
GFP_NOFAIL is rare but this is basically saying that all threads with a
fatal signal and using NOFAIL can ignore watermarks.
I don't think there is any caller in an exit path will be using GFP_NOFAIL
as it's most common user is file-system related but it still feels unnecssary
to check this case on every call to the slow path.
> > > unlikely(test_thread_flag(TIF_MEMDIE))))
> > > alloc_flags |= ALLOC_NO_WATERMARKS;
> > > }
> > > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > int migratetype)
> > > {
> > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > > struct page *page = NULL;
> > > int alloc_flags;
> > > unsigned long pages_reclaimed = 0;
> > > @@ -1876,7 +1879,7 @@ rebalance:
> > > goto nopage;
> > >
> > > /* Avoid allocations with no watermarks from looping endlessly */
> > > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > > goto nopage;
> > >
> > > /* Try direct reclaim and then allocating */
> > > @@ -1888,6 +1891,10 @@ rebalance:
> > > if (page)
> > > goto got_pg;
> > >
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p) && !nofail)
> > > + goto nopage;
> > > +
> >
> > Again, I would expect this to be caught by should_alloc_retry().
> >
>
> It is, but only after the oom killer is called. We don't want to
> needlessly kill another task here when p has already been killed but may
> not be PF_EXITING yet.
>
Fair point. How about just checking before __alloc_pages_may_oom() is
called then? This check will be then in a slower path.
I recognise this means that it is also only checked when direct reclaim
is failing but there is at least one good reason for it.
With this change, processes that have been sigkilled may now fail allocations
that they might not have failed before. It would be difficult to trigger
but here is one possible problem with this change;
1. System was borderline with some trashing
2. User starts program that gobbles up lots of memory on page faults,
trashing the system further and annoying the user
3. User sends SIGKILL
4. Process was faulting and returns NULL because fatal signal was pending
5. Fault path returns VM_FAULT_OOM
6. Arch-specific path (on x86 anyway) calls out_of_memory again because
VM_FAULT_OOM was returned.
Ho hum, I haven't thought about this before but it's also possible that
a process that is fauling that gets oom-killed will trigger a cascading
OOM kill. If the system was heavily trashing, it might mean a large
number of processes get killed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-04-05 10:47 ` Mel Gorman
0 siblings, 0 replies; 197+ messages in thread
From: Mel Gorman @ 2010-04-05 10:47 UTC (permalink / raw)
To: David Rientjes
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Sun, Apr 04, 2010 at 04:26:38PM -0700, David Rientjes wrote:
> On Fri, 2 Apr 2010, Mel Gorman wrote:
>
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1610,13 +1610,21 @@ try_next_zone:
> > > }
> > >
> > > static inline int
> > > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > > unsigned long pages_reclaimed)
> > > {
> > > /* Do not loop if specifically requested */
> > > if (gfp_mask & __GFP_NORETRY)
> > > return 0;
> > >
> > > + /* Loop if specifically requested */
> > > + if (gfp_mask & __GFP_NOFAIL)
> > > + return 1;
> > > +
> >
> > Meh, you could have preserved the comment but no biggie.
> >
>
> I'll remember to preserve it when it's proposed.
>
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p))
> > > + return 0;
> > > +
> >
> > Seems reasonable. This will be checked on every major loop in the
> > allocator slow patch.
> >
> > > /*
> > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > * means __GFP_NOFAIL, but that may not be true in other
> > > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > > return 1;
> > >
> > > - /*
> > > - * Don't let big-order allocations loop unless the caller
> > > - * explicitly requests that.
> > > - */
> > > - if (gfp_mask & __GFP_NOFAIL)
> > > - return 1;
> > > -
> > > return 0;
> > > }
> > >
> > > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > if (!in_interrupt() &&
> > > ((p->flags & PF_MEMALLOC) ||
> > > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> >
> > This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> > that all threads with a fatal signal pending can ignore watermarks. This
> > is dangerous because if 1000 threads get killed, there is a possibility
> > of deadlocking the system.
> >
>
> I don't quite understand the comment, this is only for __GFP_NOFAIL
> allocations, which you say are rare, so a large number of threads won't be
> doing this simultaneously.
>
> > Why not obey the watermarks and just not retry the loop later and fail
> > the allocation?
> >
>
> The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
> essentially oom kills p without invoking the oom killer before direct
> reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
> so we allow it to allocate beyond the min watermark to avoid costly
> reclaim or needlessly killing another task.
>
Sorry, I typod.
GFP_NOFAIL is rare but this is basically saying that all threads with a
fatal signal and using NOFAIL can ignore watermarks.
I don't think there is any caller in an exit path will be using GFP_NOFAIL
as it's most common user is file-system related but it still feels unnecssary
to check this case on every call to the slow path.
> > > unlikely(test_thread_flag(TIF_MEMDIE))))
> > > alloc_flags |= ALLOC_NO_WATERMARKS;
> > > }
> > > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > int migratetype)
> > > {
> > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > > struct page *page = NULL;
> > > int alloc_flags;
> > > unsigned long pages_reclaimed = 0;
> > > @@ -1876,7 +1879,7 @@ rebalance:
> > > goto nopage;
> > >
> > > /* Avoid allocations with no watermarks from looping endlessly */
> > > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > > goto nopage;
> > >
> > > /* Try direct reclaim and then allocating */
> > > @@ -1888,6 +1891,10 @@ rebalance:
> > > if (page)
> > > goto got_pg;
> > >
> > > + /* Task is killed, fail the allocation if possible */
> > > + if (fatal_signal_pending(p) && !nofail)
> > > + goto nopage;
> > > +
> >
> > Again, I would expect this to be caught by should_alloc_retry().
> >
>
> It is, but only after the oom killer is called. We don't want to
> needlessly kill another task here when p has already been killed but may
> not be PF_EXITING yet.
>
Fair point. How about just checking before __alloc_pages_may_oom() is
called then? This check will be then in a slower path.
I recognise this means that it is also only checked when direct reclaim
is failing but there is at least one good reason for it.
With this change, processes that have been sigkilled may now fail allocations
that they might not have failed before. It would be difficult to trigger
but here is one possible problem with this change;
1. System was borderline with some trashing
2. User starts program that gobbles up lots of memory on page faults,
trashing the system further and annoying the user
3. User sends SIGKILL
4. Process was faulting and returns NULL because fatal signal was pending
5. Fault path returns VM_FAULT_OOM
6. Arch-specific path (on x86 anyway) calls out_of_memory again because
VM_FAULT_OOM was returned.
Ho hum, I haven't thought about this before but it's also possible that
a process that is fauling that gets oom-killed will trigger a cascading
OOM kill. If the system was heavily trashing, it might mean a large
number of processes get killed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm] oom: select_bad_process: never choose tasks with badness == 0
2010-04-02 18:30 ` Oleg Nesterov
@ 2010-04-05 14:23 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-05 14:23 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
This is the David's patch rediffed agains the recent changes in -mm.
As David pointed out, we should fix select_bad_process() which currently
always selects the first process which was not filtered out before
oom_badness(), no matter what oom_badness() returns.
Change the code to ignore the process if oom_badness() returns 0, this
matters Documentation/filesystems/proc.txt and this merely looks better.
This also allows us to do more cleanups:
- no need to check OOM_SCORE_ADJ_MIN in select_bad_process(),
oom_badness() returns 0 in this case.
- oom_badness() can simply return 0 instead of -1 if the task
has no ->mm.
Now we can make it "unsigned" again, the signness and the
special "points >= 0" in select_bad_process() was added to
preserve the current behaviour.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 3 ++-
mm/oom_kill.c | 11 ++++-------
2 files changed, 6 insertions(+), 8 deletions(-)
--- MM/include/linux/oom.h~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-05 15:44:49.000000000 +0200
@@ -40,7 +40,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern int oom_badness(struct task_struct *p, unsigned long totalpages);
+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- MM/mm/oom_kill.c~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-05 16:09:58.000000000 +0200
@@ -153,7 +153,7 @@ static unsigned long oom_forkbomb_penalt
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom conditions.
*/
-int oom_badness(struct task_struct *p, unsigned long totalpages)
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
{
int points;
@@ -173,7 +173,7 @@ int oom_badness(struct task_struct *p, u
p = find_lock_task_mm(p);
if (!p)
- return -1;
+ return 0;
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
@@ -294,7 +294,7 @@ static struct task_struct *select_bad_pr
*ppoints = 0;
for_each_process(p) {
- int points;
+ unsigned int points;
/* skip the init task and kthreads */
if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -336,11 +336,8 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- continue;
-
points = oom_badness(p, totalpages);
- if (points >= 0 && (points > *ppoints || !chosen)) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
^ permalink raw reply [flat|nested] 197+ messages in thread
* [PATCH -mm] oom: select_bad_process: never choose tasks with badness == 0
@ 2010-04-05 14:23 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-05 14:23 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: anfei, KOSAKI Motohiro, nishimura, KAMEZAWA Hiroyuki, Mel Gorman,
linux-mm, linux-kernel
This is the David's patch rediffed agains the recent changes in -mm.
As David pointed out, we should fix select_bad_process() which currently
always selects the first process which was not filtered out before
oom_badness(), no matter what oom_badness() returns.
Change the code to ignore the process if oom_badness() returns 0, this
matters Documentation/filesystems/proc.txt and this merely looks better.
This also allows us to do more cleanups:
- no need to check OOM_SCORE_ADJ_MIN in select_bad_process(),
oom_badness() returns 0 in this case.
- oom_badness() can simply return 0 instead of -1 if the task
has no ->mm.
Now we can make it "unsigned" again, the signness and the
special "points >= 0" in select_bad_process() was added to
preserve the current behaviour.
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/oom.h | 3 ++-
mm/oom_kill.c | 11 ++++-------
2 files changed, 6 insertions(+), 8 deletions(-)
--- MM/include/linux/oom.h~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/include/linux/oom.h 2010-04-05 15:44:49.000000000 +0200
@@ -40,7 +40,8 @@ enum oom_constraint {
CONSTRAINT_MEMORY_POLICY,
};
-extern int oom_badness(struct task_struct *p, unsigned long totalpages);
+extern unsigned int oom_badness(struct task_struct *p,
+ unsigned long totalpages);
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
--- MM/mm/oom_kill.c~5_BADNESS_DONT_RET_NEGATIVE 2010-04-05 15:39:21.000000000 +0200
+++ MM/mm/oom_kill.c 2010-04-05 16:09:58.000000000 +0200
@@ -153,7 +153,7 @@ static unsigned long oom_forkbomb_penalt
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom conditions.
*/
-int oom_badness(struct task_struct *p, unsigned long totalpages)
+unsigned int oom_badness(struct task_struct *p, unsigned long totalpages)
{
int points;
@@ -173,7 +173,7 @@ int oom_badness(struct task_struct *p, u
p = find_lock_task_mm(p);
if (!p)
- return -1;
+ return 0;
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
@@ -294,7 +294,7 @@ static struct task_struct *select_bad_pr
*ppoints = 0;
for_each_process(p) {
- int points;
+ unsigned int points;
/* skip the init task and kthreads */
if (is_global_init(p) || (p->flags & PF_KTHREAD))
@@ -336,11 +336,8 @@ static struct task_struct *select_bad_pr
*ppoints = 1000;
}
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
- continue;
-
points = oom_badness(p, totalpages);
- if (points >= 0 && (points > *ppoints || !chosen)) {
+ if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-04 23:28 ` David Rientjes
@ 2010-04-05 21:30 ` Andrew Morton
2010-04-05 22:40 ` David Rientjes
0 siblings, 1 reply; 197+ messages in thread
From: Andrew Morton @ 2010-04-05 21:30 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
On Sun, 4 Apr 2010 16:28:01 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 31 Mar 2010, David Rientjes wrote:
>
> > It's pointless to try to kill current if select_bad_process() did not
> > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > guaranteed that current is a member of the memcg that is oom and it is,
> > by definition, unkillable.
> >
> > Signed-off-by: David Rientjes <rientjes@google.com>
> > ---
> > mm/oom_kill.c | 5 +----
> > 1 files changed, 1 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > read_lock(&tasklist_lock);
> > retry:
> > p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > - if (PTR_ERR(p) == -1UL)
> > + if (!p || PTR_ERR(p) == -1UL)
> > goto out;
> >
> > - if (!p)
> > - p = current;
> > -
> > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > "Memory cgroup out of memory"))
> > goto retry;
> >
>
> Are there any objections to merging this? It's pretty straight-forward
> given the fact that oom_kill_process() would fail if select_bad_process()
> returns NULL even if p is set to current since it was not found to be
> eligible during the tasklist scan.
I've lost the plot on the oom-killer patches. Half the things I'm
seeing don't even apply.
Perhaps I should drop the lot and we start again. We still haven't
resolved the procfs back-compat issue, either.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-05 21:30 ` Andrew Morton
@ 2010-04-05 22:40 ` David Rientjes
2010-04-05 22:49 ` Andrew Morton
0 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-05 22:40 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
On Mon, 5 Apr 2010, Andrew Morton wrote:
> > > It's pointless to try to kill current if select_bad_process() did not
> > > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > > guaranteed that current is a member of the memcg that is oom and it is,
> > > by definition, unkillable.
> > >
> > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > ---
> > > mm/oom_kill.c | 5 +----
> > > 1 files changed, 1 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > > read_lock(&tasklist_lock);
> > > retry:
> > > p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > > - if (PTR_ERR(p) == -1UL)
> > > + if (!p || PTR_ERR(p) == -1UL)
> > > goto out;
> > >
> > > - if (!p)
> > > - p = current;
> > > -
> > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > "Memory cgroup out of memory"))
> > > goto retry;
> > >
> >
> > Are there any objections to merging this? It's pretty straight-forward
> > given the fact that oom_kill_process() would fail if select_bad_process()
> > returns NULL even if p is set to current since it was not found to be
> > eligible during the tasklist scan.
>
> I've lost the plot on the oom-killer patches. Half the things I'm
> seeing don't even apply.
>
This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see
anything that has been added since then that touches
mem_cgroup_out_of_memory().
> Perhaps I should drop the lot and we start again. We still haven't
> resolved the procfs back-compat issue, either.
I haven't seen any outstanding compatibility issues raised. The only
thing that isn't backwards compatible is consolidating
/proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into
/proc/sys/vm/oom_kill_quick. We can do that because we've enabled
oom_dump_tasks by default so that systems that use both of these tunables
need to now disable oom_dump_tasks to avoid the costly tasklist scan.
Both tunables would then have the same audience, i.e. users would never
want to enable one without the other, so it's possible to consolidate
them.
Nobody, to my knowledge, has objected to that reasoning and removing
dozens of patches from -mm isn't the answer for (yet to be raised)
questions about a single change.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-05 22:40 ` David Rientjes
@ 2010-04-05 22:49 ` Andrew Morton
2010-04-05 23:01 ` David Rientjes
0 siblings, 1 reply; 197+ messages in thread
From: Andrew Morton @ 2010-04-05 22:49 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
On Mon, 5 Apr 2010 15:40:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Mon, 5 Apr 2010, Andrew Morton wrote:
>
> > > > It's pointless to try to kill current if select_bad_process() did not
> > > > find an eligible task to kill in mem_cgroup_out_of_memory() since it's
> > > > guaranteed that current is a member of the memcg that is oom and it is,
> > > > by definition, unkillable.
> > > >
> > > > Signed-off-by: David Rientjes <rientjes@google.com>
> > > > ---
> > > > mm/oom_kill.c | 5 +----
> > > > 1 files changed, 1 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -500,12 +500,9 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> > > > read_lock(&tasklist_lock);
> > > > retry:
> > > > p = select_bad_process(&points, limit, mem, CONSTRAINT_NONE, NULL);
> > > > - if (PTR_ERR(p) == -1UL)
> > > > + if (!p || PTR_ERR(p) == -1UL)
> > > > goto out;
> > > >
> > > > - if (!p)
> > > > - p = current;
> > > > -
> > > > if (oom_kill_process(p, gfp_mask, 0, points, limit, mem,
> > > > "Memory cgroup out of memory"))
> > > > goto retry;
> > > >
> > >
> > > Are there any objections to merging this? It's pretty straight-forward
> > > given the fact that oom_kill_process() would fail if select_bad_process()
> > > returns NULL even if p is set to current since it was not found to be
> > > eligible during the tasklist scan.
> >
> > I've lost the plot on the oom-killer patches. Half the things I'm
> > seeing don't even apply.
> >
>
> This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see
> anything that has been added since then that touches
> mem_cgroup_out_of_memory().
I'm working on another mmotm at present.
> > Perhaps I should drop the lot and we start again. We still haven't
> > resolved the procfs back-compat issue, either.
>
> I haven't seen any outstanding compatibility issues raised. The only
> thing that isn't backwards compatible is consolidating
> /proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into
> /proc/sys/vm/oom_kill_quick. We can do that because we've enabled
> oom_dump_tasks by default so that systems that use both of these tunables
> need to now disable oom_dump_tasks to avoid the costly tasklist scan.
This can break stuff, as I've already described - if a startup tool is
correctly checking its syscall return values and a /procfs file
vanishes, the app may bail out and not work.
Others had other objections, iirc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-05 22:49 ` Andrew Morton
@ 2010-04-05 23:01 ` David Rientjes
2010-04-06 12:08 ` KOSAKI Motohiro
0 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-05 23:01 UTC (permalink / raw)
To: Andrew Morton
Cc: KAMEZAWA Hiroyuki, anfei, KOSAKI Motohiro, nishimura,
Balbir Singh, linux-mm
On Mon, 5 Apr 2010, Andrew Morton wrote:
> > This patch applies cleanly on mmotm-2010-03-24-14-48 and I don't see
> > anything that has been added since then that touches
> > mem_cgroup_out_of_memory().
>
> I'm working on another mmotm at present.
>
Nothing else you've merged since mmotm-2010-03-24-14-48 has touched
mem_cgroup_out_of_memory() that I've been cc'd on. This patch should
apply cleanly.
> > I haven't seen any outstanding compatibility issues raised. The only
> > thing that isn't backwards compatible is consolidating
> > /proc/sys/vm/oom_kill_allocating_task and /proc/sys/vm/oom_dump_tasks into
> > /proc/sys/vm/oom_kill_quick. We can do that because we've enabled
> > oom_dump_tasks by default so that systems that use both of these tunables
> > need to now disable oom_dump_tasks to avoid the costly tasklist scan.
>
> This can break stuff, as I've already described - if a startup tool is
> correctly checking its syscall return values and a /procfs file
> vanishes, the app may bail out and not work.
>
This is not the first time we have changed or obsoleted tunables in
/proc/sys/vm. If a startup tool really is really bailing out depending on
whether echo 1 > /proc/sys/vm/oom_kill_allocating_task succeeds, it should
be fixed regardless because you're not protecting anything by doing that
since you can't predict what task is allocating memory at the time of oom.
Those same startup tools will need to disable /proc/sys/vm/oom_dump_tasks
if we are to remove the consolidation into oom_kill_quick and maintain two
seperate VM sysctls that are always used together by the same users.
Nobody can even cite a single example of oom_kill_allocating_task being
used in practice, yet we want to unnecessarily maintain these two seperate
sysctls forever because it's possible that a buggy startup tool cares
about the return value of enabling it?
> Others had other objections, iirc.
>
I'm all ears.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
2010-04-02 18:32 ` Oleg Nesterov
@ 2010-04-06 11:42 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-04-06 11:42 UTC (permalink / raw)
To: Oleg Nesterov
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> select_bad_process() checks PF_EXITING to detect the task which
> is going to release its memory, but the logic is very wrong.
>
> - a single process P with the dead group leader disables
> select_bad_process() completely, it will always return
> ERR_PTR() while P can live forever
>
> - if the PF_EXITING task has already released its ->mm
> it doesn't make sense to expect it is goiing to free
> more memory (except task_struct/etc)
>
> Change the code to ignore the PF_EXITING tasks without ->mm.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> * the process of exiting and releasing its resources.
> * Otherwise we could get an easy OOM deadlock.
> */
> - if (p->flags & PF_EXITING) {
> + if ((p->flags & PF_EXITING) && p->mm) {
Even this check is satisfied, it still can't say p is a good victim or
it will release memory automatically if multi threaded, as the exiting
of p doesn't mean the other threads are going to exit, so the ->mm won't
be released.
> if (p != current)
> return ERR_PTR(-1UL);
>
>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-04-06 11:42 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-04-06 11:42 UTC (permalink / raw)
To: Oleg Nesterov
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> select_bad_process() checks PF_EXITING to detect the task which
> is going to release its memory, but the logic is very wrong.
>
> - a single process P with the dead group leader disables
> select_bad_process() completely, it will always return
> ERR_PTR() while P can live forever
>
> - if the PF_EXITING task has already released its ->mm
> it doesn't make sense to expect it is goiing to free
> more memory (except task_struct/etc)
>
> Change the code to ignore the PF_EXITING tasks without ->mm.
>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> ---
>
> mm/oom_kill.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> * the process of exiting and releasing its resources.
> * Otherwise we could get an easy OOM deadlock.
> */
> - if (p->flags & PF_EXITING) {
> + if ((p->flags & PF_EXITING) && p->mm) {
Even this check is satisfied, it still can't say p is a good victim or
it will release memory automatically if multi threaded, as the exiting
of p doesn't mean the other threads are going to exit, so the ->mm won't
be released.
> if (p != current)
> return ERR_PTR(-1UL);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-05 23:01 ` David Rientjes
@ 2010-04-06 12:08 ` KOSAKI Motohiro
2010-04-06 21:47 ` David Rientjes
0 siblings, 1 reply; 197+ messages in thread
From: KOSAKI Motohiro @ 2010-04-06 12:08 UTC (permalink / raw)
To: David Rientjes, Andrew Morton
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
Balbir Singh, linux-mm
> This is not the first time we have changed or obsoleted tunables in
> /proc/sys/vm. If a startup tool really is really bailing out depending on
> whether echo 1 > /proc/sys/vm/oom_kill_allocating_task succeeds, it should
> be fixed regardless because you're not protecting anything by doing that
>
> since you can't predict what task is allocating memory at the time of oom.
> Those same startup tools will need to disable /proc/sys/vm/oom_dump_tasks
> if we are to remove the consolidation into oom_kill_quick and maintain two
> seperate VM sysctls that are always used together by the same users.
>
> Nobody can even cite a single example of oom_kill_allocating_task being
> used in practice, yet we want to unnecessarily maintain these two seperate
> sysctls forever because it's possible that a buggy startup tool cares
> about the return value of enabling it?
>
> > Others had other objections, iirc.
> >
>
> I'm all ears.
Complain.
Many people reviewed these patches, but following four patches got no ack.
oom-badness-heuristic-rewrite.patch
oom-default-to-killing-current-for-pagefault-ooms.patch
oom-deprecate-oom_adj-tunable.patch
oom-replace-sysctls-with-quick-mode.patch
IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
We don't hope join loudly voice contest nor help to making flame. but it
doesn't mean explicit ack.
Andrew, If you really really really hope to merge these, I'm not againt
it anymore. but please put following remark explicitely into the patches.
Nacked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujistu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
2010-04-06 11:42 ` anfei
@ 2010-04-06 12:18 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-06 12:18 UTC (permalink / raw)
To: anfei
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/06, anfei wrote:
>
> On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> > select_bad_process() checks PF_EXITING to detect the task which
> > is going to release its memory, but the logic is very wrong.
> >
> > - a single process P with the dead group leader disables
> > select_bad_process() completely, it will always return
> > ERR_PTR() while P can live forever
> >
> > - if the PF_EXITING task has already released its ->mm
> > it doesn't make sense to expect it is goiing to free
> > more memory (except task_struct/etc)
> >
> > Change the code to ignore the PF_EXITING tasks without ->mm.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > ---
> >
> > mm/oom_kill.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> > +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> > @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> > * the process of exiting and releasing its resources.
> > * Otherwise we could get an easy OOM deadlock.
> > */
> > - if (p->flags & PF_EXITING) {
> > + if ((p->flags & PF_EXITING) && p->mm) {
>
> Even this check is satisfied, it still can't say p is a good victim or
> it will release memory automatically if multi threaded, as the exiting
> of p doesn't mean the other threads are going to exit, so the ->mm won't
> be released.
Yes, completely agreed.
Unfortunately I forgot to copy this into the changelog, but when I
discussed this change I mentioned "still not perfect, but much better".
I do not really know what is the "right" solution. Even if we fix this
check for mt case, we also have CLONE_VM tasks.
So, this patch just tries to improve things, to avoid the easy-to-trigger
false positives.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-04-06 12:18 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-06 12:18 UTC (permalink / raw)
To: anfei
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/06, anfei wrote:
>
> On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> > select_bad_process() checks PF_EXITING to detect the task which
> > is going to release its memory, but the logic is very wrong.
> >
> > - a single process P with the dead group leader disables
> > select_bad_process() completely, it will always return
> > ERR_PTR() while P can live forever
> >
> > - if the PF_EXITING task has already released its ->mm
> > it doesn't make sense to expect it is goiing to free
> > more memory (except task_struct/etc)
> >
> > Change the code to ignore the PF_EXITING tasks without ->mm.
> >
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > ---
> >
> > mm/oom_kill.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> > +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> > @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> > * the process of exiting and releasing its resources.
> > * Otherwise we could get an easy OOM deadlock.
> > */
> > - if (p->flags & PF_EXITING) {
> > + if ((p->flags & PF_EXITING) && p->mm) {
>
> Even this check is satisfied, it still can't say p is a good victim or
> it will release memory automatically if multi threaded, as the exiting
> of p doesn't mean the other threads are going to exit, so the ->mm won't
> be released.
Yes, completely agreed.
Unfortunately I forgot to copy this into the changelog, but when I
discussed this change I mentioned "still not perfect, but much better".
I do not really know what is the "right" solution. Even if we fix this
check for mt case, we also have CLONE_VM tasks.
So, this patch just tries to improve things, to avoid the easy-to-trigger
false positives.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
2010-04-06 12:18 ` Oleg Nesterov
@ 2010-04-06 13:05 ` anfei
-1 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-04-06 13:05 UTC (permalink / raw)
To: Oleg Nesterov
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, Apr 06, 2010 at 02:18:11PM +0200, Oleg Nesterov wrote:
> On 04/06, anfei wrote:
> >
> > On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> > > select_bad_process() checks PF_EXITING to detect the task which
> > > is going to release its memory, but the logic is very wrong.
> > >
> > > - a single process P with the dead group leader disables
> > > select_bad_process() completely, it will always return
> > > ERR_PTR() while P can live forever
> > >
> > > - if the PF_EXITING task has already released its ->mm
> > > it doesn't make sense to expect it is goiing to free
> > > more memory (except task_struct/etc)
> > >
> > > Change the code to ignore the PF_EXITING tasks without ->mm.
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > > ---
> > >
> > > mm/oom_kill.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> > > +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> > > @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> > > * the process of exiting and releasing its resources.
> > > * Otherwise we could get an easy OOM deadlock.
> > > */
> > > - if (p->flags & PF_EXITING) {
> > > + if ((p->flags & PF_EXITING) && p->mm) {
> >
> > Even this check is satisfied, it still can't say p is a good victim or
> > it will release memory automatically if multi threaded, as the exiting
> > of p doesn't mean the other threads are going to exit, so the ->mm won't
> > be released.
>
> Yes, completely agreed.
>
> Unfortunately I forgot to copy this into the changelog, but when I
> discussed this change I mentioned "still not perfect, but much better".
>
> I do not really know what is the "right" solution. Even if we fix this
> check for mt case, we also have CLONE_VM tasks.
>
What about checking mm->mm_users too? If there are any other users,
just let badness judge. CLONE_VM tasks but not mt seem rare, and
badness doesn't consider it too.
> So, this patch just tries to improve things, to avoid the easy-to-trigger
> false positives.
>
Agreed.
Thanks,
Anfei.
> Oleg.
>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-04-06 13:05 ` anfei
0 siblings, 0 replies; 197+ messages in thread
From: anfei @ 2010-04-06 13:05 UTC (permalink / raw)
To: Oleg Nesterov
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Tue, Apr 06, 2010 at 02:18:11PM +0200, Oleg Nesterov wrote:
> On 04/06, anfei wrote:
> >
> > On Fri, Apr 02, 2010 at 08:32:16PM +0200, Oleg Nesterov wrote:
> > > select_bad_process() checks PF_EXITING to detect the task which
> > > is going to release its memory, but the logic is very wrong.
> > >
> > > - a single process P with the dead group leader disables
> > > select_bad_process() completely, it will always return
> > > ERR_PTR() while P can live forever
> > >
> > > - if the PF_EXITING task has already released its ->mm
> > > it doesn't make sense to expect it is goiing to free
> > > more memory (except task_struct/etc)
> > >
> > > Change the code to ignore the PF_EXITING tasks without ->mm.
> > >
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > > ---
> > >
> > > mm/oom_kill.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > --- MM/mm/oom_kill.c~2_FIX_PF_EXITING 2010-04-02 18:51:05.000000000 +0200
> > > +++ MM/mm/oom_kill.c 2010-04-02 18:58:37.000000000 +0200
> > > @@ -322,7 +322,7 @@ static struct task_struct *select_bad_pr
> > > * the process of exiting and releasing its resources.
> > > * Otherwise we could get an easy OOM deadlock.
> > > */
> > > - if (p->flags & PF_EXITING) {
> > > + if ((p->flags & PF_EXITING) && p->mm) {
> >
> > Even this check is satisfied, it still can't say p is a good victim or
> > it will release memory automatically if multi threaded, as the exiting
> > of p doesn't mean the other threads are going to exit, so the ->mm won't
> > be released.
>
> Yes, completely agreed.
>
> Unfortunately I forgot to copy this into the changelog, but when I
> discussed this change I mentioned "still not perfect, but much better".
>
> I do not really know what is the "right" solution. Even if we fix this
> check for mt case, we also have CLONE_VM tasks.
>
What about checking mm->mm_users too? If there are any other users,
just let badness judge. CLONE_VM tasks but not mt seem rare, and
badness doesn't consider it too.
> So, this patch just tries to improve things, to avoid the easy-to-trigger
> false positives.
>
Agreed.
Thanks,
Anfei.
> Oleg.
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
2010-04-06 13:05 ` anfei
@ 2010-04-06 13:38 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-06 13:38 UTC (permalink / raw)
To: anfei
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/06, anfei wrote:
>
> On Tue, Apr 06, 2010 at 02:18:11PM +0200, Oleg Nesterov wrote:
> >
> > I do not really know what is the "right" solution. Even if we fix this
> > check for mt case, we also have CLONE_VM tasks.
> >
> What about checking mm->mm_users too? If there are any other users,
> just let badness judge. CLONE_VM tasks but not mt seem rare, and
> badness doesn't consider it too.
Even if we forget about get_task_mm() which increments mm_users, it is not
clear to me how to do this check correctly.
Say, mm_users > 1 but SIGNAL_GROUP_EXIT is set. This means this process is
exiting and (ignoring CLONE_VM task) it is going to release its ->mm. But
otoh mm can be NULL.
Perhaps we can do
if ((PF_EXITING && thread_group_empty(p) ||
(p->signal->flags & SIGNAL_GROUP_EXIT) {
// OK, it is exiting
bool has_mm = false;
do {
if (t->mm) {
has_mm = true;
break;
}
} while_each_thread(p, t);
if (!has_mm)
continue;
if (p != current)
return ERR_PTR(-1);
...
}
I dunno.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account
@ 2010-04-06 13:38 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-06 13:38 UTC (permalink / raw)
To: anfei
Cc: David Rientjes, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/06, anfei wrote:
>
> On Tue, Apr 06, 2010 at 02:18:11PM +0200, Oleg Nesterov wrote:
> >
> > I do not really know what is the "right" solution. Even if we fix this
> > check for mt case, we also have CLONE_VM tasks.
> >
> What about checking mm->mm_users too? If there are any other users,
> just let badness judge. CLONE_VM tasks but not mt seem rare, and
> badness doesn't consider it too.
Even if we forget about get_task_mm() which increments mm_users, it is not
clear to me how to do this check correctly.
Say, mm_users > 1 but SIGNAL_GROUP_EXIT is set. This means this process is
exiting and (ignoring CLONE_VM task) it is going to release its ->mm. But
otoh mm can be NULL.
Perhaps we can do
if ((PF_EXITING && thread_group_empty(p) ||
(p->signal->flags & SIGNAL_GROUP_EXIT) {
// OK, it is exiting
bool has_mm = false;
do {
if (t->mm) {
has_mm = true;
break;
}
} while_each_thread(p, t);
if (!has_mm)
continue;
if (p != current)
return ERR_PTR(-1);
...
}
I dunno.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-06 12:08 ` KOSAKI Motohiro
@ 2010-04-06 21:47 ` David Rientjes
2010-04-07 0:20 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-06 21:47 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Andrew Morton, KAMEZAWA Hiroyuki, anfei, nishimura, Balbir Singh,
linux-mm
On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:
> Many people reviewed these patches, but following four patches got no ack.
>
> oom-badness-heuristic-rewrite.patch
Do you have any specific feedback that you could offer on why you decided
to nack this?
> oom-default-to-killing-current-for-pagefault-ooms.patch
Same, what is the specific concern that you have with this patch?
If you don't believe we should kill current first, could you please submit
patches for all other architectures like powerpc that already do this as
their only course of action for VM_FAULT_OOM and then make pagefault oom
killing consistent amongst architectures?
> oom-deprecate-oom_adj-tunable.patch
Alan had a concern about removing /proc/pid/oom_adj, or redefining it with
different semantics as I originally did, and then I updated the patchset
to deprecate the old tunable as Andrew suggested.
My somewhat arbitrary time of removal was approximately 18 months from
the date of deprecation which would give us 5-6 major kernel releases in
between. If you think that's too early of a deadline, then I'd happily
extend it by 6 months or a year.
Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if
there's a finer grained alternative available already unless you want
/proc/pid/oom_adj to actually mean something in which case you'll never be
able to seperate oom badness scores from bitshifts. I believe everyone
agrees that a more understood and finer grained tunable is necessary as
compared to the current implementation that has very limited functionality
other than polarizing tasks.
> oom-replace-sysctls-with-quick-mode.patch
>
> IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
Which patch of the four you listed are you referring to here?
> We don't hope join loudly voice contest nor help to making flame. but it
> doesn't mean explicit ack.
>
If someone has a concern with a patch and then I reply to it and the reply
goes unanswered, what exactly does that imply? Do we want to stop
development because discussion occurred on a patch yet no rebuttal was
made that addressed specific points that I raised?
Arguing to keep /proc/pid/oom_kill_allocating_task means that we should
also not enable /proc/pid/oom_dump_tasks by default since the same systems
that use the former will need to now disable the latter to avoid costly
tasklist scans. So are you suggesting that we should not enable
oom_dump_tasks like the rewrite does even though it provides very useful
information to 99.9% (or perhaps 100%) of users to understand the memory
usage of their tasks because you believe systems out there would flake out
with the tasklist scan it requires, even though you can't cite a single
example?
Now instead of not replying to these questions and insisting that your
nack stand based solely on the fact that you nacked it, please get
involved in the development process.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
2010-04-05 10:47 ` Mel Gorman
@ 2010-04-06 22:40 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-06 22:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Mon, 5 Apr 2010, Mel Gorman wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1610,13 +1610,21 @@ try_next_zone:
> > > > }
> > > >
> > > > static inline int
> > > > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > > > unsigned long pages_reclaimed)
> > > > {
> > > > /* Do not loop if specifically requested */
> > > > if (gfp_mask & __GFP_NORETRY)
> > > > return 0;
> > > >
> > > > + /* Loop if specifically requested */
> > > > + if (gfp_mask & __GFP_NOFAIL)
> > > > + return 1;
> > > > +
> > >
> > > Meh, you could have preserved the comment but no biggie.
> > >
> >
> > I'll remember to preserve it when it's proposed.
> >
> > > > + /* Task is killed, fail the allocation if possible */
> > > > + if (fatal_signal_pending(p))
> > > > + return 0;
> > > > +
> > >
> > > Seems reasonable. This will be checked on every major loop in the
> > > allocator slow patch.
> > >
> > > > /*
> > > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > > * means __GFP_NOFAIL, but that may not be true in other
> > > > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > > > return 1;
> > > >
> > > > - /*
> > > > - * Don't let big-order allocations loop unless the caller
> > > > - * explicitly requests that.
> > > > - */
> > > > - if (gfp_mask & __GFP_NOFAIL)
> > > > - return 1;
> > > > -
> > > > return 0;
> > > > }
> > > >
> > > > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > > if (!in_interrupt() &&
> > > > ((p->flags & PF_MEMALLOC) ||
> > > > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> > >
> > > This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> > > that all threads with a fatal signal pending can ignore watermarks. This
> > > is dangerous because if 1000 threads get killed, there is a possibility
> > > of deadlocking the system.
> > >
> >
> > I don't quite understand the comment, this is only for __GFP_NOFAIL
> > allocations, which you say are rare, so a large number of threads won't be
> > doing this simultaneously.
> >
> > > Why not obey the watermarks and just not retry the loop later and fail
> > > the allocation?
> > >
> >
> > The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
> > essentially oom kills p without invoking the oom killer before direct
> > reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
> > so we allow it to allocate beyond the min watermark to avoid costly
> > reclaim or needlessly killing another task.
> >
>
> Sorry, I typod.
>
> GFP_NOFAIL is rare but this is basically saying that all threads with a
> fatal signal and using NOFAIL can ignore watermarks.
>
> I don't think there is any caller in an exit path will be using GFP_NOFAIL
> as it's most common user is file-system related but it still feels unnecssary
> to check this case on every call to the slow path.
>
Ok, that's reasonable. We already handle this case indirectly in -mm.
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch in
-mm makes the oom killer set TIF_MEMDIE for current and return without
killing any other task; it's unnecesary to check if
!test_thread_flag(TIF_MEMDIE) before that since the oom killer will be a
no-op anyway if there exist TIF_MEMDIE threads.
The problem is that the should_alloc_retry() logic isn't checked when the
oom killer is called, we immediately retry instead even if the oom killer
didn't do anything. So if the oom killed task fails to exit because it's
looping in the page allocator, that's going to happen forever since
reclaim has failed and the oom killer can't kill anything else (or it's
__GFP_NOFAIL and __alloc_pages_may_oom() will infinitely loop without ever
returning).
I guess this could potentially deplete memory reserves if too many threads
have fatal signals and the oom killer is constantly invoked, regardless of
__GFP_NOFAIL or not. That's why we have always opted to kill a memory
hogging task instead via a tasklist scan: we want to set TIF_MEMDIE for as
few tasks as possible with a large upside of memory freeing.
I'm wondering if we should check should_alloc_retry() first, it seems like
we could get rid of a few different branches in the oom killer path by
doing so: the comparisons to PAGE_ALLOC_COSTLY_ORDER, __GFP_NORETRY, etc.
> > > > unlikely(test_thread_flag(TIF_MEMDIE))))
> > > > alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > }
> > > > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > int migratetype)
> > > > {
> > > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > > > struct page *page = NULL;
> > > > int alloc_flags;
> > > > unsigned long pages_reclaimed = 0;
> > > > @@ -1876,7 +1879,7 @@ rebalance:
> > > > goto nopage;
> > > >
> > > > /* Avoid allocations with no watermarks from looping endlessly */
> > > > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > > > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > > > goto nopage;
> > > >
> > > > /* Try direct reclaim and then allocating */
> > > > @@ -1888,6 +1891,10 @@ rebalance:
> > > > if (page)
> > > > goto got_pg;
> > > >
> > > > + /* Task is killed, fail the allocation if possible */
> > > > + if (fatal_signal_pending(p) && !nofail)
> > > > + goto nopage;
> > > > +
> > >
> > > Again, I would expect this to be caught by should_alloc_retry().
> > >
> >
> > It is, but only after the oom killer is called. We don't want to
> > needlessly kill another task here when p has already been killed but may
> > not be PF_EXITING yet.
> >
>
> Fair point. How about just checking before __alloc_pages_may_oom() is
> called then? This check will be then in a slower path.
Yeah, that's what
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
effectively does.
> I recognise this means that it is also only checked when direct reclaim
> is failing but there is at least one good reason for it.
>
> With this change, processes that have been sigkilled may now fail allocations
> that they might not have failed before. It would be difficult to trigger
> but here is one possible problem with this change;
>
> 1. System was borderline with some trashing
> 2. User starts program that gobbles up lots of memory on page faults,
> trashing the system further and annoying the user
> 3. User sends SIGKILL
> 4. Process was faulting and returns NULL because fatal signal was pending
> 5. Fault path returns VM_FAULT_OOM
> 6. Arch-specific path (on x86 anyway) calls out_of_memory again because
> VM_FAULT_OOM was returned.
>
> Ho hum, I haven't thought about this before but it's also possible that
> a process that is fauling that gets oom-killed will trigger a cascading
> OOM kill. If the system was heavily trashing, it might mean a large
> number of processes get killed.
>
Pagefault ooms default to killing current first in -mm and only kill
another task if current is unkillable for the architectures that use
pagefault_out_of_memory(); the rest of the architectures such as powerpc
just kill current. So while this scenario is plausible, I don't think
there would be a large number of processes getting killed:
pagefault_out_of_memory() will kill current and give it access to memory
reserves and the oom killer won't perform any needless oom killing while
that is happening.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [PATCH] oom killer: break from infinite loop
@ 2010-04-06 22:40 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-06 22:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Oleg Nesterov, anfei, Andrew Morton, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, linux-mm, linux-kernel
On Mon, 5 Apr 2010, Mel Gorman wrote:
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1610,13 +1610,21 @@ try_next_zone:
> > > > }
> > > >
> > > > static inline int
> > > > -should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > > +should_alloc_retry(struct task_struct *p, gfp_t gfp_mask, unsigned int order,
> > > > unsigned long pages_reclaimed)
> > > > {
> > > > /* Do not loop if specifically requested */
> > > > if (gfp_mask & __GFP_NORETRY)
> > > > return 0;
> > > >
> > > > + /* Loop if specifically requested */
> > > > + if (gfp_mask & __GFP_NOFAIL)
> > > > + return 1;
> > > > +
> > >
> > > Meh, you could have preserved the comment but no biggie.
> > >
> >
> > I'll remember to preserve it when it's proposed.
> >
> > > > + /* Task is killed, fail the allocation if possible */
> > > > + if (fatal_signal_pending(p))
> > > > + return 0;
> > > > +
> > >
> > > Seems reasonable. This will be checked on every major loop in the
> > > allocator slow patch.
> > >
> > > > /*
> > > > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > > > * means __GFP_NOFAIL, but that may not be true in other
> > > > @@ -1635,13 +1643,6 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > > > if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
> > > > return 1;
> > > >
> > > > - /*
> > > > - * Don't let big-order allocations loop unless the caller
> > > > - * explicitly requests that.
> > > > - */
> > > > - if (gfp_mask & __GFP_NOFAIL)
> > > > - return 1;
> > > > -
> > > > return 0;
> > > > }
> > > >
> > > > @@ -1798,6 +1799,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > > > if (!in_interrupt() &&
> > > > ((p->flags & PF_MEMALLOC) ||
> > > > + (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL)) ||
> > >
> > > This is a lot less clear. GFP_NOFAIL is rare so this is basically saying
> > > that all threads with a fatal signal pending can ignore watermarks. This
> > > is dangerous because if 1000 threads get killed, there is a possibility
> > > of deadlocking the system.
> > >
> >
> > I don't quite understand the comment, this is only for __GFP_NOFAIL
> > allocations, which you say are rare, so a large number of threads won't be
> > doing this simultaneously.
> >
> > > Why not obey the watermarks and just not retry the loop later and fail
> > > the allocation?
> > >
> >
> > The above check for (fatal_signal_pending(p) && (gfp_mask & __GFP_NOFAIL))
> > essentially oom kills p without invoking the oom killer before direct
> > reclaim is invoked. We know it has a pending SIGKILL and wants to exit,
> > so we allow it to allocate beyond the min watermark to avoid costly
> > reclaim or needlessly killing another task.
> >
>
> Sorry, I typod.
>
> GFP_NOFAIL is rare but this is basically saying that all threads with a
> fatal signal and using NOFAIL can ignore watermarks.
>
> I don't think there is any caller in an exit path will be using GFP_NOFAIL
> as it's most common user is file-system related but it still feels unnecssary
> to check this case on every call to the slow path.
>
Ok, that's reasonable. We already handle this case indirectly in -mm.
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch in
-mm makes the oom killer set TIF_MEMDIE for current and return without
killing any other task; it's unnecesary to check if
!test_thread_flag(TIF_MEMDIE) before that since the oom killer will be a
no-op anyway if there exist TIF_MEMDIE threads.
The problem is that the should_alloc_retry() logic isn't checked when the
oom killer is called, we immediately retry instead even if the oom killer
didn't do anything. So if the oom killed task fails to exit because it's
looping in the page allocator, that's going to happen forever since
reclaim has failed and the oom killer can't kill anything else (or it's
__GFP_NOFAIL and __alloc_pages_may_oom() will infinitely loop without ever
returning).
I guess this could potentially deplete memory reserves if too many threads
have fatal signals and the oom killer is constantly invoked, regardless of
__GFP_NOFAIL or not. That's why we have always opted to kill a memory
hogging task instead via a tasklist scan: we want to set TIF_MEMDIE for as
few tasks as possible with a large upside of memory freeing.
I'm wondering if we should check should_alloc_retry() first, it seems like
we could get rid of a few different branches in the oom killer path by
doing so: the comparisons to PAGE_ALLOC_COSTLY_ORDER, __GFP_NORETRY, etc.
> > > > unlikely(test_thread_flag(TIF_MEMDIE))))
> > > > alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > }
> > > > @@ -1812,6 +1814,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > int migratetype)
> > > > {
> > > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > > + const gfp_t nofail = gfp_mask & __GFP_NOFAIL;
> > > > struct page *page = NULL;
> > > > int alloc_flags;
> > > > unsigned long pages_reclaimed = 0;
> > > > @@ -1876,7 +1879,7 @@ rebalance:
> > > > goto nopage;
> > > >
> > > > /* Avoid allocations with no watermarks from looping endlessly */
> > > > - if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
> > > > + if (test_thread_flag(TIF_MEMDIE) && !nofail)
> > > > goto nopage;
> > > >
> > > > /* Try direct reclaim and then allocating */
> > > > @@ -1888,6 +1891,10 @@ rebalance:
> > > > if (page)
> > > > goto got_pg;
> > > >
> > > > + /* Task is killed, fail the allocation if possible */
> > > > + if (fatal_signal_pending(p) && !nofail)
> > > > + goto nopage;
> > > > +
> > >
> > > Again, I would expect this to be caught by should_alloc_retry().
> > >
> >
> > It is, but only after the oom killer is called. We don't want to
> > needlessly kill another task here when p has already been killed but may
> > not be PF_EXITING yet.
> >
>
> Fair point. How about just checking before __alloc_pages_may_oom() is
> called then? This check will be then in a slower path.
Yeah, that's what
oom-give-current-access-to-memory-reserves-if-it-has-been-killed.patch
effectively does.
> I recognise this means that it is also only checked when direct reclaim
> is failing but there is at least one good reason for it.
>
> With this change, processes that have been sigkilled may now fail allocations
> that they might not have failed before. It would be difficult to trigger
> but here is one possible problem with this change;
>
> 1. System was borderline with some trashing
> 2. User starts program that gobbles up lots of memory on page faults,
> trashing the system further and annoying the user
> 3. User sends SIGKILL
> 4. Process was faulting and returns NULL because fatal signal was pending
> 5. Fault path returns VM_FAULT_OOM
> 6. Arch-specific path (on x86 anyway) calls out_of_memory again because
> VM_FAULT_OOM was returned.
>
> Ho hum, I haven't thought about this before but it's also possible that
> a process that is fauling that gets oom-killed will trigger a cascading
> OOM kill. If the system was heavily trashing, it might mean a large
> number of processes get killed.
>
Pagefault ooms default to killing current first in -mm and only kill
another task if current is unkillable for the architectures that use
pagefault_out_of_memory(); the rest of the architectures such as powerpc
just kill current. So while this scenario is plausible, I don't think
there would be a large number of processes getting killed:
pagefault_out_of_memory() will kill current and give it access to memory
reserves and the oom killer won't perform any needless oom killing while
that is happening.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-06 21:47 ` David Rientjes
@ 2010-04-07 0:20 ` KAMEZAWA Hiroyuki
2010-04-07 13:29 ` KOSAKI Motohiro
2010-04-08 17:36 ` David Rientjes
0 siblings, 2 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-07 0:20 UTC (permalink / raw)
To: David Rientjes
Cc: KOSAKI Motohiro, Andrew Morton, anfei, nishimura, Balbir Singh, linux-mm
On Tue, 6 Apr 2010 14:47:58 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:
>
> > Many people reviewed these patches, but following four patches got no ack.
> >
> > oom-badness-heuristic-rewrite.patch
>
> Do you have any specific feedback that you could offer on why you decided
> to nack this?
>
I like this patch. But I think no one can't Ack this because there is no
"correct" answer. At least, this show good behavior on my environment.
> > oom-default-to-killing-current-for-pagefault-ooms.patch
>
> Same, what is the specific concern that you have with this patch?
>
I'm not sure about this. Personally, I feel pagefault-out-of-memory only
happens drivers are corrupted. So, I have no much concern on this.
> If you don't believe we should kill current first, could you please submit
> patches for all other architectures like powerpc that already do this as
> their only course of action for VM_FAULT_OOM and then make pagefault oom
> killing consistent amongst architectures?
>
> > oom-deprecate-oom_adj-tunable.patch
>
> Alan had a concern about removing /proc/pid/oom_adj, or redefining it with
> different semantics as I originally did, and then I updated the patchset
> to deprecate the old tunable as Andrew suggested.
>
> My somewhat arbitrary time of removal was approximately 18 months from
> the date of deprecation which would give us 5-6 major kernel releases in
> between. If you think that's too early of a deadline, then I'd happily
> extend it by 6 months or a year.
>
> Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if
> there's a finer grained alternative available already unless you want
> /proc/pid/oom_adj to actually mean something in which case you'll never be
> able to seperate oom badness scores from bitshifts. I believe everyone
> agrees that a more understood and finer grained tunable is necessary as
> compared to the current implementation that has very limited functionality
> other than polarizing tasks.
>
If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
But my concern is administorator has to check all oom_score_adj and
tune it again if he adds more memory to the system.
Now, not-small amount of people use Virtual Machine or Contaienr. So, this
oom_score_adj's sensivity to the size of memory can put admins to hell.
Assume a host A and B. A has 4G memory, B has 8G memory.
Here, an applicaton which consumes 2G memory.
Then, this application's oom_score will be 500 on A, 250 on B.
To make oom_score 0 by oom_score_adj, admin should set -500 on A, -250 on B.
I think this kind of interface is _bad_. If admin is great and all machines
in the system has the same configuration, this oom_score_adj will work powerfully.
I admit it.
But usually, admin are not great and the system includes irregular hosts.
I hope you add one more magic knob to give admins to show importance of application
independent from system configuration, which can work cooperatively with oom_score_adj.
> > oom-replace-sysctls-with-quick-mode.patch
> >
> > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
>
> Which patch of the four you listed are you referring to here?
>
replacing used sysctl is bad idea, in general.
I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
Thank you for your work.
Regards,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-07 0:20 ` KAMEZAWA Hiroyuki
@ 2010-04-07 13:29 ` KOSAKI Motohiro
2010-04-08 18:05 ` David Rientjes
2010-04-08 17:36 ` David Rientjes
1 sibling, 1 reply; 197+ messages in thread
From: KOSAKI Motohiro @ 2010-04-07 13:29 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kosaki.motohiro, David Rientjes, Andrew Morton, anfei, nishimura,
Balbir Singh, linux-mm
> On Tue, 6 Apr 2010 14:47:58 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
> > On Tue, 6 Apr 2010, KOSAKI Motohiro wrote:
> >
> > > Many people reviewed these patches, but following four patches got no ack.
> > >
> > > oom-badness-heuristic-rewrite.patch
> >
> > Do you have any specific feedback that you could offer on why you decided
> > to nack this?
> >
>
> I like this patch. But I think no one can't Ack this because there is no
> "correct" answer. At least, this show good behavior on my environment.
see diffstat. that's perfectly crap, obviously need to make separate patches
individual one. Who can review it?
Documentation/filesystems/proc.txt | 95 ++++----
Documentation/sysctl/vm.txt | 21 +
fs/proc/base.c | 98 ++++++++
include/linux/memcontrol.h | 8
include/linux/oom.h | 17 +
include/linux/sched.h | 3
kernel/fork.c | 1
kernel/sysctl.c | 9
mm/memcontrol.c | 18 +
mm/oom_kill.c | 319 ++++++++++++++-------------
10 files changed, 404 insertions(+), 185 deletions(-)
additional commets is in below.
> > > oom-default-to-killing-current-for-pagefault-ooms.patch
> >
> > Same, what is the specific concern that you have with this patch?
>
> I'm not sure about this. Personally, I feel pagefault-out-of-memory only
> happens drivers are corrupted. So, I have no much concern on this.
If you suggest to revert pagefault_oom itself, it is considerable. but
even though I don't think so.
quote nick's mail
The thing I should explain is that user interfaces are most important
for their intended semantics. We don't generally call bugs or oversights
part of the interface, and they are to be fixed unless some program
relies on them.
Nowhere in the vm documentation does it say anything about "pagefault
ooms", and even in the kernel code, even to mm developers (who mostly
don't care about oom killer) probably wouldn't immediately think of
pagefault versus any other type of oom.
Given that, do you think it is reasonable, when panic_on_oom is set,
to allow a process to be killed due to oom condition? Or do you think
that was an oversight of the implementation?
Regardless of what architectures currently do. Yes there is a
consistency issue, and it should have been fixed earlier, but the
consistency issue goes both ways now. Some (the most widely tested
and used, if that matters) architectures, do it the right way.
So, this patch is purely backstep. it break panic_on_oom.
If anyone post "pagefault_out_of_memory() aware pagefault for ppc" or
something else architecture, I'm glad and ack it.
> > If you don't believe we should kill current first, could you please submit
> > patches for all other architectures like powerpc that already do this as
> > their only course of action for VM_FAULT_OOM and then make pagefault oom
> > killing consistent amongst architectures?
>
> >
> > > oom-deprecate-oom_adj-tunable.patch
> >
> > Alan had a concern about removing /proc/pid/oom_adj, or redefining it with
> > different semantics as I originally did, and then I updated the patchset
> > to deprecate the old tunable as Andrew suggested.
> >
> > My somewhat arbitrary time of removal was approximately 18 months from
> > the date of deprecation which would give us 5-6 major kernel releases in
> > between. If you think that's too early of a deadline, then I'd happily
> > extend it by 6 months or a year.
> >
> > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if
> > there's a finer grained alternative available already unless you want
> > /proc/pid/oom_adj to actually mean something in which case you'll never be
> > able to seperate oom badness scores from bitshifts. I believe everyone
> > agrees that a more understood and finer grained tunable is necessary as
> > compared to the current implementation that has very limited functionality
> > other than polarizing tasks.
The problem is, oom_adj is one of most widely used knob. it is not only used
admin, but also be used applications. in addition, oom_score_adj is bad interface
and no good to replace oom_adj. kamezawa-san, as following your mentioned.
> If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
> But my concern is administorator has to check all oom_score_adj and
> tune it again if he adds more memory to the system.
>
> Now, not-small amount of people use Virtual Machine or Contaienr. So, this
> oom_score_adj's sensivity to the size of memory can put admins to hell.
>
> Assume a host A and B. A has 4G memory, B has 8G memory.
> Here, an applicaton which consumes 2G memory.
> Then, this application's oom_score will be 500 on A, 250 on B.
> To make oom_score 0 by oom_score_adj, admin should set -500 on A, -250 on B.
>
> I think this kind of interface is _bad_. If admin is great and all machines
> in the system has the same configuration, this oom_score_adj will work powerfully.
> I admit it.
> But usually, admin are not great and the system includes irregular hosts.
> I hope you add one more magic knob to give admins to show importance of application
> independent from system configuration, which can work cooperatively with oom_score_adj.
agreed. oom_score_adj is completely crap. should gone.
but also following pseudo scaling adjustment is crap too. it don't consider
both page sharing and mlock pages. iow, it never works correctly.
+ points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
+ totalpages;
>
> > > oom-replace-sysctls-with-quick-mode.patch
> > >
> > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> >
> > Which patch of the four you listed are you referring to here?
> >
> replacing used sysctl is bad idea, in general.
>
> I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> Thank you for your work.
I really hate "that is _inteltional_ regression" crap. now almost developers
ignore a bug report and don't join problem investigate works. I and very few
people does that. (ok, I agree you are in such few developers, thanks)
Why can't we discard it simplely? please don't make crap.
now, sadly, I can imagine why some active developers have prefered to
override ugly code immeditely rather than a code review and dialogue.
I'm feel down that I have to do it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-07 0:20 ` KAMEZAWA Hiroyuki
2010-04-07 13:29 ` KOSAKI Motohiro
@ 2010-04-08 17:36 ` David Rientjes
1 sibling, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-08 17:36 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro, Andrew Morton, anfei, nishimura, Balbir Singh, linux-mm
On Wed, 7 Apr 2010, KAMEZAWA Hiroyuki wrote:
> > > oom-badness-heuristic-rewrite.patch
> >
> > Do you have any specific feedback that you could offer on why you decided
> > to nack this?
> >
>
> I like this patch. But I think no one can't Ack this because there is no
> "correct" answer. At least, this show good behavior on my environment.
>
Agreed. I think the new oom_badness() function is much better than the
current heuristic and should prevent X from being killed as we've
discussed fairly often on LKML over the past six months.
> > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if
> > there's a finer grained alternative available already unless you want
> > /proc/pid/oom_adj to actually mean something in which case you'll never be
> > able to seperate oom badness scores from bitshifts. I believe everyone
> > agrees that a more understood and finer grained tunable is necessary as
> > compared to the current implementation that has very limited functionality
> > other than polarizing tasks.
> >
>
> If oom-badness-heuristic-rewrite.patch will go ahead, this should go.
> But my concern is administorator has to check all oom_score_adj and
> tune it again if he adds more memory to the system.
>
> Now, not-small amount of people use Virtual Machine or Contaienr. So, this
> oom_score_adj's sensivity to the size of memory can put admins to hell.
>
Would you necessarily want to change oom_score_adj when you add or remove
memory? I see the currently available pool of memory available (whether
it is system-wide, constrained to a cpuset mems, mempolicy nodes, or memcg
limits) as a shared resource so if you want to bias a task by 25% of
available memory by using an oom_score_adj of 250, that doesn't change if
we add or remove memory. It still means that the task should be biased by
that amount in comparison to other tasks.
My perspective is that we should define oom killing priorities is terms of
how much memory tasks are using compared to others and that the actual
capacity itself is irrelevant if its a shared resource. So when tasks are
moved into a memcg, for example, that becomes a "virtualized system" with
a more limited shared memory resource and has the same bias (or
preference) that it did when it was in the root cgroup.
In other words, I think it would be more inconvenient to update
oom_score_adj anytime a task changes memcg, is attached to a different
cpuset, or is bound to nodes by way of a mempolicy. In these scenarios, I
see them as simply having a restricted set of allowed memory yet the bias
can remain the same.
Users who do actually want to bias a task by a memory quantity can easily
do so, but I think they would be in the minority and we hope to avoid
adding unnecessary tunables when a conversion to the appropriate
oom_score_adj value is possible with a simple divide.
> > > oom-replace-sysctls-with-quick-mode.patch
> > >
> > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> >
> > Which patch of the four you listed are you referring to here?
> >
> replacing used sysctl is bad idea, in general.
>
I agree, but since the audience for both of these sysctls will need to do
echo 0 > /proc/sys/vm/oom_dump_tasks as the result of this patchset since
it is now enabled by default, do you think we can take this as an
opportunity to consolidate them down into one? Otherwise, we're obliged
to continue to support them indefinitely even though their only users are
the exact same systems.
> I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> Thank you for your work.
>
Thanks, Kame, I appreciate that.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-07 13:29 ` KOSAKI Motohiro
@ 2010-04-08 18:05 ` David Rientjes
2010-04-21 19:17 ` Andrew Morton
0 siblings, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-08 18:05 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, Andrew Morton, anfei, nishimura, Balbir Singh,
linux-mm
On Wed, 7 Apr 2010, KOSAKI Motohiro wrote:
> > > > oom-badness-heuristic-rewrite.patch
> > >
> > > Do you have any specific feedback that you could offer on why you decided
> > > to nack this?
> > >
> >
> > I like this patch. But I think no one can't Ack this because there is no
> > "correct" answer. At least, this show good behavior on my environment.
>
> see diffstat. that's perfectly crap, obviously need to make separate patches
> individual one. Who can review it?
>
> Documentation/filesystems/proc.txt | 95 ++++----
> Documentation/sysctl/vm.txt | 21 +
> fs/proc/base.c | 98 ++++++++
> include/linux/memcontrol.h | 8
> include/linux/oom.h | 17 +
> include/linux/sched.h | 3
> kernel/fork.c | 1
> kernel/sysctl.c | 9
> mm/memcontrol.c | 18 +
> mm/oom_kill.c | 319 ++++++++++++++-------------
> 10 files changed, 404 insertions(+), 185 deletions(-)
>
> additional commets is in below.
>
This specific change cannot be broken down into individual patches as much
as I'd like to. It's a complete rewrite of the badness() function and
requires two new tunables to be introduced, determination of the amount of
memory available to current, formals being changed around, and
documentation.
A review tip: the change itself is in the rewrite of the function now
called oom_badness(), so I recommend applying downloading mmotm and
reading it there as well as the documentation change. The remainder of
the patch fixes up the various callers of that function and isn't
interesting.
> If you suggest to revert pagefault_oom itself, it is considerable. but
> even though I don't think so.
>
> quote nick's mail
>
> The thing I should explain is that user interfaces are most important
> for their intended semantics. We don't generally call bugs or oversights
> part of the interface, and they are to be fixed unless some program
> relies on them.
>
I disagree, I believe the long-standing semantics of user interfaces such
as panic_on_oom are more important than what the name implies or what it
was intended for when it was introduced.
> Nowhere in the vm documentation does it say anything about "pagefault
> ooms", and even in the kernel code, even to mm developers (who mostly
> don't care about oom killer) probably wouldn't immediately think of
> pagefault versus any other type of oom.
>
> Given that, do you think it is reasonable, when panic_on_oom is set,
> to allow a process to be killed due to oom condition? Or do you think
> that was an oversight of the implementation?
>
Users have a well-defined and long-standing method of protecting their
applications from oom kill and that is OOM_DISABLE. With my patch, if
current is unkillable because it is OOM_DISABLE, then we fallback to a
tasklist scan iff panic_on_oom is unset.
> Regardless of what architectures currently do. Yes there is a
> consistency issue, and it should have been fixed earlier, but the
> consistency issue goes both ways now. Some (the most widely tested
> and used, if that matters) architectures, do it the right way.
>
> So, this patch is purely backstep. it break panic_on_oom.
> If anyone post "pagefault_out_of_memory() aware pagefault for ppc" or
> something else architecture, I'm glad and ack it.
>
It's not a backstep, it's making all architectures consistent as it sits
right now in mmotm. If someone would like to change all VM_FAULT_OOM
handlers to do a tasklist scan and not default to killing current, that is
an extension of this patchset. Likewise, if we want to ensure
panic_on_oom is respected even for pagefault ooms, then we need to do that
on all architectures so that we don't have multiple definitions depending
on machine type. The semantics of a sysctl shouldn't depend on the
architecture and right now it does, so this patch fixes that. In other
words: if you want to extend the definition of panic_on_oom, then do so
completely for all architectures first and then add it to the
documentation.
> > > > oom-deprecate-oom_adj-tunable.patch
> > >
> > > Alan had a concern about removing /proc/pid/oom_adj, or redefining it with
> > > different semantics as I originally did, and then I updated the patchset
> > > to deprecate the old tunable as Andrew suggested.
> > >
> > > My somewhat arbitrary time of removal was approximately 18 months from
> > > the date of deprecation which would give us 5-6 major kernel releases in
> > > between. If you think that's too early of a deadline, then I'd happily
> > > extend it by 6 months or a year.
> > >
> > > Keeping /proc/pid/oom_adj around indefinitely isn't very helpful if
> > > there's a finer grained alternative available already unless you want
> > > /proc/pid/oom_adj to actually mean something in which case you'll never be
> > > able to seperate oom badness scores from bitshifts. I believe everyone
> > > agrees that a more understood and finer grained tunable is necessary as
> > > compared to the current implementation that has very limited functionality
> > > other than polarizing tasks.
>
> The problem is, oom_adj is one of most widely used knob. it is not only used
> admin, but also be used applications. in addition, oom_score_adj is bad interface
> and no good to replace oom_adj. kamezawa-san, as following your mentioned.
>
oom_adj is retained but deprecated, so I'm not sure what you're suggesting
here. Do you think we should instead keep oom_adj forever in parallel
with oom_score_adj? It's quite clear that a more powerful, finer-grained
solution is necessary than what oom_adj provides. I believe the
deprecation for 5-6 major kernel releases is enough, but we can certainly
talk about extending that by a year if you'd like.
Can you elaborate on why you believe oom_score_adj is a bad interface or
have had problems with it in your personal use?
> agreed. oom_score_adj is completely crap. should gone.
> but also following pseudo scaling adjustment is crap too. it don't consider
> both page sharing and mlock pages. iow, it never works correctly.
>
>
> + points = (get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS)) * 1000 /
> + totalpages;
>
That baseline actually does work much better than total_vm as we've
discussed multiple times on LKML leading up to the development of this
series, but if you'd like to propose additional considerations into the
heuristic, than please do so.
> > > > oom-replace-sysctls-with-quick-mode.patch
> > > >
> > > > IIRC, alan and nick and I NAKed such patch. everybody explained the reason.
> > >
> > > Which patch of the four you listed are you referring to here?
> > >
> > replacing used sysctl is bad idea, in general.
> >
> > I have no _strong_ opinion. I welcome the patch series. But aboves are my concern.
> > Thank you for your work.
>
> I really hate "that is _inteltional_ regression" crap. now almost developers
> ignore a bug report and don't join problem investigate works. I and very few
> people does that. (ok, I agree you are in such few developers, thanks)
>
> Why can't we discard it simplely? please don't make crap.
>
Perhaps you don't understand. The users of oom_kill_allocating_task are
those systems that have extremely large tasklists and so iterating through
it comes at a substantial cost. It was originally requested by SGI
because they preferred an alternative to the tasklist scan used for
cpuset-constrained ooms and were satisfied with simply killing something
quickly instead of iterating the tasklist.
This patchset, however, enables oom_dump_tasks by default because it
provides useful information to the user to understand the memory use of
their applications so they can hopefully determine why the oom occurred.
This requires a tasklist scan itself, so those same users of
oom_kill_allocating_task are no longer protected from that cost by simply
setting this sysctl. They must also disable oom_dump_tasks or we're at
the same efficiency that we were before oom_kill_allocating_task was
introduced.
Since they must modify their startup scripts, and since the users of both
of these sysctls are the same and nobody would use one without the other,
it should be possible to consolidate them into a single sysctl. If
additional changes are made to the oom killer in the future, it would then
be possible to test for this single sysctl, oom_kill_quick, instead
without introducing additional sysctls and polluting procfs.
Thus, it's completely unnecessary to keep oom_kill_allocating_task and we
can redefine it for those systems. What alternatives do you have in mind
or what part of this logic do you not agree with?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-01 15:26 ` Oleg Nesterov
@ 2010-04-08 21:08 ` David Rientjes
-1 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-08 21:08 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > > Again, this is not right even if we forget about !child->mm check.
> > > > > This list_for_each_entry() can only see the processes forked by the
> > > > > main thread.
> > > > >
> > > >
> > > > That's the intention.
> > >
> > > Why? shouldn't oom_badness() return the same result for any thread
> > > in thread group? We should take all childs into account.
> > >
> >
> > oom_forkbomb_penalty() only cares about first-descendant children that
> > do not share the same memory,
>
> I see, but the code doesn't really do this. I mean, it doesn't really
> see the first-descendant children, only those which were forked by the
> main thread.
>
> Look. We have a main thread M and the sub-thread T. T forks a lot of
> processes which use a lot of memory. These processes _are_ the first
> descendant children of the M+T thread group, they should be accounted.
> But M->children list is empty.
>
> oom_forkbomb_penalty() and oom_kill_process() should do
>
> t = tsk;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> ... take child into account ...
> }
> } while_each_thread(tsk, t);
>
>
In this case, it seems more appropriate that we would penalize T and not M
since it's not necessarily responsible for the behavior of the children it
forks. T is the buggy/malicious program, not M.
> See the patch below. Yes, this is minor, but it is always good to avoid
> the unnecessary locks, and thread_group_cputime() is O(N).
>
> Not only for performance reasons. This allows to change the locking in
> thread_group_cputime() if needed without fear to deadlock with task_lock().
>
> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
> return 0;
> list_for_each_entry(child, &tsk->children, sibling) {
> struct task_cputime task_time;
> - unsigned long runtime;
> + unsigned long runtime, this_rss;
>
> task_lock(child);
> if (!child->mm || child->mm == tsk->mm) {
> task_unlock(child);
> continue;
> }
> + this_rss = get_mm_rss(child->mm);
> + task_unlock(child);
> +
> thread_group_cputime(child, &task_time);
> runtime = cputime_to_jiffies(task_time.utime) +
> cputime_to_jiffies(task_time.stime);
> @@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
> * get to execute at all in such cases anyway.
> */
> if (runtime < HZ) {
> - child_rss += get_mm_rss(child->mm);
> + child_rss += this_rss;
> forkcount++;
> }
> - task_unlock(child);
> }
>
> /*
This patch looks good, will you send it to Andrew with a changelog and
sign-off line? Also feel free to add:
Acked-by: David Rientjes <rientjes@google.com>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-08 21:08 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-08 21:08 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On Thu, 1 Apr 2010, Oleg Nesterov wrote:
> > > > > Say, oom_forkbomb_penalty() does list_for_each_entry(tsk->children).
> > > > > Again, this is not right even if we forget about !child->mm check.
> > > > > This list_for_each_entry() can only see the processes forked by the
> > > > > main thread.
> > > > >
> > > >
> > > > That's the intention.
> > >
> > > Why? shouldn't oom_badness() return the same result for any thread
> > > in thread group? We should take all childs into account.
> > >
> >
> > oom_forkbomb_penalty() only cares about first-descendant children that
> > do not share the same memory,
>
> I see, but the code doesn't really do this. I mean, it doesn't really
> see the first-descendant children, only those which were forked by the
> main thread.
>
> Look. We have a main thread M and the sub-thread T. T forks a lot of
> processes which use a lot of memory. These processes _are_ the first
> descendant children of the M+T thread group, they should be accounted.
> But M->children list is empty.
>
> oom_forkbomb_penalty() and oom_kill_process() should do
>
> t = tsk;
> do {
> list_for_each_entry(child, &t->children, sibling) {
> ... take child into account ...
> }
> } while_each_thread(tsk, t);
>
>
In this case, it seems more appropriate that we would penalize T and not M
since it's not necessarily responsible for the behavior of the children it
forks. T is the buggy/malicious program, not M.
> See the patch below. Yes, this is minor, but it is always good to avoid
> the unnecessary locks, and thread_group_cputime() is O(N).
>
> Not only for performance reasons. This allows to change the locking in
> thread_group_cputime() if needed without fear to deadlock with task_lock().
>
> Oleg.
>
> --- x/mm/oom_kill.c
> +++ x/mm/oom_kill.c
> @@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
> return 0;
> list_for_each_entry(child, &tsk->children, sibling) {
> struct task_cputime task_time;
> - unsigned long runtime;
> + unsigned long runtime, this_rss;
>
> task_lock(child);
> if (!child->mm || child->mm == tsk->mm) {
> task_unlock(child);
> continue;
> }
> + this_rss = get_mm_rss(child->mm);
> + task_unlock(child);
> +
> thread_group_cputime(child, &task_time);
> runtime = cputime_to_jiffies(task_time.utime) +
> cputime_to_jiffies(task_time.stime);
> @@ -113,10 +116,9 @@ static unsigned long oom_forkbomb_penalt
> * get to execute at all in such cases anyway.
> */
> if (runtime < HZ) {
> - child_rss += get_mm_rss(child->mm);
> + child_rss += this_rss;
> forkcount++;
> }
> - task_unlock(child);
> }
>
> /*
This patch looks good, will you send it to Andrew with a changelog and
sign-off line? Also feel free to add:
Acked-by: David Rientjes <rientjes@google.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
2010-04-08 21:08 ` David Rientjes
@ 2010-04-09 12:38 ` Oleg Nesterov
-1 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-09 12:38 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/08, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > Look. We have a main thread M and the sub-thread T. T forks a lot of
> > processes which use a lot of memory. These processes _are_ the first
> > descendant children of the M+T thread group, they should be accounted.
> > But M->children list is empty.
> >
> > oom_forkbomb_penalty() and oom_kill_process() should do
> >
> > t = tsk;
> > do {
> > list_for_each_entry(child, &t->children, sibling) {
> > ... take child into account ...
> > }
> > } while_each_thread(tsk, t);
> >
>
> In this case, it seems more appropriate that we would penalize T and not M
We can't. Any fatal signal sent to any sub-thread kills the whole thread
group. It is not possible to kill T but not M.
> since it's not necessarily responsible for the behavior of the children it
> forks. T is the buggy/malicious program, not M.
Since a) they share the same ->mm and b) they share their children, I
don't think we should separate T and M.
->children is per_thread. But this is only because we have some strange
historiral oddities like __WNOTHREAD. Otherwise, it is not correct to
assume that the child of T is not the child of M. Any process is the
child of its parent's thread group, not the thread which actually called
fork().
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
> > return 0;
> > list_for_each_entry(child, &tsk->children, sibling) {
> > struct task_cputime task_time;
> > - unsigned long runtime;
> > + unsigned long runtime, this_rss;
> >
> > task_lock(child);
> > if (!child->mm || child->mm == tsk->mm) {
> > task_unlock(child);
> > continue;
> > }
> > + this_rss = get_mm_rss(child->mm);
> > + task_unlock(child);
> > +
> > /*
>
> This patch looks good, will you send it to Andrew with a changelog and
> sign-off line? Also feel free to add:
>
> Acked-by: David Rientjes <rientjes@google.com>
Thanks! already in -mm.
Oleg.
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch] oom: give current access to memory reserves if it has been killed
@ 2010-04-09 12:38 ` Oleg Nesterov
0 siblings, 0 replies; 197+ messages in thread
From: Oleg Nesterov @ 2010-04-09 12:38 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, anfei, KOSAKI Motohiro, nishimura,
KAMEZAWA Hiroyuki, Mel Gorman, linux-mm, linux-kernel
On 04/08, David Rientjes wrote:
>
> On Thu, 1 Apr 2010, Oleg Nesterov wrote:
>
> > Look. We have a main thread M and the sub-thread T. T forks a lot of
> > processes which use a lot of memory. These processes _are_ the first
> > descendant children of the M+T thread group, they should be accounted.
> > But M->children list is empty.
> >
> > oom_forkbomb_penalty() and oom_kill_process() should do
> >
> > t = tsk;
> > do {
> > list_for_each_entry(child, &t->children, sibling) {
> > ... take child into account ...
> > }
> > } while_each_thread(tsk, t);
> >
>
> In this case, it seems more appropriate that we would penalize T and not M
We can't. Any fatal signal sent to any sub-thread kills the whole thread
group. It is not possible to kill T but not M.
> since it's not necessarily responsible for the behavior of the children it
> forks. T is the buggy/malicious program, not M.
Since a) they share the same ->mm and b) they share their children, I
don't think we should separate T and M.
->children is per_thread. But this is only because we have some strange
historiral oddities like __WNOTHREAD. Otherwise, it is not correct to
assume that the child of T is not the child of M. Any process is the
child of its parent's thread group, not the thread which actually called
fork().
> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -97,13 +97,16 @@ static unsigned long oom_forkbomb_penalt
> > return 0;
> > list_for_each_entry(child, &tsk->children, sibling) {
> > struct task_cputime task_time;
> > - unsigned long runtime;
> > + unsigned long runtime, this_rss;
> >
> > task_lock(child);
> > if (!child->mm || child->mm == tsk->mm) {
> > task_unlock(child);
> > continue;
> > }
> > + this_rss = get_mm_rss(child->mm);
> > + task_unlock(child);
> > +
> > /*
>
> This patch looks good, will you send it to Andrew with a changelog and
> sign-off line? Also feel free to add:
>
> Acked-by: David Rientjes <rientjes@google.com>
Thanks! already in -mm.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-08 18:05 ` David Rientjes
@ 2010-04-21 19:17 ` Andrew Morton
2010-04-21 22:04 ` David Rientjes
` (2 more replies)
0 siblings, 3 replies; 197+ messages in thread
From: Andrew Morton @ 2010-04-21 19:17 UTC (permalink / raw)
To: David Rientjes
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
Balbir Singh, linux-mm
fyi, I still consider these patches to be in the "stuck" state. So we
need to get them unstuck.
Hiroyuki (and anyone else): could you please summarise in the briefest
way possible what your objections are to Daivd's oom-killer changes?
I'll start: we don't change the kernel ABI. Ever. And when we _do_
change it we don't change it without warning.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-21 19:17 ` Andrew Morton
@ 2010-04-21 22:04 ` David Rientjes
2010-04-22 0:23 ` KAMEZAWA Hiroyuki
2010-04-27 22:58 ` [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task David Rientjes
2010-04-22 7:23 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found Nick Piggin
2010-05-04 23:55 ` David Rientjes
2 siblings, 2 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-21 22:04 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
Balbir Singh, linux-mm
On Wed, 21 Apr 2010, Andrew Morton wrote:
> fyi, I still consider these patches to be in the "stuck" state. So we
> need to get them unstuck.
>
>
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
>
> I'll start: we don't change the kernel ABI. Ever. And when we _do_
> change it we don't change it without warning.
>
I'm not going to allow a simple cleanup to jeopardize the entire patchset,
so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task
that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then
warn about its deprecation. I don't believe we need to do the same thing
for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is
now enabled by default.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-21 22:04 ` David Rientjes
@ 2010-04-22 0:23 ` KAMEZAWA Hiroyuki
2010-04-22 8:34 ` David Rientjes
2010-04-27 22:58 ` [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task David Rientjes
1 sibling, 1 reply; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22 0:23 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, KOSAKI Motohiro, anfei, nishimura, Balbir Singh, linux-mm
On Wed, 21 Apr 2010 15:04:27 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> On Wed, 21 Apr 2010, Andrew Morton wrote:
>
> > fyi, I still consider these patches to be in the "stuck" state. So we
> > need to get them unstuck.
> >
> >
> > Hiroyuki (and anyone else): could you please summarise in the briefest
> > way possible what your objections are to Daivd's oom-killer changes?
> >
> > I'll start: we don't change the kernel ABI. Ever. And when we _do_
> > change it we don't change it without warning.
> >
>
> I'm not going to allow a simple cleanup to jeopardize the entire patchset,
> so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task
> that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then
> warn about its deprecation.
Yeah, I welcome it.
> I don't believe we need to do the same thing
> for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is
> now enabled by default.
>
But *warning* is always apprecieated and will not make the whole patches
too dirty. So, please write one.
BTW, I don't think there is an admin who turns off oom_dump_task..
So, just keeping interface and putting this one to feature-removal-list
is okay for me if you want to cleanup sysctl possibly.
Talking about myself, I also want to remove/cleanup some interface under memcg
which is rarely used. But I don't do because we have users. And I'll not to
clean up as far as we can maintain it. Then, we have to be careful to add
interfaces.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-21 19:17 ` Andrew Morton
2010-04-21 22:04 ` David Rientjes
@ 2010-04-22 7:23 ` Nick Piggin
2010-04-22 7:25 ` KAMEZAWA Hiroyuki
2010-05-04 23:55 ` David Rientjes
2 siblings, 1 reply; 197+ messages in thread
From: Nick Piggin @ 2010-04-22 7:23 UTC (permalink / raw)
To: Andrew Morton
Cc: David Rientjes, KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei,
nishimura, Balbir Singh, linux-mm
On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
>
> fyi, I still consider these patches to be in the "stuck" state. So we
> need to get them unstuck.
>
>
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
>
> I'll start: we don't change the kernel ABI. Ever. And when we _do_
> change it we don't change it without warning.
How is this turning into such a big issue? It is totally ridiculous.
It is not even a "cleanup".
Just drop the ABI-changing patches, and I think the rest of them looked
OK, didn't they?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 7:23 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found Nick Piggin
@ 2010-04-22 7:25 ` KAMEZAWA Hiroyuki
2010-04-22 10:09 ` Nick Piggin
0 siblings, 1 reply; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22 7:25 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
Balbir Singh, linux-mm
On Thu, 22 Apr 2010 17:23:19 +1000
Nick Piggin <npiggin@suse.de> wrote:
> On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> >
> > fyi, I still consider these patches to be in the "stuck" state. So we
> > need to get them unstuck.
> >
> >
> > Hiroyuki (and anyone else): could you please summarise in the briefest
> > way possible what your objections are to Daivd's oom-killer changes?
> >
> > I'll start: we don't change the kernel ABI. Ever. And when we _do_
> > change it we don't change it without warning.
>
> How is this turning into such a big issue? It is totally ridiculous.
> It is not even a "cleanup".
>
> Just drop the ABI-changing patches, and I think the rest of them looked
> OK, didn't they?
>
I agree with you.
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 0:23 ` KAMEZAWA Hiroyuki
@ 2010-04-22 8:34 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-22 8:34 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, KOSAKI Motohiro, anfei, nishimura, Balbir Singh, linux-mm
On Thu, 22 Apr 2010, KAMEZAWA Hiroyuki wrote:
> > I'm not going to allow a simple cleanup to jeopardize the entire patchset,
> > so I can write a patch that readds /proc/sys/vm/oom_kill_allocating_task
> > that simply mirrors the setting of /proc/sys/vm/oom_kill_quick and then
> > warn about its deprecation.
>
> Yeah, I welcome it.
>
Ok, good.
> > I don't believe we need to do the same thing
> > for the removal of /proc/sys/vm/oom_dump_tasks since that functionality is
> > now enabled by default.
> >
>
> But *warning* is always apprecieated and will not make the whole patches
> too dirty. So, please write one.
>
> BTW, I don't think there is an admin who turns off oom_dump_task..
> So, just keeping interface and putting this one to feature-removal-list
> is okay for me if you want to cleanup sysctl possibly.
>
Do we really need to keep oom_dump_tasks around since the result of this
patchset is that we've enabled it by default? It seems to me like users
who now want to disable it (something that nobody is currently doing, it's
the default in Linus' tree) can simply do
echo 1 > /proc/sys/vm/oom_kill_quick
instead to both suppress the tasklist scan for the dump and for the target
selection.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 7:25 ` KAMEZAWA Hiroyuki
@ 2010-04-22 10:09 ` Nick Piggin
2010-04-22 10:27 ` KAMEZAWA Hiroyuki
2010-04-22 10:28 ` David Rientjes
0 siblings, 2 replies; 197+ messages in thread
From: Nick Piggin @ 2010-04-22 10:09 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
Balbir Singh, linux-mm
On Thu, Apr 22, 2010 at 04:25:36PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 22 Apr 2010 17:23:19 +1000
> Nick Piggin <npiggin@suse.de> wrote:
>
> > On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> > >
> > > fyi, I still consider these patches to be in the "stuck" state. So we
> > > need to get them unstuck.
> > >
> > >
> > > Hiroyuki (and anyone else): could you please summarise in the briefest
> > > way possible what your objections are to Daivd's oom-killer changes?
> > >
> > > I'll start: we don't change the kernel ABI. Ever. And when we _do_
> > > change it we don't change it without warning.
> >
> > How is this turning into such a big issue? It is totally ridiculous.
> > It is not even a "cleanup".
> >
> > Just drop the ABI-changing patches, and I think the rest of them looked
> > OK, didn't they?
> >
> I agree with you.
Oh actually what happened with the pagefault OOM / panic on oom thing?
We were talking around in circles about that too.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 10:09 ` Nick Piggin
@ 2010-04-22 10:27 ` KAMEZAWA Hiroyuki
2010-04-22 21:11 ` David Rientjes
2010-04-22 10:28 ` David Rientjes
1 sibling, 1 reply; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-22 10:27 UTC (permalink / raw)
To: Nick Piggin
Cc: Andrew Morton, David Rientjes, KOSAKI Motohiro, anfei, nishimura,
Balbir Singh, linux-mm
On Thu, 22 Apr 2010 20:09:44 +1000
Nick Piggin <npiggin@suse.de> wrote:
> On Thu, Apr 22, 2010 at 04:25:36PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 22 Apr 2010 17:23:19 +1000
> > Nick Piggin <npiggin@suse.de> wrote:
> >
> > > On Wed, Apr 21, 2010 at 12:17:58PM -0700, Andrew Morton wrote:
> > > >
> > > > fyi, I still consider these patches to be in the "stuck" state. So we
> > > > need to get them unstuck.
> > > >
> > > >
> > > > Hiroyuki (and anyone else): could you please summarise in the briefest
> > > > way possible what your objections are to Daivd's oom-killer changes?
> > > >
> > > > I'll start: we don't change the kernel ABI. Ever. And when we _do_
> > > > change it we don't change it without warning.
> > >
> > > How is this turning into such a big issue? It is totally ridiculous.
> > > It is not even a "cleanup".
> > >
> > > Just drop the ABI-changing patches, and I think the rest of them looked
> > > OK, didn't they?
> > >
> > I agree with you.
>
> Oh actually what happened with the pagefault OOM / panic on oom thing?
> We were talking around in circles about that too.
>
Hmm...checking again.
Maybe related patches are:
1: oom-remove-special-handling-for-pagefault-ooms.patch
2: oom-default-to-killing-current-for-pagefault-ooms.patch
IIUC, (1) doesn't make change. But (2)...
Before(1)
- pagefault-oom kills someone by out_of_memory().
After (1)
- pagefault-oom calls out_of_memory() only when someone isn't being killed.
So, this patch helps to avoid double-kill and I like this change.
Before (2)
At pagefault-out-of-memory
- panic_on_oom==2, panic always.
- panic_on_oom==1, panic when CONSITRAINT_NONE.
After (2)
At pagefault-put-of-memory, if there is no running OOM-Kill,
current is killed always. In this case, panic_on_oom doesn't work.
I think panic_on_oom==2 should work.. Hmm. why this behavior changes ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 10:09 ` Nick Piggin
2010-04-22 10:27 ` KAMEZAWA Hiroyuki
@ 2010-04-22 10:28 ` David Rientjes
2010-04-22 15:39 ` Nick Piggin
1 sibling, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-22 10:28 UTC (permalink / raw)
To: Nick Piggin
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
nishimura, Balbir Singh, linux-mm
On Thu, 22 Apr 2010, Nick Piggin wrote:
> Oh actually what happened with the pagefault OOM / panic on oom thing?
> We were talking around in circles about that too.
>
The oom killer rewrite attempts to kill current first, if possible, and
then will panic if panic_on_oom is set before falling back to selecting a
victim. This is consistent with all other architectures such as powerpc
that currently do not use pagefault_out_of_memory(). If all architectures
are eventually going to be converted to using pagefault_out_of_memory()
with additional work on top of -mm, it would be possible to define
consistent panic_on_oom semantics for this case. I welcome such an
addition since I believe it's a natural extension of panic_on_oom, but I
believe it should be done consistently so the sysctl doesn't have
different semantics depending on the underlying arch.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 10:28 ` David Rientjes
@ 2010-04-22 15:39 ` Nick Piggin
2010-04-22 21:09 ` David Rientjes
0 siblings, 1 reply; 197+ messages in thread
From: Nick Piggin @ 2010-04-22 15:39 UTC (permalink / raw)
To: David Rientjes
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
nishimura, Balbir Singh, linux-mm
On Thu, Apr 22, 2010 at 03:28:38AM -0700, David Rientjes wrote:
> On Thu, 22 Apr 2010, Nick Piggin wrote:
>
> > Oh actually what happened with the pagefault OOM / panic on oom thing?
> > We were talking around in circles about that too.
> >
>
> The oom killer rewrite attempts to kill current first, if possible, and
> then will panic if panic_on_oom is set before falling back to selecting a
> victim.
See, this is what we want to avoid. If the user sets panic_on_oom,
it is because they want the system to panic on oom. Not to kill
tasks and try to continue. The user does not know or care in the
slightest about "page fault oom". So I don't know why you think this
is a good idea.
> This is consistent with all other architectures such as powerpc
> that currently do not use pagefault_out_of_memory(). If all architectures
> are eventually going to be converted to using pagefault_out_of_memory()
Yes, architectures are going to be converted, it has already been
agreed, I dropped the ball and lazily hoped the arch people would do it.
But further work done should be to make it consistent in the right way,
not the wrong way.
> with additional work on top of -mm, it would be possible to define
> consistent panic_on_oom semantics for this case. I welcome such an
> addition since I believe it's a natural extension of panic_on_oom, but I
> believe it should be done consistently so the sysctl doesn't have
> different semantics depending on the underlying arch.
It's simply a bug rather than intentional semantics. "pagefault oom"
is basically a meaningless semantic for the user.
Let's do a deal. I'll split up the below patch and send it to arch
maintainers, and you don't change the sysctl interface or "fix" the
pagefault oom path.
--
Index: linux-2.6/arch/alpha/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/alpha/mm/fault.c
+++ linux-2.6/arch/alpha/mm/fault.c
@@ -188,16 +188,10 @@ do_page_fault(unsigned long address, uns
/* We ran out of memory, or some other thing happened to us that
made us unable to handle the page fault gracefully. */
out_of_memory:
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk(KERN_ALERT "VM: killing process %s(%d)\n",
- current->comm, task_pid_nr(current));
if (!user_mode(regs))
goto no_context;
- do_group_exit(SIGKILL);
+ pagefault_out_of_memory();
+ return;
do_sigbus:
/* Send a sigbus, regardless of whether we were in kernel
Index: linux-2.6/arch/avr32/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/avr32/mm/fault.c
+++ linux-2.6/arch/avr32/mm/fault.c
@@ -211,15 +211,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: Killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ pagefault_out_of_memory();
+ if (!user_mode(regs))
+ goto no_context;
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/cris/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/cris/mm/fault.c
+++ linux-2.6/arch/cris/mm/fault.c
@@ -245,10 +245,10 @@ do_page_fault(unsigned long address, str
out_of_memory:
up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/frv/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/frv/mm/fault.c
+++ linux-2.6/arch/frv/mm/fault.c
@@ -257,10 +257,10 @@ asmlinkage void do_page_fault(int datamm
*/
out_of_memory:
up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", current->comm);
- if (user_mode(__frame))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(__frame))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/ia64/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/ia64/mm/fault.c
+++ linux-2.6/arch/ia64/mm/fault.c
@@ -276,13 +276,7 @@ ia64_do_page_fault (unsigned long addres
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk(KERN_CRIT "VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
}
Index: linux-2.6/arch/m32r/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m32r/mm/fault.c
+++ linux-2.6/arch/m32r/mm/fault.c
@@ -271,15 +271,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", tsk->comm);
if (error_code & ACE_USERMODE)
- do_group_exit(SIGKILL);
- goto no_context;
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/m68k/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/m68k/mm/fault.c
+++ linux-2.6/arch/m68k/mm/fault.c
@@ -180,15 +180,10 @@ good_area:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
-
- printk("VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
no_context:
current->thread.signo = SIGBUS;
Index: linux-2.6/arch/microblaze/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/microblaze/mm/fault.c
+++ linux-2.6/arch/microblaze/mm/fault.c
@@ -273,16 +273,11 @@ bad_area_nosemaphore:
* us unable to handle the page fault gracefully.
*/
out_of_memory:
- if (current->pid == 1) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
up_read(&mm->mmap_sem);
- printk(KERN_WARNING "VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- bad_page_fault(regs, address, SIGKILL);
+ if (!user_mode(regs))
+ bad_page_fault(regs, address, SIGKILL);
+ else
+ pagefault_out_of_memory();
return;
do_sigbus:
Index: linux-2.6/arch/mn10300/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/mn10300/mm/fault.c
+++ linux-2.6/arch/mn10300/mm/fault.c
@@ -338,11 +338,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- monitor_signal(regs);
- printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
- if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
- do_exit(SIGKILL);
- goto no_context;
+ if ((fault_code & MMUFCR_xFC_ACCESS) != MMUFCR_xFC_ACCESS_USR)
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/parisc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/parisc/mm/fault.c
+++ linux-2.6/arch/parisc/mm/fault.c
@@ -264,8 +264,7 @@ no_context:
out_of_memory:
up_read(&mm->mmap_sem);
- printk(KERN_CRIT "VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
}
Index: linux-2.6/arch/powerpc/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/powerpc/mm/fault.c
+++ linux-2.6/arch/powerpc/mm/fault.c
@@ -359,15 +359,10 @@ bad_area_nosemaphore:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- return SIGKILL;
+ if (!user_mode(regs))
+ return SIGKILL;
+ pagefault_out_of_memory();
+ return 0;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/score/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/score/mm/fault.c
+++ linux-2.6/arch/score/mm/fault.c
@@ -167,15 +167,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/sh/mm/fault_32.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/fault_32.c
+++ linux-2.6/arch/sh/mm/fault_32.c
@@ -290,15 +290,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
Index: linux-2.6/arch/sh/mm/tlbflush_64.c
===================================================================
--- linux-2.6.orig/arch/sh/mm/tlbflush_64.c
+++ linux-2.6/arch/sh/mm/tlbflush_64.c
@@ -294,22 +294,11 @@ no_context:
* us unable to handle the page fault gracefully.
*/
out_of_memory:
- if (is_global_init(current)) {
- panic("INIT out of memory\n");
- yield();
- goto survive;
- }
- printk("fault:Out of memory\n");
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
printk("fault:Do sigbus\n");
Index: linux-2.6/arch/xtensa/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/xtensa/mm/fault.c
+++ linux-2.6/arch/xtensa/mm/fault.c
@@ -146,15 +146,10 @@ bad_area:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(current)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
- printk("VM: killing process %s\n", current->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- bad_page_fault(regs, address, SIGKILL);
+ if (!user_mode(regs))
+ bad_page_fault(regs, address, SIGKILL);
+ else
+ pagefault_out_of_memory();
return;
do_sigbus:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 15:39 ` Nick Piggin
@ 2010-04-22 21:09 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-22 21:09 UTC (permalink / raw)
To: Nick Piggin
Cc: KAMEZAWA Hiroyuki, Andrew Morton, KOSAKI Motohiro, anfei,
nishimura, Balbir Singh, linux-mm
On Fri, 23 Apr 2010, Nick Piggin wrote:
> > The oom killer rewrite attempts to kill current first, if possible, and
> > then will panic if panic_on_oom is set before falling back to selecting a
> > victim.
>
> See, this is what we want to avoid. If the user sets panic_on_oom,
> it is because they want the system to panic on oom. Not to kill
> tasks and try to continue. The user does not know or care in the
> slightest about "page fault oom". So I don't know why you think this
> is a good idea.
>
Unless we unify the behavior of panic_on_oom, it would be possible for the
architectures that are not converted to using pagefault_out_of_memory()
yet using your patch series to kill tasks even if they have OOM_DISABLE
set. So, as it sits this second in -mm, the system will still try to kill
current first so that all architectures are consistent. Once all
architectures use pagefault_out_of_memory(), we can simply add
if (sysctl_panic_on_oom) {
read_lock(&tasklist_lock);
dump_header(NULL, 0, 0, NULL);
read_unlock(&tasklist_lock);
panic("Out of memory: panic_on_oom is enabled\n");
}
to pagefault_out_of_memory(). I simply opted for consistency across all
architectures before that was done.
> > This is consistent with all other architectures such as powerpc
> > that currently do not use pagefault_out_of_memory(). If all architectures
> > are eventually going to be converted to using pagefault_out_of_memory()
>
> Yes, architectures are going to be converted, it has already been
> agreed, I dropped the ball and lazily hoped the arch people would do it.
> But further work done should be to make it consistent in the right way,
> not the wrong way.
>
Thanks for doing the work and proposing the patchset, there were a couple
of patches that looked like it needed a v2, but overall it looked very
good. Once they're merged in upstream, I think we can add the panic to
pagefault_out_of_memory() in -mm since they'll probably make it to Linus
before the oom killer rewrite at the speed we're going here.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-22 10:27 ` KAMEZAWA Hiroyuki
@ 2010-04-22 21:11 ` David Rientjes
0 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-04-22 21:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Nick Piggin, Andrew Morton, KOSAKI Motohiro, anfei, nishimura,
Balbir Singh, linux-mm
On Thu, 22 Apr 2010, KAMEZAWA Hiroyuki wrote:
> Hmm...checking again.
>
> Maybe related patches are:
> 1: oom-remove-special-handling-for-pagefault-ooms.patch
> 2: oom-default-to-killing-current-for-pagefault-ooms.patch
>
> IIUC, (1) doesn't make change. But (2)...
>
> Before(1)
> - pagefault-oom kills someone by out_of_memory().
> After (1)
> - pagefault-oom calls out_of_memory() only when someone isn't being killed.
>
> So, this patch helps to avoid double-kill and I like this change.
>
> Before (2)
> At pagefault-out-of-memory
> - panic_on_oom==2, panic always.
> - panic_on_oom==1, panic when CONSITRAINT_NONE.
>
> After (2)
> At pagefault-put-of-memory, if there is no running OOM-Kill,
> current is killed always. In this case, panic_on_oom doesn't work.
>
> I think panic_on_oom==2 should work.. Hmm. why this behavior changes ?
>
We can readd the panic_on_oom code once Nick's patchset is merged that
unifies all architectures in using pagefault_out_of_memory() for
VM_FAULT_OOM. Otherwise, some architectures would panic in this case and
others would not (while they allow tasks to be SIGKILL'd even when
panic_on_oom == 2 is set, including OOM_DISABLE tasks!) so I think it's
better to be entirely consistent with sysctl semantics across
architectures.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task
2010-04-21 22:04 ` David Rientjes
2010-04-22 0:23 ` KAMEZAWA Hiroyuki
@ 2010-04-27 22:58 ` David Rientjes
2010-04-28 0:57 ` KAMEZAWA Hiroyuki
1 sibling, 1 reply; 197+ messages in thread
From: David Rientjes @ 2010-04-27 22:58 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
Balbir Singh, Nick Piggin, linux-mm
There's a concern that removing /proc/sys/vm/oom_kill_allocating_task
will unnecessarily break the userspace API as the result of the oom
killer rewrite.
This patch reintroduces the sysctl and deprecates it by adding an entry
to Documentation/feature-removal-schedule.txt with a suggested removal
date of December 2011 and emitting a warning the first time it is written
including the writing task's name and pid.
/proc/sys/vm/oom_kill_allocating task mirrors the value of
/proc/sys/vm/oom_kill_quick.
Signed-off-by: David Rientjes <rientjes@google.com>
---
Documentation/feature-removal-schedule.txt | 19 +++++++++++++++++++
include/linux/oom.h | 2 ++
kernel/sysctl.c | 7 +++++++
mm/oom_kill.c | 14 ++++++++++++++
4 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -204,6 +204,25 @@ Who: David Rientjes <rientjes@google.com>
---------------------------
+What: /proc/sys/vm/oom_kill_allocating_task
+When: December 2011
+Why: /proc/sys/vm/oom_kill_allocating_task is equivalent to
+ /proc/sys/vm/oom_kill_quick. The two sysctls will mirror each other's
+ value when set.
+
+ Existing users of /proc/sys/vm/oom_kill_allocating_task should simply
+ write a non-zero value to /proc/sys/vm/oom_kill_quick. This will also
+ suppress a costly tasklist scan when dumping VM information for all
+ oom kill candidates.
+
+ A warning will be emitted to the kernel log if an application uses this
+ deprecated interface. After it is printed once, future warning will be
+ suppressed until the kernel is rebooted.
+
+Who: David Rientjes <rientjes@google.com>
+
+---------------------------
+
What: remove EXPORT_SYMBOL(kernel_thread)
When: August 2006
Files: arch/*/kernel/*_ksyms.c
diff --git a/include/linux/oom.h b/include/linux/oom.h
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -67,5 +67,7 @@ extern int sysctl_panic_on_oom;
extern int sysctl_oom_forkbomb_thres;
extern int sysctl_oom_kill_quick;
+extern int oom_kill_allocating_task_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
#endif /* __KERNEL__*/
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -983,6 +983,13 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "oom_kill_allocating_task",
+ .data = &sysctl_oom_kill_quick,
+ .maxlen = sizeof(sysctl_oom_kill_quick),
+ .mode = 0644,
+ .proc_handler = oom_kill_allocating_task_handler,
+ },
+ {
.procname = "oom_forkbomb_thres",
.data = &sysctl_oom_forkbomb_thres,
.maxlen = sizeof(sysctl_oom_forkbomb_thres),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -37,6 +37,20 @@ int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
int sysctl_oom_kill_quick;
static DEFINE_SPINLOCK(zone_scan_lock);
+int oom_kill_allocating_task_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret;
+
+ ret = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (!ret && write)
+ printk_once(KERN_WARNING "%s (%d): "
+ "/proc/sys/vm/oom_kill_allocating_task is deprecated, "
+ "please use /proc/sys/vm/oom_kill_quick instead.\n",
+ current->comm, task_pid_nr(current));
+ return ret;
+}
+
/*
* Do all threads of the target process overlap our allowed nodes?
* @tsk: task struct of which task to consider
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task
2010-04-27 22:58 ` [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task David Rientjes
@ 2010-04-28 0:57 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 197+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-28 0:57 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, KOSAKI Motohiro, anfei, nishimura, Balbir Singh,
Nick Piggin, linux-mm
On Tue, 27 Apr 2010 15:58:41 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:
> There's a concern that removing /proc/sys/vm/oom_kill_allocating_task
> will unnecessarily break the userspace API as the result of the oom
> killer rewrite.
>
> This patch reintroduces the sysctl and deprecates it by adding an entry
> to Documentation/feature-removal-schedule.txt with a suggested removal
> date of December 2011 and emitting a warning the first time it is written
> including the writing task's name and pid.
>
> /proc/sys/vm/oom_kill_allocating task mirrors the value of
> /proc/sys/vm/oom_kill_quick.
>
> Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
> Documentation/feature-removal-schedule.txt | 19 +++++++++++++++++++
> include/linux/oom.h | 2 ++
> kernel/sysctl.c | 7 +++++++
> mm/oom_kill.c | 14 ++++++++++++++
> 4 files changed, 42 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
> --- a/Documentation/feature-removal-schedule.txt
> +++ b/Documentation/feature-removal-schedule.txt
> @@ -204,6 +204,25 @@ Who: David Rientjes <rientjes@google.com>
>
> ---------------------------
>
> +What: /proc/sys/vm/oom_kill_allocating_task
> +When: December 2011
> +Why: /proc/sys/vm/oom_kill_allocating_task is equivalent to
> + /proc/sys/vm/oom_kill_quick. The two sysctls will mirror each other's
> + value when set.
> +
> + Existing users of /proc/sys/vm/oom_kill_allocating_task should simply
> + write a non-zero value to /proc/sys/vm/oom_kill_quick. This will also
> + suppress a costly tasklist scan when dumping VM information for all
> + oom kill candidates.
> +
> + A warning will be emitted to the kernel log if an application uses this
> + deprecated interface. After it is printed once, future warning will be
> + suppressed until the kernel is rebooted.
> +
> +Who: David Rientjes <rientjes@google.com>
> +
> +---------------------------
> +
> What: remove EXPORT_SYMBOL(kernel_thread)
> When: August 2006
> Files: arch/*/kernel/*_ksyms.c
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -67,5 +67,7 @@ extern int sysctl_panic_on_oom;
> extern int sysctl_oom_forkbomb_thres;
> extern int sysctl_oom_kill_quick;
>
> +extern int oom_kill_allocating_task_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos);
> #endif /* __KERNEL__*/
> #endif /* _INCLUDE_LINUX_OOM_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -983,6 +983,13 @@ static struct ctl_table vm_table[] = {
> .proc_handler = proc_dointvec,
> },
> {
> + .procname = "oom_kill_allocating_task",
> + .data = &sysctl_oom_kill_quick,
> + .maxlen = sizeof(sysctl_oom_kill_quick),
> + .mode = 0644,
> + .proc_handler = oom_kill_allocating_task_handler,
> + },
> + {
> .procname = "oom_forkbomb_thres",
> .data = &sysctl_oom_forkbomb_thres,
> .maxlen = sizeof(sysctl_oom_forkbomb_thres),
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -37,6 +37,20 @@ int sysctl_oom_forkbomb_thres = DEFAULT_OOM_FORKBOMB_THRES;
> int sysctl_oom_kill_quick;
> static DEFINE_SPINLOCK(zone_scan_lock);
>
> +int oom_kill_allocating_task_handler(struct ctl_table *table, int write,
> + void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> + int ret;
> +
> + ret = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (!ret && write)
> + printk_once(KERN_WARNING "%s (%d): "
> + "/proc/sys/vm/oom_kill_allocating_task is deprecated, "
> + "please use /proc/sys/vm/oom_kill_quick instead.\n",
> + current->comm, task_pid_nr(current));
> + return ret;
> +}
> +
> /*
> * Do all threads of the target process overlap our allowed nodes?
> * @tsk: task struct of which task to consider
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
* Re: [patch -mm] memcg: make oom killer a no-op when no killable task can be found
2010-04-21 19:17 ` Andrew Morton
2010-04-21 22:04 ` David Rientjes
2010-04-22 7:23 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found Nick Piggin
@ 2010-05-04 23:55 ` David Rientjes
2 siblings, 0 replies; 197+ messages in thread
From: David Rientjes @ 2010-05-04 23:55 UTC (permalink / raw)
To: Andrew Morton
Cc: KOSAKI Motohiro, KAMEZAWA Hiroyuki, anfei, nishimura,
Balbir Singh, linux-mm
On Wed, 21 Apr 2010, Andrew Morton wrote:
>
> fyi, I still consider these patches to be in the "stuck" state. So we
> need to get them unstuck.
>
>
> Hiroyuki (and anyone else): could you please summarise in the briefest
> way possible what your objections are to Daivd's oom-killer changes?
>
> I'll start: we don't change the kernel ABI. Ever. And when we _do_
> change it we don't change it without warning.
>
Have we resolved all of the outstanding discussion concerning the oom
killer rewrite? I'm not aware of any pending issues.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 197+ messages in thread
end of thread, other threads:[~2010-05-04 23:56 UTC | newest]
Thread overview: 197+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-24 16:25 [PATCH] oom killer: break from infinite loop Anfei Zhou
2010-03-24 16:25 ` Anfei Zhou
2010-03-25 2:51 ` KOSAKI Motohiro
2010-03-25 2:51 ` KOSAKI Motohiro
2010-03-26 22:08 ` Andrew Morton
2010-03-26 22:08 ` Andrew Morton
2010-03-26 22:33 ` Oleg Nesterov
2010-03-26 22:33 ` Oleg Nesterov
2010-03-28 14:55 ` anfei
2010-03-28 14:55 ` anfei
2010-03-28 16:28 ` Oleg Nesterov
2010-03-28 16:28 ` Oleg Nesterov
2010-03-28 21:21 ` David Rientjes
2010-03-28 21:21 ` David Rientjes
2010-03-29 11:21 ` Oleg Nesterov
2010-03-29 11:21 ` Oleg Nesterov
2010-03-29 20:49 ` [patch] oom: give current access to memory reserves if it has been killed David Rientjes
2010-03-29 20:49 ` David Rientjes
2010-03-30 15:46 ` Oleg Nesterov
2010-03-30 15:46 ` Oleg Nesterov
2010-03-30 20:26 ` David Rientjes
2010-03-30 20:26 ` David Rientjes
2010-03-31 17:58 ` Oleg Nesterov
2010-03-31 17:58 ` Oleg Nesterov
2010-03-31 20:47 ` Oleg Nesterov
2010-03-31 20:47 ` Oleg Nesterov
2010-04-01 8:35 ` David Rientjes
2010-04-01 8:35 ` David Rientjes
2010-04-01 8:57 ` [patch -mm] oom: hold tasklist_lock when dumping tasks David Rientjes
2010-04-01 14:27 ` Oleg Nesterov
2010-04-01 19:16 ` David Rientjes
2010-04-01 13:59 ` [patch] oom: give current access to memory reserves if it has been killed Oleg Nesterov
2010-04-01 14:00 ` Oleg Nesterov
2010-04-01 19:12 ` David Rientjes
2010-04-01 19:12 ` David Rientjes
2010-04-02 11:14 ` Oleg Nesterov
2010-04-02 11:14 ` Oleg Nesterov
2010-04-02 18:30 ` [PATCH -mm 0/4] oom: linux has threads Oleg Nesterov
2010-04-02 18:30 ` Oleg Nesterov
2010-04-02 18:31 ` [PATCH -mm 1/4] oom: select_bad_process: check PF_KTHREAD instead of !mm to skip kthreads Oleg Nesterov
2010-04-02 18:31 ` Oleg Nesterov
2010-04-02 19:05 ` David Rientjes
2010-04-02 19:05 ` David Rientjes
2010-04-02 18:32 ` [PATCH -mm 2/4] oom: select_bad_process: PF_EXITING check should take ->mm into account Oleg Nesterov
2010-04-02 18:32 ` Oleg Nesterov
2010-04-06 11:42 ` anfei
2010-04-06 11:42 ` anfei
2010-04-06 12:18 ` Oleg Nesterov
2010-04-06 12:18 ` Oleg Nesterov
2010-04-06 13:05 ` anfei
2010-04-06 13:05 ` anfei
2010-04-06 13:38 ` Oleg Nesterov
2010-04-06 13:38 ` Oleg Nesterov
2010-04-02 18:32 ` [PATCH -mm 3/4] oom: introduce find_lock_task_mm() to fix !mm false positives Oleg Nesterov
2010-04-02 18:32 ` Oleg Nesterov
2010-04-02 18:33 ` [PATCH -mm 4/4] oom: oom_forkbomb_penalty: move thread_group_cputime() out of task_lock() Oleg Nesterov
2010-04-02 18:33 ` Oleg Nesterov
2010-04-02 19:04 ` David Rientjes
2010-04-02 19:04 ` David Rientjes
2010-04-05 14:23 ` [PATCH -mm] oom: select_bad_process: never choose tasks with badness == 0 Oleg Nesterov
2010-04-05 14:23 ` Oleg Nesterov
2010-04-02 19:02 ` [patch] oom: give current access to memory reserves if it has been killed David Rientjes
2010-04-02 19:02 ` David Rientjes
2010-04-02 19:14 ` Oleg Nesterov
2010-04-02 19:14 ` Oleg Nesterov
2010-04-02 19:46 ` David Rientjes
2010-04-02 19:46 ` David Rientjes
2010-04-02 19:54 ` [patch -mm] oom: exclude tasks with badness score of 0 from being selected David Rientjes
2010-04-02 19:54 ` David Rientjes
2010-04-02 21:04 ` Oleg Nesterov
2010-04-02 21:04 ` Oleg Nesterov
2010-04-02 21:22 ` [patch -mm v2] " David Rientjes
2010-04-02 21:22 ` David Rientjes
2010-04-02 20:55 ` [patch] oom: give current access to memory reserves if it has been killed Oleg Nesterov
2010-04-02 20:55 ` Oleg Nesterov
2010-03-31 21:07 ` David Rientjes
2010-03-31 21:07 ` David Rientjes
2010-03-31 22:50 ` Oleg Nesterov
2010-03-31 22:50 ` Oleg Nesterov
2010-03-31 23:30 ` Oleg Nesterov
2010-03-31 23:30 ` Oleg Nesterov
2010-03-31 23:48 ` David Rientjes
2010-03-31 23:48 ` David Rientjes
2010-04-01 14:39 ` Oleg Nesterov
2010-04-01 14:39 ` Oleg Nesterov
2010-04-01 18:58 ` David Rientjes
2010-04-01 18:58 ` David Rientjes
2010-04-01 8:25 ` David Rientjes
2010-04-01 8:25 ` David Rientjes
2010-04-01 15:26 ` Oleg Nesterov
2010-04-01 15:26 ` Oleg Nesterov
2010-04-08 21:08 ` David Rientjes
2010-04-08 21:08 ` David Rientjes
2010-04-09 12:38 ` Oleg Nesterov
2010-04-09 12:38 ` Oleg Nesterov
2010-03-30 16:39 ` [PATCH] oom: fix the unsafe proc_oom_score()->badness() call Oleg Nesterov
2010-03-30 16:39 ` Oleg Nesterov
2010-03-30 17:43 ` [PATCH -mm] proc: don't take ->siglock for /proc/pid/oom_adj Oleg Nesterov
2010-03-30 17:43 ` Oleg Nesterov
2010-03-30 20:30 ` David Rientjes
2010-03-30 20:30 ` David Rientjes
2010-03-31 9:17 ` Oleg Nesterov
2010-03-31 9:17 ` Oleg Nesterov
2010-03-31 18:59 ` Oleg Nesterov
2010-03-31 18:59 ` Oleg Nesterov
2010-03-31 21:14 ` David Rientjes
2010-03-31 21:14 ` David Rientjes
2010-03-31 23:00 ` Oleg Nesterov
2010-03-31 23:00 ` Oleg Nesterov
2010-04-01 8:32 ` David Rientjes
2010-04-01 8:32 ` David Rientjes
2010-04-01 15:37 ` Oleg Nesterov
2010-04-01 15:37 ` Oleg Nesterov
2010-04-01 19:04 ` David Rientjes
2010-04-01 19:04 ` David Rientjes
2010-03-30 20:32 ` [PATCH] oom: fix the unsafe proc_oom_score()->badness() call David Rientjes
2010-03-30 20:32 ` David Rientjes
2010-03-31 9:16 ` Oleg Nesterov
2010-03-31 9:16 ` Oleg Nesterov
2010-03-31 20:17 ` Oleg Nesterov
2010-03-31 20:17 ` Oleg Nesterov
2010-04-01 7:41 ` David Rientjes
2010-04-01 7:41 ` David Rientjes
2010-04-01 13:13 ` [PATCH 0/1] oom: fix the unsafe usage of badness() in proc_oom_score() Oleg Nesterov
2010-04-01 13:13 ` Oleg Nesterov
2010-04-01 13:13 ` [PATCH 1/1] " Oleg Nesterov
2010-04-01 13:13 ` Oleg Nesterov
2010-04-01 19:03 ` David Rientjes
2010-04-01 19:03 ` David Rientjes
2010-03-29 14:06 ` [PATCH] oom killer: break from infinite loop anfei
2010-03-29 14:06 ` anfei
2010-03-29 20:01 ` David Rientjes
2010-03-29 20:01 ` David Rientjes
2010-03-30 14:29 ` anfei
2010-03-30 14:29 ` anfei
2010-03-30 20:29 ` David Rientjes
2010-03-30 20:29 ` David Rientjes
2010-03-31 0:57 ` KAMEZAWA Hiroyuki
2010-03-31 0:57 ` KAMEZAWA Hiroyuki
2010-03-31 6:07 ` David Rientjes
2010-03-31 6:07 ` David Rientjes
2010-03-31 6:13 ` KAMEZAWA Hiroyuki
2010-03-31 6:13 ` KAMEZAWA Hiroyuki
2010-03-31 6:30 ` Balbir Singh
2010-03-31 6:30 ` Balbir Singh
2010-03-31 6:31 ` KAMEZAWA Hiroyuki
2010-03-31 6:31 ` KAMEZAWA Hiroyuki
2010-03-31 7:04 ` David Rientjes
2010-03-31 7:04 ` David Rientjes
2010-03-31 6:32 ` David Rientjes
2010-03-31 6:32 ` David Rientjes
2010-03-31 7:08 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found David Rientjes
2010-03-31 7:08 ` KAMEZAWA Hiroyuki
2010-03-31 8:04 ` Balbir Singh
2010-03-31 10:38 ` David Rientjes
2010-04-04 23:28 ` David Rientjes
2010-04-05 21:30 ` Andrew Morton
2010-04-05 22:40 ` David Rientjes
2010-04-05 22:49 ` Andrew Morton
2010-04-05 23:01 ` David Rientjes
2010-04-06 12:08 ` KOSAKI Motohiro
2010-04-06 21:47 ` David Rientjes
2010-04-07 0:20 ` KAMEZAWA Hiroyuki
2010-04-07 13:29 ` KOSAKI Motohiro
2010-04-08 18:05 ` David Rientjes
2010-04-21 19:17 ` Andrew Morton
2010-04-21 22:04 ` David Rientjes
2010-04-22 0:23 ` KAMEZAWA Hiroyuki
2010-04-22 8:34 ` David Rientjes
2010-04-27 22:58 ` [patch -mm] oom: reintroduce and deprecate oom_kill_allocating_task David Rientjes
2010-04-28 0:57 ` KAMEZAWA Hiroyuki
2010-04-22 7:23 ` [patch -mm] memcg: make oom killer a no-op when no killable task can be found Nick Piggin
2010-04-22 7:25 ` KAMEZAWA Hiroyuki
2010-04-22 10:09 ` Nick Piggin
2010-04-22 10:27 ` KAMEZAWA Hiroyuki
2010-04-22 21:11 ` David Rientjes
2010-04-22 10:28 ` David Rientjes
2010-04-22 15:39 ` Nick Piggin
2010-04-22 21:09 ` David Rientjes
2010-05-04 23:55 ` David Rientjes
2010-04-08 17:36 ` David Rientjes
2010-04-02 10:17 ` [PATCH] oom killer: break from infinite loop Mel Gorman
2010-04-02 10:17 ` Mel Gorman
2010-04-04 23:26 ` David Rientjes
2010-04-04 23:26 ` David Rientjes
2010-04-05 10:47 ` Mel Gorman
2010-04-05 10:47 ` Mel Gorman
2010-04-06 22:40 ` David Rientjes
2010-04-06 22:40 ` David Rientjes
2010-03-29 11:31 ` anfei
2010-03-29 11:31 ` anfei
2010-03-29 11:46 ` Oleg Nesterov
2010-03-29 11:46 ` Oleg Nesterov
2010-03-29 12:09 ` anfei
2010-03-29 12:09 ` anfei
2010-03-28 2:46 ` David Rientjes
2010-03-28 2:46 ` David Rientjes
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.