[PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
@ 2015-05-25 14:33 Tetsuo Handa
  2015-05-26 17:02 ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-25 14:33 UTC (permalink / raw)
  To: linux-mm

>From 3728807fe66ebc24a8a28455593754b9532bbe74 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 25 May 2015 22:26:07 +0900
Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.

If the mm struct which the OOM victim is using is shared by e.g. 1000
threads, and the lock dependency prevents all threads except the OOM
victim thread from terminating until they get TIF_MEMDIE flag, the OOM
killer will be invoked for 1000 times on this mm struct. As a result,
the kernel would emit

  "Kill process %d (%s) sharing same memory\n"

line for 1000 * 1000 / 2 times. But once these threads got pending SIGKILL,
emitting this information is nothing but noise. This patch filters them.

Similarly,

  "[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n"

lines in dump_task() might be sufficient for once per each mm struct. But
this patch does not filter them because we want a marker field in the mm
struct and a lock for protecting the marker if we want to eliminate
duplication.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39..d0eccbb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -583,6 +583,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		    !(p->flags & PF_KTHREAD)) {
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
+			if (fatal_signal_pending(p))
+				continue;

 			task_lock(p);	/* Protect ->comm from prctl() */
 			pr_err("Kill process %d (%s) sharing same memory\n",
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-25 14:33 [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
@ 2015-05-26 17:02 ` Michal Hocko
  2015-05-26 21:39   ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-05-26 17:02 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 25-05-15 23:33:31, Tetsuo Handa wrote:
> >From 3728807fe66ebc24a8a28455593754b9532bbe74 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 25 May 2015 22:26:07 +0900
> Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
> 
> If the mm struct which the OOM victim is using is shared by e.g. 1000
> threads, and the lock dependency prevents all threads except the OOM
> victim thread from terminating until they get TIF_MEMDIE flag, the OOM
> killer will be invoked for 1000 times on this mm struct. As a result,
> the kernel would emit
> 
>   "Kill process %d (%s) sharing same memory\n"
> 
> line for 1000 * 1000 / 2 times. But once these threads got pending SIGKILL,
> emitting this information is nothing but noise. This patch filters them.

OK, I can see this might be really annoying. But reducing this message
will not help much because it is the dump_header which generates a lot
of output. And there is clearly no reason to treat the selected victim
any differently than the current so why not simply do the following
instead?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39b3268..a67ce18b4b35 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -505,7 +505,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
 	 */
 	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
+	if (p->mm && (fatal_signal_pending(p) || task_will_free_mem(p))) {
 		mark_oom_victim(p);
 		task_unlock(p);
 		put_task_struct(p);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-26 17:02 ` Michal Hocko
@ 2015-05-26 21:39   ` Tetsuo Handa
  2015-05-27 16:45     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-26 21:39 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Mon 25-05-15 23:33:31, Tetsuo Handa wrote:
> > >From 3728807fe66ebc24a8a28455593754b9532bbe74 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Mon, 25 May 2015 22:26:07 +0900
> > Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
> > 
> > If the mm struct which the OOM victim is using is shared by e.g. 1000
> > threads, and the lock dependency prevents all threads except the OOM
> > victim thread from terminating until they get TIF_MEMDIE flag, the OOM
> > killer will be invoked for 1000 times on this mm struct. As a result,
> > the kernel would emit
> > 
> >   "Kill process %d (%s) sharing same memory\n"
> > 
> > line for 1000 * 1000 / 2 times. But once these threads got pending SIGKILL,
> > emitting this information is nothing but noise. This patch filters them.
> 
> OK, I can see this might be really annoying. But reducing this message
> will not help much because it is the dump_header which generates a lot
> of output. And there is clearly no reason to treat the selected victim
> any differently than the current so why not simply do the following
> instead?
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39b3268..a67ce18b4b35 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -505,7 +505,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
>  	 */
>  	task_lock(p);
> -	if (p->mm && task_will_free_mem(p)) {
> +	if (p->mm && (fatal_signal_pending(p) || task_will_free_mem(p))) {
>  		mark_oom_victim(p);
>  		task_unlock(p);
>  		put_task_struct(p);
> 

I don't think this is good, for this will omit sending SIGKILL to threads
sharing p->mm ("Kill all user processes sharing victim->mm in other thread
groups, if any.") when p already has pending SIGKILL.

By the way, if p with p->mm && task_will_free_mem(p) can get stuck due to
memory allocation deadlock, is it OK that currently we are not sending SIGKILL
to threads sharing p->mm ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-26 21:39   ` Tetsuo Handa
@ 2015-05-27 16:45     ` Michal Hocko
  2015-05-27 21:59       ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-05-27 16:45 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Wed 27-05-15 06:39:42, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 25-05-15 23:33:31, Tetsuo Handa wrote:
> > > >From 3728807fe66ebc24a8a28455593754b9532bbe74 Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Date: Mon, 25 May 2015 22:26:07 +0900
> > > Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
> > > 
> > > If the mm struct which the OOM victim is using is shared by e.g. 1000
> > > threads, and the lock dependency prevents all threads except the OOM
> > > victim thread from terminating until they get TIF_MEMDIE flag, the OOM
> > > killer will be invoked for 1000 times on this mm struct. As a result,
> > > the kernel would emit
> > > 
> > >   "Kill process %d (%s) sharing same memory\n"
> > > 
> > > line for 1000 * 1000 / 2 times. But once these threads got pending SIGKILL,
> > > emitting this information is nothing but noise. This patch filters them.
> > 
> > OK, I can see this might be really annoying. But reducing this message
> > will not help much because it is the dump_header which generates a lot
> > of output. And there is clearly no reason to treat the selected victim
> > any differently than the current so why not simply do the following
> > instead?
> > ---
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5cfda39b3268..a67ce18b4b35 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -505,7 +505,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> >  	 */
> >  	task_lock(p);
> > -	if (p->mm && task_will_free_mem(p)) {
> > +	if (p->mm && (fatal_signal_pending(p) || task_will_free_mem(p))) {
> >  		mark_oom_victim(p);
> >  		task_unlock(p);
> >  		put_task_struct(p);
> > 
> 
> I don't think this is good, for this will omit sending SIGKILL to threads
> sharing p->mm ("Kill all user processes sharing victim->mm in other thread
> groups, if any.")

threads? The whole thread group will die when the fatal signal is
send to the group leader no? This mm sharing handling is about
processes which are sharing mm but they are not in the same thread group
(aka CLONE_VM without CLONE_SIGHAND resp. CLONE_THREAD).

> when p already has pending SIGKILL.

yes we can select a task which has SIGKILL already pending and then
we wouldn't kill other processes which share the same mm but does it
matter?  I do not think so. Because if this is really the case and the
OOM condition continues even after p exits (which is very probable but
p alone might release some resources and free memory) we will find a
process with the same mm in the next round.

> By the way, if p with p->mm && task_will_free_mem(p) can get stuck due to
> memory allocation deadlock, is it OK that currently we are not sending SIGKILL
> to threads sharing p->mm ?

I am not sure I understand the question. Threads will die automatically
because we are sending group signal.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-27 16:45     ` Michal Hocko
@ 2015-05-27 21:59       ` Tetsuo Handa
  2015-05-28 18:05         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-27 21:59 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Wed 27-05-15 06:39:42, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 25-05-15 23:33:31, Tetsuo Handa wrote:
> > > > >From 3728807fe66ebc24a8a28455593754b9532bbe74 Mon Sep 17 00:00:00 2001
> > > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > > Date: Mon, 25 May 2015 22:26:07 +0900
> > > > Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
> > > > 
> > > > If the mm struct which the OOM victim is using is shared by e.g. 1000
> > > > threads, and the lock dependency prevents all threads except the OOM
> > > > victim thread from terminating until they get TIF_MEMDIE flag, the OOM
> > > > killer will be invoked for 1000 times on this mm struct. As a result,
> > > > the kernel would emit
> > > > 
> > > >   "Kill process %d (%s) sharing same memory\n"
> > > > 
> > > > line for 1000 * 1000 / 2 times. But once these threads got pending SIGKILL,
> > > > emitting this information is nothing but noise. This patch filters them.
> > > 
> > > OK, I can see this might be really annoying. But reducing this message
> > > will not help much because it is the dump_header which generates a lot
> > > of output. And there is clearly no reason to treat the selected victim
> > > any differently than the current so why not simply do the following
> > > instead?
> > > ---
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index 5cfda39b3268..a67ce18b4b35 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -505,7 +505,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > >  	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> > >  	 */
> > >  	task_lock(p);
> > > -	if (p->mm && task_will_free_mem(p)) {
> > > +	if (p->mm && (fatal_signal_pending(p) || task_will_free_mem(p))) {
> > >  		mark_oom_victim(p);
> > >  		task_unlock(p);
> > >  		put_task_struct(p);
> > > 
> > 
> > I don't think this is good, for this will omit sending SIGKILL to threads
> > sharing p->mm ("Kill all user processes sharing victim->mm in other thread
> > groups, if any.")
> 
> threads? The whole thread group will die when the fatal signal is
> send to the group leader no? This mm sharing handling is about
> processes which are sharing mm but they are not in the same thread group

OK. I should say "omit sending SIGKILL to processes which are sharing mm
but they are not in the same thread group".

> (aka CLONE_VM without CLONE_SIGHAND resp. CLONE_THREAD).

clone(CLONE_SIGHAND | CLONE_VM) ?

> 
> > when p already has pending SIGKILL.
> 
> yes we can select a task which has SIGKILL already pending and then
> we wouldn't kill other processes which share the same mm but does it
> matter?  I do not think so. Because if this is really the case and the
> OOM condition continues even after p exits (which is very probable but
> p alone might release some resources and free memory) we will find a
> process with the same mm in the next round.

I think it matters because p cannot call do_exit() when p is blocked by
processes which are sharing mm but they are not in the same thread group.

> 
> > By the way, if p with p->mm && task_will_free_mem(p) can get stuck due to
> > memory allocation deadlock, is it OK that currently we are not sending SIGKILL
> > to threads sharing p->mm ?
> 
> I am not sure I understand the question. Threads will die automatically
> because we are sending group signal.

I just imagined a case where p is blocked at down_read() in acct_collect() from
do_exit() when p is sharing mm with other processes, and other process is doing
blocking operation with mm->mmap_sem held for writing. Is such case impossible?

do_exit() {
  exit_signals(tsk);  /* sets PF_EXITING */
  acct_collect(code, group_dead) {
    if (group_dead && current->mm) {
      down_read(&current->mm->mmap_sem);
      up_read(&current->mm->mmap_sem);
    }
  }
  exit_mm(tsk) {
     down_read(&mm->mmap_sem);
     tsk->mm = NULL;
     up_read(&mm->mmap_sem);
  }
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-27 21:59       ` Tetsuo Handa
@ 2015-05-28 18:05         ` Michal Hocko
  2015-05-29 12:40           ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-05-28 18:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Thu 28-05-15 06:59:32, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 27-05-15 06:39:42, Tetsuo Handa wrote:
[...]
> > > I don't think this is good, for this will omit sending SIGKILL to threads
> > > sharing p->mm ("Kill all user processes sharing victim->mm in other thread
> > > groups, if any.")
> > 
> > threads? The whole thread group will die when the fatal signal is
> > send to the group leader no? This mm sharing handling is about
> > processes which are sharing mm but they are not in the same thread group
> 
> OK. I should say "omit sending SIGKILL to processes which are sharing mm
> but they are not in the same thread group".
> 
> > (aka CLONE_VM without CLONE_SIGHAND resp. CLONE_THREAD).
> 
> clone(CLONE_SIGHAND | CLONE_VM) ?

no I meant clone(CLONE_VM | flags) where flags doesn't contain neither
CLONE_SIGHAND nor CLONE_THREAD.

[...]

> I just imagined a case where p is blocked at down_read() in acct_collect() from
> do_exit() when p is sharing mm with other processes, and other process is doing
> blocking operation with mm->mmap_sem held for writing. Is such case impossible?

It is very much possible and I have missed this case when proposing
my alternative. The other process could be doing an address space
operation e.g. mmap which requires an allocation.

We do not handle this case properly because we are doing this before
even going to select a victim.
        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                goto out;
        }

I have to think some more about a potential fix...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message.
  2015-05-28 18:05         ` Michal Hocko
@ 2015-05-29 12:40           ` Tetsuo Handa
  2015-05-29 14:49             ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-29 12:40 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Thu 28-05-15 06:59:32, Tetsuo Handa wrote:
> > I just imagined a case where p is blocked at down_read() in acct_collect() from
> > do_exit() when p is sharing mm with other processes, and other process is doing
> > blocking operation with mm->mmap_sem held for writing. Is such case impossible?
> 
> It is very much possible and I have missed this case when proposing
> my alternative. The other process could be doing an address space
> operation e.g. mmap which requires an allocation.

Are there locations that do memory allocations with mm->mmap_sem held for
writing? Is it possible that thread1 is doing memory allocation between
down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem),
thread2 sharing the same mm is waiting at down_read(&current->mm->mmap_sem),
and the OOM killer invoked by thread3 chooses thread2 as the OOM victim and
sets TIF_MEMDIE to thread2?

If yes, I think setting TIF_MEMDIE to only one thread can cause deadlock
problem when mm is shared by multiple threads, for thread2 cannot be terminated
because thread1 will not call up_write(&current->mm->mmap_sem) until thread1's
memory allocation completes, resulting in hang up. If thread2->mm &&
task_will_free_mem(thread2) was true when the OOM killer chooses thread2 as
the OOM victim, it will result in annoying silent hang up.

If there are locations that do memory allocations with mm->mmap_sem held for
writing, don't we need to send SIGKILL and set TIF_MEMDIE to all threads which
could block the OOM victim?

Maybe we can use "struct mm_struct"->"bool chosen_by_oom_killer" and checking
for (current->mm && current->mm->chosen_by_oom_killer) than
test_thread_flag(TIF_MEMDIE) inside the memory allocator?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message.
  2015-05-29 12:40           ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
@ 2015-05-29 14:49             ` Michal Hocko
  2015-05-29 17:20               ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-05-29 14:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Fri 29-05-15 21:40:47, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 28-05-15 06:59:32, Tetsuo Handa wrote:
> > > I just imagined a case where p is blocked at down_read() in acct_collect() from
> > > do_exit() when p is sharing mm with other processes, and other process is doing
> > > blocking operation with mm->mmap_sem held for writing. Is such case impossible?
> > 
> > It is very much possible and I have missed this case when proposing
> > my alternative. The other process could be doing an address space
> > operation e.g. mmap which requires an allocation.
> 
> Are there locations that do memory allocations with mm->mmap_sem held for
> writing?

Yes, I've written that in my previous email.

> Is it possible that thread1 is doing memory allocation between
> down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem),
> thread2 sharing the same mm is waiting at down_read(&current->mm->mmap_sem),
> and the OOM killer invoked by thread3 chooses thread2 as the OOM victim and
> sets TIF_MEMDIE to thread2?

Your usage of thread is confusing. Threads are of no concerns because
those get killed when the group leader is killed. If you refer to
processes then this is exactly what is handled by:
        for_each_process(p)
                if (p->mm == mm && !same_thread_group(p, victim) &&
                    !(p->flags & PF_KTHREAD)) {
                        if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
                                continue;

                        task_lock(p);   /* Protect ->comm from prctl() */
                        pr_err("Kill process %d (%s) sharing same memory\n",
                                task_pid_nr(p), p->comm);
                        task_unlock(p);
                        do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
                }
[...]
> Maybe we can use "struct mm_struct"->"bool chosen_by_oom_killer" and checking
> for (current->mm && current->mm->chosen_by_oom_killer) than
> test_thread_flag(TIF_MEMDIE) inside the memory allocator?

Bool is not sufficient because killing some of the processes might be
sufficient to resolve the OOM condition and the rest can survive. This
is quite unlikely, all right, but not impossible. And then you would
have a dangling chosen_by_oom_killer. So this should be a counter.

So I think, but I have to think more about this, a proper way to handle
this would be something like the following. The patch is obviously
incomplete because memcg OOM killer would need the same treatment which
calls for a common helper etc...

But this is a real corner case. It would have to be current to trigger
OOM killer and the userspace would have to be able to send the signal
at the right moment... So I am even not sure this needs fixing. Are you
able to trigger it?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39b3268..14128575fe86 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -428,6 +428,8 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+
+	atomic_inc(tsk->mm->under_oom);
 }
 
 /**
@@ -436,6 +438,7 @@ void mark_oom_victim(struct task_struct *tsk)
 void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
+	atomic_dec(current->active_mm->under_oom)
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -681,6 +684,16 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	}
 
 	/*
+	 * Processes which are sharing mm should die together. If one of them
+	 * was OOM killed already we should shoot others as well.
+	 */
+	if (current->mm && atomic_read(current->mm->under_oom)) {
+		mark_oom_victim(current);
+		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, current, true);
+		goto out;
+	}
+
+	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-29 14:49             ` Michal Hocko
@ 2015-05-29 17:20               ` Tetsuo Handa
  2015-05-31 11:10                 ` Tetsuo Handa
  2015-06-01  9:03                 ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-29 17:20 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Fri 29-05-15 21:40:47, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Thu 28-05-15 06:59:32, Tetsuo Handa wrote:
> > > > I just imagined a case where p is blocked at down_read() in acct_collect() from
> > > > do_exit() when p is sharing mm with other processes, and other process is doing
> > > > blocking operation with mm->mmap_sem held for writing. Is such case impossible?
> > > 
> > > It is very much possible and I have missed this case when proposing
> > > my alternative. The other process could be doing an address space
> > > operation e.g. mmap which requires an allocation.
> > 
> > Are there locations that do memory allocations with mm->mmap_sem held for
> > writing?
> 
> Yes, I've written that in my previous email.
> 
> > Is it possible that thread1 is doing memory allocation between
> > down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem),
> > thread2 sharing the same mm is waiting at down_read(&current->mm->mmap_sem),
> > and the OOM killer invoked by thread3 chooses thread2 as the OOM victim and
> > sets TIF_MEMDIE to thread2?
> 
> Your usage of thread is confusing. Threads are of no concerns because
> those get killed when the group leader is killed. If you refer to
> processes then this is exactly what is handled by:
>         for_each_process(p)
>                 if (p->mm == mm && !same_thread_group(p, victim) &&
>                     !(p->flags & PF_KTHREAD)) {
>                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
>                                 continue;
> 
>                         task_lock(p);   /* Protect ->comm from prctl() */
>                         pr_err("Kill process %d (%s) sharing same memory\n",
>                                 task_pid_nr(p), p->comm);
>                         task_unlock(p);
>                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>                 }

I refer to both "Thread-1 in process-1, thread-2 in process-1" case and
"thread-1 in process-1, thread-2 in process-2" case. Thread-3 can be in
process-1 or process-2 or neither.

When TIF_MEMDIE is set to only thread-2 waiting at down_read(), thread-1
between down_write() and up_write() cannot complete memory allocation.
The group leader is not important here because I'm talking about situations
when individual thread cannot arrive at exit_mm() after receiving SIGKILL
due to lock dependency.

> But this is a real corner case. It would have to be current to trigger
> OOM killer and the userspace would have to be able to send the signal
> at the right moment... So I am even not sure this needs fixing. Are you
> able to trigger it?

I'm not sure whether we are talking about the same problem.
I thought that we could get rid of TIF_MEMDIE like

    for_each_process(p) {
            if (p->mm == thread2->mm && !(p->flags & PF_KTHREAD) &&
                p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
                        do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
    }
    thread2->mm->chosen_by_oom_killer = true;

if we need to set TIF_MEMDIE to all threads like

    for_each_process(p) {
            if (p->mm == thread2->mm && !(p->flags & PF_KTHREAD) &&
                p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
                        do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
                        for_each_thread(p, t)
                                mark_oom_victim(t);
            }
    }

in order to make sure that thread-1 can complete memory allocation.

If thread-1 and thread-2 do not share the same mm, setting TIF_MEMDIE to
all threads might not be sufficient because they can contend on e.g.
inode->i_mutex. But that's beyond scope of this suppress message patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-29 17:20               ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
@ 2015-05-31 11:10                 ` Tetsuo Handa
  2015-06-01  9:58                   ` Michal Hocko
  2015-06-01 10:16                   ` Michal Hocko
  2015-06-01  9:03                 ` Michal Hocko
  1 sibling, 2 replies; 29+ messages in thread
From: Tetsuo Handa @ 2015-05-31 11:10 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> So I think, but I have to think more about this, a proper way to handle
> this would be something like the following. The patch is obviously
> incomplete because memcg OOM killer would need the same treatment which
> calls for a common helper etc...

I believe that current out_of_memory() code is too optimistic about exiting
task. Current code can easily result in either

  (1) silent hang up due to reporting nothing upon OOM deadlock
  (2) very noisy oom_kill_process() due to re-reporting the same mm struct

because we set TIF_MEMDIE to only one thread.

To avoid (1), we should remove

	/*
	 * If current has a pending SIGKILL or is exiting, then automatically
	 * select it.  The goal is to allow it to allocate so that it may
	 * quickly exit and free its memory.
	 *
	 * But don't select if current has already released its mm and cleared
	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
	 */
	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		goto out;
	}

in out_of_memory() and

	/*
	 * If the task is already exiting, don't alarm the sysadmin or kill
	 * its children or threads, just set TIF_MEMDIE so it can die quickly
	 */
	task_lock(p);
	if (p->mm && task_will_free_mem(p)) {
		mark_oom_victim(p);
		task_unlock(p);
		put_task_struct(p);
		return;
	}
	task_unlock(p);

in oom_kill_process() which set TIF_MEMDIE to only one thread.
Removing the former chunk helps when check_panic_on_oom() is configured to
call panic() (i.e. /proc/sys/vm/panic_on_oom is not 0) and then the system
fell into TIF_MEMDIE deadlock, for their systems will be rebooted
automatically than entering into silent hang up loop upon OOM condition.

To avoid (2), we should consider either

  (a) Add a bool to "struct mm_struct" and set that bool to true when that
      mm struct was chosen for the first time. Set TIF_MEMDIE to next thread
      without calling printk() unless that mm was chosen for the first time.

  (b) Set TIF_MEMDIE to all threads in all processes sharing the same mm
      struct, making oom_scan_process_thread() return OOM_SCAN_ABORT as
      long as there is a TIF_MEMDIE thread.

Untested patch for (a) would look like
------------------------------------------------------------
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2836da7..b43e523 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -495,6 +495,7 @@ struct mm_struct {
 	/* address of the bounds directory */
 	void __user *bd_addr;
 #endif
+	bool oom_report_done;
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..bb31a11 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -497,6 +497,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	struct task_struct *t;
 	struct mm_struct *mm;
 	unsigned int victim_points = 0;
+	bool silent;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
@@ -513,6 +514,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	}
 	task_unlock(p);
 
+	task_lock(p);
+	silent = (p->mm && p->mm->oom_report_done);
+	task_unlock(p);
+	if (silent)
+		goto silent_mode;
+
 	if (__ratelimit(&oom_rs))
 		dump_header(p, gfp_mask, order, memcg, nodemask);
 
@@ -521,6 +528,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		message, task_pid_nr(p), p->comm, points);
 	task_unlock(p);
 
+ silent_mode:
 	/*
 	 * If any of p's children has a different mm and is eligible for kill,
 	 * the one with the highest oom_badness() score is sacrificed for its
@@ -561,6 +569,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
+	mm->oom_report_done = true;
 	mark_oom_victim(victim);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
@@ -584,10 +593,12 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
 
-			task_lock(p);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
-				task_pid_nr(p), p->comm);
-			task_unlock(p);
+			if (!silent) {
+				task_lock(p);	/* Protect ->comm from prctl() */
+				pr_err("Kill process %d (%s) sharing same memory\n",
+				       task_pid_nr(p), p->comm);
+				task_unlock(p);
+			}
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 		}
 	rcu_read_unlock();
------------------------------------------------------------

Untested patch for (b) would look like
------------------------------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..8e47a1c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -581,6 +581,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	for_each_process(p)
 		if (p->mm == mm && !same_thread_group(p, victim) &&
 		    !(p->flags & PF_KTHREAD)) {
+			struct task_struct *t;
+
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
 
@@ -589,6 +591,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			for_each_thread(p, t)
+				mark_oom_victim(t);
 		}
 	rcu_read_unlock();
 
------------------------------------------------------------

If we forget about complete depletion of all memory, (b) is preferable from
the point of view of reducing the possibility of falling into TIF_MEMDIE
deadlock.

The TIF_MEMDIE is meant to facilitate setting tsk->mm = NULL so that memory
associated with the TIF_MEMDIE thread's mm struct is released soon. But the
algorithm for choosing a thread does not (more precisely, can not) take lock
dependency into account. There are locations where down_read(&tsk->mm->mmap_sem)
and up_read(&tsk->mm->mmap_sem) are called between getting PF_EXITING and
setting tsk->mm = NULL. Also, there are locations where memory allocations
are done between down_write(&current->mm->mmap_sem) and
up_write(&current->mm->mmap_sem). As a result, TIF_MEMDIE can be set to
a thread which is waiting at e.g. down_read(&current->mm->mmap_sem) when one
of threads sharing the same mm struct is doing memory allocations between
down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem).
When such case occurred, the TIF_MEMDIE thread can not be terminated because
memory allocation by non-TIF_MEMDIE thread cannot complete until
non-TIF_MEMDIE thread gets TIF_MEMDIE (assuming that the "too small to fail"
memory-allocation rule remains due to reasons explained at
http://marc.info/?l=linux-mm&m=143239200805478 ). It seems to me that the
description

  This prevents mm->mmap_sem livelock when an oom killed thread cannot exit
  because it requires the semaphore and its contended by another thread
  trying to allocate memory itself.

is not true, for sending SIGKILL cannot make another thread to return from
memory allocation attempt.



By the way, I got two mumbles.

Is "If any of p's children has a different mm and is eligible for kill," logic
in oom_kill_process() really needed? Didn't select_bad_process() which was
called proior to calling oom_kill_process() already choose a best victim
using for_each_process_thread() ?

Is "/* mm cannot safely be dereferenced after task_unlock(victim) */" true?
It seems to me that it should be "/* mm cannot safely be compared after
task_unlock(victim) */" because it is theoretically possible to have

  CPU 0                         CPU 1                   CPU 2
  task_unlock(victim);
                                victim exits and releases mm.
                                Usage count of the mm becomes 0 and thus released.
                                                        New mm is allocated and assigned to some thread.
  (p->mm == mm) matches the recreated mm and kill unrelated p.

sequence. We need to either get a reference to victim's mm before
task_unlock(victim) or do comparison before task_unlock(victim).

Below is just a guess which incorporated all changes described above.

------------------------------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..9bf9370 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -487,67 +487,20 @@ void oom_killer_enable(void)
  * Must be called while holding a reference to p, which will be released upon
  * returning.
  */
-void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
+void oom_kill_process(struct task_struct *victim, gfp_t gfp_mask, int order,
 		      unsigned int points, unsigned long totalpages,
 		      struct mem_cgroup *memcg, nodemask_t *nodemask,
 		      const char *message)
 {
-	struct task_struct *victim = p;
-	struct task_struct *child;
+	struct task_struct *p;
 	struct task_struct *t;
+	unsigned int killed = 0;
 	struct mm_struct *mm;
-	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
 
-	/*
-	 * If the task is already exiting, don't alarm the sysadmin or kill
-	 * its children or threads, just set TIF_MEMDIE so it can die quickly
-	 */
-	task_lock(p);
-	if (p->mm && task_will_free_mem(p)) {
-		mark_oom_victim(p);
-		task_unlock(p);
-		put_task_struct(p);
-		return;
-	}
-	task_unlock(p);
-
 	if (__ratelimit(&oom_rs))
-		dump_header(p, gfp_mask, order, memcg, nodemask);
-
-	task_lock(p);
-	pr_err("%s: Kill process %d (%s) score %u or sacrifice child\n",
-		message, task_pid_nr(p), p->comm, points);
-	task_unlock(p);
-
-	/*
-	 * If any of p's children has a different mm and is eligible for kill,
-	 * the one with the highest oom_badness() score is sacrificed for its
-	 * parent.  This attempts to lose the minimal amount of work done while
-	 * still freeing memory.
-	 */
-	read_lock(&tasklist_lock);
-	for_each_thread(p, t) {
-		list_for_each_entry(child, &t->children, sibling) {
-			unsigned int child_points;
-
-			if (child->mm == p->mm)
-				continue;
-			/*
-			 * oom_badness() returns 0 if the thread is unkillable
-			 */
-			child_points = oom_badness(child, memcg, nodemask,
-								totalpages);
-			if (child_points > victim_points) {
-				put_task_struct(victim);
-				victim = child;
-				victim_points = child_points;
-				get_task_struct(victim);
-			}
-		}
-	}
-	read_unlock(&tasklist_lock);
+		dump_header(victim, gfp_mask, order, memcg, nodemask);
 
 	p = find_lock_task_mm(victim);
 	if (!p) {
@@ -558,41 +511,24 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		put_task_struct(victim);
 		victim = p;
 	}
-
-	/* mm cannot safely be dereferenced after task_unlock(victim) */
 	mm = victim->mm;
-	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
-		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
-		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
-	task_unlock(victim);
-
-	/*
-	 * Kill all user processes sharing victim->mm in other thread groups, if
-	 * any.  They don't get access to memory reserves, though, to avoid
-	 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
-	 * oom killed thread cannot exit because it requires the semaphore and
-	 * its contended by another thread trying to allocate memory itself.
-	 * That thread will now get access to memory reserves since it has a
-	 * pending fatal signal.
-	 */
 	rcu_read_lock();
 	for_each_process(p)
-		if (p->mm == mm && !same_thread_group(p, victim) &&
-		    !(p->flags & PF_KTHREAD)) {
-			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
-				continue;
-
-			task_lock(p);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
-				task_pid_nr(p), p->comm);
-			task_unlock(p);
+		if (p->mm == mm && !(p->flags & PF_KTHREAD) &&
+		    p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
+			killed++;
+			for_each_thread(p, t)
+				mark_oom_victim(t);
 		}
 	rcu_read_unlock();
-
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	pr_err("%s: Kill process %d (%s) score %u, total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+	       message, task_pid_nr(p), p->comm, points, K(mm->total_vm),
+	       K(get_mm_counter(mm, MM_ANONPAGES)),
+	       K(get_mm_counter(mm, MM_FILEPAGES)));
+	if (killed > 1)
+		pr_err("Killed %u processes sharing same memory\n", killed);
+	task_unlock(victim);
 	put_task_struct(victim);
 }
 #undef K
@@ -667,20 +603,6 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		goto out;
 
 	/*
-	 * If current has a pending SIGKILL or is exiting, then automatically
-	 * select it.  The goal is to allow it to allocate so that it may
-	 * quickly exit and free its memory.
-	 *
-	 * But don't select if current has already released its mm and cleared
-	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
-	 */
-	if (current->mm &&
-	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
-		mark_oom_victim(current);
-		goto out;
-	}
-
-	/*
 	 * Check if there were limitations on the allocation (only relevant for
 	 * NUMA) that may require different handling.
 	 */
------------------------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-29 17:20               ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
  2015-05-31 11:10                 ` Tetsuo Handa
@ 2015-06-01  9:03                 ` Michal Hocko
  2015-06-01 10:51                   ` Tetsuo Handa
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01  9:03 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sat 30-05-15 02:20:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 29-05-15 21:40:47, Tetsuo Handa wrote:
[...]
> > > Is it possible that thread1 is doing memory allocation between
> > > down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem),
> > > thread2 sharing the same mm is waiting at down_read(&current->mm->mmap_sem),
> > > and the OOM killer invoked by thread3 chooses thread2 as the OOM victim and
> > > sets TIF_MEMDIE to thread2?
> > 
> > Your usage of thread is confusing. Threads are of no concerns because
> > those get killed when the group leader is killed. If you refer to
> > processes then this is exactly what is handled by:
> >         for_each_process(p)
> >                 if (p->mm == mm && !same_thread_group(p, victim) &&
> >                     !(p->flags & PF_KTHREAD)) {
> >                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> >                                 continue;
> > 
> >                         task_lock(p);   /* Protect ->comm from prctl() */
> >                         pr_err("Kill process %d (%s) sharing same memory\n",
> >                                 task_pid_nr(p), p->comm);
> >                         task_unlock(p);
> >                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> >                 }
> 
> I refer to both "Thread-1 in process-1, thread-2 in process-1" case and
> "thread-1 in process-1, thread-2 in process-2" case. Thread-3 can be in
> process-1 or process-2 or neither.

And that makes it confusing because threads in the same thread group
case is not really interesting. All the threads have fatal signal
pending and they would get access to memory reserves as they hit the oom
killer.

If the mm sharing tasks are separate processes though we have to emulate
that behavior because they do not share the signal handling. That is
what the above for_each_process does. Again they will get access to
memory reserves once they hit the oom killer due to fatal signal
pending.

See? So the only remaining case is when we do not kill other processes
sharing the mm because as you pointed out, the first task even didn't go
through that for_each_process loop.

> When TIF_MEMDIE is set to only thread-2 waiting at down_read(), thread-1
> between down_write() and up_write() cannot complete memory allocation.

> The group leader is not important here

Yes you are right. I thought that sending SIGKILL to !leader will not
kill the whole group but now that I am looking at complete_signal it
really does (SIGKILL is always a group signal). So group leader really
doesn't play any role here. Sorry about the confusion.

> because I'm talking about situations
> when individual thread cannot arrive at exit_mm() after receiving SIGKILL
> due to lock dependency.
>
> > But this is a real corner case. It would have to be current to trigger
> > OOM killer and the userspace would have to be able to send the signal
> > at the right moment... So I am even not sure this needs fixing. Are you
> > able to trigger it?
> 
> I'm not sure whether we are talking about the same problem.
> I thought that we could get rid of TIF_MEMDIE like
> 
>     for_each_process(p) {
>             if (p->mm == thread2->mm && !(p->flags & PF_KTHREAD) &&
>                 p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
>                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>     }
>     thread2->mm->chosen_by_oom_killer = true;

why would you send SIGKILL to your threads in the same thread group again?

> if we need to set TIF_MEMDIE to all threads like
> 
>     for_each_process(p) {
>             if (p->mm == thread2->mm && !(p->flags & PF_KTHREAD) &&
>                 p->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
>                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>                         for_each_thread(p, t)
>                                 mark_oom_victim(t);
>             }
>     }
> 
> in order to make sure that thread-1 can complete memory allocation.

No we can't. See the big fat warning above the for_each_process loop.
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-31 11:10                 ` Tetsuo Handa
@ 2015-06-01  9:58                   ` Michal Hocko
  2015-06-01 10:16                   ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2015-06-01  9:58 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 31-05-15 20:10:23, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > So I think, but I have to think more about this, a proper way to handle
> > this would be something like the following. The patch is obviously
> > incomplete because memcg OOM killer would need the same treatment which
> > calls for a common helper etc...
> 
> I believe that current out_of_memory() code is too optimistic about exiting
> task. Current code can easily result in either
> 
>   (1) silent hang up due to reporting nothing upon OOM deadlock
>   (2) very noisy oom_kill_process() due to re-reporting the same mm struct

Are you able to trigger any of those?

> because we set TIF_MEMDIE to only one thread.

No, we select only one task. And that is a difference. Because we allow
to set TIF_MEMDIE to multiple tasks if they are on their way out. This
is what the below code snippet, which you want to remove, does.

> To avoid (1), we should remove
> 
> 	/*
> 	 * If current has a pending SIGKILL or is exiting, then automatically
> 	 * select it.  The goal is to allow it to allocate so that it may
> 	 * quickly exit and free its memory.
> 	 *
> 	 * But don't select if current has already released its mm and cleared
> 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> 	 */
> 	if (current->mm &&
> 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> 		mark_oom_victim(current);
> 		goto out;
> 	}
> 
> in out_of_memory() and
> 
> 	/*
> 	 * If the task is already exiting, don't alarm the sysadmin or kill
> 	 * its children or threads, just set TIF_MEMDIE so it can die quickly
> 	 */
> 	task_lock(p);
> 	if (p->mm && task_will_free_mem(p)) {
> 		mark_oom_victim(p);
> 		task_unlock(p);
> 		put_task_struct(p);
> 		return;
> 	}
> 	task_unlock(p);
>
> in oom_kill_process() which set TIF_MEMDIE to only one thread.
> Removing the former chunk helps when check_panic_on_oom() is configured to
> call panic() (i.e. /proc/sys/vm/panic_on_oom is not 0) and then the system
> fell into TIF_MEMDIE deadlock, for their systems will be rebooted
> automatically than entering into silent hang up loop upon OOM condition.

Hmm, I guess we should simply move check_panic_on_oom before the current
is dying check. It seems that reliable panic is more important than a
premature failure which is quite unlikely anyway. If somebody configures
panic_on_oom we should obey that and not try to be clever (will post a
patch later today).

> To avoid (2), we should consider either
> 
>   (a) Add a bool to "struct mm_struct" and set that bool to true when that
>       mm struct was chosen for the first time. Set TIF_MEMDIE to next thread
>       without calling printk() unless that mm was chosen for the first time.

Will not work in general as explained in other email.

> 
>   (b) Set TIF_MEMDIE to all threads in all processes sharing the same mm
>       struct, making oom_scan_process_thread() return OOM_SCAN_ABORT as
>       long as there is a TIF_MEMDIE thread.

Again, too risky to deplete the whole memory.

[...]
> If we forget about complete depletion of all memory, (b) is preferable from
> the point of view of reducing the possibility of falling into TIF_MEMDIE
> deadlock.

Why doesn't "lazy" mark_oom_victim work? I mean if the mm sharing tasks
get TIF_MEMDIE when they enter OOM killer?

> The TIF_MEMDIE is meant to facilitate setting tsk->mm = NULL so that memory
> associated with the TIF_MEMDIE thread's mm struct is released soon. But the
> algorithm for choosing a thread does not (more precisely, can not) take lock
> dependency into account. There are locations where down_read(&tsk->mm->mmap_sem)
> and up_read(&tsk->mm->mmap_sem) are called between getting PF_EXITING and
> setting tsk->mm = NULL. Also, there are locations where memory allocations
> are done between down_write(&current->mm->mmap_sem) and
> up_write(&current->mm->mmap_sem). As a result, TIF_MEMDIE can be set to
> a thread which is waiting at e.g. down_read(&current->mm->mmap_sem) when one
> of threads sharing the same mm struct is doing memory allocations between
> down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem).
> When such case occurred, the TIF_MEMDIE thread can not be terminated because
> memory allocation by non-TIF_MEMDIE thread cannot complete until
> non-TIF_MEMDIE thread gets TIF_MEMDIE (assuming that the "too small to fail"
> memory-allocation rule remains due to reasons explained at
> http://marc.info/?l=linux-mm&m=143239200805478 ). It seems to me that the
> description
> 
>   This prevents mm->mmap_sem livelock when an oom killed thread cannot exit
>   because it requires the semaphore and its contended by another thread
>   trying to allocate memory itself.
> 
> is not true, for sending SIGKILL cannot make another thread to return from
> memory allocation attempt.

It does because those processes would have fatal_signal_pending and so
they would get access to memory reserves which should help them to
succeed the allocation and release the lock. IMO we should be able to
get rid of this loop and rely on a counter in mm and use it for lazy
TIF_MEMDIE for processes sharing the mm struct (similar to
fatal_signal_pending heuristic). But I am still not convinced this is
really worth bothering. Processes sharing mm are rare and the current
code should be able to deal with them quite well, modulo the race you
have pointed out but that is quite unlikely. If you are able to trigger
it, though, then I would prefer we do the counter thing and then getting
rid of the loop would be probably better.
 
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-05-31 11:10                 ` Tetsuo Handa
  2015-06-01  9:58                   ` Michal Hocko
@ 2015-06-01 10:16                   ` Michal Hocko
  2015-06-01 12:02                     ` Tetsuo Handa
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 10:16 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Sun 31-05-15 20:10:23, Tetsuo Handa wrote:
[...]
> By the way, I got two mumbles.
> 
> Is "If any of p's children has a different mm and is eligible for kill," logic
> in oom_kill_process() really needed? Didn't select_bad_process() which was
> called proior to calling oom_kill_process() already choose a best victim
> using for_each_process_thread() ?

This tries to have smaller effect on the system. It tries to kill
younger tasks because this might be and quite often is sufficient to
resolve the OOM condition.

> Is "/* mm cannot safely be dereferenced after task_unlock(victim) */" true?
> It seems to me that it should be "/* mm cannot safely be compared after
> task_unlock(victim) */" because it is theoretically possible to have
> 
>   CPU 0                         CPU 1                   CPU 2
>   task_unlock(victim);
>                                 victim exits and releases mm.
>                                 Usage count of the mm becomes 0 and thus released.
>                                                         New mm is allocated and assigned to some thread.
>   (p->mm == mm) matches the recreated mm and kill unrelated p.
> 
> sequence. We need to either get a reference to victim's mm before
> task_unlock(victim) or do comparison before task_unlock(victim).

Hmm, I guess you are right. The race is theoretically possible,
especially when there are many tasks when iterating over the list might
take some time. reference to the mm would solve this. Care to send a
patch?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01  9:03                 ` Michal Hocko
@ 2015-06-01 10:51                   ` Tetsuo Handa
  2015-06-01 11:43                     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 10:51 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Sat 30-05-15 02:20:23, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 29-05-15 21:40:47, Tetsuo Handa wrote:
> [...]
> > > > Is it possible that thread1 is doing memory allocation between
> > > > down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem),
> > > > thread2 sharing the same mm is waiting at down_read(&current->mm->mmap_sem),
> > > > and the OOM killer invoked by thread3 chooses thread2 as the OOM victim and
> > > > sets TIF_MEMDIE to thread2?
> > > 
> > > Your usage of thread is confusing. Threads are of no concerns because
> > > those get killed when the group leader is killed. If you refer to
> > > processes then this is exactly what is handled by:
> > >         for_each_process(p)
> > >                 if (p->mm == mm && !same_thread_group(p, victim) &&
> > >                     !(p->flags & PF_KTHREAD)) {
> > >                         if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> > >                                 continue;
> > > 
> > >                         task_lock(p);   /* Protect ->comm from prctl() */
> > >                         pr_err("Kill process %d (%s) sharing same memory\n",
> > >                                 task_pid_nr(p), p->comm);
> > >                         task_unlock(p);
> > >                         do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> > >                 }
> > 
> > I refer to both "Thread-1 in process-1, thread-2 in process-1" case and
> > "thread-1 in process-1, thread-2 in process-2" case. Thread-3 can be in
> > process-1 or process-2 or neither.
> 
> And that makes it confusing because threads in the same thread group
> case is not really interesting. All the threads have fatal signal
> pending and they would get access to memory reserves as they hit the oom
> killer.

Excuse me, but I didn't understand it.

TIF_MEMDIE is per a "struct task_struct" attribute which is set on
its corresponding "struct thread_info"->flags member, isn't it?
Two "struct task_struct" can't share the same "struct thread_info"->flags
member, can it?

And the condition which we allow access to memory reserves is not
"whether SIGKILL is pending or not" but "whether TIF_MEMDIE is set or not",
doesn't it?

----------
static inline int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
(...snipped...)
	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
		if (gfp_mask & __GFP_MEMALLOC)
			alloc_flags |= ALLOC_NO_WATERMARKS;
		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
			alloc_flags |= ALLOC_NO_WATERMARKS;
		else if (!in_interrupt() &&
			 ((current->flags & PF_MEMALLOC) ||
			  unlikely(test_thread_flag(TIF_MEMDIE))))
			alloc_flags |= ALLOC_NO_WATERMARKS;
	}
(...snipped...)
}
----------

How can all fatal_signal_pending() "struct task_struct" get access to memory
reserves when only one of fatal_signal_pending() "struct task_struct" has
TIF_MEMDIE ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 10:51                   ` Tetsuo Handa
@ 2015-06-01 11:43                     ` Michal Hocko
  2015-06-01 12:10                       ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 11:43 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 01-06-15 19:51:05, Tetsuo Handa wrote:
[...]
> How can all fatal_signal_pending() "struct task_struct" get access to memory
> reserves when only one of fatal_signal_pending() "struct task_struct" has
> TIF_MEMDIE ?

Because of 
	/*
	 * If current has a pending SIGKILL or is exiting, then automatically
	 * select it.  The goal is to allow it to allocate so that it may
	 * quickly exit and free its memory.
	 *
	 * But don't select if current has already released its mm and cleared
	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
	 */
	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		goto out;
	}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 10:16                   ` Michal Hocko
@ 2015-06-01 12:02                     ` Tetsuo Handa
  2015-06-01 12:15                       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 12:02 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Sun 31-05-15 20:10:23, Tetsuo Handa wrote:
> [...]
> > By the way, I got two mumbles.
> > 
> > Is "If any of p's children has a different mm and is eligible for kill," logic
> > in oom_kill_process() really needed? Didn't select_bad_process() which was
> > called proior to calling oom_kill_process() already choose a best victim
> > using for_each_process_thread() ?
> 
> This tries to have smaller effect on the system. It tries to kill
> younger tasks because this might be and quite often is sufficient to
> resolve the OOM condition.
> 
> > Is "/* mm cannot safely be dereferenced after task_unlock(victim) */" true?
> > It seems to me that it should be "/* mm cannot safely be compared after
> > task_unlock(victim) */" because it is theoretically possible to have
> > 
> >   CPU 0                         CPU 1                   CPU 2
> >   task_unlock(victim);
> >                                 victim exits and releases mm.
> >                                 Usage count of the mm becomes 0 and thus released.
> >                                                         New mm is allocated and assigned to some thread.
> >   (p->mm == mm) matches the recreated mm and kill unrelated p.
> > 
> > sequence. We need to either get a reference to victim's mm before
> > task_unlock(victim) or do comparison before task_unlock(victim).
> 
> Hmm, I guess you are right. The race is theoretically possible,
> especially when there are many tasks when iterating over the list might
> take some time. reference to the mm would solve this. Care to send a
> patch?
> 
> -- 
> Michal Hocko
> SUSE Labs
> 
I see. Here is a patch.
mmput() may sleep. But oom_kill_process() is a sleep-able context, right?
----------------------------------------
>From 15afd1f40b132719c323e81d58064ff7115206f9 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Mon, 1 Jun 2015 20:54:14 +0900
Subject: [PATCH] mm,oom: Fix potentially killing unrelated process or
 depleting memory.

At the for_each_process() loop in oom_kill_process(), we are comparing
address of OOM victim's mm without holding a reference to that mm.
If there are a lot of processes to compare or a lot of "Kill process
%d (%s) sharing same memory" messages to print, for_each_process() loop
could take very long time.

It is possible that meanwhile the OOM victim exits and releases its mm,
and then mm is allocated with the same address and assigned to some
unrelated process. When we hit such race, the unrelated process will be
killed by error. To make sure that the OOM victim's mm does not go away
until for_each_process() loop finishes, get a reference on the OOM
victim's mm before calling task_unlock(victim).

Likewise, move do_send_sig_info(SIGKILL, victim) to before
mark_oom_victim(victim) in case for_each_process() took very long time,
for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
memset() in user space until SIGKILL is delivered.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..5eb1e65 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -559,14 +559,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		victim = p;
 	}
 
-	/* mm cannot safely be dereferenced after task_unlock(victim) */
+	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
-	mark_oom_victim(victim);
+	atomic_inc(&mm->mm_users);
 	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
-		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
-		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
-		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
+		task_pid_nr(victim), victim->comm, K(mm->total_vm),
+		K(get_mm_counter(mm, MM_ANONPAGES)),
+		K(get_mm_counter(mm, MM_FILEPAGES)));
 	task_unlock(victim);
+	/* Send SIGKILL before setting TIF_MEMDIE. */
+	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	mark_oom_victim(victim);
 
 	/*
 	 * Kill all user processes sharing victim->mm in other thread groups, if
@@ -592,7 +595,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 		}
 	rcu_read_unlock();
 
-	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
+	mmput(mm);
 	put_task_struct(victim);
 }
 #undef K
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 11:43                     ` Michal Hocko
@ 2015-06-01 12:10                       ` Tetsuo Handa
  2015-06-01 12:17                         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 12:10 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Mon 01-06-15 19:51:05, Tetsuo Handa wrote:
> [...]
> > How can all fatal_signal_pending() "struct task_struct" get access to memory
> > reserves when only one of fatal_signal_pending() "struct task_struct" has
> > TIF_MEMDIE ?
> 
> Because of 
> 	/*
> 	 * If current has a pending SIGKILL or is exiting, then automatically
> 	 * select it.  The goal is to allow it to allocate so that it may
> 	 * quickly exit and free its memory.
> 	 *
> 	 * But don't select if current has already released its mm and cleared
> 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> 	 */
> 	if (current->mm &&
> 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> 		mark_oom_victim(current);
> 		goto out;
> 	}

Then, what guarantees that the thread which is between
down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem)
(or whatever locks which are blocking the OOM victim) calls out_of_memory() ?
That thread might be doing !__GFP_FS allocation request.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 12:02                     ` Tetsuo Handa
@ 2015-06-01 12:15                       ` Michal Hocko
  2015-06-01 13:04                         ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 12:15 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 01-06-15 21:02:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sun 31-05-15 20:10:23, Tetsuo Handa wrote:
> > [...]
> > > By the way, I got two mumbles.
> > > 
> > > Is "If any of p's children has a different mm and is eligible for kill," logic
> > > in oom_kill_process() really needed? Didn't select_bad_process() which was
> > > called proior to calling oom_kill_process() already choose a best victim
> > > using for_each_process_thread() ?
> > 
> > This tries to have smaller effect on the system. It tries to kill
> > younger tasks because this might be and quite often is sufficient to
> > resolve the OOM condition.
> > 
> > > Is "/* mm cannot safely be dereferenced after task_unlock(victim) */" true?
> > > It seems to me that it should be "/* mm cannot safely be compared after
> > > task_unlock(victim) */" because it is theoretically possible to have
> > > 
> > >   CPU 0                         CPU 1                   CPU 2
> > >   task_unlock(victim);
> > >                                 victim exits and releases mm.
> > >                                 Usage count of the mm becomes 0 and thus released.
> > >                                                         New mm is allocated and assigned to some thread.
> > >   (p->mm == mm) matches the recreated mm and kill unrelated p.
> > > 
> > > sequence. We need to either get a reference to victim's mm before
> > > task_unlock(victim) or do comparison before task_unlock(victim).
> > 
> > Hmm, I guess you are right. The race is theoretically possible,
> > especially when there are many tasks when iterating over the list might
> > take some time. reference to the mm would solve this. Care to send a
> > patch?
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> > 
> I see. Here is a patch.

It would be preferable to post this in a separate email thread.
Conflating different issues in the same thread has already proven messy.

> mmput() may sleep. But oom_kill_process() is a sleep-able context, right?

Sure.

> ----------------------------------------
> >From 15afd1f40b132719c323e81d58064ff7115206f9 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Mon, 1 Jun 2015 20:54:14 +0900
> Subject: [PATCH] mm,oom: Fix potentially killing unrelated process or
>  depleting memory.
> 
> At the for_each_process() loop in oom_kill_process(), we are comparing
> address of OOM victim's mm without holding a reference to that mm.
> If there are a lot of processes to compare or a lot of "Kill process
> %d (%s) sharing same memory" messages to print, for_each_process() loop
> could take very long time.
> 
> It is possible that meanwhile the OOM victim exits and releases its mm,
> and then mm is allocated with the same address and assigned to some
> unrelated process. When we hit such race, the unrelated process will be
> killed by error. To make sure that the OOM victim's mm does not go away
> until for_each_process() loop finishes, get a reference on the OOM
> victim's mm before calling task_unlock(victim).

OK

> Likewise, move do_send_sig_info(SIGKILL, victim) to before
> mark_oom_victim(victim) in case for_each_process() took very long time,
> for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
> memset() in user space until SIGKILL is delivered.

This is unrelated and I believe even not necessary.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dff991e..5eb1e65 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -559,14 +559,17 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		victim = p;
>  	}
>  
> -	/* mm cannot safely be dereferenced after task_unlock(victim) */
> +	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
> -	mark_oom_victim(victim);
> +	atomic_inc(&mm->mm_users);
>  	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
> -		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
> -		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> -		K(get_mm_counter(victim->mm, MM_FILEPAGES)));
> +		task_pid_nr(victim), victim->comm, K(mm->total_vm),
> +		K(get_mm_counter(mm, MM_ANONPAGES)),
> +		K(get_mm_counter(mm, MM_FILEPAGES)));
>  	task_unlock(victim);
> +	/* Send SIGKILL before setting TIF_MEMDIE. */
> +	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	mark_oom_victim(victim);
>  
>  	/*
>  	 * Kill all user processes sharing victim->mm in other thread groups, if
> @@ -592,7 +595,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
>  		}
>  	rcu_read_unlock();
>  
> -	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
> +	mmput(mm);
>  	put_task_struct(victim);
>  }
>  #undef K
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 12:10                       ` Tetsuo Handa
@ 2015-06-01 12:17                         ` Michal Hocko
  2015-06-01 12:34                           ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 12:17 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 01-06-15 21:10:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 01-06-15 19:51:05, Tetsuo Handa wrote:
> > [...]
> > > How can all fatal_signal_pending() "struct task_struct" get access to memory
> > > reserves when only one of fatal_signal_pending() "struct task_struct" has
> > > TIF_MEMDIE ?
> > 
> > Because of 
> > 	/*
> > 	 * If current has a pending SIGKILL or is exiting, then automatically
> > 	 * select it.  The goal is to allow it to allocate so that it may
> > 	 * quickly exit and free its memory.
> > 	 *
> > 	 * But don't select if current has already released its mm and cleared
> > 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> > 	 */
> > 	if (current->mm &&
> > 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > 		mark_oom_victim(current);
> > 		goto out;
> > 	}
> 
> Then, what guarantees that the thread which is between
> down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem)
> (or whatever locks which are blocking the OOM victim) calls out_of_memory() ?
> That thread might be doing !__GFP_FS allocation request.

Could you point to such a place?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 12:17                         ` Michal Hocko
@ 2015-06-01 12:34                           ` Tetsuo Handa
  2015-06-01 13:05                             ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 12:34 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Mon 01-06-15 21:10:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Mon 01-06-15 19:51:05, Tetsuo Handa wrote:
> > > [...]
> > > > How can all fatal_signal_pending() "struct task_struct" get access to memory
> > > > reserves when only one of fatal_signal_pending() "struct task_struct" has
> > > > TIF_MEMDIE ?
> > > 
> > > Because of 
> > > 	/*
> > > 	 * If current has a pending SIGKILL or is exiting, then automatically
> > > 	 * select it.  The goal is to allow it to allocate so that it may
> > > 	 * quickly exit and free its memory.
> > > 	 *
> > > 	 * But don't select if current has already released its mm and cleared
> > > 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> > > 	 */
> > > 	if (current->mm &&
> > > 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > > 		mark_oom_victim(current);
> > > 		goto out;
> > > 	}
> > 
> > Then, what guarantees that the thread which is between
> > down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem)
> > (or whatever locks which are blocking the OOM victim) calls out_of_memory() ?
> > That thread might be doing !__GFP_FS allocation request.
> 
> Could you point to such a place?

I think sequence shown below is possible.

[Thread1-in-Porcess1         Thread2-in-Porcess1]    [Thread3-in-Process2]

mutex_lock(&inode->i_mutex);
                                                     kmalloc(GFP_KERNEL)
                                                     Invokes the OOM killer
                             Receives TIF_MEMDIE
Receives SIGKILL
                             Receives SIGKILL
                             mutex_lock(&inode->i_mutex); <= Waiting forever
kmalloc(GFP_NOFS); <= Can't return because out_of_memory() is not called.
mutex_unlock(&inode->i_mutex);
                             kmalloc(GFP_NOFS);
                             mutex_unlock(&inode->i_mutex);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 12:15                       ` Michal Hocko
@ 2015-06-01 13:04                         ` Tetsuo Handa
  2015-06-01 13:12                           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 13:04 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> > Likewise, move do_send_sig_info(SIGKILL, victim) to before
> > mark_oom_victim(victim) in case for_each_process() took very long time,
> > for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
> > memset() in user space until SIGKILL is delivered.
> 
> This is unrelated and I believe even not necessary.

Why unnecessary? If serial console is configured and printing a series of
"Kill process %d (%s) sharing same memory" took a few seconds, the OOM
victim can consume all memory via malloc() + memset(), can't it?
What to do if the OOM victim cannot die immediately after consuming
all memory? I think that sending SIGKILL before setting TIF_MEMDIE
helps reducing consumption of memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 12:34                           ` Tetsuo Handa
@ 2015-06-01 13:05                             ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 13:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 01-06-15 21:34:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 01-06-15 21:10:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Mon 01-06-15 19:51:05, Tetsuo Handa wrote:
> > > > [...]
> > > > > How can all fatal_signal_pending() "struct task_struct" get access to memory
> > > > > reserves when only one of fatal_signal_pending() "struct task_struct" has
> > > > > TIF_MEMDIE ?
> > > > 
> > > > Because of 
> > > > 	/*
> > > > 	 * If current has a pending SIGKILL or is exiting, then automatically
> > > > 	 * select it.  The goal is to allow it to allocate so that it may
> > > > 	 * quickly exit and free its memory.
> > > > 	 *
> > > > 	 * But don't select if current has already released its mm and cleared
> > > > 	 * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> > > > 	 */
> > > > 	if (current->mm &&
> > > > 	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
> > > > 		mark_oom_victim(current);
> > > > 		goto out;
> > > > 	}
> > > 
> > > Then, what guarantees that the thread which is between
> > > down_write(&current->mm->mmap_sem) and up_write(&current->mm->mmap_sem)
> > > (or whatever locks which are blocking the OOM victim) calls out_of_memory() ?
> > > That thread might be doing !__GFP_FS allocation request.
> > 
> > Could you point to such a place?
> 
> I think sequence shown below is possible.
> 
> [Thread1-in-Porcess1         Thread2-in-Porcess1]    [Thread3-in-Process2]
> 
> mutex_lock(&inode->i_mutex);
>                                                      kmalloc(GFP_KERNEL)
>                                                      Invokes the OOM killer
>                              Receives TIF_MEMDIE
> Receives SIGKILL
>                              Receives SIGKILL
>                              mutex_lock(&inode->i_mutex); <= Waiting forever
> kmalloc(GFP_NOFS); <= Can't return because out_of_memory() is not called.
> mutex_unlock(&inode->i_mutex);
>                              kmalloc(GFP_NOFS);
>                              mutex_unlock(&inode->i_mutex);

But this doesn't have anything to do with mm sharing, does it? I mean
this needs a fix anyway and giving access to memory reserves to threads
or processes sharing the mm doesn't fix the problem. This needs a more
generic solution.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 13:04                         ` Tetsuo Handa
@ 2015-06-01 13:12                           ` Michal Hocko
  2015-06-01 15:27                             ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 13:12 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Mon 01-06-15 22:04:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > Likewise, move do_send_sig_info(SIGKILL, victim) to before
> > > mark_oom_victim(victim) in case for_each_process() took very long time,
> > > for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
> > > memset() in user space until SIGKILL is delivered.
> > 
> > This is unrelated and I believe even not necessary.
> 
> Why unnecessary? If serial console is configured and printing a series of
> "Kill process %d (%s) sharing same memory" took a few seconds, the OOM
> victim can consume all memory via malloc() + memset(), can't it?

Can? You are generating one corner case after another. All of them
without actually showing it can happen in the real life. There are
million+1 corner cases possible yet we would prefer to handle those that
have changes to happen in the real life. So let's focus on the real life
scenarios.

> What to do if the OOM victim cannot die immediately after consuming
> all memory? I think that sending SIGKILL before setting TIF_MEMDIE
> helps reducing consumption of memory reserves.

I think that SIGKILL before or after mark_oom_victim has close to zero
effect. If you think that we should send SIGKILL before looking for
tasks sharing mm then why not - BUT AGAIN A SEPARATE PATCH WITH A
JUSTIFICATION please.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 13:12                           ` Michal Hocko
@ 2015-06-01 15:27                             ` Tetsuo Handa
  2015-06-01 15:42                               ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-06-01 15:27 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm

Michal Hocko wrote:
> On Mon 01-06-15 22:04:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > Likewise, move do_send_sig_info(SIGKILL, victim) to before
> > > > mark_oom_victim(victim) in case for_each_process() took very long time,
> > > > for the OOM victim can abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE via e.g.
> > > > memset() in user space until SIGKILL is delivered.
> > > 
> > > This is unrelated and I believe even not necessary.
> > 
> > Why unnecessary? If serial console is configured and printing a series of
> > "Kill process %d (%s) sharing same memory" took a few seconds, the OOM
> > victim can consume all memory via malloc() + memset(), can't it?
> 
> Can? You are generating one corner case after another. All of them
> without actually showing it can happen in the real life. There are
> million+1 corner cases possible yet we would prefer to handle those that
> have changes to happen in the real life. So let's focus on the real life
> scenarios.

I worked at support center for three years. I saw many unexplained hangup
cases. Some of them could be caused by these corner cases. But I can't prove
that it happened in the real life because I don't have reproducer for hangups
occurred in customer's systems. Analyzing syslog / vmcore did not help because
memory allocator gives me no hints. What I can do is to imagine possible
corner cases, but my goal is not to identify all corner cases. My goal is to
propose a backportable workaround that enterprise customers can use now.
While I feel sorry for bothering you, I also feel sorry for customers for
not being able to offer one. "[PATCH] mm: Introduce timeout based OOM killing"
is what I can come up with, without identifying one corner case after another.

I've been asking for backportable workaround for many months. I spent time for
finding potential bugs ( http://marc.info/?l=linux-mm&m=141684929114209 ).
If you are already aware that there are million+1 corner cases possible yet
(that is, we have too many potential bugs to identify and fix), why do you
keep refusing to offer for-now workaround (that is, paper over potential
bugs) ? I don't want to see customers and support staff suffering with OOM
corner cases any more...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-06-01 15:27                             ` Tetsuo Handa
@ 2015-06-01 15:42                               ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2015-06-01 15:42 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Tue 02-06-15 00:27:44, Tetsuo Handa wrote:
[...]
> I've been asking for backportable workaround for many months. I spent time for
> finding potential bugs ( http://marc.info/?l=linux-mm&m=141684929114209 ).
> If you are already aware that there are million+1 corner cases possible yet
> (that is, we have too many potential bugs to identify and fix), why do you
> keep refusing to offer for-now workaround (that is, paper over potential
> bugs) ? I don't want to see customers and support staff suffering with OOM
> corner cases any more...

For-now workarounds tend to make the code even more complex and
fragile. It is much more preferable to come up with a systematic solution
rather than a pile of workarounds. Pushing workarounds just because they
are easy to backport to distribution kernels is a wrong criteria.

The current OOM killer code is far from ideal. Some of the heuristics
might be suboptimal or even outright wrong. But piling more on them is
not a way forward.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message.
  2015-09-02 11:27   ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
@ 2015-09-02 14:24     ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2015-09-02 14:24 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Wed 02-09-15 20:27:20, Tetsuo Handa wrote:
[...]
> >From 7268b614a159cd7cb307c7dfab6241b72d9cef93 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 2 Sep 2015 20:03:16 +0900
> Subject: [PATCH v2] mm/oom: Suppress unnecessary "sharing same memory" message.
> 
> oom_kill_process() sends SIGKILL to other thread groups sharing
> victim's mm. But printing
> 
>   "Kill process %d (%s) sharing same memory\n"
> 
> lines makes no sense if they already have pending SIGKILL.
> This patch reduces the "Kill process" lines by printing
> that line with info level only if SIGKILL is not pending.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

I do not expect this would have a big effect in practice but it is true
that it saves some pointless output without clobbering the code much so
it makes sense to me.

Acked-by: Michal Hocko <mhocko@suse.com>

Wrt. pr_err -> pr_info I think this makes some sense as well because it
is true that users might intentionally push away the useful information
about the killed task this way. Not that it is a big problem but
still...

But who knows maybe somebody depends on this information even on the
pr_err loglevel. I know that David is relying on parsing oom reports
quite heavily so he might have more to tell about it.

> ---
>  mm/oom_kill.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..610da01 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -576,9 +576,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  		    !(p->flags & PF_KTHREAD)) {
>  			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
>  				continue;
> +			if (fatal_signal_pending(p))
> +				continue;
>  
>  			task_lock(p);	/* Protect ->comm from prctl() */
> -			pr_err("Kill process %d (%s) sharing same memory\n",
> +			pr_info("Kill process %d (%s) sharing same memory\n",
>  				task_pid_nr(p), p->comm);
>  			task_unlock(p);
>  			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message.
  2015-09-01 13:34 ` Michal Hocko
@ 2015-09-02 11:27   ` Tetsuo Handa
  2015-09-02 14:24     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-09-02 11:27 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Michal Hocko wrote:
> Dunno, the argumentation seems quite artificial to me and not really
> relevant. Even when you see "Killed process..." then it doesn't mean
> anything. And you are quite likely to get swamped by the same messages
> the first time you hit sysrq+f.

It is no problem for me for the first time. I think I saw somewhere in
the past (maybe auditing related discussion?) that "Killed process..."
line gives us important information because that line is the only clue
when some PID unexpectedly ended.

If it is not such important, printing only number of co-killed thread
groups might be sufficient, for thread groups sharing mm can be guessed
via

  pr_info("[%5d] %5d %5d %8lu %8lu %7ld %7ld %8lu         %5hd %s\n",

lines. We can shorten RCU period dominated by

  pr_info("Kill process %d (%s) sharing same memory\n",

for the first time.

> 
> I do agree that repeating those messages is quite annoying though and it
> doesn't make sense to print them if the task is known to have
> fatal_signal_pending already. So I do agree with the patch but I would
> really appreciate rewording of the changelog.

This changelog was intended for referencing from future patches.

> 
> I would be also tempted to change pr_err to pr_info for "Kill process %d
> (%s) sharing same memory\n"

OK. Here is an updated patch.
----------------------------------------
>From 7268b614a159cd7cb307c7dfab6241b72d9cef93 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 2 Sep 2015 20:03:16 +0900
Subject: [PATCH v2] mm/oom: Suppress unnecessary "sharing same memory" message.

oom_kill_process() sends SIGKILL to other thread groups sharing
victim's mm. But printing

  "Kill process %d (%s) sharing same memory\n"

lines makes no sense if they already have pending SIGKILL.
This patch reduces the "Kill process" lines by printing
that line with info level only if SIGKILL is not pending.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..610da01 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -576,9 +576,11 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		    !(p->flags & PF_KTHREAD)) {
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
+			if (fatal_signal_pending(p))
+				continue;
 
 			task_lock(p);	/* Protect ->comm from prctl() */
-			pr_err("Kill process %d (%s) sharing same memory\n",
+			pr_info("Kill process %d (%s) sharing same memory\n",
 				task_pid_nr(p), p->comm);
 			task_unlock(p);
 			do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
  2015-08-29 11:14 Tetsuo Handa
@ 2015-09-01 13:34 ` Michal Hocko
  2015-09-02 11:27   ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2015-09-01 13:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: rientjes, hannes, linux-mm

On Sat 29-08-15 20:14:54, Tetsuo Handa wrote:
[...]
> >From 540e1ba8db5e7044134d838a256f28080cdba0f0 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Sat, 29 Aug 2015 19:24:06 +0900
> Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
> 
> If the mm struct which an OOM victim is using is shared by e.g. 1000
> other thread groups, the kernel would emit the
> 
>   "Kill process %d (%s) sharing same memory\n"
> 
> line for 1000 times.
> 
> Currently, OOM killer by SysRq-f can get stuck (i.e. SysRq-f is unable
> to kill a different task due to choosing the same OOM victim forever)
> if there is already an OOM victim. The user who presses SysRq-f need to
> check the
> 
>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n"
> 
> line in order to judge whether SysRq-f got stuck or not, but the 1000
> "Kill process" lines sweeps the "Killed process" line out of console
> screen, making it impossible to judge whether OOM killer by SysRq-f got
> stuck or not.

Dunno, the argumentation seems quite artificial to me and not really
relevant. Even when you see "Killed process..." then it doesn't mean
anything. And you are quite likely to get swamped by the same messages
the first time you hit sysrq+f.

I do agree that repeating those messages is quite annoying though and it
doesn't make sense to print them if the task is known to have
fatal_signal_pending already. So I do agree with the patch but I would
really appreciate rewording of the changelog.

I would be also tempted to change pr_err to pr_info for "Kill process %d
(%s) sharing same memory\n"

> Fixing the stuck problem is outside of this patch's scope.

> This patch
> reduces the "Kill process" lines by printing that line only if SIGKILL
> is not pending.
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/oom_kill.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ecc0bc..4816fb7 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -576,6 +576,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  		    !(p->flags & PF_KTHREAD)) {
>  			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
>  				continue;
> +			if (fatal_signal_pending(p))
> +				continue;
>  
>  			task_lock(p);	/* Protect ->comm from prctl() */
>  			pr_err("Kill process %d (%s) sharing same memory\n",
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.
@ 2015-08-29 11:14 Tetsuo Handa
  2015-09-01 13:34 ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2015-08-29 11:14 UTC (permalink / raw)
  To: mhocko; +Cc: rientjes, hannes, linux-mm

Reposting a patch at http://marc.info/?l=linux-mm&m=143256441501204 :

---------- fool-sysrq-f.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, "", 1) == 1);
	return 0;
}

static int memory_consumer(void *unused)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(3);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 1000; i++)
		clone(file_writer, malloc(4 * 1024) + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL);
	clone(memory_consumer, malloc(4 * 1024) + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL);
	pause();
	return 0;
}
---------- fool-sysrq-f.c end ----------

You can see that the "sharing same memory" lines are reduced.

---------- console log start ----------
[   70.947578] fool-sysrq-f invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   70.949288] fool-sysrq-f cpuset=/ mems_allowed=0
[   70.950521] CPU: 3 PID: 5803 Comm: fool-sysrq-f Tainted: G        W       4.2.0-rc8-next-20150828+ #87
(...snipped...)
[   72.997636] [ 5802]  1000  5802   542739   390721     774       6        0             0 fool-sysrq-f
[   72.999494] [ 5803]  1000  5803   542739   390721     774       6        0             0 fool-sysrq-f
[   73.001334] Out of memory: Kill process 4802 (fool-sysrq-f) score 874 or sacrifice child
[   73.003061] Killed process 4802 (fool-sysrq-f) total-vm:2170956kB, anon-rss:1562884kB, file-rss:0kB
[   73.004929] Kill process 4803 (fool-sysrq-f) sharing same memory
(...snipped...)
[   74.244404] Kill process 5801 (fool-sysrq-f) sharing same memory
[   74.245666] Kill process 5802 (fool-sysrq-f) sharing same memory
[   74.246943] Kill process 5803 (fool-sysrq-f) sharing same memory
[   74.250564] kworker/3:2 invoked oom-killer: gfp_mask=0x2000d0, order=0, oom_score_adj=0
[   74.252211] kworker/3:2 cpuset=/ mems_allowed=0
(...snipped...)
[   75.321323] [ 5801]  1000  5801   542739   390808     774       6        0             0 fool-sysrq-f
[   75.321324] [ 5802]  1000  5802   542739   390808     774       6        0             0 fool-sysrq-f
[   75.321326] Out of memory: Kill process 4803 (fool-sysrq-f) score 874 or sacrifice child
[   75.321327] Killed process 4803 (fool-sysrq-f) total-vm:2170956kB, anon-rss:1563232kB, file-rss:0kB
[   94.887669] sysrq: SysRq : Manual OOM execution
[   94.889230] kworker/2:0 invoked oom-killer: gfp_mask=0xd0, order=-1, oom_score_adj=0
[   94.890933] kworker/2:0 cpuset=/ mems_allowed=0
(...snipped...)
[   96.913896] [ 5801]  1000  5801   542739   390808     774       6        0             0 fool-sysrq-f
[   96.915737] [ 5802]  1000  5802   542739   390808     774       6        0             0 fool-sysrq-f
[   96.917568] Out of memory: Kill process 4803 (fool-sysrq-f) score 874 or sacrifice child
[   96.919224] Killed process 4803 (fool-sysrq-f) total-vm:2170956kB, anon-rss:1563232kB, file-rss:0kB
[  108.279680] sysrq: SysRq : Manual OOM execution
[  108.283249] kworker/2:0 invoked oom-killer: gfp_mask=0xd0, order=-1, oom_score_adj=0
[  108.284892] kworker/2:0 cpuset=/ mems_allowed=0
(...snipped...)
[  110.311167] [ 5801]  1000  5801   542739   390808     774       6        0             0 fool-sysrq-f
[  110.313006] [ 5802]  1000  5802   542739   390808     774       6        0             0 fool-sysrq-f
[  110.314847] Out of memory: Kill process 4803 (fool-sysrq-f) score 874 or sacrifice child
[  110.316494] Killed process 4803 (fool-sysrq-f) total-vm:2170956kB, anon-rss:1563232kB, file-rss:0kB
[  128.055355] sysrq: SysRq : Manual OOM execution
[  128.056931] kworker/2:0 invoked oom-killer: gfp_mask=0xd0, order=-1, oom_score_adj=0
[  128.058587] kworker/2:0 cpuset=/ mems_allowed=0
(...snipped...)
[  194.871314] sysrq: SysRq : Manual OOM execution
[  194.873328] kworker/2:0 invoked oom-killer: gfp_mask=0xd0, order=-1, oom_score_adj=0
[  194.875491] kworker/2:0 cpuset=/ mems_allowed=0
(...snipped...)
[  196.939486] [ 5801]  1000  5801   542739   390808     774       6        0             0 fool-sysrq-f
[  196.941329] [ 5802]  1000  5802   542739   390808     774       6        0             0 fool-sysrq-f
[  196.943291] Out of memory: Kill process 4803 (fool-sysrq-f) score 874 or sacrifice child
[  196.944922] Killed process 4803 (fool-sysrq-f) total-vm:2170956kB, anon-rss:1563232kB, file-rss:0kB
[  204.274562] sysrq: SysRq : Resetting
[  204.275513] ACPI MEMORY or I/O RESET_REG.
---------- console log end ----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20150829.txt.xz .



By the way, quoting from http://marc.info/?l=linux-mm&m=144014507122830 :
> As Tetsuo has shown, such a load can be
> generated from the userspace without root privileges so it is much
> easier to make the system _completely_ unusable with this patch. Not that
> having an OOM deadlock would be great but you still have emergency tools
> like sysrq triggered OOM killer to attempt to sort the situation out.

We still have emergency tools like sysrq triggered OOM killer but we can't
assume it works. Allowing small !__GFP_FS allocations to fail would make it
work for several cases, but not all cases...


----------------------------------------
>From 540e1ba8db5e7044134d838a256f28080cdba0f0 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sat, 29 Aug 2015 19:24:06 +0900
Subject: [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message.

If the mm struct which an OOM victim is using is shared by e.g. 1000
other thread groups, the kernel would emit the

  "Kill process %d (%s) sharing same memory\n"

line for 1000 times.

Currently, OOM killer by SysRq-f can get stuck (i.e. SysRq-f is unable
to kill a different task due to choosing the same OOM victim forever)
if there is already an OOM victim. The user who presses SysRq-f need to
check the

  "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n"

line in order to judge whether SysRq-f got stuck or not, but the 1000
"Kill process" lines sweeps the "Killed process" line out of console
screen, making it impossible to judge whether OOM killer by SysRq-f got
stuck or not.

Fixing the stuck problem is outside of this patch's scope. This patch
reduces the "Kill process" lines by printing that line only if SIGKILL
is not pending.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ecc0bc..4816fb7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -576,6 +576,8 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 		    !(p->flags & PF_KTHREAD)) {
 			if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
 				continue;
+			if (fatal_signal_pending(p))
+				continue;
 
 			task_lock(p);	/* Protect ->comm from prctl() */
 			pr_err("Kill process %d (%s) sharing same memory\n",
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2015-09-02 14:24 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-25 14:33 [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
2015-05-26 17:02 ` Michal Hocko
2015-05-26 21:39   ` Tetsuo Handa
2015-05-27 16:45     ` Michal Hocko
2015-05-27 21:59       ` Tetsuo Handa
2015-05-28 18:05         ` Michal Hocko
2015-05-29 12:40           ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
2015-05-29 14:49             ` Michal Hocko
2015-05-29 17:20               ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory" message Tetsuo Handa
2015-05-31 11:10                 ` Tetsuo Handa
2015-06-01  9:58                   ` Michal Hocko
2015-06-01 10:16                   ` Michal Hocko
2015-06-01 12:02                     ` Tetsuo Handa
2015-06-01 12:15                       ` Michal Hocko
2015-06-01 13:04                         ` Tetsuo Handa
2015-06-01 13:12                           ` Michal Hocko
2015-06-01 15:27                             ` Tetsuo Handa
2015-06-01 15:42                               ` Michal Hocko
2015-06-01  9:03                 ` Michal Hocko
2015-06-01 10:51                   ` Tetsuo Handa
2015-06-01 11:43                     ` Michal Hocko
2015-06-01 12:10                       ` Tetsuo Handa
2015-06-01 12:17                         ` Michal Hocko
2015-06-01 12:34                           ` Tetsuo Handa
2015-06-01 13:05                             ` Michal Hocko
2015-08-29 11:14 Tetsuo Handa
2015-09-01 13:34 ` Michal Hocko
2015-09-02 11:27   ` [PATCH] mm/oom: Suppress unnecessary "sharing same memory"message Tetsuo Handa
2015-09-02 14:24     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.