[PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
@ 2021-12-07 21:49 Joel Savitz
  2021-12-07 22:34 ` Joel Savitz
  2021-12-07 23:47 ` Andrew Morton
  0 siblings, 2 replies; 14+ messages in thread
From: Joel Savitz @ 2021-12-07 21:49 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Savitz, Andrew Morton, Waiman Long, linux-mm, Nico Pache

In the case that two or more processes share a futex located within
a shared mmaped region, such as a process that shares a lock between
itself and a number of child processes, we have observed that when
a process holding the lock is oom killed, at least one waiter is never
alerted to this new development and simply continues to wait.

This is visible via pthreads by checking the __owner field of the
pthread_mutex_t structure within a waiting process, perhaps with gdb.

We identify reproduction of this issue by checking a waiting process of
a test program and viewing the contents of the pthread_mutex_t, taking note
of the value in the owner field, and then checking dmesg to see if the
owner has already been killed.

This issue can be tricky to reproduce, but with the modifications of
this small patch, I have found it to be impossible to reproduce. There
may be additional considerations that I have not taken into account in
this patch and I welcome any comments and criticism.

Co-developed-by: Nico Pache <npache@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..fa58bd10a0df 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/futex.h>

 #include <asm/tlb.h>
 #include "internal.h"
@@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 	 * in order to prevent the OOM victim from depleting the memory
 	 * reserves from the user space under its control.
 	 */
+	futex_exit_release(victim);
 	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
 	mark_oom_victim(victim);
 	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
@@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
 		 */
 		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
+		futex_exit_release(p);
 		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
 	}
 	rcu_read_unlock();
-- 
2.33.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-07 21:49 [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex Joel Savitz
@ 2021-12-07 22:34 ` Joel Savitz
  2021-12-07 23:47 ` Andrew Morton
  1 sibling, 0 replies; 14+ messages in thread
From: Joel Savitz @ 2021-12-07 22:34 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Waiman Long, linux-mm, Nico Pache

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-07 21:49 [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex Joel Savitz
  2021-12-07 22:34 ` Joel Savitz
@ 2021-12-07 23:47 ` Andrew Morton
  2021-12-08  0:46   ` Nico Pache
  2021-12-08  9:01   ` Michal Hocko
  1 sibling, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2021-12-07 23:47 UTC (permalink / raw)
  To: Joel Savitz
  Cc: linux-kernel, Waiman Long, linux-mm, Nico Pache, Peter Zijlstra,
	Michal Hocko

(cc's added)

On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:

> In the case that two or more processes share a futex located within
> a shared mmaped region, such as a process that shares a lock between
> itself and a number of child processes, we have observed that when
> a process holding the lock is oom killed, at least one waiter is never
> alerted to this new development and simply continues to wait.

Well dang.  Is there any way of killing off that waiting process, or do
we have a resource leak here?

> This is visible via pthreads by checking the __owner field of the
> pthread_mutex_t structure within a waiting process, perhaps with gdb.
> 
> We identify reproduction of this issue by checking a waiting process of
> a test program and viewing the contents of the pthread_mutex_t, taking note
> of the value in the owner field, and then checking dmesg to see if the
> owner has already been killed.
> 
> This issue can be tricky to reproduce, but with the modifications of
> this small patch, I have found it to be impossible to reproduce. There
> may be additional considerations that I have not taken into account in
> this patch and I welcome any comments and criticism.

> Co-developed-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1ddabefcfb5a..fa58bd10a0df 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -44,6 +44,7 @@
>  #include <linux/kthread.h>
>  #include <linux/init.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/futex.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  	 * in order to prevent the OOM victim from depleting the memory
>  	 * reserves from the user space under its control.
>  	 */
> +	futex_exit_release(victim);
>  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>  	mark_oom_victim(victim);
>  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  		 */
>  		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> +		futex_exit_release(p);
>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
>  	}
>  	rcu_read_unlock();
> -- 
> 2.33.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-07 23:47 ` Andrew Morton
@ 2021-12-08  0:46   ` Nico Pache
  2021-12-08  1:58     ` Andrew Morton
  2021-12-08  9:01   ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Nico Pache @ 2021-12-08  0:46 UTC (permalink / raw)
  To: Andrew Morton, Joel Savitz
  Cc: linux-kernel, Waiman Long, linux-mm, Peter Zijlstra, Michal Hocko



On 12/7/21 18:47, Andrew Morton wrote:
> (cc's added)
> 
> On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> 
>> In the case that two or more processes share a futex located within
>> a shared mmaped region, such as a process that shares a lock between
>> itself and a number of child processes, we have observed that when
>> a process holding the lock is oom killed, at least one waiter is never
>> alerted to this new development and simply continues to wait.
> 
> Well dang.  Is there any way of killing off that waiting process, or do
> we have a resource leak here?

If I understood your question correctly, there is a way to recover the system by
killing the process that is utilizing the futex; however, the purpose of robust
futexes is to avoid having to do this.

From my work with Joel on this it seems like a race is occurring between the
oom_reaper and the exit signal sent to the OMM'd process. By setting the
futex_exit_release before these signals are sent we avoid this.

> 
>> This is visible via pthreads by checking the __owner field of the
>> pthread_mutex_t structure within a waiting process, perhaps with gdb.
>>
>> We identify reproduction of this issue by checking a waiting process of
>> a test program and viewing the contents of the pthread_mutex_t, taking note
>> of the value in the owner field, and then checking dmesg to see if the
>> owner has already been killed.
>>
>> This issue can be tricky to reproduce, but with the modifications of
>> this small patch, I have found it to be impossible to reproduce. There
>> may be additional considerations that I have not taken into account in
>> this patch and I welcome any comments and criticism.
> 
>> Co-developed-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
>> ---
>>  mm/oom_kill.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 1ddabefcfb5a..fa58bd10a0df 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -44,6 +44,7 @@
>>  #include <linux/kthread.h>
>>  #include <linux/init.h>
>>  #include <linux/mmu_notifier.h>
>> +#include <linux/futex.h>
>>  
>>  #include <asm/tlb.h>
>>  #include "internal.h"
>> @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>>  	 * in order to prevent the OOM victim from depleting the memory
>>  	 * reserves from the user space under its control.
>>  	 */
>> +	futex_exit_release(victim);
>>  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>>  	mark_oom_victim(victim);
>>  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
>> @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>>  		 */
>>  		if (unlikely(p->flags & PF_KTHREAD))
>>  			continue;
>> +		futex_exit_release(p);
>>  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
>>  	}
>>  	rcu_read_unlock();
>> -- 
>> 2.33.1
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-08  0:46   ` Nico Pache
@ 2021-12-08  1:58     ` Andrew Morton
  2021-12-08  3:38       ` Joel Savitz
  0 siblings, 1 reply; 14+ messages in thread
From: Andrew Morton @ 2021-12-08  1:58 UTC (permalink / raw)
  To: Nico Pache
  Cc: Joel Savitz, linux-kernel, Waiman Long, linux-mm, Peter Zijlstra,
	Michal Hocko

On Tue, 7 Dec 2021 19:46:57 -0500 Nico Pache <npache@redhat.com> wrote:

> 
> 
> On 12/7/21 18:47, Andrew Morton wrote:
> > (cc's added)
> > 
> > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > 
> >> In the case that two or more processes share a futex located within
> >> a shared mmaped region, such as a process that shares a lock between
> >> itself and a number of child processes, we have observed that when
> >> a process holding the lock is oom killed, at least one waiter is never
> >> alerted to this new development and simply continues to wait.
> > 
> > Well dang.  Is there any way of killing off that waiting process, or do
> > we have a resource leak here?
> 
> If I understood your question correctly, there is a way to recover the system by
> killing the process that is utilizing the futex; however, the purpose of robust
> futexes is to avoid having to do this.

OK.  My concern was whether we have a way in which userspace can
permanently leak memory, which opens a (lame) form of denial-of-service
attack.

> >From my work with Joel on this it seems like a race is occurring between the
> oom_reaper and the exit signal sent to the OMM'd process. By setting the
> futex_exit_release before these signals are sent we avoid this.

OK.  It would be nice if the patch had some comments explaining *why*
we're doing this strange futex thing here.  Although that wouldn't be
necessary if futex_exit_release() was documented...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-08  1:58     ` Andrew Morton
@ 2021-12-08  3:38       ` Joel Savitz
  0 siblings, 0 replies; 14+ messages in thread
From: Joel Savitz @ 2021-12-08  3:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nico Pache, linux-kernel, Waiman Long, linux-mm, Peter Zijlstra,
	Michal Hocko

On Tue, Dec 7, 2021 at 8:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 7 Dec 2021 19:46:57 -0500 Nico Pache <npache@redhat.com> wrote:
>
> >
> >
> > On 12/7/21 18:47, Andrew Morton wrote:
> > > (cc's added)
> > >
> > > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > >
> > >> In the case that two or more processes share a futex located within
> > >> a shared mmaped region, such as a process that shares a lock between
> > >> itself and a number of child processes, we have observed that when
> > >> a process holding the lock is oom killed, at least one waiter is never
> > >> alerted to this new development and simply continues to wait.
> > >
> > > Well dang.  Is there any way of killing off that waiting process, or do
> > > we have a resource leak here?
> >
> > If I understood your question correctly, there is a way to recover the system by
> > killing the process that is utilizing the futex; however, the purpose of robust
> > futexes is to avoid having to do this.
>
> OK.  My concern was whether we have a way in which userspace can
> permanently leak memory, which opens a (lame) form of denial-of-service
> attack.
I believe the resources are freed when the process is killed so to my
knowledge there is no resource leak in the case we were investigating.

> > >From my work with Joel on this it seems like a race is occurring between the
> > oom_reaper and the exit signal sent to the OMM'd process. By setting the
> > futex_exit_release before these signals are sent we avoid this.
>
> OK.  It would be nice if the patch had some comments explaining *why*
> we're doing this strange futex thing here.  Although that wouldn't be
> necessary if futex_exit_release() was documented...
>
Sounds good, will send a v2 tomorrow

Best,
Joel Savitz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-07 23:47 ` Andrew Morton
  2021-12-08  0:46   ` Nico Pache
@ 2021-12-08  9:01   ` Michal Hocko
  2021-12-08 16:05     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-12-08  9:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Savitz, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> (cc's added)

Extend CC to have all futex maintainers on board.
 
> On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> 
> > In the case that two or more processes share a futex located within
> > a shared mmaped region, such as a process that shares a lock between
> > itself and a number of child processes, we have observed that when
> > a process holding the lock is oom killed, at least one waiter is never
> > alerted to this new development and simply continues to wait.
> 
> Well dang.  Is there any way of killing off that waiting process, or do
> we have a resource leak here?
> 
> > This is visible via pthreads by checking the __owner field of the
> > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > 
> > We identify reproduction of this issue by checking a waiting process of
> > a test program and viewing the contents of the pthread_mutex_t, taking note
> > of the value in the owner field, and then checking dmesg to see if the
> > owner has already been killed.
> > 
> > This issue can be tricky to reproduce, but with the modifications of
> > this small patch, I have found it to be impossible to reproduce. There
> > may be additional considerations that I have not taken into account in
> > this patch and I welcome any comments and criticism.

Why does OOM killer need a special handling. All the oom killer does is
to send a fatal signal to the victim. Why is this any different from
sending SIGKILL from the userspace?

> > Co-developed-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> > ---
> >  mm/oom_kill.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 1ddabefcfb5a..fa58bd10a0df 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -44,6 +44,7 @@
> >  #include <linux/kthread.h>
> >  #include <linux/init.h>
> >  #include <linux/mmu_notifier.h>
> > +#include <linux/futex.h>
> >  
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > @@ -890,6 +891,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  	 * in order to prevent the OOM victim from depleting the memory
> >  	 * reserves from the user space under its control.
> >  	 */
> > +	futex_exit_release(victim);
> >  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> >  	mark_oom_victim(victim);
> >  	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> > @@ -930,6 +932,7 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >  		 */
> >  		if (unlikely(p->flags & PF_KTHREAD))
> >  			continue;
> > +		futex_exit_release(p);
> >  		do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
> >  	}
> >  	rcu_read_unlock();
> > -- 
> > 2.33.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-08  9:01   ` Michal Hocko
@ 2021-12-08 16:05     ` Michal Hocko
  2021-12-09  2:59       ` Joel Savitz
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-12-08 16:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Savitz, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

On Wed 08-12-21 10:01:44, Michal Hocko wrote:
> On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> > (cc's added)
> 
> Extend CC to have all futex maintainers on board.
>  
> > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > 
> > > In the case that two or more processes share a futex located within
> > > a shared mmaped region, such as a process that shares a lock between
> > > itself and a number of child processes, we have observed that when
> > > a process holding the lock is oom killed, at least one waiter is never
> > > alerted to this new development and simply continues to wait.
> > 
> > Well dang.  Is there any way of killing off that waiting process, or do
> > we have a resource leak here?
> > 
> > > This is visible via pthreads by checking the __owner field of the
> > > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > > 
> > > We identify reproduction of this issue by checking a waiting process of
> > > a test program and viewing the contents of the pthread_mutex_t, taking note
> > > of the value in the owner field, and then checking dmesg to see if the
> > > owner has already been killed.
> > > 
> > > This issue can be tricky to reproduce, but with the modifications of
> > > this small patch, I have found it to be impossible to reproduce. There
> > > may be additional considerations that I have not taken into account in
> > > this patch and I welcome any comments and criticism.
> 
> Why does OOM killer need a special handling. All the oom killer does is
> to send a fatal signal to the victim. Why is this any different from
> sending SIGKILL from the userspace?

I have had a closer look and I guess I can see what you are trying to
achieve. futex_exit_release is normally called from exit_mm context. You
are likely seeing a situation when the oom victim is blocked and cannot
exit. That is certainly possible but it shouldn't be a permanent state.
So I would be more interested about your particular issue and how long
the task has been stuck unable to exit.

Whether this is safe to be called from the oom killer context I cannot
really judge. That would be a question to Futex folks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-08 16:05     ` Michal Hocko
@ 2021-12-09  2:59       ` Joel Savitz
  2021-12-09  7:51         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Savitz @ 2021-12-09  2:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

On Wed, Dec 8, 2021 at 11:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 08-12-21 10:01:44, Michal Hocko wrote:
> > On Tue 07-12-21 15:47:59, Andrew Morton wrote:
> > > (cc's added)
> >
> > Extend CC to have all futex maintainers on board.
> >
> > > On Tue,  7 Dec 2021 16:49:02 -0500 Joel Savitz <jsavitz@redhat.com> wrote:
> > >
> > > > In the case that two or more processes share a futex located within
> > > > a shared mmaped region, such as a process that shares a lock between
> > > > itself and a number of child processes, we have observed that when
> > > > a process holding the lock is oom killed, at least one waiter is never
> > > > alerted to this new development and simply continues to wait.
> > >
> > > Well dang.  Is there any way of killing off that waiting process, or do
> > > we have a resource leak here?
> > >
> > > > This is visible via pthreads by checking the __owner field of the
> > > > pthread_mutex_t structure within a waiting process, perhaps with gdb.
> > > >
> > > > We identify reproduction of this issue by checking a waiting process of
> > > > a test program and viewing the contents of the pthread_mutex_t, taking note
> > > > of the value in the owner field, and then checking dmesg to see if the
> > > > owner has already been killed.
> > > >
> > > > This issue can be tricky to reproduce, but with the modifications of
> > > > this small patch, I have found it to be impossible to reproduce. There
> > > > may be additional considerations that I have not taken into account in
> > > > this patch and I welcome any comments and criticism.
> >
> > Why does OOM killer need a special handling. All the oom killer does is
> > to send a fatal signal to the victim. Why is this any different from
> > sending SIGKILL from the userspace?
>
> I have had a closer look and I guess I can see what you are trying to
> achieve. futex_exit_release is normally called from exit_mm context. You
> are likely seeing a situation when the oom victim is blocked and cannot
> exit. That is certainly possible but it shouldn't be a permanent state.
> So I would be more interested about your particular issue and how long
> the task has been stuck unable to exit.

Before applying this patch I never saw a task eventually exit during
the reproduction of this system state.
Every task in this waiting-on-a-dead-owner situation state appeared to
be permanently blocked until user intervention killed it manually.

>
> Whether this is safe to be called from the oom killer context I cannot
> really judge. That would be a question to Futex folks.

I am also very interested in feedback from the Futex folks.
This is the first fix for the bug that I have found but I am not sure
whether this introduces other issues due to the context.

> --
> Michal Hocko
> SUSE Labs
>

Best,
Joel Savitz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-09  2:59       ` Joel Savitz
@ 2021-12-09  7:51         ` Michal Hocko
  2022-01-14 14:39           ` Joel Savitz
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2021-12-09  7:51 UTC (permalink / raw)
  To: Joel Savitz
  Cc: Andrew Morton, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

On Wed 08-12-21 21:59:29, Joel Savitz wrote:
> On Wed, Dec 8, 2021 at 11:05 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > I have had a closer look and I guess I can see what you are trying to
> > achieve. futex_exit_release is normally called from exit_mm context. You
> > are likely seeing a situation when the oom victim is blocked and cannot
> > exit. That is certainly possible but it shouldn't be a permanent state.
> > So I would be more interested about your particular issue and how long
> > the task has been stuck unable to exit.
> 
> Before applying this patch I never saw a task eventually exit during
> the reproduction of this system state.

What has happened to the oom victim and why it has never exited?

Side note. I have noticed that you have sent v2 with minor changes. It
is usualy better to resolve review feedback before posting a new
version. Discussion gets rather hard to follow otherwise.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2021-12-09  7:51         ` Michal Hocko
@ 2022-01-14 14:39           ` Joel Savitz
  2022-01-14 14:55             ` Waiman Long
  2022-01-17 11:33             ` Michal Hocko
  0 siblings, 2 replies; 14+ messages in thread
From: Joel Savitz @ 2022-01-14 14:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

> What has happened to the oom victim and why it has never exited?

What appears to happen is that the oom victim is sent SIGKILL by the
process that triggers the oom while also being marked as an oom
victim.

As you mention in your patchset introducing the oom reaper in commit
aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
oom reaper is to try and free more memory more quickly than it
otherwise would have been by assuming anonymous or swapped out pages
won't be needed in the exit path as the owner is already dying.
However, this assumption is violated by the futex_cleanup() path,
which needs access to userspace in fetch_robust_entry() when it is
called in exit_robust_list(). Trace_printk()s in this failure path
reveal an apparent race between the oom reaper thread reaping the
victim's mm and the futex_cleanup() path. There may be other ways that
this race manifests but we have been most consistently able to trace
that one.

Since in the case of an oom victim using robust futexes the core
assumption of the oom reaper is violated, we propose to solve this
problem by either canceling or delaying the waking of the oom reaper
thread by wake_oom_reaper in the case that tsk->robust_list is
non-NULL.

e.g. the bug does not reproduce with this patch (from npache@redhat.com):

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 989f35a2bbb1..b8c518fdcf4d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct *tsk)
        if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
                return;

+#ifdef CONFIG_FUTEX
+       /*
+        * don't wake the oom_reaper thread if we still have a robust
list to handle
+        * This will then rely on the sigkill to handle the cleanup of memory
+        */
+       if(tsk->robust_list)
+               return;
+#ifdef CONFIG_COMPAT
+       if(tsk->compat_robust_list)
+               return;
+#endif
+#endif
+
        get_task_struct(tsk);

        spin_lock(&oom_reaper_lock);

Best,
Joel Savitz

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2022-01-14 14:39           ` Joel Savitz
@ 2022-01-14 14:55             ` Waiman Long
  2022-01-14 14:58               ` Waiman Long
  2022-01-17 11:33             ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Waiman Long @ 2022-01-14 14:55 UTC (permalink / raw)
  To: Joel Savitz, Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

On 1/14/22 09:39, Joel Savitz wrote:
>> What has happened to the oom victim and why it has never exited?
> What appears to happen is that the oom victim is sent SIGKILL by the
> process that triggers the oom while also being marked as an oom
> victim.
>
> As you mention in your patchset introducing the oom reaper in commit
> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
> oom reaper is to try and free more memory more quickly than it
> otherwise would have been by assuming anonymous or swapped out pages
> won't be needed in the exit path as the owner is already dying.
> However, this assumption is violated by the futex_cleanup() path,
> which needs access to userspace in fetch_robust_entry() when it is
> called in exit_robust_list(). Trace_printk()s in this failure path
> reveal an apparent race between the oom reaper thread reaping the
> victim's mm and the futex_cleanup() path. There may be other ways that
> this race manifests but we have been most consistently able to trace
> that one.
>
> Since in the case of an oom victim using robust futexes the core
> assumption of the oom reaper is violated, we propose to solve this
> problem by either canceling or delaying the waking of the oom reaper
> thread by wake_oom_reaper in the case that tsk->robust_list is
> non-NULL.
>
> e.g. the bug does not reproduce with this patch (from npache@redhat.com):
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 989f35a2bbb1..b8c518fdcf4d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct *tsk)
>          if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
>                  return;
>
> +#ifdef CONFIG_FUTEX
> +       /*
> +        * don't wake the oom_reaper thread if we still have a robust
> list to handle
> +        * This will then rely on the sigkill to handle the cleanup of memory
> +        */
> +       if(tsk->robust_list)
> +               return;
> +#ifdef CONFIG_COMPAT
> +       if(tsk->compat_robust_list)
> +               return;
> +#endif
> +#endif
> +
>          get_task_struct(tsk);
>
>          spin_lock(&oom_reaper_lock);

OK, that can explain why the robust futex is not properly cleaned up. 
Could you post a more formal v2 patch with description about the 
possible race condition?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2022-01-14 14:55             ` Waiman Long
@ 2022-01-14 14:58               ` Waiman Long
  0 siblings, 0 replies; 14+ messages in thread
From: Waiman Long @ 2022-01-14 14:58 UTC (permalink / raw)
  To: Joel Savitz, Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida


On 1/14/22 09:55, Waiman Long wrote:
> On 1/14/22 09:39, Joel Savitz wrote:
>>> What has happened to the oom victim and why it has never exited?
>> What appears to happen is that the oom victim is sent SIGKILL by the
>> process that triggers the oom while also being marked as an oom
>> victim.
>>
>> As you mention in your patchset introducing the oom reaper in commit
>> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
>> oom reaper is to try and free more memory more quickly than it
>> otherwise would have been by assuming anonymous or swapped out pages
>> won't be needed in the exit path as the owner is already dying.
>> However, this assumption is violated by the futex_cleanup() path,
>> which needs access to userspace in fetch_robust_entry() when it is
>> called in exit_robust_list(). Trace_printk()s in this failure path
>> reveal an apparent race between the oom reaper thread reaping the
>> victim's mm and the futex_cleanup() path. There may be other ways that
>> this race manifests but we have been most consistently able to trace
>> that one.
>>
>> Since in the case of an oom victim using robust futexes the core
>> assumption of the oom reaper is violated, we propose to solve this
>> problem by either canceling or delaying the waking of the oom reaper
>> thread by wake_oom_reaper in the case that tsk->robust_list is
>> non-NULL.
>>
>> e.g. the bug does not reproduce with this patch (from 
>> npache@redhat.com):
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 989f35a2bbb1..b8c518fdcf4d 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -665,6 +665,19 @@ static void wake_oom_reaper(struct task_struct 
>> *tsk)
>>          if (test_and_set_bit(MMF_OOM_REAP_QUEUED, 
>> &tsk->signal->oom_mm->flags))
>>                  return;
>>
>> +#ifdef CONFIG_FUTEX
>> +       /*
>> +        * don't wake the oom_reaper thread if we still have a robust
>> list to handle
>> +        * This will then rely on the sigkill to handle the cleanup 
>> of memory
>> +        */
>> +       if(tsk->robust_list)
>> +               return;
>> +#ifdef CONFIG_COMPAT
>> +       if(tsk->compat_robust_list)
>> +               return;
>> +#endif
>> +#endif
>> +
>>          get_task_struct(tsk);
>>
>>          spin_lock(&oom_reaper_lock);
>
> OK, that can explain why the robust futex is not properly cleaned up. 
> Could you post a more formal v2 patch with description about the 
> possible race condition?
>
It should be v3. Sorry for the mix-up.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex
  2022-01-14 14:39           ` Joel Savitz
  2022-01-14 14:55             ` Waiman Long
@ 2022-01-17 11:33             ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2022-01-17 11:33 UTC (permalink / raw)
  To: Joel Savitz
  Cc: Andrew Morton, linux-kernel, Waiman Long, linux-mm, Nico Pache,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Darren Hart,
	Davidlohr Bueso, André Almeida

I have only noticed your email now after replying to v3 so our emails
have crossed.

On Fri 14-01-22 09:39:55, Joel Savitz wrote:
> > What has happened to the oom victim and why it has never exited?
> 
> What appears to happen is that the oom victim is sent SIGKILL by the
> process that triggers the oom while also being marked as an oom
> victim.
> 
> As you mention in your patchset introducing the oom reaper in commit
> aac4536355496 ("mm, oom: introduce oom reaper"), the purpose the the
> oom reaper is to try and free more memory more quickly than it
> otherwise would have been by assuming anonymous or swapped out pages
> won't be needed in the exit path as the owner is already dying.
> However, this assumption is violated by the futex_cleanup() path,
> which needs access to userspace in fetch_robust_entry() when it is
> called in exit_robust_list(). Trace_printk()s in this failure path
> reveal an apparent race between the oom reaper thread reaping the
> victim's mm and the futex_cleanup() path. There may be other ways that
> this race manifests but we have been most consistently able to trace
> that one.

Please let's continue the discussion in the v3 email thread:
http://lkml.kernel.org/r/20220114180135.83308-1-npache@redhat.com
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-17 11:33 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07 21:49 [PATCH] mm/oom_kill: wake futex waiters before annihilating victim shared mutex Joel Savitz
2021-12-07 22:34 ` Joel Savitz
2021-12-07 23:47 ` Andrew Morton
2021-12-08  0:46   ` Nico Pache
2021-12-08  1:58     ` Andrew Morton
2021-12-08  3:38       ` Joel Savitz
2021-12-08  9:01   ` Michal Hocko
2021-12-08 16:05     ` Michal Hocko
2021-12-09  2:59       ` Joel Savitz
2021-12-09  7:51         ` Michal Hocko
2022-01-14 14:39           ` Joel Savitz
2022-01-14 14:55             ` Waiman Long
2022-01-14 14:58               ` Waiman Long
2022-01-17 11:33             ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).