linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list
@ 2022-03-09  0:25 Nico Pache
  2022-03-09  0:48 ` Waiman Long
  2022-03-09 13:09 ` Michal Hocko
  0 siblings, 2 replies; 5+ messages in thread
From: Nico Pache @ 2022-03-09  0:25 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Rafael Aquini, Waiman Long, Baoquan He,
	Christoph von Recklinghausen, Don Dutile, Herton R . Krzesinski,
	David Rientjes, Michal Hocko, Andrea Arcangeli, Andrew Morton,
	tglx, mingo, dvhart, dave, andrealmeid, peterz, Joel Savitz

The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
be targeted by the oom reaper. This mapping is also used to store the futex
robust list; the kernel does not keep a copy of the robust list and instead
references a userspace address to maintain the robustness during a process
death. A race can occur between exit_mm and the oom reaper that allows
the oom reaper to clear the memory of the futex robust list before the
exit path has handled the futex death.

Prevent the OOM reaper from concurrently reaping the mappings if the dying
process contains a robust_list. If the dying task_struct does not contain
a pointer in tsk->robust_list, we can assume there was either never one
setup for this task struct, or futex_cleanup has properly handled the
futex death and we can safely reap this memory.

Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer

[1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370

Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Christoph von Recklinghausen <crecklin@redhat.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Herton R. Krzesinski <herton@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <tglx@linutronix.de>
Cc: <mingo@redhat.com>
Cc: <dvhart@infradead.org>
Cc: <dave@stgolabs.net>
Cc: <andrealmeid@collabora.com>
Cc: <peterz@infradead.org>
Co-developed-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Joel Savitz <jsavitz@redhat.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/oom_kill.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 989f35a2bbb1..37af902494d8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -587,6 +587,25 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto out_unlock;
 	}
 
+	/* Don't reap a process holding a robust_list as the pthread
+	 * struct is allocated in userspace using PRIVATE | ANONYMOUS
+	 * memory which when reaped before futex_cleanup() can leave
+	 * the waiting process stuck.
+	 */
+#ifdef CONFIG_FUTEX
+	bool robust = false;
+
+	robust = tsk->robust_list != NULL;
+#ifdef CONFIG_COMPAT
+	robust |= tsk->compat_robust_list != NULL;
+#endif
+	if (robust) {
+		trace_skip_task_reaping(tsk->pid);
+		pr_info("oom_reaper: skipping task as it contains a robust list");
+		goto out_finish;
+	}
+#endif
+
 	trace_start_task_reaping(tsk->pid);
 
 	/* failed to reap part of the address space. Try again later */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list
  2022-03-09  0:25 [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list Nico Pache
@ 2022-03-09  0:48 ` Waiman Long
  2022-03-09 13:09 ` Michal Hocko
  1 sibling, 0 replies; 5+ messages in thread
From: Waiman Long @ 2022-03-09  0:48 UTC (permalink / raw)
  To: Nico Pache, linux-mm, linux-kernel
  Cc: Rafael Aquini, Baoquan He, Christoph von Recklinghausen,
	Don Dutile, Herton R . Krzesinski, David Rientjes, Michal Hocko,
	Andrea Arcangeli, Andrew Morton, tglx, mingo, dvhart, dave,
	andrealmeid, peterz, Joel Savitz

On 3/8/22 19:25, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is also used to store the futex
> robust list; the kernel does not keep a copy of the robust list and instead
> references a userspace address to maintain the robustness during a process
> death. A race can occur between exit_mm and the oom reaper that allows
> the oom reaper to clear the memory of the futex robust list before the
> exit path has handled the futex death.
>
> Prevent the OOM reaper from concurrently reaping the mappings if the dying
> process contains a robust_list. If the dying task_struct does not contain
> a pointer in tsk->robust_list, we can assume there was either never one
> setup for this task struct, or futex_cleanup has properly handled the
> futex death and we can safely reap this memory.
>
> Reproducer: https://gitlab.com/jsavitz/oom_futex_reproducer
>
> [1] https://elixir.bootlin.com/glibc/latest/source/nptl/allocatestack.c#L370
>
> Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
> Cc: Rafael Aquini <aquini@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Christoph von Recklinghausen <crecklin@redhat.com>
> Cc: Don Dutile <ddutile@redhat.com>
> Cc: Herton R. Krzesinski <herton@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: <tglx@linutronix.de>
> Cc: <mingo@redhat.com>
> Cc: <dvhart@infradead.org>
> Cc: <dave@stgolabs.net>
> Cc: <andrealmeid@collabora.com>
> Cc: <peterz@infradead.org>
> Co-developed-by: Joel Savitz <jsavitz@redhat.com>
> Signed-off-by: Joel Savitz <jsavitz@redhat.com>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>   mm/oom_kill.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 989f35a2bbb1..37af902494d8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -587,6 +587,25 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto out_unlock;
>   	}
>   
> +	/* Don't reap a process holding a robust_list as the pthread
> +	 * struct is allocated in userspace using PRIVATE | ANONYMOUS
> +	 * memory which when reaped before futex_cleanup() can leave
> +	 * the waiting process stuck.
> +	 */
> +#ifdef CONFIG_FUTEX
> +	bool robust = false;
> +
> +	robust = tsk->robust_list != NULL;
> +#ifdef CONFIG_COMPAT
> +	robust |= tsk->compat_robust_list != NULL;
> +#endif
> +	if (robust) {
> +		trace_skip_task_reaping(tsk->pid);
> +		pr_info("oom_reaper: skipping task as it contains a robust list");
> +		goto out_finish;
> +	}
> +#endif
> +
>   	trace_start_task_reaping(tsk->pid);
>   
>   	/* failed to reap part of the address space. Try again later */

I believe it will be easier to read if you define a helper function like

#ifdef CONFIG_FUTEX
static inline bool has_robust_list(struct task_struct *tsk)
{
         bool robust = !!tsk->robust_list;

#ifdef CONFIG_COMPAT
         robust |= !!tsk->compat_robust_list
#endif
         return robust;
}
#else
static inline bool has_robust_list(struct task_struct *tsk)
{
         return false;
}
#endif

Then you don't need any #if/endif in oom_reap_task_mm().

Cheers,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list
  2022-03-09  0:25 [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list Nico Pache
  2022-03-09  0:48 ` Waiman Long
@ 2022-03-09 13:09 ` Michal Hocko
  2022-03-09 21:34   ` Nico Pache
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2022-03-09 13:09 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-kernel, Rafael Aquini, Waiman Long, Baoquan He,
	Christoph von Recklinghausen, Don Dutile, Herton R . Krzesinski,
	David Rientjes, Andrea Arcangeli, Andrew Morton, tglx, mingo,
	dvhart, dave, andrealmeid, peterz, Joel Savitz

On Tue 08-03-22 17:25:50, Nico Pache wrote:
> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> be targeted by the oom reaper. This mapping is also used to store the futex
> robust list; the kernel does not keep a copy of the robust list and instead
> references a userspace address to maintain the robustness during a process
> death. A race can occur between exit_mm and the oom reaper that allows
> the oom reaper to clear the memory of the futex robust list before the
> exit path has handled the futex death.

The above is missing the important part of the problem description. So
the oom_reaper frees the memory which is backing the robust list. It
would be useful to link that to the lockup on the futex.

> Prevent the OOM reaper from concurrently reaping the mappings if the dying
> process contains a robust_list. If the dying task_struct does not contain
> a pointer in tsk->robust_list, we can assume there was either never one
> setup for this task struct, or futex_cleanup has properly handled the
> futex death and we can safely reap this memory.

I do agree with Waiman that this should go into a helper function. This
would be a quick workaround but I believe that it would be much better
to either do the futex cleanup in the oom_reaper context if that could
be done without blocking. If that is really not feasible for some reason
then we could skip over vmas which are backing the robust list. Have you
considered any of those solutions?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list
  2022-03-09 13:09 ` Michal Hocko
@ 2022-03-09 21:34   ` Nico Pache
  2022-03-10  8:46     ` Michal Hocko
  0 siblings, 1 reply; 5+ messages in thread
From: Nico Pache @ 2022-03-09 21:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Rafael Aquini, Baoquan He,
	Christoph von Recklinghausen, Don Dutile, Herton R . Krzesinski,
	David Rientjes, Andrea Arcangeli, Andrew Morton, tglx, mingo,
	dvhart, dave, andrealmeid, peterz, Joel Savitz, Waiman Long



On 3/9/22 06:09, Michal Hocko wrote:
> On Tue 08-03-22 17:25:50, Nico Pache wrote:
>> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
>> be targeted by the oom reaper. This mapping is also used to store the futex
>> robust list; the kernel does not keep a copy of the robust list and instead
>> references a userspace address to maintain the robustness during a process
>> death. A race can occur between exit_mm and the oom reaper that allows
>> the oom reaper to clear the memory of the futex robust list before the
>> exit path has handled the futex death.
> 
> The above is missing the important part of the problem description. So
> the oom_reaper frees the memory which is backing the robust list. It
> would be useful to link that to the lockup on the futex.
That is basically what that last sentence in the paragraph is saying. perhaps I
should change clear -> free to be more verbose. Do you mean provide the race in
the standard format of:

    CPU1                CPU2
-------------------------------------------
    page_fault
    do_exit "signal"
    wake_oom_reaper	
			oom_reaper
                        oom_reap_task_mm (invalidates the private|anon mapping)
    exit_mm
    exit_mm_release
    futex_exit_release
    futex_cleanup
    exit_robust_list
    get_user (tries to access the memory we invalidated)
> 
>> Prevent the OOM reaper from concurrently reaping the mappings if the dying
>> process contains a robust_list. If the dying task_struct does not contain
>> a pointer in tsk->robust_list, we can assume there was either never one
>> setup for this task struct, or futex_cleanup has properly handled the
>> futex death and we can safely reap this memory.
> 
> I do agree with Waiman that this should go into a helper function.This
Ok I will do so in my v5.
> would be a quick workaround but I believe that it would be much better
> to either do the futex cleanup in the oom_reaper context if that could
> be done without blocking.

The futex cleanup is protected by two locks: a sleeping mutex and a spin_lock,
and both seem to be protecting a different race condition. If we create a
function like futex_try_exit_release that does a non-sleeping try_mutex call we
could solve the issue, but in the case that we fail to acquire the lock then we
leave the race window open. So from a correctness standpoint this approach falls
short.

 If that is really not feasible for some reason
> then we could skip over vmas which are backing the robust list. Have you
> considered any of those solutions?

We did explore some other possibilities such as using:
'&& !vma->vm_mm->owner->robust_list' in  __oom_reap_task_mm which did not work
for some reason but should have the same effect as this patch but with wasting
cycles checking all the VMA under a given process for a robust_list that we
could just check for in the parent caller. The only benefit I could see from
this approach would be that if halfway through checking the VMAs the robust list
gets set to NULL, we could then clear the other half of memory more quickly with
the oom reaper.

Unless you know of some other way to determine if that VMA is the pthread
mapping or a way to hide a certain mmap from the OOM reaper then I think this is
our best approach.

We have also considered implementing a lock between the futex_cleanup and the
oom_reaper that would have a similar effect to this patch but instead of
skipping the oom on a robust list we wait for the cleanup to finish then
continue ooming. Not sure if we want to introduce that complexity though.

Cheers,
-- Nico


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list
  2022-03-09 21:34   ` Nico Pache
@ 2022-03-10  8:46     ` Michal Hocko
  0 siblings, 0 replies; 5+ messages in thread
From: Michal Hocko @ 2022-03-10  8:46 UTC (permalink / raw)
  To: Nico Pache
  Cc: linux-mm, linux-kernel, Rafael Aquini, Baoquan He,
	Christoph von Recklinghausen, Don Dutile, Herton R . Krzesinski,
	David Rientjes, Andrea Arcangeli, Andrew Morton, tglx, mingo,
	dvhart, dave, andrealmeid, peterz, Joel Savitz, Waiman Long

On Wed 09-03-22 14:34:49, Nico Pache wrote:
> 
> 
> On 3/9/22 06:09, Michal Hocko wrote:
> > On Tue 08-03-22 17:25:50, Nico Pache wrote:
> >> The pthread struct is allocated on PRIVATE|ANONYMOUS memory [1] which can
> >> be targeted by the oom reaper. This mapping is also used to store the futex
> >> robust list; the kernel does not keep a copy of the robust list and instead
> >> references a userspace address to maintain the robustness during a process
> >> death. A race can occur between exit_mm and the oom reaper that allows
> >> the oom reaper to clear the memory of the futex robust list before the
> >> exit path has handled the futex death.
> > 
> > The above is missing the important part of the problem description. So
> > the oom_reaper frees the memory which is backing the robust list. It
> > would be useful to link that to the lockup on the futex.
> That is basically what that last sentence in the paragraph is saying. perhaps I
> should change clear -> free to be more verbose. Do you mean provide the race in
> the standard format of:
> 
>     CPU1                CPU2
> -------------------------------------------
>     page_fault
>     do_exit "signal"
>     wake_oom_reaper	
> 			oom_reaper
>                         oom_reap_task_mm (invalidates the private|anon mapping)
>     exit_mm
>     exit_mm_release
>     futex_exit_release
>     futex_cleanup
>     exit_robust_list
>     get_user (tries to access the memory we invalidated)

get_user will return with EFAULT and that prevents the wake up of
somebody on that robust list so they will stay stuck waiting for the
lock, right?

And yes that form would be much easier to follow.

> >> Prevent the OOM reaper from concurrently reaping the mappings if the dying
> >> process contains a robust_list. If the dying task_struct does not contain
> >> a pointer in tsk->robust_list, we can assume there was either never one
> >> setup for this task struct, or futex_cleanup has properly handled the
> >> futex death and we can safely reap this memory.
> > 
> > I do agree with Waiman that this should go into a helper function.This
> Ok I will do so in my v5.
> > would be a quick workaround but I believe that it would be much better
> > to either do the futex cleanup in the oom_reaper context if that could
> > be done without blocking.
> 
> The futex cleanup is protected by two locks: a sleeping mutex and a spin_lock,
> and both seem to be protecting a different race condition. If we create a
> function like futex_try_exit_release that does a non-sleeping try_mutex call we
> could solve the issue, but in the case that we fail to acquire the lock then we
> leave the race window open. So from a correctness standpoint this approach falls
> short.

If we fail to take the lock we can just bail out on the oom_reaper side
and retry the same way we retry on the mmap_sem lock. I do not see a
correctness issue here. This would be definitely less prone to pointless
bailout when we fail unconditionally on a present robust list.

>  If that is really not feasible for some reason
> > then we could skip over vmas which are backing the robust list. Have you
> > considered any of those solutions?
> 
> We did explore some other possibilities such as using:
> '&& !vma->vm_mm->owner->robust_list' in  __oom_reap_task_mm which did not work
> for some reason but should have the same effect as this patch but with wasting
> cycles checking all the VMA under a given process for a robust_list that we
> could just check for in the parent caller. The only benefit I could see from
> this approach would be that if halfway through checking the VMAs the robust list
> gets set to NULL, we could then clear the other half of memory more quickly with
> the oom reaper.

I thought we could simply check the virtual address(es) which hosts the
robust lists against each vma address range and skip over that vma.
oom_reaper is a not a hot path and trivial checks like this one are not
really a problem.

> Unless you know of some other way to determine if that VMA is the pthread
> mapping or a way to hide a certain mmap from the OOM reaper then I think this is
> our best approach.
> 
> We have also considered implementing a lock between the futex_cleanup and the
> oom_reaper that would have a similar effect to this patch but instead of
> skipping the oom on a robust list we wait for the cleanup to finish then
> continue ooming. Not sure if we want to introduce that complexity though.

No, the point of the oom_reaper is to guarantee a forward progress when
an oom victim cannot terminate by itself. Such a mechanism cannot really
depend on any other locks which can be involved in memory allocations
(i.e. on any other sleeping locks).
> 
> Cheers,
> -- Nico

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-03-10  8:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-09  0:25 [PATCH v4] mm/oom_kill.c: futex: Don't OOM reap a process with a futex robust list Nico Pache
2022-03-09  0:48 ` Waiman Long
2022-03-09 13:09 ` Michal Hocko
2022-03-09 21:34   ` Nico Pache
2022-03-10  8:46     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).