linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
@ 2020-07-28 16:00 Mathieu Desnoyers
  2020-08-04 14:34 ` peterz
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2020-07-28 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Thomas Gleixner,
	Linus Torvalds, Alan Stern, linux-mm

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

The following comment in exit_mm() seems rather puzzling though:

        /* more a memory barrier than a real lock */
        task_lock(current);

So considering that spinlocks are not full memory barriers nowadays,
some digging into the origins of this comment led me to 2002, at the
earliest commit tracked by Thomas Gleixner's bitkeeper era's history
at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/

At that point, this comment was followed by:

+               /* more a memory barrier than a real lock */
+               task_lock(tsk);
+               tsk->mm = NULL;
+               task_unlock(tsk);

Which seems to indicate that grabbing the lock is really about acting as
a memory barrier, but I wonder if it is meant to be a full barrier or
just an acquire.

If a full memory barrier happens to be needed even without membarrier,
perhaps this fix also corrects other issues ? It unclear from the
comment what this memory barrier orders though. Is it the chain
exit -> waitpid, or is it related to entering lazy TLB ?

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
 kernel/exit.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index 727150f28103..ce272ec55cdc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -474,6 +474,14 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb();
 	current->mm = NULL;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-07-28 16:00 [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Mathieu Desnoyers
@ 2020-08-04 14:34 ` peterz
  2020-08-04 14:48   ` Mathieu Desnoyers
  0 siblings, 1 reply; 5+ messages in thread
From: peterz @ 2020-08-04 14:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
> 
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.

I'm still wonder what the exact failure case is though; exit_mm() is on
the exit path (as the name very much implies) and the thread is about to
die. The context switch that follows guarantees a full barrier before we
run anything else again.

> The following comment in exit_mm() seems rather puzzling though:
> 
>         /* more a memory barrier than a real lock */
>         task_lock(current);
> 
> So considering that spinlocks are not full memory barriers nowadays,
> some digging into the origins of this comment led me to 2002, at the
> earliest commit tracked by Thomas Gleixner's bitkeeper era's history
> at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/
> 
> At that point, this comment was followed by:
> 
> +               /* more a memory barrier than a real lock */
> +               task_lock(tsk);
> +               tsk->mm = NULL;
> +               task_unlock(tsk);
> 
> Which seems to indicate that grabbing the lock is really about acting as
> a memory barrier, but I wonder if it is meant to be a full barrier or
> just an acquire.
> 
> If a full memory barrier happens to be needed even without membarrier,
> perhaps this fix also corrects other issues ? It unclear from the
> comment what this memory barrier orders though. Is it the chain
> exit -> waitpid, or is it related to entering lazy TLB ?

I'm as puzzled by that comment as are you.

It does seem to be required for glorious stuff like get_task_mm()
though.

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
>  kernel/exit.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 727150f28103..ce272ec55cdc 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -474,6 +474,14 @@ static void exit_mm(void)
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
>  	task_lock(current);
> +	/*
> +	 * When a thread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb();

So like stated above, I'm not sure how missing that IPI is going to be a
problem, we're dying...

>  	current->mm = NULL;
>  	mmap_read_unlock(mm);
>  	enter_lazy_tlb(mm, current);
> -- 
> 2.11.0
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 14:34 ` peterz
@ 2020-08-04 14:48   ` Mathieu Desnoyers
  2020-08-04 16:51     ` peterz
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2020-08-04 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm



----- On Aug 4, 2020, at 10:34 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>> 
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
> 
> I'm still wonder what the exact failure case is though; exit_mm() is on
> the exit path (as the name very much implies) and the thread is about to
> die. The context switch that follows guarantees a full barrier before we
> run anything else again.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 14:48   ` Mathieu Desnoyers
@ 2020-08-04 16:51     ` peterz
  2020-08-04 17:25       ` Mathieu Desnoyers
  0 siblings, 1 reply; 5+ messages in thread
From: peterz @ 2020-08-04 16:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
> Here is the scenario I have in mind:

> Userspace variables:
> 
> int x = 0, y = 0;
> 
> CPU 0                   CPU 1
> Thread A                Thread B
> (in thread group A)     (in thread group B)
> 
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
>                         r1 = load y
>                         membarrier()
>                           skips CPU 0 (no IPI) because its current mm is NULL
>                         r2 = load x
>                         BUG_ON(r1 == 1 && r2 == 0)
> 

Ah, yes of course.

We really should have a bunch of these scenarios in membarrier.c.



Now, the above cannot happen because we have an unconditional
atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
relying on that is a wee bit dodgy.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 16:51     ` peterz
@ 2020-08-04 17:25       ` Mathieu Desnoyers
  0 siblings, 0 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2020-08-04 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

----- On Aug 4, 2020, at 12:51 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
>> Here is the scenario I have in mind:
> 
>> Userspace variables:
>> 
>> int x = 0, y = 0;
>> 
>> CPU 0                   CPU 1
>> Thread A                Thread B
>> (in thread group A)     (in thread group B)
>> 
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>>                         r1 = load y
>>                         membarrier()
>>                           skips CPU 0 (no IPI) because its current mm is NULL
>>                         r2 = load x
>>                         BUG_ON(r1 == 1 && r2 == 0)
>> 
> 
> Ah, yes of course.
> 
> We really should have a bunch of these scenarios in membarrier.c.

Good point.

> 
> 
> 
> Now, the above cannot happen because we have an unconditional
> atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
> relying on that is a wee bit dodgy.

I am not against using this already existing barrier to provide the
guarantee we need, but it would have to be documented in the code.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-08-04 17:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28 16:00 [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Mathieu Desnoyers
2020-08-04 14:34 ` peterz
2020-08-04 14:48   ` Mathieu Desnoyers
2020-08-04 16:51     ` peterz
2020-08-04 17:25       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).