linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
@ 2020-07-28 16:00 Mathieu Desnoyers
  2020-07-28 16:00 ` [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm Mathieu Desnoyers
  2020-08-04 14:34 ` [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier peterz
  0 siblings, 2 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-07-28 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Thomas Gleixner,
	Linus Torvalds, Alan Stern, linux-mm

exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

The following comment in exit_mm() seems rather puzzling though:

        /* more a memory barrier than a real lock */
        task_lock(current);

So considering that spinlocks are not full memory barriers nowadays,
some digging into the origins of this comment led me to 2002, at the
earliest commit tracked by Thomas Gleixner's bitkeeper era's history
at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/

At that point, this comment was followed by:

+               /* more a memory barrier than a real lock */
+               task_lock(tsk);
+               tsk->mm = NULL;
+               task_unlock(tsk);

Which seems to indicate that grabbing the lock is really about acting as
a memory barrier, but I wonder if it is meant to be a full barrier or
just an acquire.

If a full memory barrier happens to be needed even without membarrier,
perhaps this fix also corrects other issues ? It unclear from the
comment what this memory barrier orders though. Is it the chain
exit -> waitpid, or is it related to entering lazy TLB ?

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: linux-mm@kvack.org
---
 kernel/exit.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/exit.c b/kernel/exit.c
index 727150f28103..ce272ec55cdc 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -474,6 +474,14 @@ static void exit_mm(void)
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
+	/*
+	 * When a thread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb();
 	current->mm = NULL;
 	mmap_read_unlock(mm);
 	enter_lazy_tlb(mm, current);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-07-28 16:00 [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Mathieu Desnoyers
@ 2020-07-28 16:00 ` Mathieu Desnoyers
  2020-08-04 14:51   ` peterz
  2020-08-04 14:34 ` [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier peterz
  1 sibling, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-07-28 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Mathieu Desnoyers, Will Deacon, Paul E . McKenney,
	Nicholas Piggin, Andy Lutomirski, Andrew Morton

Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.

Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.

Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.

This patch applies on top of this patch from Peter Zijlstra:
"mm: fix kthread_use_mm() vs TLB invalidate" currently in Andrew
Morton's tree.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 kernel/kthread.c          | 19 +++++++++++++++++++
 kernel/sched/membarrier.c |  8 ++------
 2 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 48925b17920e..ef2435517f14 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1258,8 +1258,19 @@ void kthread_use_mm(struct mm_struct *mm)
 	finish_arch_post_lock_switch();
 #endif
 
+	/*
+	 * When a kthread starts operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after storing to tsk->mm, before accessing
+	 * user-space memory. A full memory barrier for membarrier
+	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
+	 * mmdrop().
+	 */
 	if (active_mm != mm)
 		mmdrop(active_mm);
+	else
+		smp_mb();
 
 	to_kthread(tsk)->oldfs = get_fs();
 	set_fs(USER_DS);
@@ -1280,6 +1291,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
 	set_fs(to_kthread(tsk)->oldfs);
 
 	task_lock(tsk);
+	/*
+	 * When a kthread stops operating on an address space, the loop
+	 * in membarrier_{private,global}_expedited() may not observe
+	 * that tsk->mm, and not issue an IPI. Membarrier requires a
+	 * memory barrier after accessing user-space memory, before
+	 * clearing tsk->mm.
+	 */
+	smp_mb();
 	sync_mm_rss(mm);
 	local_irq_disable();
 	tsk->mm = NULL;
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 168479a7d61b..8a294483074d 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
 		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
 			continue;
 
-		/*
-		 * Skip the CPU if it runs a kernel thread. The scheduler
-		 * leaves the prior task mm in place as an optimization when
-		 * scheduling a kthread.
-		 */
+		/* Skip the CPU if it runs the idle thread. */
 		p = rcu_dereference(cpu_rq(cpu)->curr);
-		if (p->flags & PF_KTHREAD)
+		if (is_idle_task(p))
 			continue;
 
 		__cpumask_set_cpu(cpu, tmpmask);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-07-28 16:00 [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Mathieu Desnoyers
  2020-07-28 16:00 ` [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm Mathieu Desnoyers
@ 2020-08-04 14:34 ` peterz
  2020-08-04 14:48   ` Mathieu Desnoyers
  1 sibling, 1 reply; 14+ messages in thread
From: peterz @ 2020-08-04 14:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
> exit_mm should issue memory barriers after user-space memory accesses,
> before clearing current->mm, to order user-space memory accesses
> performed prior to exit_mm before clearing tsk->mm, which has the
> effect of skipping the membarrier private expedited IPIs.
> 
> The membarrier system call can be issued concurrently with do_exit
> if we have thread groups created with CLONE_VM but not CLONE_THREAD.

I'm still wonder what the exact failure case is though; exit_mm() is on
the exit path (as the name very much implies) and the thread is about to
die. The context switch that follows guarantees a full barrier before we
run anything else again.

> The following comment in exit_mm() seems rather puzzling though:
> 
>         /* more a memory barrier than a real lock */
>         task_lock(current);
> 
> So considering that spinlocks are not full memory barriers nowadays,
> some digging into the origins of this comment led me to 2002, at the
> earliest commit tracked by Thomas Gleixner's bitkeeper era's history
> at https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/
> 
> At that point, this comment was followed by:
> 
> +               /* more a memory barrier than a real lock */
> +               task_lock(tsk);
> +               tsk->mm = NULL;
> +               task_unlock(tsk);
> 
> Which seems to indicate that grabbing the lock is really about acting as
> a memory barrier, but I wonder if it is meant to be a full barrier or
> just an acquire.
> 
> If a full memory barrier happens to be needed even without membarrier,
> perhaps this fix also corrects other issues ? It unclear from the
> comment what this memory barrier orders though. Is it the chain
> exit -> waitpid, or is it related to entering lazy TLB ?

I'm as puzzled by that comment as are you.

It does seem to be required for glorious stuff like get_task_mm()
though.

> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: linux-mm@kvack.org
> ---
>  kernel/exit.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 727150f28103..ce272ec55cdc 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -474,6 +474,14 @@ static void exit_mm(void)
>  	BUG_ON(mm != current->active_mm);
>  	/* more a memory barrier than a real lock */
>  	task_lock(current);
> +	/*
> +	 * When a thread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb();

So like stated above, I'm not sure how missing that IPI is going to be a
problem, we're dying...

>  	current->mm = NULL;
>  	mmap_read_unlock(mm);
>  	enter_lazy_tlb(mm, current);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 14:34 ` [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier peterz
@ 2020-08-04 14:48   ` Mathieu Desnoyers
  2020-08-04 16:51     ` peterz
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-08-04 14:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm



----- On Aug 4, 2020, at 10:34 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 28, 2020 at 12:00:09PM -0400, Mathieu Desnoyers wrote:
>> exit_mm should issue memory barriers after user-space memory accesses,
>> before clearing current->mm, to order user-space memory accesses
>> performed prior to exit_mm before clearing tsk->mm, which has the
>> effect of skipping the membarrier private expedited IPIs.
>> 
>> The membarrier system call can be issued concurrently with do_exit
>> if we have thread groups created with CLONE_VM but not CLONE_THREAD.
> 
> I'm still wonder what the exact failure case is though; exit_mm() is on
> the exit path (as the name very much implies) and the thread is about to
> die. The context switch that follows guarantees a full barrier before we
> run anything else again.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-07-28 16:00 ` [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm Mathieu Desnoyers
@ 2020-08-04 14:51   ` peterz
  2020-08-04 14:59     ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: peterz @ 2020-08-04 14:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, Paul E . McKenney, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
> to allow the effect of membarrier(2) to apply to kthreads accessing
> user-space memory as well.
> 
> Given that no prior kthread use this guarantee and that it only affects
> kthreads, adding this guarantee does not affect user-space ABI.
> 
> Refine the check in membarrier_global_expedited to exclude runqueues
> running the idle thread rather than all kthreads from the IPI cpumask.
> 
> This patch applies on top of this patch from Peter Zijlstra:
> "mm: fix kthread_use_mm() vs TLB invalidate" currently in Andrew
> Morton's tree.
> 
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  kernel/kthread.c          | 19 +++++++++++++++++++
>  kernel/sched/membarrier.c |  8 ++------
>  2 files changed, 21 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 48925b17920e..ef2435517f14 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -1258,8 +1258,19 @@ void kthread_use_mm(struct mm_struct *mm)
>  	finish_arch_post_lock_switch();
>  #endif
>  
> +	/*
> +	 * When a kthread starts operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after storing to tsk->mm, before accessing
> +	 * user-space memory. A full memory barrier for membarrier
> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
> +	 * mmdrop().
> +	 */
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> +	else
> +		smp_mb();
>  
>  	to_kthread(tsk)->oldfs = get_fs();
>  	set_fs(USER_DS);
> @@ -1280,6 +1291,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>  	set_fs(to_kthread(tsk)->oldfs);
>  
>  	task_lock(tsk);
> +	/*
> +	 * When a kthread stops operating on an address space, the loop
> +	 * in membarrier_{private,global}_expedited() may not observe
> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> +	 * memory barrier after accessing user-space memory, before
> +	 * clearing tsk->mm.
> +	 */
> +	smp_mb();
>  	sync_mm_rss(mm);
>  	local_irq_disable();

Would it make sense to put the smp_mb() inside the IRQ disable region?

>  	tsk->mm = NULL;
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index 168479a7d61b..8a294483074d 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>  			continue;
>  
> -		/*
> -		 * Skip the CPU if it runs a kernel thread. The scheduler
> -		 * leaves the prior task mm in place as an optimization when
> -		 * scheduling a kthread.
> -		 */
> +		/* Skip the CPU if it runs the idle thread. */
>  		p = rcu_dereference(cpu_rq(cpu)->curr);
> -		if (p->flags & PF_KTHREAD)
> +		if (is_idle_task(p))
>  			continue;

Do we want to add a:

	WARN_ON_ONCE(current->mm);

in play_idle_precise() ?

Because, if I read this right, we rely on the idle thread not having an
mm.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-04 14:51   ` peterz
@ 2020-08-04 14:59     ` Mathieu Desnoyers
  2020-08-04 17:01       ` peterz
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-08-04 14:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
>> Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
>> to allow the effect of membarrier(2) to apply to kthreads accessing
>> user-space memory as well.
>> 
>> Given that no prior kthread use this guarantee and that it only affects
>> kthreads, adding this guarantee does not affect user-space ABI.
>> 
>> Refine the check in membarrier_global_expedited to exclude runqueues
>> running the idle thread rather than all kthreads from the IPI cpumask.
>> 
>> This patch applies on top of this patch from Peter Zijlstra:
>> "mm: fix kthread_use_mm() vs TLB invalidate" currently in Andrew
>> Morton's tree.
>> 
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> Cc: Nicholas Piggin <npiggin@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>  kernel/kthread.c          | 19 +++++++++++++++++++
>>  kernel/sched/membarrier.c |  8 ++------
>>  2 files changed, 21 insertions(+), 6 deletions(-)
>> 
>> diff --git a/kernel/kthread.c b/kernel/kthread.c
>> index 48925b17920e..ef2435517f14 100644
>> --- a/kernel/kthread.c
>> +++ b/kernel/kthread.c
>> @@ -1258,8 +1258,19 @@ void kthread_use_mm(struct mm_struct *mm)
>>  	finish_arch_post_lock_switch();
>>  #endif
>>  
>> +	/*
>> +	 * When a kthread starts operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after storing to tsk->mm, before accessing
>> +	 * user-space memory. A full memory barrier for membarrier
>> +	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
>> +	 * mmdrop().
>> +	 */
>>  	if (active_mm != mm)
>>  		mmdrop(active_mm);
>> +	else
>> +		smp_mb();
>>  
>>  	to_kthread(tsk)->oldfs = get_fs();
>>  	set_fs(USER_DS);
>> @@ -1280,6 +1291,14 @@ void kthread_unuse_mm(struct mm_struct *mm)
>>  	set_fs(to_kthread(tsk)->oldfs);
>>  
>>  	task_lock(tsk);
>> +	/*
>> +	 * When a kthread stops operating on an address space, the loop
>> +	 * in membarrier_{private,global}_expedited() may not observe
>> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> +	 * memory barrier after accessing user-space memory, before
>> +	 * clearing tsk->mm.
>> +	 */
>> +	smp_mb();
>>  	sync_mm_rss(mm);
>>  	local_irq_disable();
> 
> Would it make sense to put the smp_mb() inside the IRQ disable region?

I've initially placed it right after task_lock so we could eventually
have a smp_mb__after_non_raw_spinlock or something with a much better naming,
which would allow removing the extra barrier when it is implied by the
spinlock.

I don't see moving the barrier inside the irq off region as having any
significant effect as far as membarrier is concern. Is it something you
need for tlb flush ?

> 
>>  	tsk->mm = NULL;
>> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
>> index 168479a7d61b..8a294483074d 100644
>> --- a/kernel/sched/membarrier.c
>> +++ b/kernel/sched/membarrier.c
>> @@ -100,13 +100,9 @@ static int membarrier_global_expedited(void)
>>  		    MEMBARRIER_STATE_GLOBAL_EXPEDITED))
>>  			continue;
>>  
>> -		/*
>> -		 * Skip the CPU if it runs a kernel thread. The scheduler
>> -		 * leaves the prior task mm in place as an optimization when
>> -		 * scheduling a kthread.
>> -		 */
>> +		/* Skip the CPU if it runs the idle thread. */
>>  		p = rcu_dereference(cpu_rq(cpu)->curr);
>> -		if (p->flags & PF_KTHREAD)
>> +		if (is_idle_task(p))
>>  			continue;
> 
> Do we want to add a:
> 
>	WARN_ON_ONCE(current->mm);
> 
> in play_idle_precise() ?
> 
> Because, if I read this right, we rely on the idle thread not having an
> mm.

Yes, that's a good idea.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 14:48   ` Mathieu Desnoyers
@ 2020-08-04 16:51     ` peterz
  2020-08-04 17:25       ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: peterz @ 2020-08-04 16:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
> Here is the scenario I have in mind:

> Userspace variables:
> 
> int x = 0, y = 0;
> 
> CPU 0                   CPU 1
> Thread A                Thread B
> (in thread group A)     (in thread group B)
> 
> x = 1
> barrier()
> y = 1
> exit()
> exit_mm()
> current->mm = NULL;
>                         r1 = load y
>                         membarrier()
>                           skips CPU 0 (no IPI) because its current mm is NULL
>                         r2 = load x
>                         BUG_ON(r1 == 1 && r2 == 0)
> 

Ah, yes of course.

We really should have a bunch of these scenarios in membarrier.c.



Now, the above cannot happen because we have an unconditional
atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
relying on that is a wee bit dodgy.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-04 14:59     ` Mathieu Desnoyers
@ 2020-08-04 17:01       ` peterz
  2020-08-05 10:59         ` peterz
  0 siblings, 1 reply; 14+ messages in thread
From: peterz @ 2020-08-04 17:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

On Tue, Aug 04, 2020 at 10:59:33AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:
> > On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:

> >>  	task_lock(tsk);
> >> +	/*
> >> +	 * When a kthread stops operating on an address space, the loop
> >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> +	 * memory barrier after accessing user-space memory, before
> >> +	 * clearing tsk->mm.
> >> +	 */
> >> +	smp_mb();
> >>  	sync_mm_rss(mm);
> >>  	local_irq_disable();
> > 
> > Would it make sense to put the smp_mb() inside the IRQ disable region?
> 
> I've initially placed it right after task_lock so we could eventually
> have a smp_mb__after_non_raw_spinlock or something with a much better naming,
> which would allow removing the extra barrier when it is implied by the
> spinlock.

Oh, right, fair enough. I'll go think about if smp_mb__after_spinlock()
will work for mutexes too.

It basically needs to upgrade atomic*_acquire() to smp_mb(). So that's
all architectures that have their own _acquire() and an actual
smp_mb__after_atomic().

Which, from the top of my head are only arm64, power and possibly riscv.
And if I then git-grep smp_mb__after_spinlock, all those seem to be
covered.

But let me do a better audit..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier
  2020-08-04 16:51     ` peterz
@ 2020-08-04 17:25       ` Mathieu Desnoyers
  0 siblings, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-08-04 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Alan Stern,
	linux-mm

----- On Aug 4, 2020, at 12:51 PM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 04, 2020 at 10:48:41AM -0400, Mathieu Desnoyers wrote:
>> Here is the scenario I have in mind:
> 
>> Userspace variables:
>> 
>> int x = 0, y = 0;
>> 
>> CPU 0                   CPU 1
>> Thread A                Thread B
>> (in thread group A)     (in thread group B)
>> 
>> x = 1
>> barrier()
>> y = 1
>> exit()
>> exit_mm()
>> current->mm = NULL;
>>                         r1 = load y
>>                         membarrier()
>>                           skips CPU 0 (no IPI) because its current mm is NULL
>>                         r2 = load x
>>                         BUG_ON(r1 == 1 && r2 == 0)
>> 
> 
> Ah, yes of course.
> 
> We really should have a bunch of these scenarios in membarrier.c.

Good point.

> 
> 
> 
> Now, the above cannot happen because we have an unconditional
> atomic_dec_and_test() in do_exit() before exit_mm(), but I'm sure
> relying on that is a wee bit dodgy.

I am not against using this already existing barrier to provide the
guarantee we need, but it would have to be documented in the code.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-04 17:01       ` peterz
@ 2020-08-05 10:59         ` peterz
  2020-08-05 15:22           ` Mathieu Desnoyers
  0 siblings, 1 reply; 14+ messages in thread
From: peterz @ 2020-08-05 10:59 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

On Tue, Aug 04, 2020 at 07:01:53PM +0200, peterz@infradead.org wrote:
> On Tue, Aug 04, 2020 at 10:59:33AM -0400, Mathieu Desnoyers wrote:
> > ----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:
> > > On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
> 
> > >>  	task_lock(tsk);
> > >> +	/*
> > >> +	 * When a kthread stops operating on an address space, the loop
> > >> +	 * in membarrier_{private,global}_expedited() may not observe
> > >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> > >> +	 * memory barrier after accessing user-space memory, before
> > >> +	 * clearing tsk->mm.
> > >> +	 */
> > >> +	smp_mb();
> > >>  	sync_mm_rss(mm);
> > >>  	local_irq_disable();
> > > 
> > > Would it make sense to put the smp_mb() inside the IRQ disable region?
> > 
> > I've initially placed it right after task_lock so we could eventually
> > have a smp_mb__after_non_raw_spinlock or something with a much better naming,
> > which would allow removing the extra barrier when it is implied by the
> > spinlock.
> 
> Oh, right, fair enough. I'll go think about if smp_mb__after_spinlock()
> will work for mutexes too.
> 
> It basically needs to upgrade atomic*_acquire() to smp_mb(). So that's
> all architectures that have their own _acquire() and an actual
> smp_mb__after_atomic().
> 
> Which, from the top of my head are only arm64, power and possibly riscv.
> And if I then git-grep smp_mb__after_spinlock, all those seem to be
> covered.
> 
> But let me do a better audit..

All I could find is csky, which, afaict, defines a superfluous
smp_mb__after_spinlock.

The relevant architectures are indeed power, arm64 and riscv, they all
have custom acquire/release and all define smp_mb__after_spinlock()
appropriately.

Should we rename it to smp_mb__after_acquire() ?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-05 10:59         ` peterz
@ 2020-08-05 15:22           ` Mathieu Desnoyers
  2020-08-06 12:13             ` Will Deacon
  0 siblings, 1 reply; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-08-05 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Will Deacon, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

----- On Aug 5, 2020, at 6:59 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Tue, Aug 04, 2020 at 07:01:53PM +0200, peterz@infradead.org wrote:
>> On Tue, Aug 04, 2020 at 10:59:33AM -0400, Mathieu Desnoyers wrote:
>> > ----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:
>> > > On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
>> 
>> > >>  	task_lock(tsk);
>> > >> +	/*
>> > >> +	 * When a kthread stops operating on an address space, the loop
>> > >> +	 * in membarrier_{private,global}_expedited() may not observe
>> > >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> > >> +	 * memory barrier after accessing user-space memory, before
>> > >> +	 * clearing tsk->mm.
>> > >> +	 */
>> > >> +	smp_mb();
>> > >>  	sync_mm_rss(mm);
>> > >>  	local_irq_disable();
>> > > 
>> > > Would it make sense to put the smp_mb() inside the IRQ disable region?
>> > 
>> > I've initially placed it right after task_lock so we could eventually
>> > have a smp_mb__after_non_raw_spinlock or something with a much better naming,
>> > which would allow removing the extra barrier when it is implied by the
>> > spinlock.
>> 
>> Oh, right, fair enough. I'll go think about if smp_mb__after_spinlock()
>> will work for mutexes too.
>> 
>> It basically needs to upgrade atomic*_acquire() to smp_mb(). So that's
>> all architectures that have their own _acquire() and an actual
>> smp_mb__after_atomic().
>> 
>> Which, from the top of my head are only arm64, power and possibly riscv.
>> And if I then git-grep smp_mb__after_spinlock, all those seem to be
>> covered.
>> 
>> But let me do a better audit..
> 
> All I could find is csky, which, afaict, defines a superfluous
> smp_mb__after_spinlock.
> 
> The relevant architectures are indeed power, arm64 and riscv, they all
> have custom acquire/release and all define smp_mb__after_spinlock()
> appropriately.
> 
> Should we rename it to smp_mb__after_acquire() ?

As discussed over IRC, smp_mb__after_atomic_acquire() would be better, because
load_acquire and spin_lock have different semantic.

We could keep a define of smp_mb__after_spinlock to smp_mb__after_atomic_acquire
to make the transition simpler.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-05 15:22           ` Mathieu Desnoyers
@ 2020-08-06 12:13             ` Will Deacon
  2020-08-06 12:48               ` peterz
  2020-08-06 12:57               ` Mathieu Desnoyers
  0 siblings, 2 replies; 14+ messages in thread
From: Will Deacon @ 2020-08-06 12:13 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

On Wed, Aug 05, 2020 at 11:22:36AM -0400, Mathieu Desnoyers wrote:
> ----- On Aug 5, 2020, at 6:59 AM, Peter Zijlstra peterz@infradead.org wrote:
> > On Tue, Aug 04, 2020 at 07:01:53PM +0200, peterz@infradead.org wrote:
> >> On Tue, Aug 04, 2020 at 10:59:33AM -0400, Mathieu Desnoyers wrote:
> >> > ----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:
> >> > > On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
> >> > >>  	task_lock(tsk);
> >> > >> +	/*
> >> > >> +	 * When a kthread stops operating on an address space, the loop
> >> > >> +	 * in membarrier_{private,global}_expedited() may not observe
> >> > >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
> >> > >> +	 * memory barrier after accessing user-space memory, before
> >> > >> +	 * clearing tsk->mm.
> >> > >> +	 */
> >> > >> +	smp_mb();
> >> > >>  	sync_mm_rss(mm);
> >> > >>  	local_irq_disable();
> >> > > 
> >> > > Would it make sense to put the smp_mb() inside the IRQ disable region?
> >> > 
> >> > I've initially placed it right after task_lock so we could eventually
> >> > have a smp_mb__after_non_raw_spinlock or something with a much better naming,
> >> > which would allow removing the extra barrier when it is implied by the
> >> > spinlock.
> >> 
> >> Oh, right, fair enough. I'll go think about if smp_mb__after_spinlock()
> >> will work for mutexes too.
> >> 
> >> It basically needs to upgrade atomic*_acquire() to smp_mb(). So that's
> >> all architectures that have their own _acquire() and an actual
> >> smp_mb__after_atomic().
> >> 
> >> Which, from the top of my head are only arm64, power and possibly riscv.
> >> And if I then git-grep smp_mb__after_spinlock, all those seem to be
> >> covered.
> >> 
> >> But let me do a better audit..
> > 
> > All I could find is csky, which, afaict, defines a superfluous
> > smp_mb__after_spinlock.
> > 
> > The relevant architectures are indeed power, arm64 and riscv, they all
> > have custom acquire/release and all define smp_mb__after_spinlock()
> > appropriately.
> > 
> > Should we rename it to smp_mb__after_acquire() ?
> 
> As discussed over IRC, smp_mb__after_atomic_acquire() would be better, because
> load_acquire and spin_lock have different semantic.

Just to clarify here, are you talking about acquire on atomic RMW operations
being different to non-RMW operations, or are you talking about
atomic_read_acquire() being different to smp_load_acquire() (which I don't
think is the case, but wanted to check)?

We need to write this stuff down.

> We could keep a define of smp_mb__after_spinlock to smp_mb__after_atomic_acquire
> to make the transition simpler.

I'm not sure I really see the benefit of the rename, to be honest with you,
especially if smp_mb__after_spinlock() doesn't disappear at the same time.
The only reason you'd use this barrier is because the atomic is hidden away
behind a locking API, otherwise you'd just have used the full-barrier variant
of the atomic op to start with, wouldn't you?

Will

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-06 12:13             ` Will Deacon
@ 2020-08-06 12:48               ` peterz
  2020-08-06 12:57               ` Mathieu Desnoyers
  1 sibling, 0 replies; 14+ messages in thread
From: peterz @ 2020-08-06 12:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mathieu Desnoyers, linux-kernel, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

On Thu, Aug 06, 2020 at 01:13:46PM +0100, Will Deacon wrote:
> I'm not sure I really see the benefit of the rename, to be honest with you,
> especially if smp_mb__after_spinlock() doesn't disappear at the same time.

The reason I proposed a rename is because:

	mutex_lock(&foo);
	smp_mb__after_spinlock();

looks weird. But, afaict, it will work as expected. As the only possible
way to implement any lock() is with atomic*_acquire() or stronger.

Another possible name would be: smp_mb__after_lock().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm
  2020-08-06 12:13             ` Will Deacon
  2020-08-06 12:48               ` peterz
@ 2020-08-06 12:57               ` Mathieu Desnoyers
  1 sibling, 0 replies; 14+ messages in thread
From: Mathieu Desnoyers @ 2020-08-06 12:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, linux-kernel, paulmck, Nicholas Piggin,
	Andy Lutomirski, Andrew Morton

----- On Aug 6, 2020, at 8:13 AM, Will Deacon will@kernel.org wrote:

> On Wed, Aug 05, 2020 at 11:22:36AM -0400, Mathieu Desnoyers wrote:
>> ----- On Aug 5, 2020, at 6:59 AM, Peter Zijlstra peterz@infradead.org wrote:
>> > On Tue, Aug 04, 2020 at 07:01:53PM +0200, peterz@infradead.org wrote:
>> >> On Tue, Aug 04, 2020 at 10:59:33AM -0400, Mathieu Desnoyers wrote:
>> >> > ----- On Aug 4, 2020, at 10:51 AM, Peter Zijlstra peterz@infradead.org wrote:
>> >> > > On Tue, Jul 28, 2020 at 12:00:10PM -0400, Mathieu Desnoyers wrote:
>> >> > >>  	task_lock(tsk);
>> >> > >> +	/*
>> >> > >> +	 * When a kthread stops operating on an address space, the loop
>> >> > >> +	 * in membarrier_{private,global}_expedited() may not observe
>> >> > >> +	 * that tsk->mm, and not issue an IPI. Membarrier requires a
>> >> > >> +	 * memory barrier after accessing user-space memory, before
>> >> > >> +	 * clearing tsk->mm.
>> >> > >> +	 */
>> >> > >> +	smp_mb();
>> >> > >>  	sync_mm_rss(mm);
>> >> > >>  	local_irq_disable();
>> >> > > 
>> >> > > Would it make sense to put the smp_mb() inside the IRQ disable region?
>> >> > 
>> >> > I've initially placed it right after task_lock so we could eventually
>> >> > have a smp_mb__after_non_raw_spinlock or something with a much better naming,
>> >> > which would allow removing the extra barrier when it is implied by the
>> >> > spinlock.
>> >> 
>> >> Oh, right, fair enough. I'll go think about if smp_mb__after_spinlock()
>> >> will work for mutexes too.
>> >> 
>> >> It basically needs to upgrade atomic*_acquire() to smp_mb(). So that's
>> >> all architectures that have their own _acquire() and an actual
>> >> smp_mb__after_atomic().
>> >> 
>> >> Which, from the top of my head are only arm64, power and possibly riscv.
>> >> And if I then git-grep smp_mb__after_spinlock, all those seem to be
>> >> covered.
>> >> 
>> >> But let me do a better audit..
>> > 
>> > All I could find is csky, which, afaict, defines a superfluous
>> > smp_mb__after_spinlock.
>> > 
>> > The relevant architectures are indeed power, arm64 and riscv, they all
>> > have custom acquire/release and all define smp_mb__after_spinlock()
>> > appropriately.
>> > 
>> > Should we rename it to smp_mb__after_acquire() ?
>> 
>> As discussed over IRC, smp_mb__after_atomic_acquire() would be better, because
>> load_acquire and spin_lock have different semantic.
> 
> Just to clarify here, are you talking about acquire on atomic RMW operations
> being different to non-RMW operations, or are you talking about
> atomic_read_acquire() being different to smp_load_acquire() (which I don't
> think is the case, but wanted to check)?

I was referring to the two following APIs:

- spin_lock()
- smp_load_acquire()

on x86, spin_lock() happens to be implemented with an atomic instruction, which
implies a full memory barrier. However, its smp_load_acquire() does not provide
a full memory barrier. Therefore, if we implement a smp_mb__after_acquire() as
proposed by Peter, we could expect it to cover both APIs, which is tricky to do
efficiently without adding a superfluous barrier.

Hence the discussion about make its naming more specific, so it does not cover
smp_load_acquire.

> 
> We need to write this stuff down.
> 
>> We could keep a define of smp_mb__after_spinlock to smp_mb__after_atomic_acquire
>> to make the transition simpler.
> 
> I'm not sure I really see the benefit of the rename, to be honest with you,
> especially if smp_mb__after_spinlock() doesn't disappear at the same time.
> The only reason you'd use this barrier is because the atomic is hidden away
> behind a locking API, otherwise you'd just have used the full-barrier variant
> of the atomic op to start with, wouldn't you?

Good point.

As long as we can state that smp_mb__after_spinlock applies both to raw_spinlock
and non-raw spinlock (which AFAIU are mutexes on RT), I think it would suffice for
our immediate needs.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-08-06 17:32 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-28 16:00 [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier Mathieu Desnoyers
2020-07-28 16:00 ` [RFC PATCH 2/2] sched: membarrier: cover kthread_use_mm Mathieu Desnoyers
2020-08-04 14:51   ` peterz
2020-08-04 14:59     ` Mathieu Desnoyers
2020-08-04 17:01       ` peterz
2020-08-05 10:59         ` peterz
2020-08-05 15:22           ` Mathieu Desnoyers
2020-08-06 12:13             ` Will Deacon
2020-08-06 12:48               ` peterz
2020-08-06 12:57               ` Mathieu Desnoyers
2020-08-04 14:34 ` [RFC PATCH 1/2] sched: Fix exit_mm vs membarrier peterz
2020-08-04 14:48   ` Mathieu Desnoyers
2020-08-04 16:51     ` peterz
2020-08-04 17:25       ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).