All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched: finish_switch_mm hook
@ 2013-11-13  8:16 Martin Schwidefsky
  2013-11-13  8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
  2013-11-13  8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
  0 siblings, 2 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-13  8:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: Martin Schwidefsky

Hi,

in order to implement TLB flushing on s390 in a way that is 100%
compliant with the architecture I would like to introcude a new
hook to the scheduler. In concept it pairs with the switch_mm
function and is called after all locks have been released to
complete the switch_mm operation. The use case for s390 boils
down to the ability to wait for another CPU in switch_mm/
finish_switch_mm.

Right now the issue is theoretical and can only show up under
a extreme race condition in combination with a buggy user space
program. But to implement local TLB flushing which has been
added to the latest machine (zEC12) the finish_switch_mm hook
will be needed.

Patch #2 of the series is included as reference and shows how the
finish_switch_mm hook would be used in s390 code. The description
of the patch has more details why the hook is needed.

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13  8:16 [PATCH 0/2] sched: finish_switch_mm hook Martin Schwidefsky
@ 2013-11-13  8:16 ` Martin Schwidefsky
  2013-11-13 11:41   ` Peter Zijlstra
  2013-11-13  8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-13  8:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: Martin Schwidefsky

The switch_mm function is called with the task_lock and/or with
request queue lock. Add finish_switch_mm to allow an architecture
to execute some code after the mm has been switched but without
any locks held. One use case is the s390 architecture which will
use this to wait for the completion of TLB flush operations.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 include/linux/mmu_context.h |    6 ++++++
 kernel/sched/core.c         |    7 +++++--
 mm/mmu_context.c            |    3 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeb..0971c37 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,15 @@
 #ifndef _LINUX_MMU_CONTEXT_H
 #define _LINUX_MMU_CONTEXT_H
 
+#include <asm/mmu_context.h>
+
 struct mm_struct;
 
 void use_mm(struct mm_struct *mm);
 void unuse_mm(struct mm_struct *mm);
 
+#ifndef finish_switch_mm
+#define finish_switch_mm(mm, tsk) do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1deccd7..89409cb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -32,7 +32,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/capability.h>
 #include <linux/completion.h>
@@ -1996,6 +1996,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	perf_event_task_sched_in(prev, current);
 	finish_lock_switch(rq, prev);
 	finish_arch_post_lock_switch();
+	finish_switch_mm(current->mm, current);
 
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
@@ -4140,8 +4141,10 @@ void idle_task_exit(void)
 
 	BUG_ON(cpu_online(smp_processor_id()));
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
 		switch_mm(mm, &init_mm, current);
+		finish_switch_mm(&init_mm, current);
+	}
 	mmdrop(mm);
 }
 
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 8a8cd02..11b3d47 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -8,8 +8,6 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 
-#include <asm/mmu_context.h>
-
 /*
  * use_mm
  *	Makes the calling kernel thread take on the specified
@@ -31,6 +29,7 @@ void use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	switch_mm(active_mm, mm, tsk);
 	task_unlock(tsk);
+	finish_switch_mm(mm, tsk);
 
 	if (active_mm != mm)
 		mmdrop(active_mm);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-13  8:16 [PATCH 0/2] sched: finish_switch_mm hook Martin Schwidefsky
  2013-11-13  8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
@ 2013-11-13  8:16 ` Martin Schwidefsky
  2013-11-13 16:16   ` Catalin Marinas
  1 sibling, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-13  8:16 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, linux-kernel; +Cc: Martin Schwidefsky

Git commit 050eef364ad70059 "[S390] fix tlb flushing vs. concurrent
/proc accesses" introduced the attach counter to avoid using the
mm_users value to decide between IPTE for every PTE and lazy TLB
flushing with IDTE. That fixed the problem with mm_users but it
introduced another subtle race, fortunately one that is very hard
to hit.
The background is the requirement of the architecture that a valid
PTE may not be changed while it can be used concurrently by another
cpu. The decision between IPTE and lazy TLB flushing needs to be
done while the PTE is still valid. Now if the virtual cpu is
temporarily stopped after the decision to use lazy TLB flushing but
before the invalid bit of the PTE has been set, another cpu can attach
the mm, find that flush_mm is set, do the IDTE, return to userspace,
and recreate a TLB that uses the PTE in question. When the first,
stopped cpu continues it will change the PTE while it is attached on
another cpu. The first cpu will do another IDTE shortly after the
modification of the PTE which makes the race window quite short.

To fix this race the CPU that wants to attach the address space of a
user space thread needs to wait for the end of the PTE modification.
To do that the number of concurrent TLB flushers for an mm is tracked
in the upper 16 bits of the attach_count and finish_switch_mm is used
to wait for the end of the flush operation if required.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 arch/s390/include/asm/mmu_context.h |   35 +++++++++++++++++---
 arch/s390/include/asm/pgtable.h     |   60 +++++++++++++++++++++--------------
 arch/s390/include/asm/thread_info.h |    2 ++
 3 files changed, 69 insertions(+), 28 deletions(-)

diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 5d1f950..e91afeb 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
 static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 			     struct task_struct *tsk)
 {
-	cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
-	update_mm(next, tsk);
+	int cpu = smp_processor_id();
+
+	if (prev == next)
+		return;
+	if (atomic_inc_return(&next->context.attach_count) >> 16) {
+		/* Delay update_mm until all TLB flushes are done. */
+		set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
+	} else {
+		cpumask_set_cpu(cpu, mm_cpumask(next));
+		update_mm(next, tsk);
+		if (next->context.flush_mm)
+			/* Flush pending TLBs */
+			__tlb_flush_mm(next);
+	}
 	atomic_dec(&prev->context.attach_count);
 	WARN_ON(atomic_read(&prev->context.attach_count) < 0);
-	atomic_inc(&next->context.attach_count);
-	/* Check for TLBs not flushed yet */
-	__tlb_flush_mm_lazy(next);
+}
+
+#define finish_switch_mm finish_switch_mm
+static inline void finish_switch_mm(struct mm_struct *mm,
+				    struct task_struct *tsk)
+{
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
+		return;
+
+	while (atomic_read(&mm->context.attach_count) >> 16)
+		cpu_relax();
+
+	cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
+	update_mm(mm, tsk);
+	if (mm->context.flush_mm)
+		__tlb_flush_mm(mm);
 }
 
 #define enter_lazy_tlb(mm,tsk)	do { } while (0)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 2204400..fc4bb82 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1034,30 +1034,41 @@ static inline int ptep_test_and_clear_user_young(struct mm_struct *mm,
 
 static inline void __ptep_ipte(unsigned long address, pte_t *ptep)
 {
-	if (!(pte_val(*ptep) & _PAGE_INVALID)) {
+	unsigned long pto = (unsigned long) ptep;
+
 #ifndef CONFIG_64BIT
-		/* pto must point to the start of the segment table */
-		pte_t *pto = (pte_t *) (((unsigned long) ptep) & 0x7ffffc00);
-#else
-		/* ipte in zarch mode can do the math */
-		pte_t *pto = ptep;
+	/* pto in ESA mode must point to the start of the segment table */
+	pto &= 0x7ffffc00;
 #endif
-		asm volatile(
-			"	ipte	%2,%3"
-			: "=m" (*ptep) : "m" (*ptep),
-			  "a" (pto), "a" (address));
-	}
+	/* Invalidation + global TLB flush for the pte */
+	asm volatile(
+		"	ipte	%2,%3"
+		: "=m" (*ptep) : "m" (*ptep), "a" (pto), "a" (address));
+}
+
+static inline void ptep_flush_direct(struct mm_struct *mm,
+				     unsigned long address, pte_t *ptep)
+{
+	if (pte_val(*ptep) & _PAGE_INVALID)
+		return;
+	__ptep_ipte(address, ptep);
 }
 
 static inline void ptep_flush_lazy(struct mm_struct *mm,
 				   unsigned long address, pte_t *ptep)
 {
-	int active = (mm == current->active_mm) ? 1 : 0;
+	int active, count;
 
-	if (atomic_read(&mm->context.attach_count) > active)
-		__ptep_ipte(address, ptep);
-	else
+	if (pte_val(*ptep) & _PAGE_INVALID)
+		return;
+	active = (mm == current->active_mm) ? 1 : 0;
+	count = atomic_add_return(0x10000, &mm->context.attach_count);
+	if ((count & 0xffff) <= active) {
+		pte_val(*ptep) |= _PAGE_INVALID;
 		mm->context.flush_mm = 1;
+	} else
+		__ptep_ipte(address, ptep);
+	atomic_sub(0x10000, &mm->context.attach_count);
 }
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
@@ -1074,7 +1085,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	}
 
 	pte = *ptep;
-	__ptep_ipte(addr, ptep);
+	ptep_flush_direct(vma->vm_mm, addr, ptep);
 	young = pte_young(pte);
 	pte = pte_mkold(pte);
 
@@ -1145,7 +1156,6 @@ static inline pte_t ptep_modify_prot_start(struct mm_struct *mm,
 
 	pte = *ptep;
 	ptep_flush_lazy(mm, address, ptep);
-	pte_val(*ptep) |= _PAGE_INVALID;
 
 	if (mm_has_pgste(mm)) {
 		pgste = pgste_update_all(&pte, pgste);
@@ -1182,7 +1192,7 @@ static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
 	}
 
 	pte = *ptep;
-	__ptep_ipte(address, ptep);
+	ptep_flush_direct(vma->vm_mm, address, ptep);
 	pte_val(*ptep) = _PAGE_INVALID;
 
 	if (mm_has_pgste(vma->vm_mm)) {
@@ -1263,7 +1273,7 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
 		pgste = pgste_ipte_notify(vma->vm_mm, address, ptep, pgste);
 	}
 
-	__ptep_ipte(address, ptep);
+	ptep_flush_direct(vma->vm_mm, address, ptep);
 
 	if (mm_has_pgste(vma->vm_mm)) {
 		pgste_set_pte(ptep, entry);
@@ -1447,12 +1457,16 @@ static inline pmd_t pmd_mkwrite(pmd_t pmd)
 static inline void pmdp_flush_lazy(struct mm_struct *mm,
 				   unsigned long address, pmd_t *pmdp)
 {
-	int active = (mm == current->active_mm) ? 1 : 0;
+	int active, count;
 
-	if ((atomic_read(&mm->context.attach_count) & 0xffff) > active)
-		__pmd_idte(address, pmdp);
-	else
+	active = (mm == current->active_mm) ? 1 : 0;
+	count = atomic_add_return(0x10000, &mm->context.attach_count);
+	if ((count & 0xffff) <= active) {
+		pmd_val(*pmdp) |= _SEGMENT_ENTRY_INVALID;
 		mm->context.flush_mm = 1;
+	} else
+		__pmd_idte(address, pmdp);
+	atomic_sub(0x10000, &mm->context.attach_count);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index eb5f64d..7991a11 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -81,6 +81,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
 #define TIF_SIGPENDING		2	/* signal pending */
 #define TIF_NEED_RESCHED	3	/* rescheduling necessary */
+#define TIF_TLB_WAIT		4	/* wait for TLB flush completion */
 #define TIF_PER_TRAP		6	/* deliver sigtrap on return to user */
 #define TIF_MCCK_PENDING	7	/* machine check handling is pending */
 #define TIF_SYSCALL_TRACE	8	/* syscall trace active */
@@ -96,6 +97,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
+#define _TIF_TLB_WAIT		(1<<TIF_TLB_WAIT)
 #define _TIF_PER_TRAP		(1<<TIF_PER_TRAP)
 #define _TIF_MCCK_PENDING	(1<<TIF_MCCK_PENDING)
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13  8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
@ 2013-11-13 11:41   ` Peter Zijlstra
  2013-11-13 11:49     ` Martin Schwidefsky
  2013-11-13 12:19     ` Catalin Marinas
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2013-11-13 11:41 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, linux-kernel, Catalin Marinas

On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> The switch_mm function is called with the task_lock and/or with
> request queue lock. Add finish_switch_mm to allow an architecture
> to execute some code after the mm has been switched but without
> any locks held. One use case is the s390 architecture which will
> use this to wait for the completion of TLB flush operations.

This so reminds me of what finish_arch_post_lock_switch() was supposed
to do. See commit: 01f23e1630d9 ("sched/arch: Introduce the
finish_arch_post_lock_switch() scheduler callback").

Now you hook into more places; but maybe you can Catalin can come up
with something you both can use?

(preserved patch for Catalin)

> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> ---
>  include/linux/mmu_context.h |    6 ++++++
>  kernel/sched/core.c         |    7 +++++--
>  mm/mmu_context.c            |    3 +--
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
> index 70fffeb..0971c37 100644
> --- a/include/linux/mmu_context.h
> +++ b/include/linux/mmu_context.h
> @@ -1,9 +1,15 @@
>  #ifndef _LINUX_MMU_CONTEXT_H
>  #define _LINUX_MMU_CONTEXT_H
>  
> +#include <asm/mmu_context.h>
> +
>  struct mm_struct;
>  
>  void use_mm(struct mm_struct *mm);
>  void unuse_mm(struct mm_struct *mm);
>  
> +#ifndef finish_switch_mm
> +#define finish_switch_mm(mm, tsk) do { } while (0)
> +#endif
> +
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1deccd7..89409cb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -32,7 +32,7 @@
>  #include <linux/init.h>
>  #include <linux/uaccess.h>
>  #include <linux/highmem.h>
> -#include <asm/mmu_context.h>
> +#include <linux/mmu_context.h>
>  #include <linux/interrupt.h>
>  #include <linux/capability.h>
>  #include <linux/completion.h>
> @@ -1996,6 +1996,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  	perf_event_task_sched_in(prev, current);
>  	finish_lock_switch(rq, prev);
>  	finish_arch_post_lock_switch();
> +	finish_switch_mm(current->mm, current);
>  
>  	fire_sched_in_preempt_notifiers(current);
>  	if (mm)
> @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
>  
>  	BUG_ON(cpu_online(smp_processor_id()));
>  
> -	if (mm != &init_mm)
> +	if (mm != &init_mm) {
>  		switch_mm(mm, &init_mm, current);
> +		finish_switch_mm(&init_mm, current);
> +	}
>  	mmdrop(mm);
>  }
>  
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 8a8cd02..11b3d47 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -8,8 +8,6 @@
>  #include <linux/export.h>
>  #include <linux/sched.h>
>  
> -#include <asm/mmu_context.h>
> -
>  /*
>   * use_mm
>   *	Makes the calling kernel thread take on the specified
> @@ -31,6 +29,7 @@ void use_mm(struct mm_struct *mm)
>  	tsk->mm = mm;
>  	switch_mm(active_mm, mm, tsk);
>  	task_unlock(tsk);
> +	finish_switch_mm(mm, tsk);
>  
>  	if (active_mm != mm)
>  		mmdrop(active_mm);
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13 11:41   ` Peter Zijlstra
@ 2013-11-13 11:49     ` Martin Schwidefsky
  2013-11-13 12:19     ` Catalin Marinas
  1 sibling, 0 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-13 11:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Catalin Marinas

On Wed, 13 Nov 2013 12:41:43 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > The switch_mm function is called with the task_lock and/or with
> > request queue lock. Add finish_switch_mm to allow an architecture
> > to execute some code after the mm has been switched but without
> > any locks held. One use case is the s390 architecture which will
> > use this to wait for the completion of TLB flush operations.
> 
> This so reminds me of what finish_arch_post_lock_switch() was supposed
> to do. See commit: 01f23e1630d9 ("sched/arch: Introduce the
> finish_arch_post_lock_switch() scheduler callback").
> 
> Now you hook into more places; but maybe you can Catalin can come up
> with something you both can use?

finish_task_switch() has the call to finish_arch_post_lock_switch(), but
idle_task_exit() and use_mm() do not have the call. If it is safe to add
finish_arch_post_lock_switch() there as well I could use it for s390.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13 11:41   ` Peter Zijlstra
  2013-11-13 11:49     ` Martin Schwidefsky
@ 2013-11-13 12:19     ` Catalin Marinas
  2013-11-13 16:05       ` Martin Schwidefsky
  1 sibling, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-13 12:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Martin Schwidefsky, Ingo Molnar, linux-kernel

On Wed, Nov 13, 2013 at 11:41:43AM +0000, Peter Zijlstra wrote:
> On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > The switch_mm function is called with the task_lock and/or with
> > request queue lock. Add finish_switch_mm to allow an architecture
> > to execute some code after the mm has been switched but without
> > any locks held. One use case is the s390 architecture which will
> > use this to wait for the completion of TLB flush operations.

We have similar needs on arm and arm64 (full cache flushing where we
want interrupts enable or some IPIs for TLB tagging synchronisation).

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1deccd7..89409cb 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -32,7 +32,7 @@
> >  #include <linux/init.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/highmem.h>
> > -#include <asm/mmu_context.h>
> > +#include <linux/mmu_context.h>
> >  #include <linux/interrupt.h>
> >  #include <linux/capability.h>
> >  #include <linux/completion.h>
> > @@ -1996,6 +1996,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> >  	perf_event_task_sched_in(prev, current);
> >  	finish_lock_switch(rq, prev);
> >  	finish_arch_post_lock_switch();
> > +	finish_switch_mm(current->mm, current);

This could use the same hook.

> >  
> >  	fire_sched_in_preempt_notifiers(current);
> >  	if (mm)
> > @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
> >  
> >  	BUG_ON(cpu_online(smp_processor_id()));
> >  
> > -	if (mm != &init_mm)
> > +	if (mm != &init_mm) {
> >  		switch_mm(mm, &init_mm, current);
> > +		finish_switch_mm(&init_mm, current);
> > +	}
> >  	mmdrop(mm);
> >  }

Here finish_switch_mm() is called in the same context with switch_mm().
What we have on ARM via switch_mm() is to check for irqs_disabled() and
if yes, defer the actual switching via a flag until the
finish_arch_post_lock_switch() hook. But on ARM we only cared about the
interrupts being enabled.

> > diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> > index 8a8cd02..11b3d47 100644
> > --- a/mm/mmu_context.c
> > +++ b/mm/mmu_context.c
> > @@ -8,8 +8,6 @@
> >  #include <linux/export.h>
> >  #include <linux/sched.h>
> >  
> > -#include <asm/mmu_context.h>
> > -
> >  /*
> >   * use_mm
> >   *	Makes the calling kernel thread take on the specified
> > @@ -31,6 +29,7 @@ void use_mm(struct mm_struct *mm)
> >  	tsk->mm = mm;
> >  	switch_mm(active_mm, mm, tsk);
> >  	task_unlock(tsk);
> > +	finish_switch_mm(mm, tsk);

As above, for ARM we only care about interrupts being enabled, so it
didn't require a hook.

Is s390 switch_mm() ok with only interrupts being enabled but some locks
held?

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13 12:19     ` Catalin Marinas
@ 2013-11-13 16:05       ` Martin Schwidefsky
  2013-11-13 17:03         ` Catalin Marinas
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-13 16:05 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On Wed, 13 Nov 2013 12:19:09 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Wed, Nov 13, 2013 at 11:41:43AM +0000, Peter Zijlstra wrote:
> > On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > > The switch_mm function is called with the task_lock and/or with
> > > request queue lock. Add finish_switch_mm to allow an architecture
> > > to execute some code after the mm has been switched but without
> > > any locks held. One use case is the s390 architecture which will
> > > use this to wait for the completion of TLB flush operations.
> 
> We have similar needs on arm and arm64 (full cache flushing where we
> want interrupts enable or some IPIs for TLB tagging synchronisation).

On s390 we need to wait for the completion of a TLB flush.

> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 1deccd7..89409cb 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -32,7 +32,7 @@
> > >  #include <linux/init.h>
> > >  #include <linux/uaccess.h>
> > >  #include <linux/highmem.h>
> > > -#include <asm/mmu_context.h>
> > > +#include <linux/mmu_context.h>
> > >  #include <linux/interrupt.h>
> > >  #include <linux/capability.h>
> > >  #include <linux/completion.h>
> > > @@ -1996,6 +1996,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> > >  	perf_event_task_sched_in(prev, current);
> > >  	finish_lock_switch(rq, prev);
> > >  	finish_arch_post_lock_switch();
> > > +	finish_switch_mm(current->mm, current);
> 
> This could use the same hook.

Yes.
 
> > >  
> > >  	fire_sched_in_preempt_notifiers(current);
> > >  	if (mm)
> > > @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
> > >  
> > >  	BUG_ON(cpu_online(smp_processor_id()));
> > >  
> > > -	if (mm != &init_mm)
> > > +	if (mm != &init_mm) {
> > >  		switch_mm(mm, &init_mm, current);
> > > +		finish_switch_mm(&init_mm, current);
> > > +	}
> > >  	mmdrop(mm);
> > >  }
> 
> Here finish_switch_mm() is called in the same context with switch_mm().
> What we have on ARM via switch_mm() is to check for irqs_disabled() and
> if yes, defer the actual switching via a flag until the
> finish_arch_post_lock_switch() hook. But on ARM we only cared about the
> interrupts being enabled.

The guarantee s390 needs is that the rq-lock is not taken. What I have
seen with the wait loop in switch_mm is a dead lock because one CPU #0
was looping in switch_mm to wait for the TLB flush of another CPU #1.
CPU #1 got an interrupt that tried to wake-up a task which happened to
be on the run-queue of CPU #0.

> > > diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> > > index 8a8cd02..11b3d47 100644
> > > --- a/mm/mmu_context.c
> > > +++ b/mm/mmu_context.c
> > > @@ -8,8 +8,6 @@
> > >  #include <linux/export.h>
> > >  #include <linux/sched.h>
> > >  
> > > -#include <asm/mmu_context.h>
> > > -
> > >  /*
> > >   * use_mm
> > >   *	Makes the calling kernel thread take on the specified
> > > @@ -31,6 +29,7 @@ void use_mm(struct mm_struct *mm)
> > >  	tsk->mm = mm;
> > >  	switch_mm(active_mm, mm, tsk);
> > >  	task_unlock(tsk);
> > > +	finish_switch_mm(mm, tsk);
> 
> As above, for ARM we only care about interrupts being enabled, so it
> didn't require a hook.
> 
> Is s390 switch_mm() ok with only interrupts being enabled but some locks
> held?

Interrupts on/off is not the problem for s390. 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-13  8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
@ 2013-11-13 16:16   ` Catalin Marinas
  2013-11-14  8:10     ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-13 16:16 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> index 5d1f950..e91afeb 100644
> --- a/arch/s390/include/asm/mmu_context.h
> +++ b/arch/s390/include/asm/mmu_context.h
> @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
>  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
>                              struct task_struct *tsk)
>  {
> -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> -       update_mm(next, tsk);
> +       int cpu = smp_processor_id();
> +
> +       if (prev == next)
> +               return;
> +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> +               /* Delay update_mm until all TLB flushes are done. */
> +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> +       } else {
> +               cpumask_set_cpu(cpu, mm_cpumask(next));
> +               update_mm(next, tsk);
> +               if (next->context.flush_mm)
> +                       /* Flush pending TLBs */
> +                       __tlb_flush_mm(next);
> +       }
>         atomic_dec(&prev->context.attach_count);
>         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> -       atomic_inc(&next->context.attach_count);
> -       /* Check for TLBs not flushed yet */
> -       __tlb_flush_mm_lazy(next);
> +}
> +
> +#define finish_switch_mm finish_switch_mm
> +static inline void finish_switch_mm(struct mm_struct *mm,
> +                                   struct task_struct *tsk)
> +{
> +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> +               return;
> +
> +       while (atomic_read(&mm->context.attach_count) >> 16)
> +               cpu_relax();
> +
> +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> +       update_mm(mm, tsk);
> +       if (mm->context.flush_mm)
> +               __tlb_flush_mm(mm);
>  }

Some care is needed here with preemption (we had this on arm and I
think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
on a thread but you get preempted just before finish_switch_mm(). The
new thread has the same mm as the preempted on and switch_mm() exits
early without setting another flag. So finish_switch_mm() wouldn't do
anything but you still switched to the new mm. The fix is to make the
flag per mm rather than thread (see commit bdae73cd374e).

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13 16:05       ` Martin Schwidefsky
@ 2013-11-13 17:03         ` Catalin Marinas
  2013-11-14  8:00           ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-13 17:03 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On Wed, Nov 13, 2013 at 04:05:56PM +0000, Martin Schwidefsky wrote:
> On Wed, 13 Nov 2013 12:19:09 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Nov 13, 2013 at 11:41:43AM +0000, Peter Zijlstra wrote:
> > > On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > > >  	fire_sched_in_preempt_notifiers(current);
> > > >  	if (mm)
> > > > @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
> > > >  
> > > >  	BUG_ON(cpu_online(smp_processor_id()));
> > > >  
> > > > -	if (mm != &init_mm)
> > > > +	if (mm != &init_mm) {
> > > >  		switch_mm(mm, &init_mm, current);
> > > > +		finish_switch_mm(&init_mm, current);
> > > > +	}
> > > >  	mmdrop(mm);
> > > >  }
> > 
> > Here finish_switch_mm() is called in the same context with switch_mm().
> > What we have on ARM via switch_mm() is to check for irqs_disabled() and
> > if yes, defer the actual switching via a flag until the
> > finish_arch_post_lock_switch() hook. But on ARM we only cared about the
> > interrupts being enabled.
> 
> The guarantee s390 needs is that the rq-lock is not taken. What I have
> seen with the wait loop in switch_mm is a dead lock because one CPU #0
> was looping in switch_mm to wait for the TLB flush of another CPU #1.
> CPU #1 got an interrupt that tried to wake-up a task which happened to
> be on the run-queue of CPU #0.

I'm not familiar with the s390 code, so how's the waiting done? Is it
part of an on_each_cpu() call (that's what I got from smp_ptlb_all)?

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/mm: add finish_switch_mm function
  2013-11-13 17:03         ` Catalin Marinas
@ 2013-11-14  8:00           ` Martin Schwidefsky
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-14  8:00 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Peter Zijlstra, Ingo Molnar, linux-kernel

On Wed, 13 Nov 2013 17:03:58 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Wed, Nov 13, 2013 at 04:05:56PM +0000, Martin Schwidefsky wrote:
> > On Wed, 13 Nov 2013 12:19:09 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Nov 13, 2013 at 11:41:43AM +0000, Peter Zijlstra wrote:
> > > > On Wed, Nov 13, 2013 at 09:16:13AM +0100, Martin Schwidefsky wrote:
> > > > >  	fire_sched_in_preempt_notifiers(current);
> > > > >  	if (mm)
> > > > > @@ -4140,8 +4141,10 @@ void idle_task_exit(void)
> > > > >  
> > > > >  	BUG_ON(cpu_online(smp_processor_id()));
> > > > >  
> > > > > -	if (mm != &init_mm)
> > > > > +	if (mm != &init_mm) {
> > > > >  		switch_mm(mm, &init_mm, current);
> > > > > +		finish_switch_mm(&init_mm, current);
> > > > > +	}
> > > > >  	mmdrop(mm);
> > > > >  }
> > > 
> > > Here finish_switch_mm() is called in the same context with switch_mm().
> > > What we have on ARM via switch_mm() is to check for irqs_disabled() and
> > > if yes, defer the actual switching via a flag until the
> > > finish_arch_post_lock_switch() hook. But on ARM we only cared about the
> > > interrupts being enabled.
> > 
> > The guarantee s390 needs is that the rq-lock is not taken. What I have
> > seen with the wait loop in switch_mm is a dead lock because one CPU #0
> > was looping in switch_mm to wait for the TLB flush of another CPU #1.
> > CPU #1 got an interrupt that tried to wake-up a task which happened to
> > be on the run-queue of CPU #0.
> 
> I'm not familiar with the s390 code, so how's the waiting done? Is it
> part of an on_each_cpu() call (that's what I got from smp_ptlb_all)?
 
Ok, the long explanation: the requirement of the architecture is that a
PTE need to be modified in a coordinated fashion if there is the possibility
that another CPU is using the mm at the same time. Coordinated means that
the modification has to be done with specific instructions, IPTE and IDTE.
These instructions make sure that other CPUs can complete instructions
that use the PTE before the PTE is made invalid and the associated TLB
is flushed.

Problem is that these instructions are slow, basically they do on_each_cpu
under the hood. Therefore the code tries to batch PTE operations. For each
PTE the code decides if it can do a batched/lazy PTE update or if the
expensive way with IPTE is required. Now if the decision to use a batched
PTE update has just been made with the PTE still valid this can race with
another CPU attaching the mm with switch_mm(). Until the first CPU is done
with PTE modification the second CPU may not use the mm yet. 

The coordination between CPU is done with the mm->context.attach_count.
The upper 16 bit have a flush-counter, the lower 16 bit an attach counter.
Each attacher of an mm increases the lower 16 bit, this is used by the
flusher to decide between lazy and direct flushing. Each flusher increases
the upper 16 bit while it is in the critical phase of the update. The
attacher uses this to loop until the flush counter is zero again. And
this wait loop may not hold any locks, otherwise it can dead-lock.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-13 16:16   ` Catalin Marinas
@ 2013-11-14  8:10     ` Martin Schwidefsky
  2013-11-14 13:22       ` Catalin Marinas
  2013-11-15  9:13       ` Martin Schwidefsky
  0 siblings, 2 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-14  8:10 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Wed, 13 Nov 2013 16:16:35 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > index 5d1f950..e91afeb 100644
> > --- a/arch/s390/include/asm/mmu_context.h
> > +++ b/arch/s390/include/asm/mmu_context.h
> > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> >                              struct task_struct *tsk)
> >  {
> > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > -       update_mm(next, tsk);
> > +       int cpu = smp_processor_id();
> > +
> > +       if (prev == next)
> > +               return;
> > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > +               /* Delay update_mm until all TLB flushes are done. */
> > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > +       } else {
> > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > +               update_mm(next, tsk);
> > +               if (next->context.flush_mm)
> > +                       /* Flush pending TLBs */
> > +                       __tlb_flush_mm(next);
> > +       }
> >         atomic_dec(&prev->context.attach_count);
> >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > -       atomic_inc(&next->context.attach_count);
> > -       /* Check for TLBs not flushed yet */
> > -       __tlb_flush_mm_lazy(next);
> > +}
> > +
> > +#define finish_switch_mm finish_switch_mm
> > +static inline void finish_switch_mm(struct mm_struct *mm,
> > +                                   struct task_struct *tsk)
> > +{
> > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > +               return;
> > +
> > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > +               cpu_relax();
> > +
> > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > +       update_mm(mm, tsk);
> > +       if (mm->context.flush_mm)
> > +               __tlb_flush_mm(mm);
> >  }
> 
> Some care is needed here with preemption (we had this on arm and I
> think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> on a thread but you get preempted just before finish_switch_mm(). The
> new thread has the same mm as the preempted on and switch_mm() exits
> early without setting another flag. So finish_switch_mm() wouldn't do
> anything but you still switched to the new mm. The fix is to make the
> flag per mm rather than thread (see commit bdae73cd374e).

Interesting. For s390 I need to make sure that each task attaching an
mm waits for the completion of concurrent TLB flush operations. If the
scheduler does not switch the mm I don't care, the mm is still attached.
For the s390 issue a TIF bit seems appropriate. But I have to add an
preempt_enable/preempt_disable pair to finish_switch_mm, otherwise the
task can get hit by preemption after the while loop.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-14  8:10     ` Martin Schwidefsky
@ 2013-11-14 13:22       ` Catalin Marinas
  2013-11-14 16:33         ` Martin Schwidefsky
  2013-11-15  9:13       ` Martin Schwidefsky
  1 sibling, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-14 13:22 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> On Wed, 13 Nov 2013 16:16:35 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > index 5d1f950..e91afeb 100644
> > > --- a/arch/s390/include/asm/mmu_context.h
> > > +++ b/arch/s390/include/asm/mmu_context.h
> > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > >                              struct task_struct *tsk)
> > >  {
> > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > -       update_mm(next, tsk);
> > > +       int cpu = smp_processor_id();
> > > +
> > > +       if (prev == next)
> > > +               return;
> > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > +               /* Delay update_mm until all TLB flushes are done. */
> > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > +       } else {
> > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > +               update_mm(next, tsk);
> > > +               if (next->context.flush_mm)
> > > +                       /* Flush pending TLBs */
> > > +                       __tlb_flush_mm(next);
> > > +       }
> > >         atomic_dec(&prev->context.attach_count);
> > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > -       atomic_inc(&next->context.attach_count);
> > > -       /* Check for TLBs not flushed yet */
> > > -       __tlb_flush_mm_lazy(next);
> > > +}
> > > +
> > > +#define finish_switch_mm finish_switch_mm
> > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > +                                   struct task_struct *tsk)
> > > +{
> > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > +               return;
> > > +
> > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > +               cpu_relax();
> > > +
> > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > +       update_mm(mm, tsk);
> > > +       if (mm->context.flush_mm)
> > > +               __tlb_flush_mm(mm);
> > >  }
> > 
> > Some care is needed here with preemption (we had this on arm and I
> > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > on a thread but you get preempted just before finish_switch_mm(). The
> > new thread has the same mm as the preempted on and switch_mm() exits
> > early without setting another flag. So finish_switch_mm() wouldn't do
> > anything but you still switched to the new mm. The fix is to make the
> > flag per mm rather than thread (see commit bdae73cd374e).
> 
> Interesting. For s390 I need to make sure that each task attaching an
> mm waits for the completion of concurrent TLB flush operations. If the
> scheduler does not switch the mm I don't care, the mm is still attached.

I assume the actual hardware mm switch happens via update_mm(). If you
have a context_switch() to a thread which requires an update_mm() but you
defer this until finish_switch_mm(), you may be preempted before the
hardware update. If the new context_switch() schedules a thread with the
same mm as the preempted one, you no longer call update_mm(). So the new
thread actually uses an old hardware mm.

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-14 13:22       ` Catalin Marinas
@ 2013-11-14 16:33         ` Martin Schwidefsky
  2013-11-15 10:44           ` Catalin Marinas
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-14 16:33 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Thu, 14 Nov 2013 13:22:23 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> > On Wed, 13 Nov 2013 16:16:35 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > 
> > > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > > index 5d1f950..e91afeb 100644
> > > > --- a/arch/s390/include/asm/mmu_context.h
> > > > +++ b/arch/s390/include/asm/mmu_context.h
> > > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > >                              struct task_struct *tsk)
> > > >  {
> > > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > > -       update_mm(next, tsk);
> > > > +       int cpu = smp_processor_id();
> > > > +
> > > > +       if (prev == next)
> > > > +               return;
> > > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > > +               /* Delay update_mm until all TLB flushes are done. */
> > > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > > +       } else {
> > > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > > +               update_mm(next, tsk);
> > > > +               if (next->context.flush_mm)
> > > > +                       /* Flush pending TLBs */
> > > > +                       __tlb_flush_mm(next);
> > > > +       }
> > > >         atomic_dec(&prev->context.attach_count);
> > > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > > -       atomic_inc(&next->context.attach_count);
> > > > -       /* Check for TLBs not flushed yet */
> > > > -       __tlb_flush_mm_lazy(next);
> > > > +}
> > > > +
> > > > +#define finish_switch_mm finish_switch_mm
> > > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > > +                                   struct task_struct *tsk)
> > > > +{
> > > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > > +               return;
> > > > +
> > > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > > +               cpu_relax();
> > > > +
> > > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > > +       update_mm(mm, tsk);
> > > > +       if (mm->context.flush_mm)
> > > > +               __tlb_flush_mm(mm);
> > > >  }
> > > 
> > > Some care is needed here with preemption (we had this on arm and I
> > > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > > on a thread but you get preempted just before finish_switch_mm(). The
> > > new thread has the same mm as the preempted on and switch_mm() exits
> > > early without setting another flag. So finish_switch_mm() wouldn't do
> > > anything but you still switched to the new mm. The fix is to make the
> > > flag per mm rather than thread (see commit bdae73cd374e).
> > 
> > Interesting. For s390 I need to make sure that each task attaching an
> > mm waits for the completion of concurrent TLB flush operations. If the
> > scheduler does not switch the mm I don't care, the mm is still attached.
> 
> I assume the actual hardware mm switch happens via update_mm(). If you
> have a context_switch() to a thread which requires an update_mm() but you
> defer this until finish_switch_mm(), you may be preempted before the
> hardware update. If the new context_switch() schedules a thread with the
> same mm as the preempted one, you no longer call update_mm(). So the new
> thread actually uses an old hardware mm.
 
If the code gets preempted between switch_mm() and finish_switch_mm()
the worst that can happen is that finish_switch_mm() is called twice.
If the preempted task is picked up again the previous task running
on the CPU at that time will do the schedule() call, including the
switch_mm() and the finish_switch_mm() before returning the code
location where preemption interrupt it. I don't see how we could end
up with an incorrect mm.

But back to the original question: would it cause a problem for arm
if we add the two additional calls to finish_arch_post_lock_switch()
to idle_task_exit() and use_mm() ?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-14  8:10     ` Martin Schwidefsky
  2013-11-14 13:22       ` Catalin Marinas
@ 2013-11-15  9:13       ` Martin Schwidefsky
  1 sibling, 0 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-15  9:13 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Catalin Marinas, Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Thu, 14 Nov 2013 09:10:07 +0100
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Wed, 13 Nov 2013 16:16:35 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > index 5d1f950..e91afeb 100644
> > > --- a/arch/s390/include/asm/mmu_context.h
> > > +++ b/arch/s390/include/asm/mmu_context.h
> > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > >                              struct task_struct *tsk)
> > >  {
> > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > -       update_mm(next, tsk);
> > > +       int cpu = smp_processor_id();
> > > +
> > > +       if (prev == next)
> > > +               return;
> > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > +               /* Delay update_mm until all TLB flushes are done. */
> > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > +       } else {
> > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > +               update_mm(next, tsk);
> > > +               if (next->context.flush_mm)
> > > +                       /* Flush pending TLBs */
> > > +                       __tlb_flush_mm(next);
> > > +       }
> > >         atomic_dec(&prev->context.attach_count);
> > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > -       atomic_inc(&next->context.attach_count);
> > > -       /* Check for TLBs not flushed yet */
> > > -       __tlb_flush_mm_lazy(next);
> > > +}
> > > +
> > > +#define finish_switch_mm finish_switch_mm
> > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > +                                   struct task_struct *tsk)
> > > +{
> > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > +               return;
> > > +
> > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > +               cpu_relax();
> > > +
> > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > +       update_mm(mm, tsk);
> > > +       if (mm->context.flush_mm)
> > > +               __tlb_flush_mm(mm);
> > >  }
> > 
> > Some care is needed here with preemption (we had this on arm and I
> > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > on a thread but you get preempted just before finish_switch_mm(). The
> > new thread has the same mm as the preempted on and switch_mm() exits
> > early without setting another flag. So finish_switch_mm() wouldn't do
> > anything but you still switched to the new mm. The fix is to make the
> > flag per mm rather than thread (see commit bdae73cd374e).
> 
> Interesting. For s390 I need to make sure that each task attaching an
> mm waits for the completion of concurrent TLB flush operations. If the
> scheduler does not switch the mm I don't care, the mm is still attached.
> For the s390 issue a TIF bit seems appropriate. But I have to add an
> preempt_enable/preempt_disable pair to finish_switch_mm, otherwise the
> task can get hit by preemption after the while loop.
 
I almost committed a patch to add preempt_enable/preempt_disable when I
realized that it is not needed after all. If a preemptive schedule hits
in finish_switch_mm a full switch_mm/finish_switch_mm pair will be done
when the task is picked up again by a CPU. The worst that can happen
is that the update_mm is done a second time which is ok. All good :-)

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-14 16:33         ` Martin Schwidefsky
@ 2013-11-15 10:44           ` Catalin Marinas
  2013-11-15 11:10             ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-15 10:44 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Thu, Nov 14, 2013 at 04:33:59PM +0000, Martin Schwidefsky wrote:
> On Thu, 14 Nov 2013 13:22:23 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> > > On Wed, 13 Nov 2013 16:16:35 +0000
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > 
> > > > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > > > index 5d1f950..e91afeb 100644
> > > > > --- a/arch/s390/include/asm/mmu_context.h
> > > > > +++ b/arch/s390/include/asm/mmu_context.h
> > > > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > > > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > > >                              struct task_struct *tsk)
> > > > >  {
> > > > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > > > -       update_mm(next, tsk);
> > > > > +       int cpu = smp_processor_id();
> > > > > +
> > > > > +       if (prev == next)
> > > > > +               return;
> > > > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > > > +               /* Delay update_mm until all TLB flushes are done. */
> > > > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > > > +       } else {
> > > > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > > > +               update_mm(next, tsk);
> > > > > +               if (next->context.flush_mm)
> > > > > +                       /* Flush pending TLBs */
> > > > > +                       __tlb_flush_mm(next);
> > > > > +       }
> > > > >         atomic_dec(&prev->context.attach_count);
> > > > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > > > -       atomic_inc(&next->context.attach_count);
> > > > > -       /* Check for TLBs not flushed yet */
> > > > > -       __tlb_flush_mm_lazy(next);
> > > > > +}
> > > > > +
> > > > > +#define finish_switch_mm finish_switch_mm
> > > > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > > > +                                   struct task_struct *tsk)
> > > > > +{
> > > > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > > > +               return;
> > > > > +
> > > > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > > > +               cpu_relax();
> > > > > +
> > > > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > > > +       update_mm(mm, tsk);
> > > > > +       if (mm->context.flush_mm)
> > > > > +               __tlb_flush_mm(mm);
> > > > >  }
> > > > 
> > > > Some care is needed here with preemption (we had this on arm and I
> > > > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > > > on a thread but you get preempted just before finish_switch_mm(). The
> > > > new thread has the same mm as the preempted on and switch_mm() exits
> > > > early without setting another flag. So finish_switch_mm() wouldn't do
> > > > anything but you still switched to the new mm. The fix is to make the
> > > > flag per mm rather than thread (see commit bdae73cd374e).
> > > 
> > > Interesting. For s390 I need to make sure that each task attaching an
> > > mm waits for the completion of concurrent TLB flush operations. If the
> > > scheduler does not switch the mm I don't care, the mm is still attached.
> > 
> > I assume the actual hardware mm switch happens via update_mm(). If you
> > have a context_switch() to a thread which requires an update_mm() but you
> > defer this until finish_switch_mm(), you may be preempted before the
> > hardware update. If the new context_switch() schedules a thread with the
> > same mm as the preempted one, you no longer call update_mm(). So the new
> > thread actually uses an old hardware mm.
>  
> If the code gets preempted between switch_mm() and finish_switch_mm()
> the worst that can happen is that finish_switch_mm() is called twice.

Yes, it's called twice, but you only set the TIF_TLB_WAIT the first
time. Here's the scenario:

1. thread-A running with mm-A
2. context_switch() to thread-B1 causing a switch_mm(mm-B)
3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
   update_mm(mm-B). Hardware still using mm-A
4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
5. interrupt and preemption before finish_mm_switch(mm-B)
6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
   that thread-B1 and thread-B2 have the same mm-B)
7. switch_mm() as in this patch exits early because prev == next
8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
   for thread-B2, therefore no call to update_mm(mm-B)

So after point 8, you get thread-B2 running (and possibly returning to
user space) with mm-A. Do you see a problem here?

> But back to the original question: would it cause a problem for arm
> if we add the two additional calls to finish_arch_post_lock_switch()
> to idle_task_exit() and use_mm() ?

There shouldn't be any issue on ARM as we only flag the need for switch
in switch_mm(). We may be able to remove the irqs_disabled() check if we
are always guaranteed the final call. But I'll follow up on the first
patch, didn't get to read it in detail.

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 10:44           ` Catalin Marinas
@ 2013-11-15 11:10             ` Martin Schwidefsky
  2013-11-15 11:17               ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-15 11:10 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Fri, 15 Nov 2013 10:44:37 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Thu, Nov 14, 2013 at 04:33:59PM +0000, Martin Schwidefsky wrote:
> > On Thu, 14 Nov 2013 13:22:23 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > 
> > > On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> > > > On Wed, 13 Nov 2013 16:16:35 +0000
> > > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > 
> > > > > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > > > > index 5d1f950..e91afeb 100644
> > > > > > --- a/arch/s390/include/asm/mmu_context.h
> > > > > > +++ b/arch/s390/include/asm/mmu_context.h
> > > > > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > > > > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > > > >                              struct task_struct *tsk)
> > > > > >  {
> > > > > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > > > > -       update_mm(next, tsk);
> > > > > > +       int cpu = smp_processor_id();
> > > > > > +
> > > > > > +       if (prev == next)
> > > > > > +               return;
> > > > > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > > > > +               /* Delay update_mm until all TLB flushes are done. */
> > > > > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > > > > +       } else {
> > > > > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > > > > +               update_mm(next, tsk);
> > > > > > +               if (next->context.flush_mm)
> > > > > > +                       /* Flush pending TLBs */
> > > > > > +                       __tlb_flush_mm(next);
> > > > > > +       }
> > > > > >         atomic_dec(&prev->context.attach_count);
> > > > > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > > > > -       atomic_inc(&next->context.attach_count);
> > > > > > -       /* Check for TLBs not flushed yet */
> > > > > > -       __tlb_flush_mm_lazy(next);
> > > > > > +}
> > > > > > +
> > > > > > +#define finish_switch_mm finish_switch_mm
> > > > > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > > > > +                                   struct task_struct *tsk)
> > > > > > +{
> > > > > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > > > > +               return;
> > > > > > +
> > > > > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > > > > +               cpu_relax();
> > > > > > +
> > > > > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > > > > +       update_mm(mm, tsk);
> > > > > > +       if (mm->context.flush_mm)
> > > > > > +               __tlb_flush_mm(mm);
> > > > > >  }
> > > > > 
> > > > > Some care is needed here with preemption (we had this on arm and I
> > > > > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > > > > on a thread but you get preempted just before finish_switch_mm(). The
> > > > > new thread has the same mm as the preempted on and switch_mm() exits
> > > > > early without setting another flag. So finish_switch_mm() wouldn't do
> > > > > anything but you still switched to the new mm. The fix is to make the
> > > > > flag per mm rather than thread (see commit bdae73cd374e).
> > > > 
> > > > Interesting. For s390 I need to make sure that each task attaching an
> > > > mm waits for the completion of concurrent TLB flush operations. If the
> > > > scheduler does not switch the mm I don't care, the mm is still attached.
> > > 
> > > I assume the actual hardware mm switch happens via update_mm(). If you
> > > have a context_switch() to a thread which requires an update_mm() but you
> > > defer this until finish_switch_mm(), you may be preempted before the
> > > hardware update. If the new context_switch() schedules a thread with the
> > > same mm as the preempted one, you no longer call update_mm(). So the new
> > > thread actually uses an old hardware mm.
> >  
> > If the code gets preempted between switch_mm() and finish_switch_mm()
> > the worst that can happen is that finish_switch_mm() is called twice.
> 
> Yes, it's called twice, but you only set the TIF_TLB_WAIT the first
> time. Here's the scenario:
> 
> 1. thread-A running with mm-A
> 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
>    update_mm(mm-B). Hardware still using mm-A
> 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> 5. interrupt and preemption before finish_mm_switch(mm-B)
> 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
>    that thread-B1 and thread-B2 have the same mm-B)
> 7. switch_mm() as in this patch exits early because prev == next
> 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
>    for thread-B2, therefore no call to update_mm(mm-B)
> 
> So after point 8, you get thread-B2 running (and possibly returning to
> user space) with mm-A. Do you see a problem here?

Oh, now I get it. Thanks for the patience, this is indeed a problem.
And I concur, a per-mm flag is the 'obvious' solution.

> > But back to the original question: would it cause a problem for arm
> > if we add the two additional calls to finish_arch_post_lock_switch()
> > to idle_task_exit() and use_mm() ?
> 
> There shouldn't be any issue on ARM as we only flag the need for switch
> in switch_mm(). We may be able to remove the irqs_disabled() check if we
> are always guaranteed the final call. But I'll follow up on the first
> patch, didn't get to read it in detail.
 
Ok, please do so. It would be good if we can use a common hook.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 11:10             ` Martin Schwidefsky
@ 2013-11-15 11:17               ` Martin Schwidefsky
  2013-11-15 11:57                 ` Catalin Marinas
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-15 11:17 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Catalin Marinas, Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Fri, 15 Nov 2013 12:10:00 +0100
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:

> On Fri, 15 Nov 2013 10:44:37 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> > On Thu, Nov 14, 2013 at 04:33:59PM +0000, Martin Schwidefsky wrote:
> > > On Thu, 14 Nov 2013 13:22:23 +0000
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > 
> > > > On Thu, Nov 14, 2013 at 08:10:07AM +0000, Martin Schwidefsky wrote:
> > > > > On Wed, 13 Nov 2013 16:16:35 +0000
> > > > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > > 
> > > > > > On 13 November 2013 08:16, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > > > > > > diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
> > > > > > > index 5d1f950..e91afeb 100644
> > > > > > > --- a/arch/s390/include/asm/mmu_context.h
> > > > > > > +++ b/arch/s390/include/asm/mmu_context.h
> > > > > > > @@ -48,13 +48,38 @@ static inline void update_mm(struct mm_struct *mm, struct task_struct *tsk)
> > > > > > >  static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> > > > > > >                              struct task_struct *tsk)
> > > > > > >  {
> > > > > > > -       cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
> > > > > > > -       update_mm(next, tsk);
> > > > > > > +       int cpu = smp_processor_id();
> > > > > > > +
> > > > > > > +       if (prev == next)
> > > > > > > +               return;
> > > > > > > +       if (atomic_inc_return(&next->context.attach_count) >> 16) {
> > > > > > > +               /* Delay update_mm until all TLB flushes are done. */
> > > > > > > +               set_tsk_thread_flag(tsk, TIF_TLB_WAIT);
> > > > > > > +       } else {
> > > > > > > +               cpumask_set_cpu(cpu, mm_cpumask(next));
> > > > > > > +               update_mm(next, tsk);
> > > > > > > +               if (next->context.flush_mm)
> > > > > > > +                       /* Flush pending TLBs */
> > > > > > > +                       __tlb_flush_mm(next);
> > > > > > > +       }
> > > > > > >         atomic_dec(&prev->context.attach_count);
> > > > > > >         WARN_ON(atomic_read(&prev->context.attach_count) < 0);
> > > > > > > -       atomic_inc(&next->context.attach_count);
> > > > > > > -       /* Check for TLBs not flushed yet */
> > > > > > > -       __tlb_flush_mm_lazy(next);
> > > > > > > +}
> > > > > > > +
> > > > > > > +#define finish_switch_mm finish_switch_mm
> > > > > > > +static inline void finish_switch_mm(struct mm_struct *mm,
> > > > > > > +                                   struct task_struct *tsk)
> > > > > > > +{
> > > > > > > +       if (!test_and_clear_tsk_thread_flag(tsk, TIF_TLB_WAIT))
> > > > > > > +               return;
> > > > > > > +
> > > > > > > +       while (atomic_read(&mm->context.attach_count) >> 16)
> > > > > > > +               cpu_relax();
> > > > > > > +
> > > > > > > +       cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
> > > > > > > +       update_mm(mm, tsk);
> > > > > > > +       if (mm->context.flush_mm)
> > > > > > > +               __tlb_flush_mm(mm);
> > > > > > >  }
> > > > > > 
> > > > > > Some care is needed here with preemption (we had this on arm and I
> > > > > > think we need a fix on arm64 as well). Basically you set TIF_TLB_WAIT
> > > > > > on a thread but you get preempted just before finish_switch_mm(). The
> > > > > > new thread has the same mm as the preempted on and switch_mm() exits
> > > > > > early without setting another flag. So finish_switch_mm() wouldn't do
> > > > > > anything but you still switched to the new mm. The fix is to make the
> > > > > > flag per mm rather than thread (see commit bdae73cd374e).
> > > > > 
> > > > > Interesting. For s390 I need to make sure that each task attaching an
> > > > > mm waits for the completion of concurrent TLB flush operations. If the
> > > > > scheduler does not switch the mm I don't care, the mm is still attached.
> > > > 
> > > > I assume the actual hardware mm switch happens via update_mm(). If you
> > > > have a context_switch() to a thread which requires an update_mm() but you
> > > > defer this until finish_switch_mm(), you may be preempted before the
> > > > hardware update. If the new context_switch() schedules a thread with the
> > > > same mm as the preempted one, you no longer call update_mm(). So the new
> > > > thread actually uses an old hardware mm.
> > >  
> > > If the code gets preempted between switch_mm() and finish_switch_mm()
> > > the worst that can happen is that finish_switch_mm() is called twice.
> > 
> > Yes, it's called twice, but you only set the TIF_TLB_WAIT the first
> > time. Here's the scenario:
> > 
> > 1. thread-A running with mm-A
> > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
> >    update_mm(mm-B). Hardware still using mm-A
> > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> > 5. interrupt and preemption before finish_mm_switch(mm-B)
> > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
> >    that thread-B1 and thread-B2 have the same mm-B)
> > 7. switch_mm() as in this patch exits early because prev == next
> > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
> >    for thread-B2, therefore no call to update_mm(mm-B)
> > 
> > So after point 8, you get thread-B2 running (and possibly returning to
> > user space) with mm-A. Do you see a problem here?
> 
> Oh, now I get it. Thanks for the patience, this is indeed a problem.
> And I concur, a per-mm flag is the 'obvious' solution.

Having said that and looking at the code I find this to be not as obvious
any more. If you have multiple cpus using a per-mm flag can get you into
trouble:

1. cpu #1 calls switch_mm and finds that irqs are disabled.
   mm->context.switch_pending is set
2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
   mm->context.switch_pending is set again
3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
6. cpu #2 continues with the old mm

This is a race, no?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 11:17               ` Martin Schwidefsky
@ 2013-11-15 11:57                 ` Catalin Marinas
  2013-11-15 13:29                   ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-15 11:57 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Fri, Nov 15, 2013 at 11:17:36AM +0000, Martin Schwidefsky wrote:
> On Fri, 15 Nov 2013 12:10:00 +0100
> Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> 
> > On Fri, 15 Nov 2013 10:44:37 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > 1. thread-A running with mm-A
> > > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> > > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
> > >    update_mm(mm-B). Hardware still using mm-A
> > > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> > > 5. interrupt and preemption before finish_mm_switch(mm-B)
> > > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
> > >    that thread-B1 and thread-B2 have the same mm-B)
> > > 7. switch_mm() as in this patch exits early because prev == next
> > > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
> > >    for thread-B2, therefore no call to update_mm(mm-B)
> > > 
> > > So after point 8, you get thread-B2 running (and possibly returning to
> > > user space) with mm-A. Do you see a problem here?
> > 
> > Oh, now I get it. Thanks for the patience, this is indeed a problem.
> > And I concur, a per-mm flag is the 'obvious' solution.
> 
> Having said that and looking at the code I find this to be not as obvious
> any more. If you have multiple cpus using a per-mm flag can get you into
> trouble:
> 
> 1. cpu #1 calls switch_mm and finds that irqs are disabled.
>    mm->context.switch_pending is set
> 2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
>    mm->context.switch_pending is set again
> 3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
> 4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
> 5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
> 6. cpu #2 continues with the old mm
> 
> This is a race, no?

Yes, but we only use this on ARMv5 and earlier and there is no SMP
support.

On arm64 however, I need to fix that and you made a good point. In my
(not yet public) patch, the switch_pending is cleared after all the
IPIs have been acknowledged but it needs some more thinking. A solution
could be to always do the cpu_switch_mm() in finish_mm_switch() without
any checks but this requires that any switch_mm() call from the kernel
needs to be paired with finish_mm_switch(). So your first patch comes in
handy (but I still need to figure out a quick arm64 fix for cc stable).

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 11:57                 ` Catalin Marinas
@ 2013-11-15 13:29                   ` Martin Schwidefsky
  2013-11-15 13:46                     ` Catalin Marinas
  0 siblings, 1 reply; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-15 13:29 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Fri, 15 Nov 2013 11:57:01 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On Fri, Nov 15, 2013 at 11:17:36AM +0000, Martin Schwidefsky wrote:
> > On Fri, 15 Nov 2013 12:10:00 +0100
> > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > 
> > > On Fri, 15 Nov 2013 10:44:37 +0000
> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > > 1. thread-A running with mm-A
> > > > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> > > > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
> > > >    update_mm(mm-B). Hardware still using mm-A
> > > > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> > > > 5. interrupt and preemption before finish_mm_switch(mm-B)
> > > > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
> > > >    that thread-B1 and thread-B2 have the same mm-B)
> > > > 7. switch_mm() as in this patch exits early because prev == next
> > > > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
> > > >    for thread-B2, therefore no call to update_mm(mm-B)
> > > > 
> > > > So after point 8, you get thread-B2 running (and possibly returning to
> > > > user space) with mm-A. Do you see a problem here?
> > > 
> > > Oh, now I get it. Thanks for the patience, this is indeed a problem.
> > > And I concur, a per-mm flag is the 'obvious' solution.
> > 
> > Having said that and looking at the code I find this to be not as obvious
> > any more. If you have multiple cpus using a per-mm flag can get you into
> > trouble:
> > 
> > 1. cpu #1 calls switch_mm and finds that irqs are disabled.
> >    mm->context.switch_pending is set
> > 2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
> >    mm->context.switch_pending is set again
> > 3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
> > 4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
> > 5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
> > 6. cpu #2 continues with the old mm
> > 
> > This is a race, no?
> 
> Yes, but we only use this on ARMv5 and earlier and there is no SMP
> support.
> 
> On arm64 however, I need to fix that and you made a good point. In my
> (not yet public) patch, the switch_pending is cleared after all the
> IPIs have been acknowledged but it needs some more thinking. A solution
> could be to always do the cpu_switch_mm() in finish_mm_switch() without
> any checks but this requires that any switch_mm() call from the kernel
> needs to be paired with finish_mm_switch(). So your first patch comes in
> handy (but I still need to figure out a quick arm64 fix for cc stable).
 
I am currently thinking about the following solution for s390: keep the
TIF_TLB_FLUSH bit per task but do a preempt_disable() in switch_mm()
if the switch is incomplete. This pairs with a preempt_enable() in
finish_switch_mm() after the update_mm has been done.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 13:29                   ` Martin Schwidefsky
@ 2013-11-15 13:46                     ` Catalin Marinas
  2013-11-18  8:11                       ` Martin Schwidefsky
  0 siblings, 1 reply; 21+ messages in thread
From: Catalin Marinas @ 2013-11-15 13:46 UTC (permalink / raw)
  To: Martin Schwidefsky; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On 15 November 2013 13:29, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> On Fri, 15 Nov 2013 11:57:01 +0000
> Catalin Marinas <catalin.marinas@arm.com> wrote:
>
>> On Fri, Nov 15, 2013 at 11:17:36AM +0000, Martin Schwidefsky wrote:
>> > On Fri, 15 Nov 2013 12:10:00 +0100
>> > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>> >
>> > > On Fri, 15 Nov 2013 10:44:37 +0000
>> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
>> > > > 1. thread-A running with mm-A
>> > > > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
>> > > > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
>> > > >    update_mm(mm-B). Hardware still using mm-A
>> > > > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
>> > > > 5. interrupt and preemption before finish_mm_switch(mm-B)
>> > > > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
>> > > >    that thread-B1 and thread-B2 have the same mm-B)
>> > > > 7. switch_mm() as in this patch exits early because prev == next
>> > > > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
>> > > >    for thread-B2, therefore no call to update_mm(mm-B)
>> > > >
>> > > > So after point 8, you get thread-B2 running (and possibly returning to
>> > > > user space) with mm-A. Do you see a problem here?
>> > >
>> > > Oh, now I get it. Thanks for the patience, this is indeed a problem.
>> > > And I concur, a per-mm flag is the 'obvious' solution.
>> >
>> > Having said that and looking at the code I find this to be not as obvious
>> > any more. If you have multiple cpus using a per-mm flag can get you into
>> > trouble:
>> >
>> > 1. cpu #1 calls switch_mm and finds that irqs are disabled.
>> >    mm->context.switch_pending is set
>> > 2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
>> >    mm->context.switch_pending is set again
>> > 3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
>> > 4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
>> > 5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
>> > 6. cpu #2 continues with the old mm
>> >
>> > This is a race, no?
>>
>> Yes, but we only use this on ARMv5 and earlier and there is no SMP
>> support.
>>
>> On arm64 however, I need to fix that and you made a good point. In my
>> (not yet public) patch, the switch_pending is cleared after all the
>> IPIs have been acknowledged but it needs some more thinking. A solution
>> could be to always do the cpu_switch_mm() in finish_mm_switch() without
>> any checks but this requires that any switch_mm() call from the kernel
>> needs to be paired with finish_mm_switch(). So your first patch comes in
>> handy (but I still need to figure out a quick arm64 fix for cc stable).
>
> I am currently thinking about the following solution for s390: keep the
> TIF_TLB_FLUSH bit per task but do a preempt_disable() in switch_mm()
> if the switch is incomplete. This pairs with a preempt_enable() in
> finish_switch_mm() after the update_mm has been done.

That's the first thing I tried when I noticed the problem but I got
weird kernel warnings with preempt_enable/disabling spanning across
the scheduler unlocking. So doesn't seem safe.

It may work if instead a simple flag you use atomic_inc/dec for the mm flag.

-- 
Catalin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries
  2013-11-15 13:46                     ` Catalin Marinas
@ 2013-11-18  8:11                       ` Martin Schwidefsky
  0 siblings, 0 replies; 21+ messages in thread
From: Martin Schwidefsky @ 2013-11-18  8:11 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Ingo Molnar, Peter Zijlstra, Linux Kernel Mailing List

On Fri, 15 Nov 2013 13:46:07 +0000
Catalin Marinas <catalin.marinas@arm.com> wrote:

> On 15 November 2013 13:29, Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> > On Fri, 15 Nov 2013 11:57:01 +0000
> > Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> >> On Fri, Nov 15, 2013 at 11:17:36AM +0000, Martin Schwidefsky wrote:
> >> > On Fri, 15 Nov 2013 12:10:00 +0100
> >> > Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> >> >
> >> > > On Fri, 15 Nov 2013 10:44:37 +0000
> >> > > Catalin Marinas <catalin.marinas@arm.com> wrote:
> >> > > > 1. thread-A running with mm-A
> >> > > > 2. context_switch() to thread-B1 causing a switch_mm(mm-B)
> >> > > > 3. switch_mm(mm-B) sets thread-B1's TIF_TLB_WAIT but does _not_ call
> >> > > >    update_mm(mm-B). Hardware still using mm-A
> >> > > > 4. scheduler unlocks and is about to call finish_mm_switch(mm-B)
> >> > > > 5. interrupt and preemption before finish_mm_switch(mm-B)
> >> > > > 6. context_switch() to thread-B2 causing a switch_mm(mm-B) (note here
> >> > > >    that thread-B1 and thread-B2 have the same mm-B)
> >> > > > 7. switch_mm() as in this patch exits early because prev == next
> >> > > > 8. finish_mm_switch(mm-B) is indeed called but TIF_TLB_WAIT is not set
> >> > > >    for thread-B2, therefore no call to update_mm(mm-B)
> >> > > >
> >> > > > So after point 8, you get thread-B2 running (and possibly returning to
> >> > > > user space) with mm-A. Do you see a problem here?
> >> > >
> >> > > Oh, now I get it. Thanks for the patience, this is indeed a problem.
> >> > > And I concur, a per-mm flag is the 'obvious' solution.
> >> >
> >> > Having said that and looking at the code I find this to be not as obvious
> >> > any more. If you have multiple cpus using a per-mm flag can get you into
> >> > trouble:
> >> >
> >> > 1. cpu #1 calls switch_mm and finds that irqs are disabled.
> >> >    mm->context.switch_pending is set
> >> > 2. cpu #2 calls switch_mm for the same mm and finds that irqs are disabled.
> >> >    mm->context.switch_pending is set again
> >> > 3. cpu #1 reaches finish_arch_post_lock_switch and finds switch_pending == 1
> >> > 4. cpu #1 zeroes mm->switch_pending and calls cpu_switch_mm
> >> > 5. cpu #2 reaches finish_arch_post_lock_switch and finds switch_pending == 0
> >> > 6. cpu #2 continues with the old mm
> >> >
> >> > This is a race, no?
> >>
> >> Yes, but we only use this on ARMv5 and earlier and there is no SMP
> >> support.
> >>
> >> On arm64 however, I need to fix that and you made a good point. In my
> >> (not yet public) patch, the switch_pending is cleared after all the
> >> IPIs have been acknowledged but it needs some more thinking. A solution
> >> could be to always do the cpu_switch_mm() in finish_mm_switch() without
> >> any checks but this requires that any switch_mm() call from the kernel
> >> needs to be paired with finish_mm_switch(). So your first patch comes in
> >> handy (but I still need to figure out a quick arm64 fix for cc stable).
> >
> > I am currently thinking about the following solution for s390: keep the
> > TIF_TLB_FLUSH bit per task but do a preempt_disable() in switch_mm()
> > if the switch is incomplete. This pairs with a preempt_enable() in
> > finish_switch_mm() after the update_mm has been done.
> 
> That's the first thing I tried when I noticed the problem but I got
> weird kernel warnings with preempt_enable/disabling spanning across
> the scheduler unlocking. So doesn't seem safe.
> 
> It may work if instead a simple flag you use atomic_inc/dec for the mm flag.

I have not seen the kernel warnings because the detour over finish_switch_mm
is used only rarely. After forcing the detour I got the fallout, doing the
preempt_disable in switch_mm and preempt_enable in finish_switch_mm does not
work. But what does work is to copy the TIF_TLB_FLUSH bit in the __switch_to
function just like the TIF_MCCK_PENDING. That way the TIF_TLB_FLUSH can not
get "hidden" by a preemptive schedule.

The patch to use finish_arch_post_lock_switch instead of finish_switch_mm
would look like this:
--
>From c4d83a9ff8c6d5ca821bab24b5bd77b782a69819 Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <schwidefsky@de.ibm.com>
Date: Fri, 26 Oct 2012 17:17:44 +0200
Subject: [PATCH] sched/mm: call finish_arch_post_lock_switch in idle_task_exit
 and use_mm

The finish_arch_post_lock_switch is called at the end of the task
switch after all locks have been released. In concept it is paired
with the switch_mm function, but the current code only does the
call in finish_task_switch. Add the call to idle_task_exit and
use_mm. One use case for the additional calls is s390 which will
use finish_arch_post_lock_switch to wait for the completion of
TLB flush operations.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
---
 include/linux/mmu_context.h | 6 ++++++
 kernel/sched/core.c         | 6 ++++--
 kernel/sched/sched.h        | 3 ---
 mm/mmu_context.c            | 3 +--
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h
index 70fffeb..38f5550 100644
--- a/include/linux/mmu_context.h
+++ b/include/linux/mmu_context.h
@@ -1,9 +1,15 @@
 #ifndef _LINUX_MMU_CONTEXT_H
 #define _LINUX_MMU_CONTEXT_H
 
+#include <asm/mmu_context.h>
+
 struct mm_struct;
 
 void use_mm(struct mm_struct *mm);
 void unuse_mm(struct mm_struct *mm);
 
+#ifndef finish_arch_post_lock_switch
+# define finish_arch_post_lock_switch()	do { } while (0)
+#endif
+
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c180860..ffa234c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -32,7 +32,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
-#include <asm/mmu_context.h>
+#include <linux/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/capability.h>
 #include <linux/completion.h>
@@ -4154,8 +4154,10 @@ void idle_task_exit(void)
 
 	BUG_ON(cpu_online(smp_processor_id()));
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
 		switch_mm(mm, &init_mm, current);
+		finish_arch_post_lock_switch();
+	}
 	mmdrop(mm);
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 88c85b2..ad48db3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -850,9 +850,6 @@ static inline int task_running(struct rq *rq, struct task_struct *p)
 #ifndef finish_arch_switch
 # define finish_arch_switch(prev)	do { } while (0)
 #endif
-#ifndef finish_arch_post_lock_switch
-# define finish_arch_post_lock_switch()	do { } while (0)
-#endif
 
 #ifndef __ARCH_WANT_UNLOCKED_CTXSW
 static inline void prepare_lock_switch(struct rq *rq, struct task_struct *next)
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 8a8cd02..56ecbbd 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -8,8 +8,6 @@
 #include <linux/export.h>
 #include <linux/sched.h>
 
-#include <asm/mmu_context.h>
-
 /*
  * use_mm
  *	Makes the calling kernel thread take on the specified
@@ -31,6 +29,7 @@ void use_mm(struct mm_struct *mm)
 	tsk->mm = mm;
 	switch_mm(active_mm, mm, tsk);
 	task_unlock(tsk);
+	finish_arch_post_lock_switch();
 
 	if (active_mm != mm)
 		mmdrop(active_mm);
-- 
1.8.3.4

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-11-18  8:11 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-13  8:16 [PATCH 0/2] sched: finish_switch_mm hook Martin Schwidefsky
2013-11-13  8:16 ` [PATCH 1/2] sched/mm: add finish_switch_mm function Martin Schwidefsky
2013-11-13 11:41   ` Peter Zijlstra
2013-11-13 11:49     ` Martin Schwidefsky
2013-11-13 12:19     ` Catalin Marinas
2013-11-13 16:05       ` Martin Schwidefsky
2013-11-13 17:03         ` Catalin Marinas
2013-11-14  8:00           ` Martin Schwidefsky
2013-11-13  8:16 ` [PATCH 2/2] s390/mm,tlb: race of lazy TLB flush vs. recreation of TLB entries Martin Schwidefsky
2013-11-13 16:16   ` Catalin Marinas
2013-11-14  8:10     ` Martin Schwidefsky
2013-11-14 13:22       ` Catalin Marinas
2013-11-14 16:33         ` Martin Schwidefsky
2013-11-15 10:44           ` Catalin Marinas
2013-11-15 11:10             ` Martin Schwidefsky
2013-11-15 11:17               ` Martin Schwidefsky
2013-11-15 11:57                 ` Catalin Marinas
2013-11-15 13:29                   ` Martin Schwidefsky
2013-11-15 13:46                     ` Catalin Marinas
2013-11-18  8:11                       ` Martin Schwidefsky
2013-11-15  9:13       ` Martin Schwidefsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.