[RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
@ 2020-02-03 20:17 Andrea Arcangeli
  2020-02-03 20:17 ` [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize " Andrea Arcangeli
  2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-03 20:17 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini, Mark Salter
  Cc: linux-kernel, linux-mm, linux-arm-kernel

Hello everyone,

I've been testing the arm64 ARMv8 tlbi broadcast instruction and it
seems it doesn't scale in SMP, which opens the door for using similar
tricks to what alpha, ia64, mips, powerpc, sh, sparc are doing to
optimize TLB flushes for single threaded processes. This should be
even more beneficial in NUMA or multi socket systems where the "ASID"
and "vaddr" information has to cross a longer distance before the tlbi
broadcast instruction can be retired.

This mm_users logic seems non standard across different arches: every
arch does it in its own way. Not only the implementation is different,
but different arches are also trying to optimize different cases
through the mm_users checks in the arch code:

1) avoiding remote TLB flushes with mm_users <= 1

2) avoiding even local TLB flushes during exit_mmap->unmap_vmas when
   mm_users == 0

For now I only tried to implement 1) on arm64, but I'm left wondering
which other arches can achieve 2) and in turn which kernel code could
write to the very userland virtual addresses being unmapped during
exit_mmap, that would strictly require their flushing using the tlb
gather mechanism.

This is just a RFC to know if this is would be a viable optimization.
A basic microbenchmark is included in the commit header of the patch.

Thanks,
Andrea

Andrea Arcangeli (2):
  mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  arm64: tlb: skip tlbi broadcast for single threaded TLB flushes

 arch/arm64/include/asm/efi.h         |  2 +-
 arch/arm64/include/asm/mmu.h         |  3 +-
 arch/arm64/include/asm/mmu_context.h | 10 +--
 arch/arm64/include/asm/tlbflush.h    | 91 +++++++++++++++++++++++++++-
 arch/arm64/mm/context.c              | 13 +++-
 mm/mmu_context.c                     |  2 +
 6 files changed, 111 insertions(+), 10 deletions(-)

Some examples of the scattered status of 2) follows:

ia64:

==
flush_tlb_mm (struct mm_struct *mm)
{
	if (!mm)
		return;

	set_bit(mm->context, ia64_ctx.flushmap);
	mm->context = 0;

	if (atomic_read(&mm->mm_users) == 0)
		return;		/* happens as a result of exit_mmap() */
[..]
==

sparc:

==
void flush_tlb_all(void)
{
	/*
	 * Don't bother flushing if this address space is about to be
	 * destroyed.
	 */
	if (atomic_read(&current->mm->mm_users) == 0)
		return;
[..]
static void fix_range(struct mm_struct *mm, unsigned long start_addr,
		      unsigned long end_addr, int force)
{
	/*
	 * Don't bother flushing if this address space is about to be
	 * destroyed.
	 */
	if (atomic_read(&mm->mm_users) == 0)
		return;
[..]
==

arc:

==
noinline void local_flush_tlb_mm(struct mm_struct *mm)
{
	/*
	 * Small optimisation courtesy IA64
	 * flush_mm called during fork,exit,munmap etc, multiple times as well.
	 * Only for fork( ) do we need to move parent to a new MMU ctxt,
	 * all other cases are NOPs, hence this check.
	 */
	if (atomic_read(&mm->mm_users) == 0)
		return;
[..]
==

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-03 20:17 [RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Andrea Arcangeli
@ 2020-02-03 20:17 ` Andrea Arcangeli
  2020-02-18 11:31   ` Michal Hocko
  2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-03 20:17 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini, Mark Salter
  Cc: linux-kernel, linux-mm, linux-arm-kernel

alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
mm->mm_users to know if they can skip some remote TLB flushes for
single threaded processes.

Most callers of use_mm() tend to invoke mmget_not_zero() or
get_task_mm() before use_mm() to ensure the mm will remain alive in
between use_mm() and unuse_mm().

Some callers however don't increase mm_users and they instead rely on
serialization in __mmput() to ensure the mm will remain alive in
between use_mm() and unuse_mm(). Not increasing mm_users during
use_mm() is however unsafe for aforementioned arch TLB flushes
optimizations. So either mmget()/mmput() should be added to the
problematic callers of use_mm()/unuse_mm() or we can embed them in
use_mm()/unuse_mm() which is more robust.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/mmu_context.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3e612ae748e9..ced0e1218c0f 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -30,6 +30,7 @@ void use_mm(struct mm_struct *mm)
 		mmgrab(mm);
 		tsk->active_mm = mm;
 	}
+	mmget(mm);
 	tsk->mm = mm;
 	switch_mm(active_mm, mm, tsk);
 	task_unlock(tsk);
@@ -57,6 +58,7 @@ void unuse_mm(struct mm_struct *mm)
 	task_lock(tsk);
 	sync_mm_rss(mm);
 	tsk->mm = NULL;
+	mmput(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	task_unlock(tsk);

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-03 20:17 ` [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize " Andrea Arcangeli
@ 2020-02-18 11:31   ` Michal Hocko
  2020-02-18 18:56     ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-02-18 11:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini,
	Mark Salter, linux-mm, linux-kernel, linux-arm-kernel

On Mon 03-02-20 15:17:44, Andrea Arcangeli wrote:
> alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
> mm->mm_users to know if they can skip some remote TLB flushes for
> single threaded processes.
> 
> Most callers of use_mm() tend to invoke mmget_not_zero() or
> get_task_mm() before use_mm() to ensure the mm will remain alive in
> between use_mm() and unuse_mm().
> 
> Some callers however don't increase mm_users and they instead rely on
> serialization in __mmput() to ensure the mm will remain alive in
> between use_mm() and unuse_mm(). Not increasing mm_users during
> use_mm() is however unsafe for aforementioned arch TLB flushes
> optimizations. So either mmget()/mmput() should be added to the
> problematic callers of use_mm()/unuse_mm() or we can embed them in
> use_mm()/unuse_mm() which is more robust.

I would prefer we do not do that because then the real owner of the mm
cannot really tear down the address space and the life time of it is
bound to a kernel thread doing the use_mm. This is undesirable I would
really prefer if the existing few users would use mmget only when they
really need to access mm.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/mmu_context.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3e612ae748e9..ced0e1218c0f 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -30,6 +30,7 @@ void use_mm(struct mm_struct *mm)
>  		mmgrab(mm);
>  		tsk->active_mm = mm;
>  	}
> +	mmget(mm);
>  	tsk->mm = mm;
>  	switch_mm(active_mm, mm, tsk);
>  	task_unlock(tsk);
> @@ -57,6 +58,7 @@ void unuse_mm(struct mm_struct *mm)
>  	task_lock(tsk);
>  	sync_mm_rss(mm);
>  	tsk->mm = NULL;
> +	mmput(mm);
>  	/* active_mm is still 'mm' */
>  	enter_lazy_tlb(mm, tsk);
>  	task_unlock(tsk);
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-18 11:31   ` Michal Hocko
@ 2020-02-18 18:56     ` Andrea Arcangeli
  2020-02-19 11:58       ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-18 18:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini,
	Mark Salter, linux-mm, linux-kernel, linux-arm-kernel

Hi Michal!

On Tue, Feb 18, 2020 at 12:31:03PM +0100, Michal Hocko wrote:
> On Mon 03-02-20 15:17:44, Andrea Arcangeli wrote:
> > alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
> > mm->mm_users to know if they can skip some remote TLB flushes for
> > single threaded processes.
> > 
> > Most callers of use_mm() tend to invoke mmget_not_zero() or
> > get_task_mm() before use_mm() to ensure the mm will remain alive in
> > between use_mm() and unuse_mm().
> > 
> > Some callers however don't increase mm_users and they instead rely on
> > serialization in __mmput() to ensure the mm will remain alive in
> > between use_mm() and unuse_mm(). Not increasing mm_users during
> > use_mm() is however unsafe for aforementioned arch TLB flushes
> > optimizations. So either mmget()/mmput() should be added to the
> > problematic callers of use_mm()/unuse_mm() or we can embed them in
> > use_mm()/unuse_mm() which is more robust.
> 
> I would prefer we do not do that because then the real owner of the mm
> cannot really tear down the address space and the life time of it is
> bound to a kernel thread doing the use_mm. This is undesirable I would
> really prefer if the existing few users would use mmget only when they
> really need to access mm.

If the existing few users that don't already do the explicit mmget
will have to start doing it too, the end result will be exactly the
same that you described in your "cons" (lieftime of the mm will still
be up to who did mmget;use_mm and didn't call unuse_mm;mmput yet).

One reason to prefer adding the mmget to the callers to forget it,
would be to avoid an atomic op in use_mm (for those callers that
didn't forget it), but if anybody is doing use_mm in a fast path that
won't be very fast anyway so I didn't think this was worth the
risk. If that microoptimization in a slow path is the reason we should
add mmget to the callers that forgot it that would be fine with me
although I think it's risky because if already happened once and it
could happen again (and when it happens it only bits a few arches if
used with a few drivers).

On a side note the patch 2/2 should be dropped for now, I'm looking if
we can optimize away TLB-i broadcasts from multithreaded apps too.

Thanks,
Andrea

> 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >  mm/mmu_context.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> > index 3e612ae748e9..ced0e1218c0f 100644
> > --- a/mm/mmu_context.c
> > +++ b/mm/mmu_context.c
> > @@ -30,6 +30,7 @@ void use_mm(struct mm_struct *mm)
> >  		mmgrab(mm);
> >  		tsk->active_mm = mm;
> >  	}
> > +	mmget(mm);
> >  	tsk->mm = mm;
> >  	switch_mm(active_mm, mm, tsk);
> >  	task_unlock(tsk);
> > @@ -57,6 +58,7 @@ void unuse_mm(struct mm_struct *mm)
> >  	task_lock(tsk);
> >  	sync_mm_rss(mm);
> >  	tsk->mm = NULL;
> > +	mmput(mm);
> >  	/* active_mm is still 'mm' */
> >  	enter_lazy_tlb(mm, tsk);
> >  	task_unlock(tsk);
> > 
> > 
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-18 18:56     ` Andrea Arcangeli
@ 2020-02-19 11:58       ` Michal Hocko
  2020-02-19 20:04         ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2020-02-19 11:58 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini,
	Mark Salter, linux-mm, linux-kernel, linux-arm-kernel

On Tue 18-02-20 13:56:18, Andrea Arcangeli wrote:
> Hi Michal!
> 
> On Tue, Feb 18, 2020 at 12:31:03PM +0100, Michal Hocko wrote:
> > On Mon 03-02-20 15:17:44, Andrea Arcangeli wrote:
> > > alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
> > > mm->mm_users to know if they can skip some remote TLB flushes for
> > > single threaded processes.
> > > 
> > > Most callers of use_mm() tend to invoke mmget_not_zero() or
> > > get_task_mm() before use_mm() to ensure the mm will remain alive in
> > > between use_mm() and unuse_mm().
> > > 
> > > Some callers however don't increase mm_users and they instead rely on
> > > serialization in __mmput() to ensure the mm will remain alive in
> > > between use_mm() and unuse_mm(). Not increasing mm_users during
> > > use_mm() is however unsafe for aforementioned arch TLB flushes
> > > optimizations. So either mmget()/mmput() should be added to the
> > > problematic callers of use_mm()/unuse_mm() or we can embed them in
> > > use_mm()/unuse_mm() which is more robust.
> > 
> > I would prefer we do not do that because then the real owner of the mm
> > cannot really tear down the address space and the life time of it is
> > bound to a kernel thread doing the use_mm. This is undesirable I would
> > really prefer if the existing few users would use mmget only when they
> > really need to access mm.
> 
> If the existing few users that don't already do the explicit mmget
> will have to start doing it too, the end result will be exactly the
> same that you described in your "cons" (lieftime of the mm will still
> be up to who did mmget;use_mm and didn't call unuse_mm;mmput yet).

Well, they should use mmget only for moments when they access the
address space.

> One reason to prefer adding the mmget to the callers to forget it,
> would be to avoid an atomic op in use_mm (for those callers that
> didn't forget it), but if anybody is doing use_mm in a fast path that
> won't be very fast anyway so I didn't think this was worth the
> risk. If that microoptimization in a slow path is the reason we should
> add mmget to the callers that forgot it that would be fine with me
> although I think it's risky because if already happened once and it
> could happen again (and when it happens it only bits a few arches if
> used with a few drivers).

The primary concern really is that use_mm is usually not bound in time
and we do not want to pin the address space for such a long time without
any binding to a killable process.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize TLB flushes
  2020-02-19 11:58       ` Michal Hocko
@ 2020-02-19 20:04         ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-19 20:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini,
	Mark Salter, linux-mm, linux-kernel, linux-arm-kernel

On Wed, Feb 19, 2020 at 12:58:55PM +0100, Michal Hocko wrote:
> On Tue 18-02-20 13:56:18, Andrea Arcangeli wrote:
> > Hi Michal!
> > 
> > On Tue, Feb 18, 2020 at 12:31:03PM +0100, Michal Hocko wrote:
> > > On Mon 03-02-20 15:17:44, Andrea Arcangeli wrote:
> > > > alpha, ia64, mips, powerpc, sh, sparc are relying on a check on
> > > > mm->mm_users to know if they can skip some remote TLB flushes for
> > > > single threaded processes.
> > > > 
> > > > Most callers of use_mm() tend to invoke mmget_not_zero() or
> > > > get_task_mm() before use_mm() to ensure the mm will remain alive in
> > > > between use_mm() and unuse_mm().
> > > > 
> > > > Some callers however don't increase mm_users and they instead rely on
> > > > serialization in __mmput() to ensure the mm will remain alive in
> > > > between use_mm() and unuse_mm(). Not increasing mm_users during
> > > > use_mm() is however unsafe for aforementioned arch TLB flushes
> > > > optimizations. So either mmget()/mmput() should be added to the
> > > > problematic callers of use_mm()/unuse_mm() or we can embed them in
> > > > use_mm()/unuse_mm() which is more robust.
> > > 
> > > I would prefer we do not do that because then the real owner of the mm
> > > cannot really tear down the address space and the life time of it is
> > > bound to a kernel thread doing the use_mm. This is undesirable I would
> > > really prefer if the existing few users would use mmget only when they
> > > really need to access mm.
> > 
> > If the existing few users that don't already do the explicit mmget
> > will have to start doing it too, the end result will be exactly the
> > same that you described in your "cons" (lieftime of the mm will still
> > be up to who did mmget;use_mm and didn't call unuse_mm;mmput yet).
> 
> Well, they should use mmget only for moments when they access the
> address space.

I don't think so because mmget is a pure atomic_inc. How can you serialize
that against a tlb flush that just does an atomic_read() on mm_users?

At the very least you will have to invent something new slower than
mmget that adds some serialization against the TLB flush code and the
common code would need to learn to use that. And the question is how
much faster that can be than switch_mm() that use_mm already invokes.

It doesn't need to block but it needs a smp_mb() on both sides.

i.e. after atomic_inc(&mm_users) you need a smp_mb() and a
test_and_clear_bit and conditional local tlb flush. In the TLB flush
code you need to set the bit, smp_mb() and then atomic_read(&mm_users).

As things are now (no new mmget_local_tlb_flush() call in common code)
the mmget has to happen regardless before use_mm() so there's a slight
chance to serialize it in switch_mm arch code. Note I did put a
smp_mb() myself in the switch_mm path to make it safe in 2/2.

With your solution it'd be already possible to call mmput at any time
after using the mm and before calling unuse_mm. But then you'd have to
still call unuse_mm; mmget;use_mm again before you can use the mm
again so why not to call unuse_mm immediately instead of only mmput?

> > One reason to prefer adding the mmget to the callers to forget it,
> > would be to avoid an atomic op in use_mm (for those callers that
> > didn't forget it), but if anybody is doing use_mm in a fast path that
> > won't be very fast anyway so I didn't think this was worth the
> > risk. If that microoptimization in a slow path is the reason we should
> > add mmget to the callers that forgot it that would be fine with me
> > although I think it's risky because if already happened once and it
> > could happen again (and when it happens it only bits a few arches if
> > used with a few drivers).
> 
> The primary concern really is that use_mm is usually not bound in time
> and we do not want to pin the address space for such a long time without
> any binding to a killable process.

Almost all use_mm already do mmget before use_mm and mmput after
unuse_mm, or perhaps I wouldn't have to find about this by source
review only. So the cons you describe already affects the vast
majority of use_mm cases (the non buggy ones).

I see you concern for adding a "cons" to the minority of remaining
cases, but it's still better to introduce that "cons" than corrupting
userland memory.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-03 20:17 [RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Andrea Arcangeli
  2020-02-03 20:17 ` [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize " Andrea Arcangeli
@ 2020-02-03 20:17 ` Andrea Arcangeli
  2020-02-10 17:51   ` Catalin Marinas
  2020-02-12 14:13   ` qi.fuli
  1 sibling, 2 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-03 20:17 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Jon Masters, Rafael Aquini, Mark Salter
  Cc: linux-kernel, linux-mm, linux-arm-kernel

With multiple NUMA nodes and multiple sockets, the tlbi broadcast
shall be delivered through the interconnects in turn increasing the
interconnect traffic and the latency of the tlbi broadcast instruction.

Even within a single NUMA node the latency of the tlbi broadcast
instruction increases almost linearly with the number of CPUs trying to
send tlbi broadcasts at the same time.

When the process is single threaded however we can achieve full SMP
scalability by skipping the tlbi broadcasting. Other arches already
deploy this optimization.

After the local TLB flush this however means the ASID context goes out
of sync in all CPUs except the local one. This can be tracked in the
mm_cpumask(mm): if the bit is set it means the asid context is stale
for that CPU. This results in an extra local ASID TLB flush only if a
single threaded process is migrated to a different CPU and only after a
TLB flush. No extra local TLB flush is needed for the common case of
single threaded processes context scheduling within the same CPU and for
multithreaded processes.

Skipping the tlbi instruction broadcasting is already implemented in
local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
flush_tlb_range() and flush_tlb_page() too.

Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same
time from 32 single threaded processes before the patch:

 Performance counter stats for './loop' (3 runs):

                 0      dummy

          2.121353 +- 0.000387 seconds time elapsed  ( +-  0.02% )

and with the patch applied:

 Performance counter stats for './loop' (3 runs):

                 0      dummy

         0.1197750 +- 0.0000827 seconds time elapsed  ( +-  0.07% )

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/arm64/include/asm/efi.h         |  2 +-
 arch/arm64/include/asm/mmu.h         |  3 +-
 arch/arm64/include/asm/mmu_context.h | 10 +--
 arch/arm64/include/asm/tlbflush.h    | 91 +++++++++++++++++++++++++++-
 arch/arm64/mm/context.c              | 13 +++-
 5 files changed, 109 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
index 44531a69d32b..5d9a1433d918 100644
--- a/arch/arm64/include/asm/efi.h
+++ b/arch/arm64/include/asm/efi.h
@@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
 
 static inline void efi_set_pgd(struct mm_struct *mm)
 {
-	__switch_mm(mm);
+	__switch_mm(mm, smp_processor_id());
 
 	if (system_uses_ttbr0_pan()) {
 		if (mm != current->active_mm) {
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index e4d862420bb4..1f84289d3e44 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -27,7 +27,8 @@ typedef struct {
  * ASID change and therefore doesn't need to reload the counter using
  * atomic64_read.
  */
-#define ASID(mm)	((mm)->context.id.counter & 0xffff)
+#define __ASID(asid)	((asid) & 0xffff)
+#define ASID(mm)	__ASID((mm)->context.id.counter)
 
 extern bool arm64_use_ng_mappings;
 
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 3827ff4040a3..5ec264297968 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -210,10 +210,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	update_saved_ttbr0(tsk, &init_mm);
 }
 
-static inline void __switch_mm(struct mm_struct *next)
+static inline void __switch_mm(struct mm_struct *next, unsigned int cpu)
 {
-	unsigned int cpu = smp_processor_id();
-
 	/*
 	 * init_mm.pgd does not contain any user mappings and it is always
 	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
@@ -230,8 +228,12 @@ static inline void
 switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	  struct task_struct *tsk)
 {
+	unsigned int cpu = smp_processor_id();
+
 	if (prev != next)
-		__switch_mm(next);
+		__switch_mm(next, cpu);
+	else if (cpumask_test_and_clear_cpu(cpu, mm_cpumask(next)))
+		local_flush_tlb_asid(atomic64_read(&next->context.id));
 
 	/*
 	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index bc3949064725..283f97af3fc5 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void)
 	isb();
 }
 
+static inline void local_flush_tlb_asid(unsigned long asid)
+{
+	asid = __TLBI_VADDR(0, __ASID(asid));
+	dsb(nshst);
+	__tlbi(aside1, asid);
+	__tlbi_user(aside1, asid);
+	dsb(nsh);
+}
+
 static inline void flush_tlb_all(void)
 {
 	dsb(ishst);
@@ -148,6 +157,27 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 {
 	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
 
+	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+		int cpu = get_cpu();
+
+		cpumask_setall(mm_cpumask(mm));
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
+
+		smp_mb();
+
+		if (atomic_read(&mm->mm_users) <= 1) {
+			dsb(nshst);
+			__tlbi(aside1, asid);
+			__tlbi_user(aside1, asid);
+			dsb(nsh);
+
+			put_cpu();
+			return;
+		}
+		put_cpu();
+	}
+
 	dsb(ishst);
 	__tlbi(aside1is, asid);
 	__tlbi_user(aside1is, asid);
@@ -167,7 +197,33 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
 static inline void flush_tlb_page(struct vm_area_struct *vma,
 				  unsigned long uaddr)
 {
-	flush_tlb_page_nosync(vma, uaddr);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
+
+	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+		int cpu = get_cpu();
+
+		cpumask_setall(mm_cpumask(mm));
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
+
+		smp_mb();
+
+		if (atomic_read(&mm->mm_users) <= 1) {
+			dsb(nshst);
+			__tlbi(vale1, addr);
+			__tlbi_user(vale1, addr);
+			dsb(nsh);
+
+			put_cpu();
+			return;
+		}
+		put_cpu();
+	}
+
+	dsb(ishst);
+	__tlbi(vale1is, addr);
+	__tlbi_user(vale1is, addr);
 	dsb(ish);
 }
 
@@ -181,14 +237,15 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 				     unsigned long start, unsigned long end,
 				     unsigned long stride, bool last_level)
 {
-	unsigned long asid = ASID(vma->vm_mm);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long asid = ASID(mm);
 	unsigned long addr;
 
 	start = round_down(start, stride);
 	end = round_up(end, stride);
 
 	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
-		flush_tlb_mm(vma->vm_mm);
+		flush_tlb_mm(mm);
 		return;
 	}
 
@@ -198,6 +255,34 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	start = __TLBI_VADDR(start, asid);
 	end = __TLBI_VADDR(end, asid);
 
+	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+		int cpu = get_cpu();
+
+		cpumask_setall(mm_cpumask(mm));
+		cpumask_clear_cpu(cpu, mm_cpumask(mm));
+
+		smp_mb();
+
+		if (atomic_read(&mm->mm_users) <= 1) {
+			dsb(nshst);
+			for (addr = start; addr < end; addr += stride) {
+				if (last_level) {
+					__tlbi(vale1, addr);
+					__tlbi_user(vale1, addr);
+				} else {
+					__tlbi(vae1, addr);
+					__tlbi_user(vae1, addr);
+				}
+			}
+			dsb(nsh);
+
+			put_cpu();
+			return;
+		}
+		put_cpu();
+	}
+
 	dsb(ishst);
 	for (addr = start; addr < end; addr += stride) {
 		if (last_level) {
diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
index 8ef73e89d514..b44459d64dff 100644
--- a/arch/arm64/mm/context.c
+++ b/arch/arm64/mm/context.c
@@ -198,6 +198,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 {
 	unsigned long flags;
 	u64 asid, old_active_asid;
+	int need_flush = 0;
 
 	if (system_supports_cnp())
 		cpu_set_reserved_ttbr0();
@@ -233,7 +234,8 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 		atomic64_set(&mm->context.id, asid);
 	}
 
-	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
+	need_flush = cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending);
+	if (need_flush)
 		local_flush_tlb_all();
 
 	atomic64_set(&per_cpu(active_asids, cpu), asid);
@@ -241,6 +243,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
 
 switch_mm_fastpath:
 
+	/*
+	 * Enforce CPU ordering between the mmget() in use_mm() and
+	 * the below cpumask_test_and_clear_cpu().
+	 */
+	smp_mb();
+	if (cpumask_test_and_clear_cpu(cpu, mm_cpumask(mm)) &&
+	    likely(!need_flush))
+		local_flush_tlb_asid(asid);
+
 	arm64_apply_bp_hardening();
 
 	/*


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
@ 2020-02-10 17:51   ` Catalin Marinas
  2020-02-10 20:14     ` Andrea Arcangeli
  2020-02-12 14:13   ` qi.fuli
  1 sibling, 1 reply; 13+ messages in thread
From: Catalin Marinas @ 2020-02-10 17:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Jon Masters, Rafael Aquini, Mark Salter,
	linux-kernel, linux-mm, linux-arm-kernel

Hi Andrea,

Thanks for having a go at this.

On Mon, Feb 03, 2020 at 03:17:45PM -0500, Andrea Arcangeli wrote:
> With multiple NUMA nodes and multiple sockets, the tlbi broadcast
> shall be delivered through the interconnects in turn increasing the
> interconnect traffic and the latency of the tlbi broadcast instruction.

There are better ways to handle this in hardware (snoop filters), so
hopefully those vendors will eventually learn.

> After the local TLB flush this however means the ASID context goes out
> of sync in all CPUs except the local one. This can be tracked in the
> mm_cpumask(mm): if the bit is set it means the asid context is stale
> for that CPU. This results in an extra local ASID TLB flush only if a
> single threaded process is migrated to a different CPU and only after a
> TLB flush. No extra local TLB flush is needed for the common case of
> single threaded processes context scheduling within the same CPU and for
> multithreaded processes.

Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have
a kernel thread running with the previous user pgd and ASID set in
ttbr0_el1. The mm_users would still be 1 since only mm_count is
incremented in context_switch(). If the user thread now runs on CPU1, a
local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may
still walk (speculatively) the user page tables.

An example where this matters is a group of small pages converted to a
huge page. If CPU0 already has some TLB entries for small pages in the
group but, not being aware of a TLBI for the ptes in the range, may read
a block pmd entry (huge page) and we end up with a TLB conflict on CPU0
(CPU1 is fine since you do the local tlbi).

There are other examples where this could go wrong as the hardware may
keep intermediate pgtable entries in a walk cache. In the arm64 kernel
we rely on something the architecture calls break-before-make for any
page table updates and these need to be broadcast to other CPUs that may
potentially have an entry in their TLB.

It may be better if you used mm_cpumask to mark wherever an mm ever ran
than relying on mm_users.

> Skipping the tlbi instruction broadcasting is already implemented in
> local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
> flush_tlb_range() and flush_tlb_page() too.
> 
> Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same
> time from 32 single threaded processes before the patch:
> 
>  Performance counter stats for './loop' (3 runs):
> 
>                  0      dummy
> 
>           2.121353 +- 0.000387 seconds time elapsed  ( +-  0.02% )
> 
> and with the patch applied:
> 
>  Performance counter stats for './loop' (3 runs):
> 
>                  0      dummy
> 
>          0.1197750 +- 0.0000827 seconds time elapsed  ( +-  0.07% )

That's a pretty artificial test and it is indeed improved by this patch.
However, it would be nice to have some real-world scenarios where this
matters.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-10 17:51   ` Catalin Marinas
@ 2020-02-10 20:14     ` Andrea Arcangeli
  2020-02-11 14:00       ` Catalin Marinas
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-10 20:14 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Jon Masters, Rafael Aquini, Mark Salter,
	linux-kernel, linux-mm, linux-arm-kernel

Hello Catalin,

On Mon, Feb 10, 2020 at 05:51:06PM +0000, Catalin Marinas wrote:
> Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have
> a kernel thread running with the previous user pgd and ASID set in
> ttbr0_el1. The mm_users would still be 1 since only mm_count is
> incremented in context_switch(). If the user thread now runs on CPU1, a
> local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may
> still walk (speculatively) the user page tables.
> 
> An example where this matters is a group of small pages converted to a
> huge page. If CPU0 already has some TLB entries for small pages in the
> group but, not being aware of a TLBI for the ptes in the range, may read
> a block pmd entry (huge page) and we end up with a TLB conflict on CPU0
> (CPU1 is fine since you do the local tlbi).
> 
> There are other examples where this could go wrong as the hardware may
> keep intermediate pgtable entries in a walk cache. In the arm64 kernel
> we rely on something the architecture calls break-before-make for any
> page table updates and these need to be broadcast to other CPUs that may
> potentially have an entry in their TLB.
> 
> It may be better if you used mm_cpumask to mark wherever an mm ever ran
> than relying on mm_users.

Agreed.

If we can use mm_cpumask to track where the mm ever run, then if I'm
not mistaken we could optimize also multithreaded processes in the
same way: if only one thread is running frequently and the others are
frequently sleeping, we could issue a single tlbi broadcast (modulo
invalidates of small virtual ranges).

In the meantime the below should be enough to address the concern you
raised of the proof of concept RFC patch.

I already experimented with mm_users == 1 earlier and it doesn't
change the benchmark results for the "best case" below.

(untested)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 772bbc45b867..a2d53b301f22 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -169,7 +169,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -177,7 +178,9 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			__tlbi(aside1, asid);
 			__tlbi_user(aside1, asid);
@@ -212,7 +215,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -220,7 +224,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			__tlbi(vale1, addr);
 			__tlbi_user(vale1, addr);
@@ -264,7 +270,8 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	end = __TLBI_VADDR(end, asid);
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1  &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -272,7 +279,9 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			for (addr = start; addr < end; addr += stride) {
 				if (last_level) {


> That's a pretty artificial test and it is indeed improved by this patch.
> However, it would be nice to have some real-world scenarios where this
> matters.

I don't know exactly how much we should rely on the hardware to snoop
the asid on NUMA. The hardware to fully optimize would need to
implement a replicated mm_cpumask bitflag for each asid and every CPU
would need to tell every other CPU which asid it is loading every time
it is loading it. Exactly what x86 does with mm_cpumask in software.

That is ideal, but is it an arch requirement to add the above in all
implementations?

The case I measured has a single socket so it's even simpler because
it could be optimized all in-core. Even with a single socket I'm not
sure what's going wrong in the chip: it felt like it's the engine that
does the broadcast that runs serially system wide and then all CPUs
have to wait on it.

Still your question if it'll make a difference in practice is a good
one and I don't have a sure answer yet. I suppose before doing more
benchmarking it's better to make a new version of this that uses
mm_cpumask to track where the asid was ever loaded as you suggested,
so that it will also optimize away tlbi broadcaasts from multithreaded
processes where only one thread is running frequently?

Thanks!
Andrea


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-10 20:14     ` Andrea Arcangeli
@ 2020-02-11 14:00       ` Catalin Marinas
  2020-02-12 12:57         ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Catalin Marinas @ 2020-02-11 14:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Will Deacon, Jon Masters, Rafael Aquini, Mark Salter,
	linux-kernel, linux-mm, linux-arm-kernel

Hi Andrea,

On Mon, Feb 10, 2020 at 03:14:11PM -0500, Andrea Arcangeli wrote:
> On Mon, Feb 10, 2020 at 05:51:06PM +0000, Catalin Marinas wrote:
> > It may be better if you used mm_cpumask to mark wherever an mm ever ran
> > than relying on mm_users.
> 
> Agreed.
> 
> If we can use mm_cpumask to track where the mm ever run, then if I'm
> not mistaken we could optimize also multithreaded processes in the
> same way: if only one thread is running frequently and the others are
> frequently sleeping, we could issue a single tlbi broadcast (modulo
> invalidates of small virtual ranges).

Possibly, though not sure how you'd detect such scenario.

> In the meantime the below should be enough to address the concern you
> raised of the proof of concept RFC patch.
> 
> I already experimented with mm_users == 1 earlier and it doesn't
> change the benchmark results for the "best case" below.
> 
> (untested)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 772bbc45b867..a2d53b301f22 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
[...]
> @@ -212,7 +215,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>  	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
>  
>  	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
> -	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
> +	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
> +	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
>  		int cpu = get_cpu();
>  
>  		cpumask_setall(mm_cpumask(mm));

I think there is another race here. IIUC, the assumption you make is
that when mm_users <= 1 && mm_count == 1, the only active user of this
pgd/ASID is on the CPU doing the TLBI. This is not the case for
try_to_unmap() where the above condition may be true but the active
thread on a different CPU won't notice the local TLBI.

> > That's a pretty artificial test and it is indeed improved by this patch.
> > However, it would be nice to have some real-world scenarios where this
> > matters.
[...]
> Still your question if it'll make a difference in practice is a good
> one and I don't have a sure answer yet. I suppose before doing more
> benchmarking it's better to make a new version of this that uses
> mm_cpumask to track where the asid was ever loaded as you suggested,
> so that it will also optimize away tlbi broadcasts from multithreaded
> processes where only one thread is running frequently?

I was actually curious what triggered this patch series, whether you've
seen a real use-case where the TLBI was a bottleneck.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-11 14:00       ` Catalin Marinas
@ 2020-02-12 12:57         ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2020-02-12 12:57 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Jon Masters, Rafael Aquini, Mark Salter,
	linux-kernel, linux-mm, linux-arm-kernel

Hello,

On Tue, Feb 11, 2020 at 02:00:25PM +0000, Catalin Marinas wrote:
> I think there is another race here. IIUC, the assumption you make is
> that when mm_users <= 1 && mm_count == 1, the only active user of this
> pgd/ASID is on the CPU doing the TLBI. This is not the case for
> try_to_unmap() where the above condition may be true but the active
> thread on a different CPU won't notice the local TLBI.

The "current->mm == mm" check is what shall prevent the above.

Thanks,
Andrea


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
  2020-02-10 17:51   ` Catalin Marinas
@ 2020-02-12 14:13   ` qi.fuli
  2020-02-12 15:26     ` Catalin Marinas
  1 sibling, 1 reply; 13+ messages in thread
From: qi.fuli @ 2020-02-12 14:13 UTC (permalink / raw)
  To: Andrea Arcangeli, Will Deacon, Catalin Marinas, Jon Masters,
	Rafael Aquini, Mark Salter
  Cc: linux-mm, linux-kernel, linux-arm-kernel

On 2/4/20 5:17 AM, Andrea Arcangeli wrote:
> With multiple NUMA nodes and multiple sockets, the tlbi broadcast
> shall be delivered through the interconnects in turn increasing the
> interconnect traffic and the latency of the tlbi broadcast instruction.
> 
> Even within a single NUMA node the latency of the tlbi broadcast
> instruction increases almost linearly with the number of CPUs trying to
> send tlbi broadcasts at the same time.
> 
> When the process is single threaded however we can achieve full SMP
> scalability by skipping the tlbi broadcasting. Other arches already
> deploy this optimization.
> 
> After the local TLB flush this however means the ASID context goes out
> of sync in all CPUs except the local one. This can be tracked in the
> mm_cpumask(mm): if the bit is set it means the asid context is stale
> for that CPU. This results in an extra local ASID TLB flush only if a
> single threaded process is migrated to a different CPU and only after a
> TLB flush. No extra local TLB flush is needed for the common case of
> single threaded processes context scheduling within the same CPU and for
> multithreaded processes.
> 
> Skipping the tlbi instruction broadcasting is already implemented in
> local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
> flush_tlb_range() and flush_tlb_page() too.
> 
> Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same
> time from 32 single threaded processes before the patch:
> 
>   Performance counter stats for './loop' (3 runs):
> 
>                   0      dummy
> 
>            2.121353 +- 0.000387 seconds time elapsed  ( +-  0.02% )
> 
> and with the patch applied:
> 
>   Performance counter stats for './loop' (3 runs):
> 
>                   0      dummy
> 
>           0.1197750 +- 0.0000827 seconds time elapsed  ( +-  0.07% )

Hi,

I have tested this patch on thunderX2 with Himeno benchmark[1] with 
LARGE calculation size. Here are the results.

   w/o patch:   MFLOPS : 1149.480174
   w/  patch:   MFLOPS : 1110.653003

In order to validate the effectivness of the patch, I ran a 
single-threded program, which calls mprotect() in a loop to issue the 
tlbi broadcast instruction on a CPU core. At the same time, I ran Himeno 
benchmark on another CPU core. The results are:

   w/o patch:   MFLOPS :  860.238792
   w/  patch:   MFLOPS : 1110.449666

Though Himeno benchmark is a microbenchmark, I hope it helps.

[1] http://accc.riken.jp/en/supercom/documents/himenobmt/

Thanks,
QI Fuli

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   arch/arm64/include/asm/efi.h         |  2 +-
>   arch/arm64/include/asm/mmu.h         |  3 +-
>   arch/arm64/include/asm/mmu_context.h | 10 +--
>   arch/arm64/include/asm/tlbflush.h    | 91 +++++++++++++++++++++++++++-
>   arch/arm64/mm/context.c              | 13 +++-
>   5 files changed, 109 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/efi.h b/arch/arm64/include/asm/efi.h
> index 44531a69d32b..5d9a1433d918 100644
> --- a/arch/arm64/include/asm/efi.h
> +++ b/arch/arm64/include/asm/efi.h
> @@ -131,7 +131,7 @@ static inline void efifb_setup_from_dmi(struct screen_info *si, const char *opt)
>   
>   static inline void efi_set_pgd(struct mm_struct *mm)
>   {
> -	__switch_mm(mm);
> +	__switch_mm(mm, smp_processor_id());
>   
>   	if (system_uses_ttbr0_pan()) {
>   		if (mm != current->active_mm) {
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index e4d862420bb4..1f84289d3e44 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -27,7 +27,8 @@ typedef struct {
>    * ASID change and therefore doesn't need to reload the counter using
>    * atomic64_read.
>    */
> -#define ASID(mm)	((mm)->context.id.counter & 0xffff)
> +#define __ASID(asid)	((asid) & 0xffff)
> +#define ASID(mm)	__ASID((mm)->context.id.counter)
>   
>   extern bool arm64_use_ng_mappings;
>   
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index 3827ff4040a3..5ec264297968 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -210,10 +210,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>   	update_saved_ttbr0(tsk, &init_mm);
>   }
>   
> -static inline void __switch_mm(struct mm_struct *next)
> +static inline void __switch_mm(struct mm_struct *next, unsigned int cpu)
>   {
> -	unsigned int cpu = smp_processor_id();
> -
>   	/*
>   	 * init_mm.pgd does not contain any user mappings and it is always
>   	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
> @@ -230,8 +228,12 @@ static inline void
>   switch_mm(struct mm_struct *prev, struct mm_struct *next,
>   	  struct task_struct *tsk)
>   {
> +	unsigned int cpu = smp_processor_id();
> +
>   	if (prev != next)
> -		__switch_mm(next);
> +		__switch_mm(next, cpu);
> +	else if (cpumask_test_and_clear_cpu(cpu, mm_cpumask(next)))
> +		local_flush_tlb_asid(atomic64_read(&next->context.id));
>   
>   	/*
>   	 * Update the saved TTBR0_EL1 of the scheduled-in task as the previous
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index bc3949064725..283f97af3fc5 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -136,6 +136,15 @@ static inline void local_flush_tlb_all(void)
>   	isb();
>   }
>   
> +static inline void local_flush_tlb_asid(unsigned long asid)
> +{
> +	asid = __TLBI_VADDR(0, __ASID(asid));
> +	dsb(nshst);
> +	__tlbi(aside1, asid);
> +	__tlbi_user(aside1, asid);
> +	dsb(nsh);
> +}
> +
>   static inline void flush_tlb_all(void)
>   {
>   	dsb(ishst);
> @@ -148,6 +157,27 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>   {
>   	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
>   
> +	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
> +	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
> +		int cpu = get_cpu();
> +
> +		cpumask_setall(mm_cpumask(mm));
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +
> +		smp_mb();
> +
> +		if (atomic_read(&mm->mm_users) <= 1) {
> +			dsb(nshst);
> +			__tlbi(aside1, asid);
> +			__tlbi_user(aside1, asid);
> +			dsb(nsh);
> +
> +			put_cpu();
> +			return;
> +		}
> +		put_cpu();
> +	}
> +
>   	dsb(ishst);
>   	__tlbi(aside1is, asid);
>   	__tlbi_user(aside1is, asid);
> @@ -167,7 +197,33 @@ static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>   static inline void flush_tlb_page(struct vm_area_struct *vma,
>   				  unsigned long uaddr)
>   {
> -	flush_tlb_page_nosync(vma, uaddr);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
> +
> +	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
> +	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
> +		int cpu = get_cpu();
> +
> +		cpumask_setall(mm_cpumask(mm));
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +
> +		smp_mb();
> +
> +		if (atomic_read(&mm->mm_users) <= 1) {
> +			dsb(nshst);
> +			__tlbi(vale1, addr);
> +			__tlbi_user(vale1, addr);
> +			dsb(nsh);
> +
> +			put_cpu();
> +			return;
> +		}
> +		put_cpu();
> +	}
> +
> +	dsb(ishst);
> +	__tlbi(vale1is, addr);
> +	__tlbi_user(vale1is, addr);
>   	dsb(ish);
>   }
>   
> @@ -181,14 +237,15 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   				     unsigned long start, unsigned long end,
>   				     unsigned long stride, bool last_level)
>   {
> -	unsigned long asid = ASID(vma->vm_mm);
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long asid = ASID(mm);
>   	unsigned long addr;
>   
>   	start = round_down(start, stride);
>   	end = round_up(end, stride);
>   
>   	if ((end - start) >= (MAX_TLBI_OPS * stride)) {
> -		flush_tlb_mm(vma->vm_mm);
> +		flush_tlb_mm(mm);
>   		return;
>   	}
>   
> @@ -198,6 +255,34 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
>   	start = __TLBI_VADDR(start, asid);
>   	end = __TLBI_VADDR(end, asid);
>   
> +	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
> +	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
> +		int cpu = get_cpu();
> +
> +		cpumask_setall(mm_cpumask(mm));
> +		cpumask_clear_cpu(cpu, mm_cpumask(mm));
> +
> +		smp_mb();
> +
> +		if (atomic_read(&mm->mm_users) <= 1) {
> +			dsb(nshst);
> +			for (addr = start; addr < end; addr += stride) {
> +				if (last_level) {
> +					__tlbi(vale1, addr);
> +					__tlbi_user(vale1, addr);
> +				} else {
> +					__tlbi(vae1, addr);
> +					__tlbi_user(vae1, addr);
> +				}
> +			}
> +			dsb(nsh);
> +
> +			put_cpu();
> +			return;
> +		}
> +		put_cpu();
> +	}
> +
>   	dsb(ishst);
>   	for (addr = start; addr < end; addr += stride) {
>   		if (last_level) {
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index 8ef73e89d514..b44459d64dff 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -198,6 +198,7 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>   {
>   	unsigned long flags;
>   	u64 asid, old_active_asid;
> +	int need_flush = 0;
>   
>   	if (system_supports_cnp())
>   		cpu_set_reserved_ttbr0();
> @@ -233,7 +234,8 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>   		atomic64_set(&mm->context.id, asid);
>   	}
>   
> -	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
> +	need_flush = cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending);
> +	if (need_flush)
>   		local_flush_tlb_all();
>   
>   	atomic64_set(&per_cpu(active_asids, cpu), asid);
> @@ -241,6 +243,15 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>   
>   switch_mm_fastpath:
>   
> +	/*
> +	 * Enforce CPU ordering between the mmget() in use_mm() and
> +	 * the below cpumask_test_and_clear_cpu().
> +	 */
> +	smp_mb();
> +	if (cpumask_test_and_clear_cpu(cpu, mm_cpumask(mm)) &&
> +	    likely(!need_flush))
> +		local_flush_tlb_asid(asid);
> +
>   	arm64_apply_bp_hardening();
>   
>   	/*
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
  2020-02-12 14:13   ` qi.fuli
@ 2020-02-12 15:26     ` Catalin Marinas
  0 siblings, 0 replies; 13+ messages in thread
From: Catalin Marinas @ 2020-02-12 15:26 UTC (permalink / raw)
  To: qi.fuli
  Cc: Andrea Arcangeli, Will Deacon, Jon Masters, Rafael Aquini,
	Mark Salter, linux-mm, linux-kernel, linux-arm-kernel

On Wed, Feb 12, 2020 at 02:13:56PM +0000, qi.fuli@fujitsu.com wrote:
> On 2/4/20 5:17 AM, Andrea Arcangeli wrote:
> > With multiple NUMA nodes and multiple sockets, the tlbi broadcast
> > shall be delivered through the interconnects in turn increasing the
> > interconnect traffic and the latency of the tlbi broadcast instruction.
> > 
> > Even within a single NUMA node the latency of the tlbi broadcast
> > instruction increases almost linearly with the number of CPUs trying to
> > send tlbi broadcasts at the same time.
> > 
> > When the process is single threaded however we can achieve full SMP
> > scalability by skipping the tlbi broadcasting. Other arches already
> > deploy this optimization.
> > 
> > After the local TLB flush this however means the ASID context goes out
> > of sync in all CPUs except the local one. This can be tracked in the
> > mm_cpumask(mm): if the bit is set it means the asid context is stale
> > for that CPU. This results in an extra local ASID TLB flush only if a
> > single threaded process is migrated to a different CPU and only after a
> > TLB flush. No extra local TLB flush is needed for the common case of
> > single threaded processes context scheduling within the same CPU and for
> > multithreaded processes.
> > 
> > Skipping the tlbi instruction broadcasting is already implemented in
> > local_flush_tlb_all(), this patch only extends it to flush_tlb_mm(),
> > flush_tlb_range() and flush_tlb_page() too.
> > 
> > Here's the result of 32 CPUs (ARMv8 Ampere) running mprotect at the same
> > time from 32 single threaded processes before the patch:
> > 
> >   Performance counter stats for './loop' (3 runs):
> > 
> >                   0      dummy
> > 
> >            2.121353 +- 0.000387 seconds time elapsed  ( +-  0.02% )
> > 
> > and with the patch applied:
> > 
> >   Performance counter stats for './loop' (3 runs):
> > 
> >                   0      dummy
> > 
> >           0.1197750 +- 0.0000827 seconds time elapsed  ( +-  0.07% )
> 
> I have tested this patch on thunderX2 with Himeno benchmark[1] with 
> LARGE calculation size. Here are the results.
> 
>    w/o patch:   MFLOPS : 1149.480174
>    w/  patch:   MFLOPS : 1110.653003
> 
> In order to validate the effectivness of the patch, I ran a 
> single-threded program, which calls mprotect() in a loop to issue the 
> tlbi broadcast instruction on a CPU core. At the same time, I ran Himeno 
> benchmark on another CPU core. The results are:
> 
>    w/o patch:   MFLOPS :  860.238792
>    w/  patch:   MFLOPS : 1110.449666
> 
> Though Himeno benchmark is a microbenchmark, I hope it helps.

It doesn't really help. What if you have a two-thread program calling
mprotect() in a loop? IOW, how is this relevant to real-world scenarios?

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2020-02-19 20:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-03 20:17 [RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Andrea Arcangeli
2020-02-03 20:17 ` [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize " Andrea Arcangeli
2020-02-18 11:31   ` Michal Hocko
2020-02-18 18:56     ` Andrea Arcangeli
2020-02-19 11:58       ` Michal Hocko
2020-02-19 20:04         ` Andrea Arcangeli
2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
2020-02-10 17:51   ` Catalin Marinas
2020-02-10 20:14     ` Andrea Arcangeli
2020-02-11 14:00       ` Catalin Marinas
2020-02-12 12:57         ` Andrea Arcangeli
2020-02-12 14:13   ` qi.fuli
2020-02-12 15:26     ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).