linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>, Jon Masters <jcm@jonmasters.org>,
	Rafael Aquini <aquini@redhat.com>,
	Mark Salter <msalter@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes
Date: Mon, 10 Feb 2020 15:14:11 -0500	[thread overview]
Message-ID: <20200210201411.GC3699@redhat.com> (raw)
In-Reply-To: <20200210175106.GA27215@arrakis.emea.arm.com>

Hello Catalin,

On Mon, Feb 10, 2020 at 05:51:06PM +0000, Catalin Marinas wrote:
> Relying om mm_users is not sufficient AFAICT. Let's say on CPU0 you have
> a kernel thread running with the previous user pgd and ASID set in
> ttbr0_el1. The mm_users would still be 1 since only mm_count is
> incremented in context_switch(). If the user thread now runs on CPU1, a
> local tlbi would only invalidate the TLBs on CPU1. However, CPU0 may
> still walk (speculatively) the user page tables.
> 
> An example where this matters is a group of small pages converted to a
> huge page. If CPU0 already has some TLB entries for small pages in the
> group but, not being aware of a TLBI for the ptes in the range, may read
> a block pmd entry (huge page) and we end up with a TLB conflict on CPU0
> (CPU1 is fine since you do the local tlbi).
> 
> There are other examples where this could go wrong as the hardware may
> keep intermediate pgtable entries in a walk cache. In the arm64 kernel
> we rely on something the architecture calls break-before-make for any
> page table updates and these need to be broadcast to other CPUs that may
> potentially have an entry in their TLB.
> 
> It may be better if you used mm_cpumask to mark wherever an mm ever ran
> than relying on mm_users.

Agreed.

If we can use mm_cpumask to track where the mm ever run, then if I'm
not mistaken we could optimize also multithreaded processes in the
same way: if only one thread is running frequently and the others are
frequently sleeping, we could issue a single tlbi broadcast (modulo
invalidates of small virtual ranges).

In the meantime the below should be enough to address the concern you
raised of the proof of concept RFC patch.

I already experimented with mm_users == 1 earlier and it doesn't
change the benchmark results for the "best case" below.

(untested)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 772bbc45b867..a2d53b301f22 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -169,7 +169,8 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 	unsigned long asid = __TLBI_VADDR(0, ASID(mm));
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -177,7 +178,9 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			__tlbi(aside1, asid);
 			__tlbi_user(aside1, asid);
@@ -212,7 +215,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	unsigned long addr = __TLBI_VADDR(uaddr, ASID(mm));
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1 &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -220,7 +224,9 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			__tlbi(vale1, addr);
 			__tlbi_user(vale1, addr);
@@ -264,7 +270,8 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 	end = __TLBI_VADDR(end, asid);
 
 	/* avoid TLB-i broadcast to remote NUMA nodes if it's a local flush */
-	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1) {
+	if (current->mm == mm && atomic_read(&mm->mm_users) <= 1  &&
+	    (system_uses_ttbr0_pan() || atomic_read(&mm->mm_count) == 1)) {
 		int cpu = get_cpu();
 
 		cpumask_setall(mm_cpumask(mm));
@@ -272,7 +279,9 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
 
 		smp_mb();
 
-		if (atomic_read(&mm->mm_users) <= 1) {
+		if (atomic_read(&mm->mm_users) <= 1 &&
+		    (system_uses_ttbr0_pan() ||
+		     atomic_read(&mm->mm_count) == 1)) {
 			dsb(nshst);
 			for (addr = start; addr < end; addr += stride) {
 				if (last_level) {


> That's a pretty artificial test and it is indeed improved by this patch.
> However, it would be nice to have some real-world scenarios where this
> matters.

I don't know exactly how much we should rely on the hardware to snoop
the asid on NUMA. The hardware to fully optimize would need to
implement a replicated mm_cpumask bitflag for each asid and every CPU
would need to tell every other CPU which asid it is loading every time
it is loading it. Exactly what x86 does with mm_cpumask in software.

That is ideal, but is it an arch requirement to add the above in all
implementations?

The case I measured has a single socket so it's even simpler because
it could be optimized all in-core. Even with a single socket I'm not
sure what's going wrong in the chip: it felt like it's the engine that
does the broadcast that runs serially system wide and then all CPUs
have to wait on it.

Still your question if it'll make a difference in practice is a good
one and I don't have a sure answer yet. I suppose before doing more
benchmarking it's better to make a new version of this that uses
mm_cpumask to track where the asid was ever loaded as you suggested,
so that it will also optimize away tlbi broadcaasts from multithreaded
processes where only one thread is running frequently?

Thanks!
Andrea



  reply	other threads:[~2020-02-10 20:14 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-03 20:17 [RFC] [PATCH 0/2] arm64: tlb: skip tlbi broadcast for single threaded TLB flushes Andrea Arcangeli
2020-02-03 20:17 ` [PATCH 1/2] mm: use_mm: fix for arches checking mm_users to optimize " Andrea Arcangeli
2020-02-18 11:31   ` Michal Hocko
2020-02-18 18:56     ` Andrea Arcangeli
2020-02-19 11:58       ` Michal Hocko
2020-02-19 20:04         ` Andrea Arcangeli
2020-02-03 20:17 ` [PATCH 2/2] arm64: tlb: skip tlbi broadcast for single threaded " Andrea Arcangeli
2020-02-10 17:51   ` Catalin Marinas
2020-02-10 20:14     ` Andrea Arcangeli [this message]
2020-02-11 14:00       ` Catalin Marinas
2020-02-12 12:57         ` Andrea Arcangeli
2020-02-12 14:13   ` qi.fuli
2020-02-12 15:26     ` Catalin Marinas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200210201411.GC3699@redhat.com \
    --to=aarcange@redhat.com \
    --cc=aquini@redhat.com \
    --cc=catalin.marinas@arm.com \
    --cc=jcm@jonmasters.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=msalter@redhat.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).