All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] Enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH on POWER
@ 2017-11-01 10:17 Anshuman Khandual
  2017-11-01 10:17 ` [RFC 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Anshuman Khandual
  2017-11-01 10:17 ` [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim Anshuman Khandual
  0 siblings, 2 replies; 4+ messages in thread
From: Anshuman Khandual @ 2017-11-01 10:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, aneesh.kumar, npiggin

From: Anshuman Khandual <Khandual@linux.vnet.ibm.com>

Batched TLB flush during reclaim path has been around for couple of years
now and been enabled on X86 platform. The idea is to batch multiple page
TLB invalidation requests together and flush all those CPUs completely
who might have the TLB cache for any of the unmapped pages instead of just
sending multiple IPIs and flushing out individual pages each time reclaim
unmaps one page. This has the potential to improve performance for certain
types of workloads under memory pressure provided some conditions related
to individual page TLB invalidation, CPU wide TLB invalidation, system
wide TLB invalidation, TLB reload, IPI costs etc are met.

Please refer the commit 72b252aed5 ("mm: send one IPI per CPU to TLB flush
all entries after unmapping pages") from Mel Gorman for more details on
how it can impact the performance for various workloads. This enablement
improves performance for the original test case 'case-lru-file-mmap-read'
from vm-scallability bucket but only from system time perspective.

time ./run case-lru-file-mmap-read

Without the patch:

real    4m20.364s
user    102m52.492s
sys     433m26.190s

With the patch:

real    4m15.942s	(-  1.69%)
user    111m16.662s	(+  7.55%)
sys     382m35.202s	(- 11.73%)

Parallel kernel compilation does not see any performance improvement or
degradation with and with out this patch. It remains within margin of
error.

Without the patch:

real    1m13.850s
user    39m21.803s
sys     2m43.362s

With the patch:

real    1m14.481s	(+ 0.85%)
user    39m27.409s	(+ 0.23%)
sys     2m44.656s	(+ 0.79%)

It batches up multiple struct mm during reclaim and keeps on accumulating
the superset of struct mm's cpu mask who might have a TLB which needs to
be invalidated. Then local struct mm wide invalidation is performance on
the cpu mask for all those batched ones. Please do the review and let me
know if there is any other way to do this better. Thank you.

Anshuman Khandual (2):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
  powerpc/mm: Enable deferred flushing of TLB during reclaim

 arch/powerpc/Kconfig                |  1 +
 arch/powerpc/include/asm/tlbbatch.h | 30 +++++++++++++++++++++++
 arch/powerpc/include/asm/tlbflush.h |  3 +++
 arch/powerpc/mm/tlb-radix.c         | 49 +++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h     | 12 +++++++++
 mm/rmap.c                           |  9 +------
 6 files changed, 96 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/include/asm/tlbbatch.h

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
  2017-11-01 10:17 [RFC 0/2] Enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH on POWER Anshuman Khandual
@ 2017-11-01 10:17 ` Anshuman Khandual
  2017-11-01 10:17 ` [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim Anshuman Khandual
  1 sibling, 0 replies; 4+ messages in thread
From: Anshuman Khandual @ 2017-11-01 10:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, aneesh.kumar, npiggin

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/x86/include/asm/tlbflush.h | 12 ++++++++++++
 mm/rmap.c                       |  9 +--------
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index c4aed0d..5875f2c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -366,6 +366,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
+static inline void arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	bool should_defer = false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index b874c47..bfbfe92 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -627,17 +627,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-	bool should_defer = false;
-
 	if (!(flags & TTU_BATCH_FLUSH))
 		return false;
 
-	/* If remote CPUs need to be flushed then defer batch the flush */
-	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-		should_defer = true;
-	put_cpu();
-
-	return should_defer;
+	return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim
  2017-11-01 10:17 [RFC 0/2] Enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH on POWER Anshuman Khandual
  2017-11-01 10:17 ` [RFC 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Anshuman Khandual
@ 2017-11-01 10:17 ` Anshuman Khandual
  2017-11-02  4:02   ` Anshuman Khandual
  1 sibling, 1 reply; 4+ messages in thread
From: Anshuman Khandual @ 2017-11-01 10:17 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mpe, aneesh.kumar, npiggin

Deferred flushing can only be enabled on POWER9 DD2.0 processor onwards.
Because prior versions of POWER9 and previous hash table based POWER
processors will do TLB flushing in pte_get_and_clear() function itself
which then prevents batching and eventual flush completion later on.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/Kconfig                |  1 +
 arch/powerpc/include/asm/tlbbatch.h | 30 +++++++++++++++++++++++
 arch/powerpc/include/asm/tlbflush.h |  3 +++
 arch/powerpc/mm/tlb-radix.c         | 49 +++++++++++++++++++++++++++++++++++++
 4 files changed, 83 insertions(+)
 create mode 100644 arch/powerpc/include/asm/tlbbatch.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 809c468..f06b565 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -230,6 +230,7 @@ config PPC
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
 	select VIRT_TO_BUS			if !PPC64
+	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if (PPC64 && PPC_BOOK3S)
 	#
 	# Please keep this list sorted alphabetically.
 	#
diff --git a/arch/powerpc/include/asm/tlbbatch.h b/arch/powerpc/include/asm/tlbbatch.h
new file mode 100644
index 0000000..fc762ef
--- /dev/null
+++ b/arch/powerpc/include/asm/tlbbatch.h
@@ -0,0 +1,30 @@
+#ifndef _ARCH_POWERPC_TLBBATCH_H
+#define _ARCH_POWERPC_TLBBATCH_H
+
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+
+#define MAX_BATCHED_MM 1024
+
+struct arch_tlbflush_unmap_batch {
+	/*
+	 * Each bit set is a CPU that potentially has a
+	 * TLB entry for one of the PFN being flushed.
+	 * This represents whether all deferred struct
+	 * mm will be flushed for any given CPU.
+	 */
+	struct cpumask cpumask;
+
+	/* All the deferred struct mm */
+	struct mm_struct *mm[MAX_BATCHED_MM];
+	unsigned long int nr_mm;
+	
+};
+
+extern bool arch_tlbbatch_should_defer(struct mm_struct *mm);
+extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+extern void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
+					struct mm_struct *mm);
+#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
+#endif /* _ARCH_POWERPC_TLBBATCH_H */
diff --git a/arch/powerpc/include/asm/tlbflush.h b/arch/powerpc/include/asm/tlbflush.h
index 13dbcd4..2041923 100644
--- a/arch/powerpc/include/asm/tlbflush.h
+++ b/arch/powerpc/include/asm/tlbflush.h
@@ -20,6 +20,9 @@
  */
 #ifdef __KERNEL__
 
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_PPC_MMU_NOHASH
 /*
  * TLB flushing for software loaded TLB chips
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index b3e849c..506e7ed 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -12,6 +12,8 @@
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
 #include <linux/memblock.h>
+#include <linux/mutex.h>
+#include <linux/smp.h>
 
 #include <asm/ppc-opcode.h>
 #include <asm/tlb.h>
@@ -519,3 +521,50 @@ extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(radix_kvm_prefetch_workaround);
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
+
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+static void clear_tlb(void *data)
+{
+	struct arch_tlbflush_unmap_batch *batch = data;
+	int i;
+
+	WARN_ON(!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1));
+
+	for (i = 0; i < batch->nr_mm; i++) {
+		if (batch->mm[i])
+			radix__local_flush_tlb_mm(batch->mm[i]);
+	}
+}
+
+void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+	WARN_ON(!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1));
+
+	smp_call_function_many(&batch->cpumask, (void *)clear_tlb, batch, 1);
+	batch->nr_mm = 0;
+	cpumask_clear(&batch->cpumask);
+}
+
+void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm)
+{
+	WARN_ON(!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1));
+
+	++batch->nr_mm;
+	if (batch->nr_mm != MAX_BATCHED_MM)
+		batch->mm[batch->nr_mm] = mm;
+	else
+		pr_err("Deferred TLB flush: missed a struct mm\n");
+	cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+}
+
+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+	if (!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1))
+		return false;
+
+	if (!mm_is_thread_local(mm))
+		return true;
+
+	return false;
+}
+#endif
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim
  2017-11-01 10:17 ` [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim Anshuman Khandual
@ 2017-11-02  4:02   ` Anshuman Khandual
  0 siblings, 0 replies; 4+ messages in thread
From: Anshuman Khandual @ 2017-11-02  4:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: aneesh.kumar, npiggin

On 11/01/2017 03:47 PM, Anshuman Khandual wrote:
> Deferred flushing can only be enabled on POWER9 DD2.0 processor onwards.
> Because prior versions of POWER9 and previous hash table based POWER
> processors will do TLB flushing in pte_get_and_clear() function itself
> which then prevents batching and eventual flush completion later on.
> 
> Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> ---
>  arch/powerpc/Kconfig                |  1 +
>  arch/powerpc/include/asm/tlbbatch.h | 30 +++++++++++++++++++++++
>  arch/powerpc/include/asm/tlbflush.h |  3 +++
>  arch/powerpc/mm/tlb-radix.c         | 49 +++++++++++++++++++++++++++++++++++++
>  4 files changed, 83 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/tlbbatch.h
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 809c468..f06b565 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -230,6 +230,7 @@ config PPC
>  	select SPARSE_IRQ
>  	select SYSCTL_EXCEPTION_TRACE
>  	select VIRT_TO_BUS			if !PPC64
> +	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if (PPC64 && PPC_BOOK3S)
>  	#
>  	# Please keep this list sorted alphabetically.
>  	#
> diff --git a/arch/powerpc/include/asm/tlbbatch.h b/arch/powerpc/include/asm/tlbbatch.h
> new file mode 100644
> index 0000000..fc762ef
> --- /dev/null
> +++ b/arch/powerpc/include/asm/tlbbatch.h
> @@ -0,0 +1,30 @@
> +#ifndef _ARCH_POWERPC_TLBBATCH_H
> +#define _ARCH_POWERPC_TLBBATCH_H
> +
> +#include <linux/spinlock.h>
> +
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +
> +#define MAX_BATCHED_MM 1024
> +
> +struct arch_tlbflush_unmap_batch {
> +	/*
> +	 * Each bit set is a CPU that potentially has a
> +	 * TLB entry for one of the PFN being flushed.
> +	 * This represents whether all deferred struct
> +	 * mm will be flushed for any given CPU.
> +	 */
> +	struct cpumask cpumask;
> +
> +	/* All the deferred struct mm */
> +	struct mm_struct *mm[MAX_BATCHED_MM];
> +	unsigned long int nr_mm;
> +	
> +};
> +
> +extern bool arch_tlbbatch_should_defer(struct mm_struct *mm);
> +extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> +extern void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
> +					struct mm_struct *mm);
> +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> +#endif /* _ARCH_POWERPC_TLBBATCH_H */
> diff --git a/arch/powerpc/include/asm/tlbflush.h b/arch/powerpc/include/asm/tlbflush.h
> index 13dbcd4..2041923 100644
> --- a/arch/powerpc/include/asm/tlbflush.h
> +++ b/arch/powerpc/include/asm/tlbflush.h
> @@ -20,6 +20,9 @@
>   */
>  #ifdef __KERNEL__
> 
> +#include <linux/sched.h>
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_PPC_MMU_NOHASH
>  /*
>   * TLB flushing for software loaded TLB chips
> diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
> index b3e849c..506e7ed 100644
> --- a/arch/powerpc/mm/tlb-radix.c
> +++ b/arch/powerpc/mm/tlb-radix.c
> @@ -12,6 +12,8 @@
>  #include <linux/mm.h>
>  #include <linux/hugetlb.h>
>  #include <linux/memblock.h>
> +#include <linux/mutex.h>
> +#include <linux/smp.h>
> 
>  #include <asm/ppc-opcode.h>
>  #include <asm/tlb.h>
> @@ -519,3 +521,50 @@ extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(radix_kvm_prefetch_workaround);
>  #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> +
> +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> +static void clear_tlb(void *data)
> +{
> +	struct arch_tlbflush_unmap_batch *batch = data;
> +	int i;
> +
> +	WARN_ON(!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1));
> +
> +	for (i = 0; i < batch->nr_mm; i++) {
> +		if (batch->mm[i])
> +			radix__local_flush_tlb_mm(batch->mm[i]);
> +	}
> +}

Instead of clearing each affected 'struct mm' on the CPU, we can
just TLB flush the entire CPU for the given partition. But its
not really giving any improvement from the flushing mechanism
described above.

diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index 506e7ed..b0eb218 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -525,15 +525,8 @@ extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void clear_tlb(void *data)
 {
-       struct arch_tlbflush_unmap_batch *batch = data;
-       int i;
-
        WARN_ON(!radix_enabled() || cpu_has_feature(CPU_FTR_POWER9_DD1));
-
-       for (i = 0; i < batch->nr_mm; i++) {
-               if (batch->mm[i])
-                       radix__local_flush_tlb_mm(batch->mm[i]);
-       }
+       cur_cpu_spec->flush_tlb(TLB_INVAL_SCOPE_LPID);
 }

$time ./run case-lru-file-mmap-read

real    4m15.766s
user    108m6.967s
sys     393m15.152s

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-11-02  4:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-01 10:17 [RFC 0/2] Enable ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH on POWER Anshuman Khandual
2017-11-01 10:17 ` [RFC 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Anshuman Khandual
2017-11-01 10:17 ` [RFC 2/2] powerpc/mm: Enable deferred flushing of TLB during reclaim Anshuman Khandual
2017-11-02  4:02   ` Anshuman Khandual

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.