All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-03 14:37 Aneesh Kumar K.V
  2021-08-04  5:14 ` Nicholas Piggin
  0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2021-08-03 14:37 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, npiggin

With shared mapping, even though we are unmapping a large range, the kernel
will force a TLB flush with ptl lock held to avoid the race mentioned in
commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
This results in the kernel issuing a high number of TLB flushes even for a large
range. This can be improved by making sure the kernel switch to pid based flush if the
kernel is unmapping a 2M range.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..21d0f098e43b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
  * invalidating a full PID, so it has a far lower threshold to change from
  * individual page flushes to full-pid flushes.
  */
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
+static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
@@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
 	else
 		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
 	/*
@@ -1335,7 +1335,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
 	else
 		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
 
@@ -1505,7 +1505,7 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned long lpid,
 			continue;
 
 		nr_pages = (end - start) >> def->shift;
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
 
 		/*
 		 * If the number of pages spanning the range is above
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-03 14:37 [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE Aneesh Kumar K.V
@ 2021-08-04  5:14 ` Nicholas Piggin
  2021-08-04  6:39     ` Nicholas Piggin
  2021-08-04  6:59   ` Michael Ellerman
  0 siblings, 2 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04  5:14 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe

Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
> With shared mapping, even though we are unmapping a large range, the kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for a large
> range. This can be improved by making sure the kernel switch to pid based flush if the
> kernel is unmapping a 2M range.

It would be good to have a bit more description here.

In any patch that changes a heuristic like this, I would like to see 
some justification or reasoning that could be refuted or used as a 
supporting argument if we ever wanted to change the heuristic later.
Ideally with some of the obvious downsides listed as well.

This "improves" things here, but what if it hurt things elsewhere, how 
would we come in later and decide to change it back?

THP flushes for example, I think now they'll do PID flushes (if they 
have to be broadcast, which they will tend to be when khugepaged does
them). So now that might increase jitter for THP and cause it to be a
loss for more workloads.

So where do you notice this? What's the benefit?

> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>   * invalidating a full PID, so it has a far lower threshold to change from
>   * individual page flushes to full-pid flushes.
>   */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
>  
>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>  	if (fullmm)
>  		flush_pid = true;
>  	else if (type == FLUSH_TYPE_GLOBAL)
> -		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> +		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;

Arguably >= is nicer than > here, but this shouldn't be in the same 
patch as the value change.

>  	else
>  		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;

And it should change everything to be consistent. Although I'm not sure 
it's worth changing even though I highly doubt any administrator would
be tweaking this.

Thanks,
Nick

>  	/*
> @@ -1335,7 +1335,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
>  	if (fullmm)
>  		flush_pid = true;
>  	else if (type == FLUSH_TYPE_GLOBAL)
> -		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> +		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>  	else
>  		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>  
> @@ -1505,7 +1505,7 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned long lpid,
>  			continue;
>  
>  		nr_pages = (end - start) >> def->shift;
> -		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> +		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>  
>  		/*
>  		 * If the number of pages spanning the range is above
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-04  5:14 ` Nicholas Piggin
@ 2021-08-04  6:39     ` Nicholas Piggin
  2021-08-04  6:59   ` Michael Ellerman
  1 sibling, 0 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04  6:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Will Deacon, Peter Zijlstra, linux-arch, linux-mm

Excerpts from Nicholas Piggin's message of August 4, 2021 3:14 pm:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
> 
> It would be good to have a bit more description here.
> 
> In any patch that changes a heuristic like this, I would like to see 
> some justification or reasoning that could be refuted or used as a 
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
> 
> This "improves" things here, but what if it hurt things elsewhere, how 
> would we come in later and decide to change it back?
> 
> THP flushes for example, I think now they'll do PID flushes (if they 
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
> 
> So where do you notice this? What's the benefit?

For that matter, I wonder if we shouldn't do something like this 
(untested) so the low level batch flush has visibility to the high
level flush range.

x86 could use this too AFAIKS, just needs to pass the range a bit
further down, but in practice I'm not sure it would ever really
matter for them because the 2MB level exceeds the single page flush
ceiling for 4k pages unlike powerpc with 64k pages. But in corner
cases where the unmap crossed a bunch of small vmas or the ceiling
was increased then in theory it could be of use.

Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
 to be flushed, not just the particular gather

This allows archs to optimise flushing heuristics better, in the face of
flush operations forcing smaller flush batches. For example, an
architecture may choose a more costly per-page invalidation for small
ranges of pages with the assumption that the full TLB flush cost would
be more expensive in terms of refills. However if a very large range is
forced into flushing small ranges, the faster full-process flush may
have been the better choice.

---
 arch/powerpc/mm/book3s64/radix_tlb.c | 33 ++++++++++++++++------------
 fs/exec.c                            |  3 ++-
 include/asm-generic/tlb.h            |  9 ++++++++
 include/linux/mm_types.h             |  3 ++-
 mm/hugetlb.c                         |  2 +-
 mm/madvise.c                         |  6 ++---
 mm/memory.c                          |  4 ++--
 mm/mmap.c                            |  2 +-
 mm/mmu_gather.c                      | 10 ++++++---
 mm/oom_kill.c                        |  2 +-
 10 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..e1072d85d72e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1110,12 +1110,13 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
-					    unsigned long start, unsigned long end)
+					    unsigned long start, unsigned long end,
+					    unsigned long entire_range)
 {
 	unsigned long pid;
 	unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
 	unsigned long page_size = 1UL << page_shift;
-	unsigned long nr_pages = (end - start) >> page_shift;
+	unsigned long entire_nr_pages = entire_range >> page_shift;
 	bool fullmm = (end == TLB_FLUSH_ALL);
 	bool flush_pid, flush_pwc = false;
 	enum tlb_flush_type type;
@@ -1133,9 +1134,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
 	else
-		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
 	/*
 	 * full pid flush already does the PWC flush. if it is not full pid
 	 * flush check the range is more than PMD and force a pwc flush
@@ -1220,7 +1221,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		return radix__flush_hugetlb_tlb_range(vma, start, end);
 #endif
 
-	__radix__flush_tlb_range(vma->vm_mm, start, end);
+	__radix__flush_tlb_range(vma->vm_mm, start, end, end - start);
 }
 EXPORT_SYMBOL(radix__flush_tlb_range);
 
@@ -1278,6 +1279,11 @@ void radix__flush_all_lpid_guest(unsigned int lpid)
 	_tlbie_lpid_guest(lpid, RIC_FLUSH_ALL);
 }
 
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
+				unsigned long start, unsigned long end,
+				unsigned long entire_range,
+				int psize, bool also_pwc);
+
 void radix__tlb_flush(struct mmu_gather *tlb)
 {
 	int psize = 0;
@@ -1285,6 +1291,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	int page_size = tlb->page_size;
 	unsigned long start = tlb->start;
 	unsigned long end = tlb->end;
+	unsigned long entire_range = tlb->entire_end - tlb->entire_start;
 
 	/*
 	 * if page size is not something we understand, do a full mm flush
@@ -1301,21 +1308,19 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 		else
 			radix__flush_all_mm(mm);
 	} else {
-		if (!tlb->freed_tables)
-			radix__flush_tlb_range_psize(mm, start, end, psize);
-		else
-			radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
+		__radix__flush_tlb_range_psize(mm, start, end, entire_range, psize, tlb->freed_tables);
 	}
 }
 
 static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 				unsigned long start, unsigned long end,
+				unsigned long entire_range,
 				int psize, bool also_pwc)
 {
 	unsigned long pid;
 	unsigned int page_shift = mmu_psize_defs[psize].shift;
 	unsigned long page_size = 1UL << page_shift;
-	unsigned long nr_pages = (end - start) >> page_shift;
+	unsigned long entire_nr_pages = entire_range >> page_shift;
 	bool fullmm = (end == TLB_FLUSH_ALL);
 	bool flush_pid;
 	enum tlb_flush_type type;
@@ -1335,9 +1340,9 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
 	else
-		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
 
 	if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
 		unsigned long tgt = H_RPTI_TARGET_CMMU;
@@ -1381,13 +1386,13 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, int psize)
 {
-	return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
+	return __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, false);
 }
 
 void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
 				      unsigned long end, int psize)
 {
-	__radix__flush_tlb_range_psize(mm, start, end, psize, true);
+	__radix__flush_tlb_range_psize(mm, start, end, end - start, psize, true);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..c769c12bdf56 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -705,11 +705,11 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
+		tlb_gather_mmu(&tlb, mm, new_end, old_end);
 		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
 	} else {
@@ -719,6 +719,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
+		tlb_gather_mmu(&tlb, mm, old_start, old_end);
 		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
 	}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..857fd83af695 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -256,6 +256,15 @@ struct mmu_gather {
 	struct mmu_table_batch	*batch;
 #endif
 
+	/*
+	 * This is the range of the "entire" logical flush
+	 * operation being performed. It does not relate to
+	 * the current batch to flush, but it can inform
+	 * heuristics that choose the best flushing strategy.
+	 */
+	unsigned long		entire_start;
+	unsigned long		entire_end;
+
 	unsigned long		start;
 	unsigned long		end;
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..9d2ff06b574c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -599,7 +599,8 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 }
 
 struct mmu_gather;
-extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
+extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+				unsigned long start, unsigned long end);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d5221d..e41106eb4df7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4458,7 +4458,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mmu_gather tlb;
 
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
 	tlb_finish_mmu(&tlb);
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..b3634672aeb9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -508,7 +508,7 @@ static long madvise_cold(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb);
 
@@ -561,7 +561,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
 		return 0;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb);
 
@@ -726,7 +726,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..61c303e84baf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1647,7 +1647,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				start, start + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, range.start, range.end);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
@@ -1674,7 +1674,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address, address + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	unmap_single_vma(&tlb, vma, address, range.end, details);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..f2808febde40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2672,7 +2672,7 @@ static void unmap_region(struct mm_struct *mm,
 	struct mmu_gather tlb;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end);
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..863a5bd7e650 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -250,9 +250,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 }
 
 static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+			     unsigned long start, unsigned long end,
 			     bool fullmm)
 {
 	tlb->mm = mm;
+	tlb->entire_start = start;
+	tlb->entire_end = end;
 	tlb->fullmm = fullmm;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -281,9 +284,10 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
  * Called to initialize an (on-stack) mmu_gather structure for page-table
  * tear-down from @mm.
  */
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+		    unsigned long start, unsigned long end)
 {
-	__tlb_gather_mmu(tlb, mm, false);
+	__tlb_gather_mmu(tlb, mm, start, end, false);
 }
 
 /**
@@ -299,7 +303,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
  */
 void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
 {
-	__tlb_gather_mmu(tlb, mm, true);
+	__tlb_gather_mmu(tlb, mm, 0, ~0UL, true);
 }
 
 /**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..bfcc9cbdfb20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -545,7 +545,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 			mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
 						vma, mm, vma->vm_start,
 						vma->vm_end);
-			tlb_gather_mmu(&tlb, mm);
+			tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
 			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
 				tlb_finish_mmu(&tlb);
 				ret = false;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-04  6:39     ` Nicholas Piggin
  0 siblings, 0 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04  6:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe
  Cc: Peter Zijlstra, linux-arch, Will Deacon, linux-mm

Excerpts from Nicholas Piggin's message of August 4, 2021 3:14 pm:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
> 
> It would be good to have a bit more description here.
> 
> In any patch that changes a heuristic like this, I would like to see 
> some justification or reasoning that could be refuted or used as a 
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
> 
> This "improves" things here, but what if it hurt things elsewhere, how 
> would we come in later and decide to change it back?
> 
> THP flushes for example, I think now they'll do PID flushes (if they 
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
> 
> So where do you notice this? What's the benefit?

For that matter, I wonder if we shouldn't do something like this 
(untested) so the low level batch flush has visibility to the high
level flush range.

x86 could use this too AFAIKS, just needs to pass the range a bit
further down, but in practice I'm not sure it would ever really
matter for them because the 2MB level exceeds the single page flush
ceiling for 4k pages unlike powerpc with 64k pages. But in corner
cases where the unmap crossed a bunch of small vmas or the ceiling
was increased then in theory it could be of use.

Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
 to be flushed, not just the particular gather

This allows archs to optimise flushing heuristics better, in the face of
flush operations forcing smaller flush batches. For example, an
architecture may choose a more costly per-page invalidation for small
ranges of pages with the assumption that the full TLB flush cost would
be more expensive in terms of refills. However if a very large range is
forced into flushing small ranges, the faster full-process flush may
have been the better choice.

---
 arch/powerpc/mm/book3s64/radix_tlb.c | 33 ++++++++++++++++------------
 fs/exec.c                            |  3 ++-
 include/asm-generic/tlb.h            |  9 ++++++++
 include/linux/mm_types.h             |  3 ++-
 mm/hugetlb.c                         |  2 +-
 mm/madvise.c                         |  6 ++---
 mm/memory.c                          |  4 ++--
 mm/mmap.c                            |  2 +-
 mm/mmu_gather.c                      | 10 ++++++---
 mm/oom_kill.c                        |  2 +-
 10 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..e1072d85d72e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1110,12 +1110,13 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
 
 static inline void __radix__flush_tlb_range(struct mm_struct *mm,
-					    unsigned long start, unsigned long end)
+					    unsigned long start, unsigned long end,
+					    unsigned long entire_range)
 {
 	unsigned long pid;
 	unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
 	unsigned long page_size = 1UL << page_shift;
-	unsigned long nr_pages = (end - start) >> page_shift;
+	unsigned long entire_nr_pages = entire_range >> page_shift;
 	bool fullmm = (end == TLB_FLUSH_ALL);
 	bool flush_pid, flush_pwc = false;
 	enum tlb_flush_type type;
@@ -1133,9 +1134,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
 	else
-		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
 	/*
 	 * full pid flush already does the PWC flush. if it is not full pid
 	 * flush check the range is more than PMD and force a pwc flush
@@ -1220,7 +1221,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
 		return radix__flush_hugetlb_tlb_range(vma, start, end);
 #endif
 
-	__radix__flush_tlb_range(vma->vm_mm, start, end);
+	__radix__flush_tlb_range(vma->vm_mm, start, end, end - start);
 }
 EXPORT_SYMBOL(radix__flush_tlb_range);
 
@@ -1278,6 +1279,11 @@ void radix__flush_all_lpid_guest(unsigned int lpid)
 	_tlbie_lpid_guest(lpid, RIC_FLUSH_ALL);
 }
 
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
+				unsigned long start, unsigned long end,
+				unsigned long entire_range,
+				int psize, bool also_pwc);
+
 void radix__tlb_flush(struct mmu_gather *tlb)
 {
 	int psize = 0;
@@ -1285,6 +1291,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 	int page_size = tlb->page_size;
 	unsigned long start = tlb->start;
 	unsigned long end = tlb->end;
+	unsigned long entire_range = tlb->entire_end - tlb->entire_start;
 
 	/*
 	 * if page size is not something we understand, do a full mm flush
@@ -1301,21 +1308,19 @@ void radix__tlb_flush(struct mmu_gather *tlb)
 		else
 			radix__flush_all_mm(mm);
 	} else {
-		if (!tlb->freed_tables)
-			radix__flush_tlb_range_psize(mm, start, end, psize);
-		else
-			radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
+		__radix__flush_tlb_range_psize(mm, start, end, entire_range, psize, tlb->freed_tables);
 	}
 }
 
 static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 				unsigned long start, unsigned long end,
+				unsigned long entire_range,
 				int psize, bool also_pwc)
 {
 	unsigned long pid;
 	unsigned int page_shift = mmu_psize_defs[psize].shift;
 	unsigned long page_size = 1UL << page_shift;
-	unsigned long nr_pages = (end - start) >> page_shift;
+	unsigned long entire_nr_pages = entire_range >> page_shift;
 	bool fullmm = (end == TLB_FLUSH_ALL);
 	bool flush_pid;
 	enum tlb_flush_type type;
@@ -1335,9 +1340,9 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 	if (fullmm)
 		flush_pid = true;
 	else if (type == FLUSH_TYPE_GLOBAL)
-		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
 	else
-		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+		flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
 
 	if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
 		unsigned long tgt = H_RPTI_TARGET_CMMU;
@@ -1381,13 +1386,13 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
 void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
 				  unsigned long end, int psize)
 {
-	return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
+	return __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, false);
 }
 
 void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
 				      unsigned long end, int psize)
 {
-	__radix__flush_tlb_range_psize(mm, start, end, psize, true);
+	__radix__flush_tlb_range_psize(mm, start, end, end - start, psize, true);
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..c769c12bdf56 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -705,11 +705,11 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		return -ENOMEM;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
+		tlb_gather_mmu(&tlb, mm, new_end, old_end);
 		free_pgd_range(&tlb, new_end, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
 	} else {
@@ -719,6 +719,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		 * have constraints on va-space that make this illegal (IA64) -
 		 * for the others its just a little faster.
 		 */
+		tlb_gather_mmu(&tlb, mm, old_start, old_end);
 		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
 	}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..857fd83af695 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -256,6 +256,15 @@ struct mmu_gather {
 	struct mmu_table_batch	*batch;
 #endif
 
+	/*
+	 * This is the range of the "entire" logical flush
+	 * operation being performed. It does not relate to
+	 * the current batch to flush, but it can inform
+	 * heuristics that choose the best flushing strategy.
+	 */
+	unsigned long		entire_start;
+	unsigned long		entire_end;
+
 	unsigned long		start;
 	unsigned long		end;
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..9d2ff06b574c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -599,7 +599,8 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 }
 
 struct mmu_gather;
-extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
+extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+				unsigned long start, unsigned long end);
 extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d5221d..e41106eb4df7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4458,7 +4458,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mmu_gather tlb;
 
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
 	tlb_finish_mmu(&tlb);
 }
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..b3634672aeb9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -508,7 +508,7 @@ static long madvise_cold(struct vm_area_struct *vma,
 		return -EINVAL;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb);
 
@@ -561,7 +561,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
 		return 0;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
 	tlb_finish_mmu(&tlb);
 
@@ -726,7 +726,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 				range.start, range.end);
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, range.start, range.end);
 	update_hiwater_rss(mm);
 
 	mmu_notifier_invalidate_range_start(&range);
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..61c303e84baf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1647,7 +1647,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				start, start + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, range.start, range.end);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
@@ -1674,7 +1674,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
 				address, address + size);
-	tlb_gather_mmu(&tlb, vma->vm_mm);
+	tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	unmap_single_vma(&tlb, vma, address, range.end, details);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..f2808febde40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2672,7 +2672,7 @@ static void unmap_region(struct mm_struct *mm,
 	struct mmu_gather tlb;
 
 	lru_add_drain();
-	tlb_gather_mmu(&tlb, mm);
+	tlb_gather_mmu(&tlb, mm, start, end);
 	update_hiwater_rss(mm);
 	unmap_vmas(&tlb, vma, start, end);
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..863a5bd7e650 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -250,9 +250,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
 }
 
 static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+			     unsigned long start, unsigned long end,
 			     bool fullmm)
 {
 	tlb->mm = mm;
+	tlb->entire_start = start;
+	tlb->entire_end = end;
 	tlb->fullmm = fullmm;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -281,9 +284,10 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
  * Called to initialize an (on-stack) mmu_gather structure for page-table
  * tear-down from @mm.
  */
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+		    unsigned long start, unsigned long end)
 {
-	__tlb_gather_mmu(tlb, mm, false);
+	__tlb_gather_mmu(tlb, mm, start, end, false);
 }
 
 /**
@@ -299,7 +303,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
  */
 void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
 {
-	__tlb_gather_mmu(tlb, mm, true);
+	__tlb_gather_mmu(tlb, mm, 0, ~0UL, true);
 }
 
 /**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..bfcc9cbdfb20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -545,7 +545,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 			mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
 						vma, mm, vma->vm_start,
 						vma->vm_end);
-			tlb_gather_mmu(&tlb, mm);
+			tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
 			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
 				tlb_finish_mmu(&tlb);
 				ret = false;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-04  5:14 ` Nicholas Piggin
  2021-08-04  6:39     ` Nicholas Piggin
@ 2021-08-04  6:59   ` Michael Ellerman
  1 sibling, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2021-08-04  6:59 UTC (permalink / raw)
  To: Nicholas Piggin, Aneesh Kumar K.V, linuxppc-dev

Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>
> It would be good to have a bit more description here.
>
> In any patch that changes a heuristic like this, I would like to see 
> some justification or reasoning that could be refuted or used as a 
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
>
> This "improves" things here, but what if it hurt things elsewhere, how 
> would we come in later and decide to change it back?
>
> THP flushes for example, I think now they'll do PID flushes (if they 
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
>
> So where do you notice this? What's the benefit?

Ack. Needs some numbers and supporting evidence.

>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>   * invalidating a full PID, so it has a far lower threshold to change from
>>   * individual page flushes to full-pid flushes.
>>   */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
>>  
>>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>  	if (fullmm)
>>  		flush_pid = true;
>>  	else if (type == FLUSH_TYPE_GLOBAL)
>> -		flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> +		flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>
> Arguably >= is nicer than > here, but this shouldn't be in the same 
> patch as the value change.
>
>>  	else
>>  		flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> And it should change everything to be consistent. Although I'm not sure 
> it's worth changing even though I highly doubt any administrator would
> be tweaking this.

This made me look at how an administrator tweaks these thresholds, and
AFAICS the answer is "recompile the kernel"?

It looks like x86 have a debugfs file for tlb_single_page_flush_ceiling,
but we don't. I guess we meant to copy that but never did?

So at the moment both thresholds could just be #defines.

Making them tweakable at runtime would be nice, it would give us an
escape hatch if we ever hit a workload in production that wants a
different value.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-04  6:39     ` Nicholas Piggin
@ 2021-08-04  7:34       ` Peter Zijlstra
  -1 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2021-08-04  7:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Aneesh Kumar K.V, linuxppc-dev, mpe, Will Deacon, linux-arch, linux-mm

On Wed, Aug 04, 2021 at 04:39:44PM +1000, Nicholas Piggin wrote:
> For that matter, I wonder if we shouldn't do something like this 
> (untested) so the low level batch flush has visibility to the high
> level flush range.
> 
> x86 could use this too AFAIKS, just needs to pass the range a bit
> further down, but in practice I'm not sure it would ever really
> matter for them because the 2MB level exceeds the single page flush
> ceiling for 4k pages unlike powerpc with 64k pages. But in corner
> cases where the unmap crossed a bunch of small vmas or the ceiling
> was increased then in theory it could be of use.

x86 can do single invalidates for 2M pages if that's the only type
encountered in the range, at which point we can do 33*2M = 66M, which is
well below the 1G range for the next level of huge.

> Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
>  to be flushed, not just the particular gather
> 
> This allows archs to optimise flushing heuristics better, in the face of
> flush operations forcing smaller flush batches. For example, an
> architecture may choose a more costly per-page invalidation for small
> ranges of pages with the assumption that the full TLB flush cost would
> be more expensive in terms of refills. However if a very large range is
> forced into flushing small ranges, the faster full-process flush may
> have been the better choice.

What is splitting up the flushes for you? That page_size power-special?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-04  7:34       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2021-08-04  7:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linux-arch, Will Deacon, Aneesh Kumar K.V, linux-mm, linuxppc-dev

On Wed, Aug 04, 2021 at 04:39:44PM +1000, Nicholas Piggin wrote:
> For that matter, I wonder if we shouldn't do something like this 
> (untested) so the low level batch flush has visibility to the high
> level flush range.
> 
> x86 could use this too AFAIKS, just needs to pass the range a bit
> further down, but in practice I'm not sure it would ever really
> matter for them because the 2MB level exceeds the single page flush
> ceiling for 4k pages unlike powerpc with 64k pages. But in corner
> cases where the unmap crossed a bunch of small vmas or the ceiling
> was increased then in theory it could be of use.

x86 can do single invalidates for 2M pages if that's the only type
encountered in the range, at which point we can do 33*2M = 66M, which is
well below the 1G range for the next level of huge.

> Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
>  to be flushed, not just the particular gather
> 
> This allows archs to optimise flushing heuristics better, in the face of
> flush operations forcing smaller flush batches. For example, an
> architecture may choose a more costly per-page invalidation for small
> ranges of pages with the assumption that the full TLB flush cost would
> be more expensive in terms of refills. However if a very large range is
> forced into flushing small ranges, the faster full-process flush may
> have been the better choice.

What is splitting up the flushes for you? That page_size power-special?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-12 13:20   ` Aneesh Kumar K.V
@ 2021-08-16  7:03     ` Michael Ellerman
  0 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2021-08-16  7:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Puvichakravarthy Ramachandran; +Cc: linuxppc-dev, npiggin

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 8/12/21 6:19 PM, Michael Ellerman wrote:
>> "Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>>>> With shared mapping, even though we are unmapping a large range, the kernel
>>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>>> kernel is unmapping a 2M range.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>>   arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> index aefc100d79a7..21d0f098e43b 100644
>>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>>    * invalidating a full PID, so it has a far lower threshold to change > from
>>>>    * individual page flushes to full-pid flushes.
>>>>    */
>>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>>   static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>>
>>>>   static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>>        if (fullmm)
>>>>                flush_pid = true;
>>>>        else if (type == FLUSH_TYPE_GLOBAL)
>>>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>>        else
>>>>                flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>>
>>> Additional details on the test environment. This was tested on a 2 Node/8
>>> socket Power10 system.
>>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>>   Rate of work: = 176
>>> #           time             counts unit events
>>>       1.029206442         4198594519      cycles
>>>       1.029206442         2458254252      instructions              # 0.59 insn per cycle
>>>       1.029206442         3004031488      PM_EXEC_STALL
>>>       1.029206442         1798186036      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 181
>>>       2.054288539         4183883450      cycles
>>>       2.054288539         2472178171      instructions              # 0.59 insn per cycle
>>>       2.054288539         3014609313      PM_EXEC_STALL
>>>       2.054288539         1797851642      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 180
>>>       3.078306883         4171250717      cycles
>>>       3.078306883         2468341094      instructions              # 0.59 insn per cycle
>>>       3.078306883         2993036205      PM_EXEC_STALL
>>>       3.078306883         1798181890      PM_EXEC_STALL_TLBIE
>>> .
>>> .
>>>
>>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>> 34
>>>
>>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>>   Rate of work: = 313
>>> #           time             counts unit events
>>>       1.030310506         4206071143      cycles
>>>       1.030310506         4314716958      instructions              # 1.03 insn per cycle
>>>       1.030310506         2157762167      PM_EXEC_STALL
>>>       1.030310506          110825573      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 322
>>>       2.056034068         4331745630      cycles
>>>       2.056034068         4531658304      instructions              # 1.05 insn per cycle
>>>       2.056034068         2288971361      PM_EXEC_STALL
>>>       2.056034068          111267927      PM_EXEC_STALL_TLBIE
>>>   Rate of work: = 321
>>>       3.081216434         4327050349      cycles
>>>       3.081216434         4379679508      instructions              # 1.01 insn per cycle
>>>       3.081216434         2252602550      PM_EXEC_STALL
>>>       3.081216434          110974887      PM_EXEC_STALL_TLBIE
>> 
>> 
>> What is the tlbie test actually doing?
>> 
>> Does it do anything to measure the cost of refilling after the full mm flush?
>
> That is essentially
>
> for ()
> {
>    shmat()
>    fillshm()
>    shmdt()
>
> }
>
> for a 256MB range. So it is not really a fair benchmark because it 
> doesn't take into account the impact of throwing away the full pid 
> translation. But even then the TLBIE stalls is an important data point?

Choosing the ceiling is a trade-off, and this test only measures one
side of the trade-off.

It tells us that the actual time taken to execute the full flush is less
than doing 32 individual flushes, but that's not the full story.

To decide I think we need some numbers for some more "real" workloads,
to at least see that there's no change, or preferably some improvement.

Another interesting test might be to do the shmat/fillshm/shmdt, and
then chase some pointers to provoke TLB misses. Then we could work out
the relative cost of TLB misses vs the time to do the flush.

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-12 12:49 ` Michael Ellerman
@ 2021-08-12 13:20   ` Aneesh Kumar K.V
  2021-08-16  7:03     ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2021-08-12 13:20 UTC (permalink / raw)
  To: Michael Ellerman, Puvichakravarthy Ramachandran; +Cc: linuxppc-dev, npiggin

On 8/12/21 6:19 PM, Michael Ellerman wrote:
> "Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>>> With shared mapping, even though we are unmapping a large range, the kernel
>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>> kernel is unmapping a 2M range.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>   arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> index aefc100d79a7..21d0f098e43b 100644
>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>    * invalidating a full PID, so it has a far lower threshold to change > from
>>>    * individual page flushes to full-pid flushes.
>>>    */
>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>   static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>
>>>   static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>        if (fullmm)
>>>                flush_pid = true;
>>>        else if (type == FLUSH_TYPE_GLOBAL)
>>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>        else
>>>                flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>
>> Additional details on the test environment. This was tested on a 2 Node/8
>> socket Power10 system.
>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>   Rate of work: = 176
>> #           time             counts unit events
>>       1.029206442         4198594519      cycles
>>       1.029206442         2458254252      instructions              # 0.59 insn per cycle
>>       1.029206442         3004031488      PM_EXEC_STALL
>>       1.029206442         1798186036      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 181
>>       2.054288539         4183883450      cycles
>>       2.054288539         2472178171      instructions              # 0.59 insn per cycle
>>       2.054288539         3014609313      PM_EXEC_STALL
>>       2.054288539         1797851642      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 180
>>       3.078306883         4171250717      cycles
>>       3.078306883         2468341094      instructions              # 0.59 insn per cycle
>>       3.078306883         2993036205      PM_EXEC_STALL
>>       3.078306883         1798181890      PM_EXEC_STALL_TLBIE
>> .
>> .
>>
>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>> 34
>>
>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>>   Rate of work: = 313
>> #           time             counts unit events
>>       1.030310506         4206071143      cycles
>>       1.030310506         4314716958      instructions              # 1.03 insn per cycle
>>       1.030310506         2157762167      PM_EXEC_STALL
>>       1.030310506          110825573      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 322
>>       2.056034068         4331745630      cycles
>>       2.056034068         4531658304      instructions              # 1.05 insn per cycle
>>       2.056034068         2288971361      PM_EXEC_STALL
>>       2.056034068          111267927      PM_EXEC_STALL_TLBIE
>>   Rate of work: = 321
>>       3.081216434         4327050349      cycles
>>       3.081216434         4379679508      instructions              # 1.01 insn per cycle
>>       3.081216434         2252602550      PM_EXEC_STALL
>>       3.081216434          110974887      PM_EXEC_STALL_TLBIE
> 
> 
> What is the tlbie test actually doing?
> 
> Does it do anything to measure the cost of refilling after the full mm flush?
> 


That is essentially

for ()
{
   shmat()
   fillshm()
   shmdt()

}

for a 256MB range. So it is not really a fair benchmark because it 
doesn't take into account the impact of throwing away the full pid 
translation. But even then the TLBIE stalls is an important data point?

-aneesh



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
  2021-08-06  7:56 Puvichakravarthy Ramachandran
@ 2021-08-12 12:49 ` Michael Ellerman
  2021-08-12 13:20   ` Aneesh Kumar K.V
  0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2021-08-12 12:49 UTC (permalink / raw)
  To: Puvichakravarthy Ramachandran, aneesh.kumar; +Cc: linuxppc-dev, npiggin

"Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>> 
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>   * invalidating a full PID, so it has a far lower threshold to change > from
>>   * individual page flushes to full-pid flushes.
>>   */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>> 
>>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>       if (fullmm)
>>               flush_pid = true;
>>       else if (type == FLUSH_TYPE_GLOBAL)
>> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>       else
>>               flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> Additional details on the test environment. This was tested on a 2 Node/8 
> socket Power10 system.
> The LPAR had 105 cores and the LPAR spanned across all the sockets. 
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 176 
> #           time             counts unit events
>      1.029206442         4198594519      cycles  
>      1.029206442         2458254252      instructions              # 0.59 insn per cycle 
>      1.029206442         3004031488      PM_EXEC_STALL   
>      1.029206442         1798186036      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 181 
>      2.054288539         4183883450      cycles  
>      2.054288539         2472178171      instructions              # 0.59 insn per cycle 
>      2.054288539         3014609313      PM_EXEC_STALL   
>      2.054288539         1797851642      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 180 
>      3.078306883         4171250717      cycles  
>      3.078306883         2468341094      instructions              # 0.59 insn per cycle 
>      3.078306883         2993036205      PM_EXEC_STALL   
>      3.078306883         1798181890      PM_EXEC_STALL_TLBIE   
> .
> . 
>
> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
> 34
>
> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>
> # perf stat -I 1000 -a -e cycles,instructions -e 
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
>  Rate of work: = 313 
> #           time             counts unit events
>      1.030310506         4206071143      cycles  
>      1.030310506         4314716958      instructions              # 1.03 insn per cycle 
>      1.030310506         2157762167      PM_EXEC_STALL   
>      1.030310506          110825573      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 322 
>      2.056034068         4331745630      cycles  
>      2.056034068         4531658304      instructions              # 1.05 insn per cycle 
>      2.056034068         2288971361      PM_EXEC_STALL   
>      2.056034068          111267927      PM_EXEC_STALL_TLBIE   
>  Rate of work: = 321 
>      3.081216434         4327050349      cycles  
>      3.081216434         4379679508      instructions              # 1.01 insn per cycle 
>      3.081216434         2252602550      PM_EXEC_STALL   
>      3.081216434          110974887      PM_EXEC_STALL_TLBIE   


What is the tlbie test actually doing?

Does it do anything to measure the cost of refilling after the full mm flush?

cheers

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-06  7:56 Puvichakravarthy Ramachandran
  2021-08-12 12:49 ` Michael Ellerman
  0 siblings, 1 reply; 12+ messages in thread
From: Puvichakravarthy Ramachandran @ 2021-08-06  7:56 UTC (permalink / raw)
  To: aneesh.kumar; +Cc: linuxppc-dev, npiggin

> With shared mapping, even though we are unmapping a large range, the 
kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and 
memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for 
a large
> range. This can be improved by making sure the kernel switch to pid 
based flush if the
> kernel is unmapping a 2M range.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>   * invalidating a full PID, so it has a far lower threshold to change 
from
>   * individual page flushes to full-pid flushes.
>   */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly 
= POWER9_TLB_SETS_RADIX * 2;
> 
>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
>       if (fullmm)
>               flush_pid = true;
>       else if (type == FLUSH_TYPE_GLOBAL)
> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>       else
>               flush_pid = nr_pages > 
tlb_local_single_page_flush_ceiling;

Additional details on the test environment. This was tested on a 2 Node/8 
socket Power10 system.
The LPAR had 105 cores and the LPAR spanned across all the sockets. 

# perf stat -I 1000 -a -e cycles,instructions -e 
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
 Rate of work: = 176 
#           time             counts unit events
     1.029206442         4198594519      cycles  
     1.029206442         2458254252      instructions              # 0.59 
insn per cycle 
     1.029206442         3004031488      PM_EXEC_STALL   
     1.029206442         1798186036      PM_EXEC_STALL_TLBIE   
 Rate of work: = 181 
     2.054288539         4183883450      cycles  
     2.054288539         2472178171      instructions              # 0.59 
insn per cycle 
     2.054288539         3014609313      PM_EXEC_STALL   
     2.054288539         1797851642      PM_EXEC_STALL_TLBIE   
 Rate of work: = 180 
     3.078306883         4171250717      cycles  
     3.078306883         2468341094      instructions              # 0.59 
insn per cycle 
     3.078306883         2993036205      PM_EXEC_STALL   
     3.078306883         1798181890      PM_EXEC_STALL_TLBIE   
.
. 

# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
34

# echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling

# perf stat -I 1000 -a -e cycles,instructions -e 
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e 
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1  -t 1
 Rate of work: = 313 
#           time             counts unit events
     1.030310506         4206071143      cycles  
     1.030310506         4314716958      instructions              # 1.03 
insn per cycle 
     1.030310506         2157762167      PM_EXEC_STALL   
     1.030310506          110825573      PM_EXEC_STALL_TLBIE   
 Rate of work: = 322 
     2.056034068         4331745630      cycles  
     2.056034068         4531658304      instructions              # 1.05 
insn per cycle 
     2.056034068         2288971361      PM_EXEC_STALL   
     2.056034068          111267927      PM_EXEC_STALL_TLBIE   
 Rate of work: = 321 
     3.081216434         4327050349      cycles  
     3.081216434         4379679508      instructions              # 1.01 
insn per cycle 
     3.081216434         2252602550      PM_EXEC_STALL   
     3.081216434          110974887      PM_EXEC_STALL_TLBIE   
.
.
 

Regards,
Puvichakravarthy Ramachandran





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-06  5:22 Puvichakravarthy Ramachandran
  0 siblings, 0 replies; 12+ messages in thread
From: Puvichakravarthy Ramachandran @ 2021-08-06  5:22 UTC (permalink / raw)
  To: aneesh.kumar; +Cc: linuxppc-dev, npiggin

> With shared mapping, even though we are unmapping a large range, the 
kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and 
memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for 
a large
> range. This can be improved by making sure the kernel switch to pid 
based flush if the
> kernel is unmapping a 2M range.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>   * invalidating a full PID, so it has a far lower threshold to change 
from
>   * individual page flushes to full-pid flushes.
>   */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>  static unsigned long tlb_local_single_page_flush_ceiling __read_mostly 
= POWER9_TLB_SETS_RADIX * 2;
> 
>  static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
>       if (fullmm)
>               flush_pid = true;
>       else if (type == FLUSH_TYPE_GLOBAL)
> -             flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> +             flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>       else
>               flush_pid = nr_pages > 
tlb_local_single_page_flush_ceiling;

I evaluated the patches from Aneesh with a micro benchmark which does 
shmat, shmdt of 256 MB segment.
Higher the rate of work, better the performance. With a value of 32, we 
match the performance of 
GTSE=off. This was evaluated on SLES15 SP3 kernel.


# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling 
32

# perf stat -I 1000 -a -e powerpc:tlbie,r30058 ./tlbie -i 5 -c 1 t 1
 Rate of work: = 311 
#           time             counts unit events
     1.013131404              50939      powerpc:tlbie   
     1.013131404              50658      r30058  
 Rate of work: = 318 
     2.026957019              51520      powerpc:tlbie   
     2.026957019              51481      r30058  
 Rate of work: = 318 
     3.038884431              51485      powerpc:tlbie   
     3.038884431              51461      r30058  
 Rate of work: = 318 
     4.051483926              51485      powerpc:tlbie   
     4.051483926              51520      r30058  
 Rate of work: = 318 
     5.063635713              48577      powerpc:tlbie   
     5.063635713              48347      r30058  
 
# echo 34 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling 

# perf stat -I 1000 -a -e powerpc:tlbie,r30058 ./tlbie -i 5 -c 1 t 1
 Rate of work: = 174 
#           time             counts unit events
     1.012672696             721471      powerpc:tlbie   
     1.012672696             726491      r30058  
 Rate of work: = 177 
     2.026348661             737460      powerpc:tlbie   
     2.026348661             736138      r30058  
 Rate of work: = 178 
     3.037932122             737460      powerpc:tlbie   
     3.037932122             737460      r30058  
 Rate of work: = 178 
     4.050198819             737044      powerpc:tlbie   
     4.050198819             737460      r30058  
 Rate of work: = 177 
     5.062400776             692832      powerpc:tlbie   
     5.062400776             688319      r30058          


Regards,
Puvichakravarthy Ramachandran




^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-08-16  7:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 14:37 [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE Aneesh Kumar K.V
2021-08-04  5:14 ` Nicholas Piggin
2021-08-04  6:39   ` Nicholas Piggin
2021-08-04  6:39     ` Nicholas Piggin
2021-08-04  7:34     ` Peter Zijlstra
2021-08-04  7:34       ` Peter Zijlstra
2021-08-04  6:59   ` Michael Ellerman
2021-08-06  5:22 Puvichakravarthy Ramachandran
2021-08-06  7:56 Puvichakravarthy Ramachandran
2021-08-12 12:49 ` Michael Ellerman
2021-08-12 13:20   ` Aneesh Kumar K.V
2021-08-16  7:03     ` Michael Ellerman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.