* [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-03 14:37 Aneesh Kumar K.V
2021-08-04 5:14 ` Nicholas Piggin
0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2021-08-03 14:37 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, npiggin
With shared mapping, even though we are unmapping a large range, the kernel
will force a TLB flush with ptl lock held to avoid the race mentioned in
commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
This results in the kernel issuing a high number of TLB flushes even for a large
range. This can be improved by making sure the kernel switch to pid based flush if the
kernel is unmapping a 2M range.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..21d0f098e43b 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
* invalidating a full PID, so it has a far lower threshold to change from
* individual page flushes to full-pid flushes.
*/
-static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
+static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
static inline void __radix__flush_tlb_range(struct mm_struct *mm,
@@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
/*
@@ -1335,7 +1335,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
else
flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
@@ -1505,7 +1505,7 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned long lpid,
continue;
nr_pages = (end - start) >> def->shift;
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
/*
* If the number of pages spanning the range is above
--
2.31.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-03 14:37 [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE Aneesh Kumar K.V
@ 2021-08-04 5:14 ` Nicholas Piggin
2021-08-04 6:39 ` Nicholas Piggin
2021-08-04 6:59 ` Michael Ellerman
0 siblings, 2 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04 5:14 UTC (permalink / raw)
To: Aneesh Kumar K.V, linuxppc-dev, mpe
Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
> With shared mapping, even though we are unmapping a large range, the kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for a large
> range. This can be improved by making sure the kernel switch to pid based flush if the
> kernel is unmapping a 2M range.
It would be good to have a bit more description here.
In any patch that changes a heuristic like this, I would like to see
some justification or reasoning that could be refuted or used as a
supporting argument if we ever wanted to change the heuristic later.
Ideally with some of the obvious downsides listed as well.
This "improves" things here, but what if it hurt things elsewhere, how
would we come in later and decide to change it back?
THP flushes for example, I think now they'll do PID flushes (if they
have to be broadcast, which they will tend to be when khugepaged does
them). So now that might increase jitter for THP and cause it to be a
loss for more workloads.
So where do you notice this? What's the benefit?
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
> * invalidating a full PID, so it has a far lower threshold to change from
> * individual page flushes to full-pid flushes.
> */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
>
> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> if (fullmm)
> flush_pid = true;
> else if (type == FLUSH_TYPE_GLOBAL)
> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
Arguably >= is nicer than > here, but this shouldn't be in the same
patch as the value change.
> else
> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
And it should change everything to be consistent. Although I'm not sure
it's worth changing even though I highly doubt any administrator would
be tweaking this.
Thanks,
Nick
> /*
> @@ -1335,7 +1335,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
> if (fullmm)
> flush_pid = true;
> else if (type == FLUSH_TYPE_GLOBAL)
> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
> else
> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> @@ -1505,7 +1505,7 @@ void do_h_rpt_invalidate_prt(unsigned long pid, unsigned long lpid,
> continue;
>
> nr_pages = (end - start) >> def->shift;
> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>
> /*
> * If the number of pages spanning the range is above
> --
> 2.31.1
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-04 5:14 ` Nicholas Piggin
@ 2021-08-04 6:39 ` Nicholas Piggin
2021-08-04 6:59 ` Michael Ellerman
1 sibling, 0 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04 6:39 UTC (permalink / raw)
To: Aneesh Kumar K.V, linuxppc-dev, mpe
Cc: Will Deacon, Peter Zijlstra, linux-arch, linux-mm
Excerpts from Nicholas Piggin's message of August 4, 2021 3:14 pm:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>
> It would be good to have a bit more description here.
>
> In any patch that changes a heuristic like this, I would like to see
> some justification or reasoning that could be refuted or used as a
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
>
> This "improves" things here, but what if it hurt things elsewhere, how
> would we come in later and decide to change it back?
>
> THP flushes for example, I think now they'll do PID flushes (if they
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
>
> So where do you notice this? What's the benefit?
For that matter, I wonder if we shouldn't do something like this
(untested) so the low level batch flush has visibility to the high
level flush range.
x86 could use this too AFAIKS, just needs to pass the range a bit
further down, but in practice I'm not sure it would ever really
matter for them because the 2MB level exceeds the single page flush
ceiling for 4k pages unlike powerpc with 64k pages. But in corner
cases where the unmap crossed a bunch of small vmas or the ceiling
was increased then in theory it could be of use.
Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
to be flushed, not just the particular gather
This allows archs to optimise flushing heuristics better, in the face of
flush operations forcing smaller flush batches. For example, an
architecture may choose a more costly per-page invalidation for small
ranges of pages with the assumption that the full TLB flush cost would
be more expensive in terms of refills. However if a very large range is
forced into flushing small ranges, the faster full-process flush may
have been the better choice.
---
arch/powerpc/mm/book3s64/radix_tlb.c | 33 ++++++++++++++++------------
fs/exec.c | 3 ++-
include/asm-generic/tlb.h | 9 ++++++++
include/linux/mm_types.h | 3 ++-
mm/hugetlb.c | 2 +-
mm/madvise.c | 6 ++---
mm/memory.c | 4 ++--
mm/mmap.c | 2 +-
mm/mmu_gather.c | 10 ++++++---
mm/oom_kill.c | 2 +-
10 files changed, 47 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..e1072d85d72e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1110,12 +1110,13 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
static inline void __radix__flush_tlb_range(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ unsigned long entire_range)
{
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
unsigned long page_size = 1UL << page_shift;
- unsigned long nr_pages = (end - start) >> page_shift;
+ unsigned long entire_nr_pages = entire_range >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
bool flush_pid, flush_pwc = false;
enum tlb_flush_type type;
@@ -1133,9 +1134,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
else
- flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
/*
* full pid flush already does the PWC flush. if it is not full pid
* flush check the range is more than PMD and force a pwc flush
@@ -1220,7 +1221,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
return radix__flush_hugetlb_tlb_range(vma, start, end);
#endif
- __radix__flush_tlb_range(vma->vm_mm, start, end);
+ __radix__flush_tlb_range(vma->vm_mm, start, end, end - start);
}
EXPORT_SYMBOL(radix__flush_tlb_range);
@@ -1278,6 +1279,11 @@ void radix__flush_all_lpid_guest(unsigned int lpid)
_tlbie_lpid_guest(lpid, RIC_FLUSH_ALL);
}
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ unsigned long entire_range,
+ int psize, bool also_pwc);
+
void radix__tlb_flush(struct mmu_gather *tlb)
{
int psize = 0;
@@ -1285,6 +1291,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
int page_size = tlb->page_size;
unsigned long start = tlb->start;
unsigned long end = tlb->end;
+ unsigned long entire_range = tlb->entire_end - tlb->entire_start;
/*
* if page size is not something we understand, do a full mm flush
@@ -1301,21 +1308,19 @@ void radix__tlb_flush(struct mmu_gather *tlb)
else
radix__flush_all_mm(mm);
} else {
- if (!tlb->freed_tables)
- radix__flush_tlb_range_psize(mm, start, end, psize);
- else
- radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
+ __radix__flush_tlb_range_psize(mm, start, end, entire_range, psize, tlb->freed_tables);
}
}
static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
unsigned long start, unsigned long end,
+ unsigned long entire_range,
int psize, bool also_pwc)
{
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[psize].shift;
unsigned long page_size = 1UL << page_shift;
- unsigned long nr_pages = (end - start) >> page_shift;
+ unsigned long entire_nr_pages = entire_range >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
bool flush_pid;
enum tlb_flush_type type;
@@ -1335,9 +1340,9 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
else
- flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
unsigned long tgt = H_RPTI_TARGET_CMMU;
@@ -1381,13 +1386,13 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
unsigned long end, int psize)
{
- return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
+ return __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, false);
}
void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
unsigned long end, int psize)
{
- __radix__flush_tlb_range_psize(mm, start, end, psize, true);
+ __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, true);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..c769c12bdf56 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -705,11 +705,11 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
return -ENOMEM;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
if (new_end > old_start) {
/*
* when the old and new regions overlap clear from new_end.
*/
+ tlb_gather_mmu(&tlb, mm, new_end, old_end);
free_pgd_range(&tlb, new_end, old_end, new_end,
vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
} else {
@@ -719,6 +719,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
* have constraints on va-space that make this illegal (IA64) -
* for the others its just a little faster.
*/
+ tlb_gather_mmu(&tlb, mm, old_start, old_end);
free_pgd_range(&tlb, old_start, old_end, new_end,
vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..857fd83af695 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -256,6 +256,15 @@ struct mmu_gather {
struct mmu_table_batch *batch;
#endif
+ /*
+ * This is the range of the "entire" logical flush
+ * operation being performed. It does not relate to
+ * the current batch to flush, but it can inform
+ * heuristics that choose the best flushing strategy.
+ */
+ unsigned long entire_start;
+ unsigned long entire_end;
+
unsigned long start;
unsigned long end;
/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..9d2ff06b574c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -599,7 +599,8 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
}
struct mmu_gather;
-extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
+extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_finish_mmu(struct mmu_gather *tlb);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d5221d..e41106eb4df7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4458,7 +4458,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
{
struct mmu_gather tlb;
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
tlb_finish_mmu(&tlb);
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..b3634672aeb9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -508,7 +508,7 @@ static long madvise_cold(struct vm_area_struct *vma,
return -EINVAL;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb);
@@ -561,7 +561,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
return 0;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb);
@@ -726,7 +726,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
range.start, range.end);
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, range.start, range.end);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(&range);
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..61c303e84baf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1647,7 +1647,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
start, start + size);
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, range.start, range.end);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
@@ -1674,7 +1674,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
address, address + size);
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
unmap_single_vma(&tlb, vma, address, range.end, details);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..f2808febde40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2672,7 +2672,7 @@ static void unmap_region(struct mm_struct *mm,
struct mmu_gather tlb;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end);
free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..863a5bd7e650 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -250,9 +250,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
}
static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
bool fullmm)
{
tlb->mm = mm;
+ tlb->entire_start = start;
+ tlb->entire_end = end;
tlb->fullmm = fullmm;
#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -281,9 +284,10 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
* Called to initialize an (on-stack) mmu_gather structure for page-table
* tear-down from @mm.
*/
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
- __tlb_gather_mmu(tlb, mm, false);
+ __tlb_gather_mmu(tlb, mm, start, end, false);
}
/**
@@ -299,7 +303,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
*/
void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
{
- __tlb_gather_mmu(tlb, mm, true);
+ __tlb_gather_mmu(tlb, mm, 0, ~0UL, true);
}
/**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..bfcc9cbdfb20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -545,7 +545,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
vma, mm, vma->vm_start,
vma->vm_end);
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
tlb_finish_mmu(&tlb);
ret = false;
--
2.23.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-04 6:39 ` Nicholas Piggin
0 siblings, 0 replies; 12+ messages in thread
From: Nicholas Piggin @ 2021-08-04 6:39 UTC (permalink / raw)
To: Aneesh Kumar K.V, linuxppc-dev, mpe
Cc: Peter Zijlstra, linux-arch, Will Deacon, linux-mm
Excerpts from Nicholas Piggin's message of August 4, 2021 3:14 pm:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>
> It would be good to have a bit more description here.
>
> In any patch that changes a heuristic like this, I would like to see
> some justification or reasoning that could be refuted or used as a
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
>
> This "improves" things here, but what if it hurt things elsewhere, how
> would we come in later and decide to change it back?
>
> THP flushes for example, I think now they'll do PID flushes (if they
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
>
> So where do you notice this? What's the benefit?
For that matter, I wonder if we shouldn't do something like this
(untested) so the low level batch flush has visibility to the high
level flush range.
x86 could use this too AFAIKS, just needs to pass the range a bit
further down, but in practice I'm not sure it would ever really
matter for them because the 2MB level exceeds the single page flush
ceiling for 4k pages unlike powerpc with 64k pages. But in corner
cases where the unmap crossed a bunch of small vmas or the ceiling
was increased then in theory it could be of use.
Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
to be flushed, not just the particular gather
This allows archs to optimise flushing heuristics better, in the face of
flush operations forcing smaller flush batches. For example, an
architecture may choose a more costly per-page invalidation for small
ranges of pages with the assumption that the full TLB flush cost would
be more expensive in terms of refills. However if a very large range is
forced into flushing small ranges, the faster full-process flush may
have been the better choice.
---
arch/powerpc/mm/book3s64/radix_tlb.c | 33 ++++++++++++++++------------
fs/exec.c | 3 ++-
include/asm-generic/tlb.h | 9 ++++++++
include/linux/mm_types.h | 3 ++-
mm/hugetlb.c | 2 +-
mm/madvise.c | 6 ++---
mm/memory.c | 4 ++--
mm/mmap.c | 2 +-
mm/mmu_gather.c | 10 ++++++---
mm/oom_kill.c | 2 +-
10 files changed, 47 insertions(+), 27 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index aefc100d79a7..e1072d85d72e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1110,12 +1110,13 @@ static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
static inline void __radix__flush_tlb_range(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ unsigned long entire_range)
{
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
unsigned long page_size = 1UL << page_shift;
- unsigned long nr_pages = (end - start) >> page_shift;
+ unsigned long entire_nr_pages = entire_range >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
bool flush_pid, flush_pwc = false;
enum tlb_flush_type type;
@@ -1133,9 +1134,9 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
else
- flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
/*
* full pid flush already does the PWC flush. if it is not full pid
* flush check the range is more than PMD and force a pwc flush
@@ -1220,7 +1221,7 @@ void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
return radix__flush_hugetlb_tlb_range(vma, start, end);
#endif
- __radix__flush_tlb_range(vma->vm_mm, start, end);
+ __radix__flush_tlb_range(vma->vm_mm, start, end, end - start);
}
EXPORT_SYMBOL(radix__flush_tlb_range);
@@ -1278,6 +1279,11 @@ void radix__flush_all_lpid_guest(unsigned int lpid)
_tlbie_lpid_guest(lpid, RIC_FLUSH_ALL);
}
+static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ unsigned long entire_range,
+ int psize, bool also_pwc);
+
void radix__tlb_flush(struct mmu_gather *tlb)
{
int psize = 0;
@@ -1285,6 +1291,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
int page_size = tlb->page_size;
unsigned long start = tlb->start;
unsigned long end = tlb->end;
+ unsigned long entire_range = tlb->entire_end - tlb->entire_start;
/*
* if page size is not something we understand, do a full mm flush
@@ -1301,21 +1308,19 @@ void radix__tlb_flush(struct mmu_gather *tlb)
else
radix__flush_all_mm(mm);
} else {
- if (!tlb->freed_tables)
- radix__flush_tlb_range_psize(mm, start, end, psize);
- else
- radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
+ __radix__flush_tlb_range_psize(mm, start, end, entire_range, psize, tlb->freed_tables);
}
}
static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
unsigned long start, unsigned long end,
+ unsigned long entire_range,
int psize, bool also_pwc)
{
unsigned long pid;
unsigned int page_shift = mmu_psize_defs[psize].shift;
unsigned long page_size = 1UL << page_shift;
- unsigned long nr_pages = (end - start) >> page_shift;
+ unsigned long entire_nr_pages = entire_range >> page_shift;
bool fullmm = (end == TLB_FLUSH_ALL);
bool flush_pid;
enum tlb_flush_type type;
@@ -1335,9 +1340,9 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
if (fullmm)
flush_pid = true;
else if (type == FLUSH_TYPE_GLOBAL)
- flush_pid = nr_pages > tlb_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_single_page_flush_ceiling;
else
- flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
+ flush_pid = entire_nr_pages > tlb_local_single_page_flush_ceiling;
if (!mmu_has_feature(MMU_FTR_GTSE) && type == FLUSH_TYPE_GLOBAL) {
unsigned long tgt = H_RPTI_TARGET_CMMU;
@@ -1381,13 +1386,13 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm,
void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
unsigned long end, int psize)
{
- return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
+ return __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, false);
}
void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
unsigned long end, int psize)
{
- __radix__flush_tlb_range_psize(mm, start, end, psize, true);
+ __radix__flush_tlb_range_psize(mm, start, end, end - start, psize, true);
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/fs/exec.c b/fs/exec.c
index 38f63451b928..c769c12bdf56 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -705,11 +705,11 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
return -ENOMEM;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
if (new_end > old_start) {
/*
* when the old and new regions overlap clear from new_end.
*/
+ tlb_gather_mmu(&tlb, mm, new_end, old_end);
free_pgd_range(&tlb, new_end, old_end, new_end,
vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
} else {
@@ -719,6 +719,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
* have constraints on va-space that make this illegal (IA64) -
* for the others its just a little faster.
*/
+ tlb_gather_mmu(&tlb, mm, old_start, old_end);
free_pgd_range(&tlb, old_start, old_end, new_end,
vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..857fd83af695 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -256,6 +256,15 @@ struct mmu_gather {
struct mmu_table_batch *batch;
#endif
+ /*
+ * This is the range of the "entire" logical flush
+ * operation being performed. It does not relate to
+ * the current batch to flush, but it can inform
+ * heuristics that choose the best flushing strategy.
+ */
+ unsigned long entire_start;
+ unsigned long entire_end;
+
unsigned long start;
unsigned long end;
/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 52bbd2b7cb46..9d2ff06b574c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -599,7 +599,8 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
}
struct mmu_gather;
-extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
+extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_finish_mmu(struct mmu_gather *tlb);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfc940d5221d..e41106eb4df7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4458,7 +4458,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
{
struct mmu_gather tlb;
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, start, end);
__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
tlb_finish_mmu(&tlb);
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d3d348b17f4..b3634672aeb9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -508,7 +508,7 @@ static long madvise_cold(struct vm_area_struct *vma,
return -EINVAL;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb);
@@ -561,7 +561,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
return 0;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb);
@@ -726,7 +726,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
range.start, range.end);
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, range.start, range.end);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(&range);
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..61c303e84baf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1647,7 +1647,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
start, start + size);
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, range.start, range.end);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
@@ -1674,7 +1674,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
lru_add_drain();
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
address, address + size);
- tlb_gather_mmu(&tlb, vma->vm_mm);
+ tlb_gather_mmu(&tlb, vma->vm_mm, address, range.end);
update_hiwater_rss(vma->vm_mm);
mmu_notifier_invalidate_range_start(&range);
unmap_single_vma(&tlb, vma, address, range.end, details);
diff --git a/mm/mmap.c b/mm/mmap.c
index ca54d36d203a..f2808febde40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2672,7 +2672,7 @@ static void unmap_region(struct mm_struct *mm,
struct mmu_gather tlb;
lru_add_drain();
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, start, end);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end);
free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..863a5bd7e650 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -250,9 +250,12 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
}
static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end,
bool fullmm)
{
tlb->mm = mm;
+ tlb->entire_start = start;
+ tlb->entire_end = end;
tlb->fullmm = fullmm;
#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -281,9 +284,10 @@ static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
* Called to initialize an (on-stack) mmu_gather structure for page-table
* tear-down from @mm.
*/
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+ unsigned long start, unsigned long end)
{
- __tlb_gather_mmu(tlb, mm, false);
+ __tlb_gather_mmu(tlb, mm, start, end, false);
}
/**
@@ -299,7 +303,7 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm)
*/
void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
{
- __tlb_gather_mmu(tlb, mm, true);
+ __tlb_gather_mmu(tlb, mm, 0, ~0UL, true);
}
/**
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c729a4c4a1ac..bfcc9cbdfb20 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -545,7 +545,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0,
vma, mm, vma->vm_start,
vma->vm_end);
- tlb_gather_mmu(&tlb, mm);
+ tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
tlb_finish_mmu(&tlb);
ret = false;
--
2.23.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-04 5:14 ` Nicholas Piggin
2021-08-04 6:39 ` Nicholas Piggin
@ 2021-08-04 6:59 ` Michael Ellerman
1 sibling, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2021-08-04 6:59 UTC (permalink / raw)
To: Nicholas Piggin, Aneesh Kumar K.V, linuxppc-dev
Nicholas Piggin <npiggin@gmail.com> writes:
> Excerpts from Aneesh Kumar K.V's message of August 4, 2021 12:37 am:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>
> It would be good to have a bit more description here.
>
> In any patch that changes a heuristic like this, I would like to see
> some justification or reasoning that could be refuted or used as a
> supporting argument if we ever wanted to change the heuristic later.
> Ideally with some of the obvious downsides listed as well.
>
> This "improves" things here, but what if it hurt things elsewhere, how
> would we come in later and decide to change it back?
>
> THP flushes for example, I think now they'll do PID flushes (if they
> have to be broadcast, which they will tend to be when khugepaged does
> them). So now that might increase jitter for THP and cause it to be a
> loss for more workloads.
>
> So where do you notice this? What's the benefit?
Ack. Needs some numbers and supporting evidence.
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>> * invalidating a full PID, so it has a far lower threshold to change from
>> * individual page flushes to full-pid flushes.
>> */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
>>
>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> if (fullmm)
>> flush_pid = true;
>> else if (type == FLUSH_TYPE_GLOBAL)
>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>
> Arguably >= is nicer than > here, but this shouldn't be in the same
> patch as the value change.
>
>> else
>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> And it should change everything to be consistent. Although I'm not sure
> it's worth changing even though I highly doubt any administrator would
> be tweaking this.
This made me look at how an administrator tweaks these thresholds, and
AFAICS the answer is "recompile the kernel"?
It looks like x86 have a debugfs file for tlb_single_page_flush_ceiling,
but we don't. I guess we meant to copy that but never did?
So at the moment both thresholds could just be #defines.
Making them tweakable at runtime would be nice, it would give us an
escape hatch if we ever hit a workload in production that wants a
different value.
cheers
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-04 6:39 ` Nicholas Piggin
@ 2021-08-04 7:34 ` Peter Zijlstra
-1 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2021-08-04 7:34 UTC (permalink / raw)
To: Nicholas Piggin
Cc: Aneesh Kumar K.V, linuxppc-dev, mpe, Will Deacon, linux-arch, linux-mm
On Wed, Aug 04, 2021 at 04:39:44PM +1000, Nicholas Piggin wrote:
> For that matter, I wonder if we shouldn't do something like this
> (untested) so the low level batch flush has visibility to the high
> level flush range.
>
> x86 could use this too AFAIKS, just needs to pass the range a bit
> further down, but in practice I'm not sure it would ever really
> matter for them because the 2MB level exceeds the single page flush
> ceiling for 4k pages unlike powerpc with 64k pages. But in corner
> cases where the unmap crossed a bunch of small vmas or the ceiling
> was increased then in theory it could be of use.
x86 can do single invalidates for 2M pages if that's the only type
encountered in the range, at which point we can do 33*2M = 66M, which is
well below the 1G range for the next level of huge.
> Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
> to be flushed, not just the particular gather
>
> This allows archs to optimise flushing heuristics better, in the face of
> flush operations forcing smaller flush batches. For example, an
> architecture may choose a more costly per-page invalidation for small
> ranges of pages with the assumption that the full TLB flush cost would
> be more expensive in terms of refills. However if a very large range is
> forced into flushing small ranges, the faster full-process flush may
> have been the better choice.
What is splitting up the flushes for you? That page_size power-special?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-04 7:34 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2021-08-04 7:34 UTC (permalink / raw)
To: Nicholas Piggin
Cc: linux-arch, Will Deacon, Aneesh Kumar K.V, linux-mm, linuxppc-dev
On Wed, Aug 04, 2021 at 04:39:44PM +1000, Nicholas Piggin wrote:
> For that matter, I wonder if we shouldn't do something like this
> (untested) so the low level batch flush has visibility to the high
> level flush range.
>
> x86 could use this too AFAIKS, just needs to pass the range a bit
> further down, but in practice I'm not sure it would ever really
> matter for them because the 2MB level exceeds the single page flush
> ceiling for 4k pages unlike powerpc with 64k pages. But in corner
> cases where the unmap crossed a bunch of small vmas or the ceiling
> was increased then in theory it could be of use.
x86 can do single invalidates for 2M pages if that's the only type
encountered in the range, at which point we can do 33*2M = 66M, which is
well below the 1G range for the next level of huge.
> Subject: [PATCH v1] mm/mmu_gather: provide archs with the entire range that is
> to be flushed, not just the particular gather
>
> This allows archs to optimise flushing heuristics better, in the face of
> flush operations forcing smaller flush batches. For example, an
> architecture may choose a more costly per-page invalidation for small
> ranges of pages with the assumption that the full TLB flush cost would
> be more expensive in terms of refills. However if a very large range is
> forced into flushing small ranges, the faster full-process flush may
> have been the better choice.
What is splitting up the flushes for you? That page_size power-special?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-12 13:20 ` Aneesh Kumar K.V
@ 2021-08-16 7:03 ` Michael Ellerman
0 siblings, 0 replies; 12+ messages in thread
From: Michael Ellerman @ 2021-08-16 7:03 UTC (permalink / raw)
To: Aneesh Kumar K.V, Puvichakravarthy Ramachandran; +Cc: linuxppc-dev, npiggin
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> On 8/12/21 6:19 PM, Michael Ellerman wrote:
>> "Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>>>> With shared mapping, even though we are unmapping a large range, the kernel
>>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>>> kernel is unmapping a 2M range.
>>>>
>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> ---
>>>> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> index aefc100d79a7..21d0f098e43b 100644
>>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>>> * invalidating a full PID, so it has a far lower threshold to change > from
>>>> * individual page flushes to full-pid flushes.
>>>> */
>>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>>
>>>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>>> if (fullmm)
>>>> flush_pid = true;
>>>> else if (type == FLUSH_TYPE_GLOBAL)
>>>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>>> else
>>>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>>
>>> Additional details on the test environment. This was tested on a 2 Node/8
>>> socket Power10 system.
>>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>>> Rate of work: = 176
>>> # time counts unit events
>>> 1.029206442 4198594519 cycles
>>> 1.029206442 2458254252 instructions # 0.59 insn per cycle
>>> 1.029206442 3004031488 PM_EXEC_STALL
>>> 1.029206442 1798186036 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 181
>>> 2.054288539 4183883450 cycles
>>> 2.054288539 2472178171 instructions # 0.59 insn per cycle
>>> 2.054288539 3014609313 PM_EXEC_STALL
>>> 2.054288539 1797851642 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 180
>>> 3.078306883 4171250717 cycles
>>> 3.078306883 2468341094 instructions # 0.59 insn per cycle
>>> 3.078306883 2993036205 PM_EXEC_STALL
>>> 3.078306883 1798181890 PM_EXEC_STALL_TLBIE
>>> .
>>> .
>>>
>>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>> 34
>>>
>>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>>
>>> # perf stat -I 1000 -a -e cycles,instructions -e
>>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>>> Rate of work: = 313
>>> # time counts unit events
>>> 1.030310506 4206071143 cycles
>>> 1.030310506 4314716958 instructions # 1.03 insn per cycle
>>> 1.030310506 2157762167 PM_EXEC_STALL
>>> 1.030310506 110825573 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 322
>>> 2.056034068 4331745630 cycles
>>> 2.056034068 4531658304 instructions # 1.05 insn per cycle
>>> 2.056034068 2288971361 PM_EXEC_STALL
>>> 2.056034068 111267927 PM_EXEC_STALL_TLBIE
>>> Rate of work: = 321
>>> 3.081216434 4327050349 cycles
>>> 3.081216434 4379679508 instructions # 1.01 insn per cycle
>>> 3.081216434 2252602550 PM_EXEC_STALL
>>> 3.081216434 110974887 PM_EXEC_STALL_TLBIE
>>
>>
>> What is the tlbie test actually doing?
>>
>> Does it do anything to measure the cost of refilling after the full mm flush?
>
> That is essentially
>
> for ()
> {
> shmat()
> fillshm()
> shmdt()
>
> }
>
> for a 256MB range. So it is not really a fair benchmark because it
> doesn't take into account the impact of throwing away the full pid
> translation. But even then the TLBIE stalls is an important data point?
Choosing the ceiling is a trade-off, and this test only measures one
side of the trade-off.
It tells us that the actual time taken to execute the full flush is less
than doing 32 individual flushes, but that's not the full story.
To decide I think we need some numbers for some more "real" workloads,
to at least see that there's no change, or preferably some improvement.
Another interesting test might be to do the shmat/fillshm/shmdt, and
then chase some pointers to provoke TLB misses. Then we could work out
the relative cost of TLB misses vs the time to do the flush.
cheers
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-12 12:49 ` Michael Ellerman
@ 2021-08-12 13:20 ` Aneesh Kumar K.V
2021-08-16 7:03 ` Michael Ellerman
0 siblings, 1 reply; 12+ messages in thread
From: Aneesh Kumar K.V @ 2021-08-12 13:20 UTC (permalink / raw)
To: Michael Ellerman, Puvichakravarthy Ramachandran; +Cc: linuxppc-dev, npiggin
On 8/12/21 6:19 PM, Michael Ellerman wrote:
> "Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>>> With shared mapping, even though we are unmapping a large range, the kernel
>>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>>> This results in the kernel issuing a high number of TLB flushes even for a large
>>> range. This can be improved by making sure the kernel switch to pid based flush if the
>>> kernel is unmapping a 2M range.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> index aefc100d79a7..21d0f098e43b 100644
>>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>>> * invalidating a full PID, so it has a far lower threshold to change > from
>>> * individual page flushes to full-pid flushes.
>>> */
>>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>>
>>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>>> if (fullmm)
>>> flush_pid = true;
>>> else if (type == FLUSH_TYPE_GLOBAL)
>>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>>> else
>>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>>
>> Additional details on the test environment. This was tested on a 2 Node/8
>> socket Power10 system.
>> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>> Rate of work: = 176
>> # time counts unit events
>> 1.029206442 4198594519 cycles
>> 1.029206442 2458254252 instructions # 0.59 insn per cycle
>> 1.029206442 3004031488 PM_EXEC_STALL
>> 1.029206442 1798186036 PM_EXEC_STALL_TLBIE
>> Rate of work: = 181
>> 2.054288539 4183883450 cycles
>> 2.054288539 2472178171 instructions # 0.59 insn per cycle
>> 2.054288539 3014609313 PM_EXEC_STALL
>> 2.054288539 1797851642 PM_EXEC_STALL_TLBIE
>> Rate of work: = 180
>> 3.078306883 4171250717 cycles
>> 3.078306883 2468341094 instructions # 0.59 insn per cycle
>> 3.078306883 2993036205 PM_EXEC_STALL
>> 3.078306883 1798181890 PM_EXEC_STALL_TLBIE
>> .
>> .
>>
>> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>> 34
>>
>> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>>
>> # perf stat -I 1000 -a -e cycles,instructions -e
>> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
>> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
>> Rate of work: = 313
>> # time counts unit events
>> 1.030310506 4206071143 cycles
>> 1.030310506 4314716958 instructions # 1.03 insn per cycle
>> 1.030310506 2157762167 PM_EXEC_STALL
>> 1.030310506 110825573 PM_EXEC_STALL_TLBIE
>> Rate of work: = 322
>> 2.056034068 4331745630 cycles
>> 2.056034068 4531658304 instructions # 1.05 insn per cycle
>> 2.056034068 2288971361 PM_EXEC_STALL
>> 2.056034068 111267927 PM_EXEC_STALL_TLBIE
>> Rate of work: = 321
>> 3.081216434 4327050349 cycles
>> 3.081216434 4379679508 instructions # 1.01 insn per cycle
>> 3.081216434 2252602550 PM_EXEC_STALL
>> 3.081216434 110974887 PM_EXEC_STALL_TLBIE
>
>
> What is the tlbie test actually doing?
>
> Does it do anything to measure the cost of refilling after the full mm flush?
>
That is essentially
for ()
{
shmat()
fillshm()
shmdt()
}
for a 256MB range. So it is not really a fair benchmark because it
doesn't take into account the impact of throwing away the full pid
translation. But even then the TLBIE stalls is an important data point?
-aneesh
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
2021-08-06 7:56 Puvichakravarthy Ramachandran
@ 2021-08-12 12:49 ` Michael Ellerman
2021-08-12 13:20 ` Aneesh Kumar K.V
0 siblings, 1 reply; 12+ messages in thread
From: Michael Ellerman @ 2021-08-12 12:49 UTC (permalink / raw)
To: Puvichakravarthy Ramachandran, aneesh.kumar; +Cc: linuxppc-dev, npiggin
"Puvichakravarthy Ramachandran" <puvichakravarthy@in.ibm.com> writes:
>> With shared mapping, even though we are unmapping a large range, the kernel
>> will force a TLB flush with ptl lock held to avoid the race mentioned in
>> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts")
>> This results in the kernel issuing a high number of TLB flushes even for a large
>> range. This can be improved by making sure the kernel switch to pid based flush if the
>> kernel is unmapping a 2M range.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
>> 1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c > b/arch/powerpc/mm/book3s64/radix_tlb.c
>> index aefc100d79a7..21d0f098e43b 100644
>> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
>> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
>> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
>> * invalidating a full PID, so it has a far lower threshold to change > from
>> * individual page flushes to full-pid flushes.
>> */
>> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
>> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly > = POWER9_TLB_SETS_RADIX * 2;
>>
>> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct > mm_struct *mm,
>> if (fullmm)
>> flush_pid = true;
>> else if (type == FLUSH_TYPE_GLOBAL)
>> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
>> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
>> else
>> flush_pid = nr_pages > tlb_local_single_page_flush_ceiling;
>
> Additional details on the test environment. This was tested on a 2 Node/8
> socket Power10 system.
> The LPAR had 105 cores and the LPAR spanned across all the sockets.
>
> # perf stat -I 1000 -a -e cycles,instructions -e
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
> Rate of work: = 176
> # time counts unit events
> 1.029206442 4198594519 cycles
> 1.029206442 2458254252 instructions # 0.59 insn per cycle
> 1.029206442 3004031488 PM_EXEC_STALL
> 1.029206442 1798186036 PM_EXEC_STALL_TLBIE
> Rate of work: = 181
> 2.054288539 4183883450 cycles
> 2.054288539 2472178171 instructions # 0.59 insn per cycle
> 2.054288539 3014609313 PM_EXEC_STALL
> 2.054288539 1797851642 PM_EXEC_STALL_TLBIE
> Rate of work: = 180
> 3.078306883 4171250717 cycles
> 3.078306883 2468341094 instructions # 0.59 insn per cycle
> 3.078306883 2993036205 PM_EXEC_STALL
> 3.078306883 1798181890 PM_EXEC_STALL_TLBIE
> .
> .
>
> # cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
> 34
>
> # echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
>
> # perf stat -I 1000 -a -e cycles,instructions -e
> "{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
> "{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
> Rate of work: = 313
> # time counts unit events
> 1.030310506 4206071143 cycles
> 1.030310506 4314716958 instructions # 1.03 insn per cycle
> 1.030310506 2157762167 PM_EXEC_STALL
> 1.030310506 110825573 PM_EXEC_STALL_TLBIE
> Rate of work: = 322
> 2.056034068 4331745630 cycles
> 2.056034068 4531658304 instructions # 1.05 insn per cycle
> 2.056034068 2288971361 PM_EXEC_STALL
> 2.056034068 111267927 PM_EXEC_STALL_TLBIE
> Rate of work: = 321
> 3.081216434 4327050349 cycles
> 3.081216434 4379679508 instructions # 1.01 insn per cycle
> 3.081216434 2252602550 PM_EXEC_STALL
> 3.081216434 110974887 PM_EXEC_STALL_TLBIE
What is the tlbie test actually doing?
Does it do anything to measure the cost of refilling after the full mm flush?
cheers
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-06 7:56 Puvichakravarthy Ramachandran
2021-08-12 12:49 ` Michael Ellerman
0 siblings, 1 reply; 12+ messages in thread
From: Puvichakravarthy Ramachandran @ 2021-08-06 7:56 UTC (permalink / raw)
To: aneesh.kumar; +Cc: linuxppc-dev, npiggin
> With shared mapping, even though we are unmapping a large range, the
kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and
memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for
a large
> range. This can be improved by making sure the kernel switch to pid
based flush if the
> kernel is unmapping a 2M range.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c
b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
> * invalidating a full PID, so it has a far lower threshold to change
from
> * individual page flushes to full-pid flushes.
> */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly
= POWER9_TLB_SETS_RADIX * 2;
>
> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct
mm_struct *mm,
> if (fullmm)
> flush_pid = true;
> else if (type == FLUSH_TYPE_GLOBAL)
> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
> else
> flush_pid = nr_pages >
tlb_local_single_page_flush_ceiling;
Additional details on the test environment. This was tested on a 2 Node/8
socket Power10 system.
The LPAR had 105 cores and the LPAR spanned across all the sockets.
# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
Rate of work: = 176
# time counts unit events
1.029206442 4198594519 cycles
1.029206442 2458254252 instructions # 0.59
insn per cycle
1.029206442 3004031488 PM_EXEC_STALL
1.029206442 1798186036 PM_EXEC_STALL_TLBIE
Rate of work: = 181
2.054288539 4183883450 cycles
2.054288539 2472178171 instructions # 0.59
insn per cycle
2.054288539 3014609313 PM_EXEC_STALL
2.054288539 1797851642 PM_EXEC_STALL_TLBIE
Rate of work: = 180
3.078306883 4171250717 cycles
3.078306883 2468341094 instructions # 0.59
insn per cycle
3.078306883 2993036205 PM_EXEC_STALL
3.078306883 1798181890 PM_EXEC_STALL_TLBIE
.
.
# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
34
# echo 32 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
# perf stat -I 1000 -a -e cycles,instructions -e
"{cpu/config=0x030008,name=PM_EXEC_STALL/}" -e
"{cpu/config=0x02E01C,name=PM_EXEC_STALL_TLBIE/}" ./tlbie -i 10 -c 1 -t 1
Rate of work: = 313
# time counts unit events
1.030310506 4206071143 cycles
1.030310506 4314716958 instructions # 1.03
insn per cycle
1.030310506 2157762167 PM_EXEC_STALL
1.030310506 110825573 PM_EXEC_STALL_TLBIE
Rate of work: = 322
2.056034068 4331745630 cycles
2.056034068 4531658304 instructions # 1.05
insn per cycle
2.056034068 2288971361 PM_EXEC_STALL
2.056034068 111267927 PM_EXEC_STALL_TLBIE
Rate of work: = 321
3.081216434 4327050349 cycles
3.081216434 4379679508 instructions # 1.01
insn per cycle
3.081216434 2252602550 PM_EXEC_STALL
3.081216434 110974887 PM_EXEC_STALL_TLBIE
.
.
Regards,
Puvichakravarthy Ramachandran
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE
@ 2021-08-06 5:22 Puvichakravarthy Ramachandran
0 siblings, 0 replies; 12+ messages in thread
From: Puvichakravarthy Ramachandran @ 2021-08-06 5:22 UTC (permalink / raw)
To: aneesh.kumar; +Cc: linuxppc-dev, npiggin
> With shared mapping, even though we are unmapping a large range, the
kernel
> will force a TLB flush with ptl lock held to avoid the race mentioned in
> commit 1cf35d47712d ("mm: split 'tlb_flush_mmu()' into tlb flushing and
memory freeing parts")
> This results in the kernel issuing a high number of TLB flushes even for
a large
> range. This can be improved by making sure the kernel switch to pid
based flush if the
> kernel is unmapping a 2M range.
>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
> arch/powerpc/mm/book3s64/radix_tlb.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c
b/arch/powerpc/mm/book3s64/radix_tlb.c
> index aefc100d79a7..21d0f098e43b 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -1106,7 +1106,7 @@ EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
> * invalidating a full PID, so it has a far lower threshold to change
from
> * individual page flushes to full-pid flushes.
> */
> -static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
> +static unsigned long tlb_single_page_flush_ceiling __read_mostly = 32;
> static unsigned long tlb_local_single_page_flush_ceiling __read_mostly
= POWER9_TLB_SETS_RADIX * 2;
>
> static inline void __radix__flush_tlb_range(struct mm_struct *mm,
> @@ -1133,7 +1133,7 @@ static inline void __radix__flush_tlb_range(struct
mm_struct *mm,
> if (fullmm)
> flush_pid = true;
> else if (type == FLUSH_TYPE_GLOBAL)
> - flush_pid = nr_pages > tlb_single_page_flush_ceiling;
> + flush_pid = nr_pages >= tlb_single_page_flush_ceiling;
> else
> flush_pid = nr_pages >
tlb_local_single_page_flush_ceiling;
I evaluated the patches from Aneesh with a micro benchmark which does
shmat, shmdt of 256 MB segment.
Higher the rate of work, better the performance. With a value of 32, we
match the performance of
GTSE=off. This was evaluated on SLES15 SP3 kernel.
# cat /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
32
# perf stat -I 1000 -a -e powerpc:tlbie,r30058 ./tlbie -i 5 -c 1 t 1
Rate of work: = 311
# time counts unit events
1.013131404 50939 powerpc:tlbie
1.013131404 50658 r30058
Rate of work: = 318
2.026957019 51520 powerpc:tlbie
2.026957019 51481 r30058
Rate of work: = 318
3.038884431 51485 powerpc:tlbie
3.038884431 51461 r30058
Rate of work: = 318
4.051483926 51485 powerpc:tlbie
4.051483926 51520 r30058
Rate of work: = 318
5.063635713 48577 powerpc:tlbie
5.063635713 48347 r30058
# echo 34 > /sys/kernel/debug/powerpc/tlb_single_page_flush_ceiling
# perf stat -I 1000 -a -e powerpc:tlbie,r30058 ./tlbie -i 5 -c 1 t 1
Rate of work: = 174
# time counts unit events
1.012672696 721471 powerpc:tlbie
1.012672696 726491 r30058
Rate of work: = 177
2.026348661 737460 powerpc:tlbie
2.026348661 736138 r30058
Rate of work: = 178
3.037932122 737460 powerpc:tlbie
3.037932122 737460 r30058
Rate of work: = 178
4.050198819 737044 powerpc:tlbie
4.050198819 737460 r30058
Rate of work: = 177
5.062400776 692832 powerpc:tlbie
5.062400776 688319 r30058
Regards,
Puvichakravarthy Ramachandran
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-08-16 7:04 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 14:37 [RFC PATCH] powerpc/book3s64/radix: Upgrade va tlbie to PID tlbie if we cross PMD_SIZE Aneesh Kumar K.V
2021-08-04 5:14 ` Nicholas Piggin
2021-08-04 6:39 ` Nicholas Piggin
2021-08-04 6:39 ` Nicholas Piggin
2021-08-04 7:34 ` Peter Zijlstra
2021-08-04 7:34 ` Peter Zijlstra
2021-08-04 6:59 ` Michael Ellerman
2021-08-06 5:22 Puvichakravarthy Ramachandran
2021-08-06 7:56 Puvichakravarthy Ramachandran
2021-08-12 12:49 ` Michael Ellerman
2021-08-12 13:20 ` Aneesh Kumar K.V
2021-08-16 7:03 ` Michael Ellerman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.