[PATCH 1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run

* [PATCH 1/1] powerpc: mm: Check if serialize_against_pte_lookup() really needs to run
@ 2019-09-16 20:55 ` Leonardo Bras
  0 siblings, 0 replies; 6+ messages in thread
From: Leonardo Bras @ 2019-09-16 20:55 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Aneesh Kumar K.V, Christophe Leroy,
	Andrew Morton, Mike Rapoport, Nicholas Piggin, Thomas Gleixner,
	Jason Gunthorpe, Vlastimil Babka, Dan Williams,
	Christoph Lameter, Tobin C. Harding, Jann Horn,
	Jesper Dangaard Brouer, Souptick Joarder, Ralph Campbell,
	Andrey Ryabinin, Davidlohr Bueso

If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk
of memory (496GB) mapped with THP, it takes an average of 275 seconds,
which can cause a lot of problems to the load (in qemu case, the guest
will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot of
time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it will
take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently
with the next part of the functions it's called in.
It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[];

So, by what I could understand, if there is no find_current_mm_pte()
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I propose
a counter that keeps track of how many find_current_mm_pte() are currently
running, and if there is none, just skip smp_call_function_many().

On my workload (qemu), I could see munmap's time reduction from 275 seconds
to 418ms.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>

---
I need more experienced people's help in order to understand if this is
really a valid improvement, and if mm_struct is the best place to put such
counter.

Thanks!
---
 arch/powerpc/include/asm/pte-walk.h | 3 +++
 arch/powerpc/mm/book3s64/pgtable.c  | 2 ++
 include/linux/mm_types.h            | 1 +
 3 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h
index 33fa5dd8ee6a..3b82cb3bd563 100644
--- a/arch/powerpc/include/asm/pte-walk.h
+++ b/arch/powerpc/include/asm/pte-walk.h
@@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
 {
 	pte_t *pte;
 
+	atomic64_inc(&current->mm->lockless_ptlookup_count);
+
 	VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__);
 	VM_WARN(pgdir != current->mm->pgd,
 		"%s lock less page table lookup called on wrong mm\n", __func__);
@@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
 	if (hshift)
 		WARN_ON(*hshift);
 #endif
+	atomic64_dec(&current->mm->lockless_ptlookup_count);
 	return pte;
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..8f6fc2f80071 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,6 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
 	smp_mb();
+	if (atomic64_read(&mm->lockless_ptlookup_count) == 0)
+		return;
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6a7a1083b6fb..97fb2545e967 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -518,6 +518,7 @@ struct mm_struct {
 #endif
 	} __randomize_layout;
 
+	atomic64_t lockless_ptlookup_count;
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread