linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks
@ 2019-09-27 23:39 Leonardo Bras
  2019-09-27 23:39 ` [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks Leonardo Bras
                   ` (10 more replies)
  0 siblings, 11 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:39 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
start_lockless_pgtbl_walk(mm)
        Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
        Insert after the end of any lockless pgtable walk
        (Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
        Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Also, I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.

Changes since v3:
 Adds memory barrier to {start,end}_lockless_pgtbl_walk()
 Explain (comments) why some lockless pgtbl walks don't need
	local_irq_disable (real mode + MSR_EE=0)
 Explain (comments) places where counting method is not needed (guest pgd,
	which is not touched by THP)
 Fixes some misplaced local_irq_restore()
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/


Leonardo Bras (11):
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable
    walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/book3s_64: Enables counting method to monitor lockless pgtbl
    walk
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/mmu.h     |  3 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 ++
 arch/powerpc/kernel/mce_power.c              | 13 ++++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c       | 30 ++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c          |  3 ++
 arch/powerpc/kvm/book3s_hv_nested.c          | 22 ++++++++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          | 18 ++++++--
 arch/powerpc/kvm/e500_mmu_host.c             |  6 ++-
 arch/powerpc/mm/book3s64/hash_tlb.c          |  2 +
 arch/powerpc/mm/book3s64/hash_utils.c        | 12 ++++-
 arch/powerpc/mm/book3s64/mmu_context.c       |  1 +
 arch/powerpc/mm/book3s64/pgtable.c           | 48 +++++++++++++++++++-
 arch/powerpc/perf/callchain.c                |  5 +-
 include/asm-generic/pgtable.h                | 15 ++++++
 mm/gup.c                                     |  8 ++++
 16 files changed, 180 insertions(+), 13 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
@ 2019-09-27 23:39 ` Leonardo Bras
  2019-09-29 22:40   ` John Hubbard
  2019-09-27 23:39 ` [PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions " Leonardo Bras
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:39 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup some cases, I propose a refcount-based approach, that
counts the number of lockless pagetable	walks happening on the process.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

start_lockless_pgtbl_walk(mm)
	Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
	Insert after the end of any lockless pgtable walk
	(Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
	Returns the number of lockless pgtable walks running

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
 arch/powerpc/mm/book3s64/pgtable.c       | 45 ++++++++++++++++++++++++
 3 files changed, 49 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 23b83d3593e2..13b006e7dde4 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -116,6 +116,9 @@ typedef struct {
 	/* Number of users of the external (Nest) MMU */
 	atomic_t copros;
 
+	/* Number of running instances of lockless pagetable walk*/
+	atomic_t lockless_pgtbl_walk_count;
+
 	struct hash_mm_context *hash_context;
 
 	unsigned long vdso_base;
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
index 2d0cb5ba9a47..3dd01c0ca5be 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 #endif
 	atomic_set(&mm->context.active_cpus, 0);
 	atomic_set(&mm->context.copros, 0);
+	atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
 
 	return 0;
 }
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..6ba6195bff1b 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte()
+ */
+void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	atomic_inc(&mm->context.lockless_pgtbl_walk_count);
+	/* Avoid reorder to garantee that the increment will happen before any
+	 * part of the walkless pagetable walk after it.
+	 */
+	smp_mb();
+}
+EXPORT_SYMBOL(start_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+*/
+void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	/* Avoid reorder to garantee that it will only decrement after the last
+	 * use of the returned ptep from the lockless pagetable walk.
+	 */
+	smp_mb();
+	atomic_dec(&mm->context.lockless_pgtbl_walk_count);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return atomic_read(&mm->context.lockless_pgtbl_walk_count);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
  2019-09-27 23:39 ` [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks Leonardo Bras
@ 2019-09-27 23:39 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:39 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

There is a need to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them.

Some methods rely on local_irq_{save,restore}, but that can be slow on
cases with a lot of cpus are used for the process.

In order to speedup these cases, I propose a refcount-based approach, that
counts the number of lockless pagetable	walks happening on the process.

Given that there are lockless pagetable walks on generic code, it's
necessary to create dummy functions for archs that won't use the approach.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 include/asm-generic/pgtable.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 75d9d68a6de7..0831475e72d3 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1172,6 +1172,21 @@ static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+static inline void start_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return 0;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
  2019-09-27 23:39 ` [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks Leonardo Bras
  2019-09-27 23:39 ` [PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-30 11:09   ` Kirill A. Shutemov
  2019-09-30 21:51   ` John Hubbard
  2019-09-27 23:40 ` [PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the couting method, it's necessary
to bound it with {start,end}_lockless_pgtbl_walk.

There are dummy functions, so it is not going to add any overhead on archs
that don't use this method.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 mm/gup.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..7105c829cf44 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
 {
+	struct mm_struct *mm;
 	unsigned long len, end;
 	unsigned long flags;
 	int nr = 0;
@@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
+		mm = current->mm;
+		start_lockless_pgtbl_walk(mm);
 		local_irq_save(flags);
 		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
 		local_irq_restore(flags);
+		end_lockless_pgtbl_walk(mm);
 	}
 
 	return nr;
@@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
 	unsigned long addr, len, end;
+	struct mm_struct *mm;
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
+		mm = current->mm;
+		start_lockless_pgtbl_walk(mm);
 		local_irq_disable();
 		gup_pgd_range(addr, end, gup_flags, pages, &nr);
 		local_irq_enable();
+		end_lockless_pgtbl_walk(mm);
 		ret = nr;
 	}
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (2 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 05/11] powerpc/perf: " Leonardo Bras
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
addr_to_pfn().

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kernel/mce_power.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index a814d2dfb5b0..0f2f87da4cd1 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -27,6 +27,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 {
 	pte_t *ptep;
 	unsigned long flags;
+	unsigned long pfn;
 	struct mm_struct *mm;
 
 	if (user_mode(regs))
@@ -34,15 +35,21 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 	else
 		mm = &init_mm;
 
+	start_lockless_pgtbl_walk(mm);
 	local_irq_save(flags);
 	if (mm == current->mm)
 		ptep = find_current_mm_pte(mm->pgd, addr, NULL, NULL);
 	else
 		ptep = find_init_mm_pte(addr, NULL);
-	local_irq_restore(flags);
+
 	if (!ptep || pte_special(*ptep))
-		return ULONG_MAX;
-	return pte_pfn(*ptep);
+		pfn = ULONG_MAX;
+	else
+		pfn = pte_pfn(*ptep);
+
+	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm);
+	return pfn;
 }
 
 /* flush SLBs and reload */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (3 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
read_user_stack_slow.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/perf/callchain.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..9d76194a2a8f 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -113,16 +113,18 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
 	int ret = -EFAULT;
 	pgd_t *pgdir;
 	pte_t *ptep, pte;
+	struct mm_struct *mm = current->mm;
 	unsigned shift;
 	unsigned long addr = (unsigned long) ptr;
 	unsigned long offset;
 	unsigned long pfn, flags;
 	void *kaddr;
 
-	pgdir = current->mm->pgd;
+	pgdir = mm->pgd;
 	if (!pgdir)
 		return -EFAULT;
 
+	start_lockless_pgtbl_walk(mm);
 	local_irq_save(flags);
 	ptep = find_current_mm_pte(pgdir, addr, NULL, &shift);
 	if (!ptep)
@@ -146,6 +148,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
 	ret = 0;
 err_out:
 	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm);
 	return ret;
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (4 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 05/11] powerpc/perf: " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 07/11] powerpc/kvm/e500: " Leonardo Bras
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.

hash_page_mm: Adds comment that explain that there is no need to
local_int_disable/save given that it is only called from DataAccess
interrupt, so interrupts are already disabled.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/hash_tlb.c   |  2 ++
 arch/powerpc/mm/book3s64/hash_utils.c | 12 +++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..5e5213c3f7c4 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -209,6 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 	 * to being hashed). This is not the most performance oriented
 	 * way to do things but is fine for our needs here.
 	 */
+	start_lockless_pgtbl_walk(mm);
 	local_irq_save(flags);
 	arch_enter_lazy_mmu_mode();
 	for (; start < end; start += PAGE_SIZE) {
@@ -230,6 +231,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 	}
 	arch_leave_lazy_mmu_mode();
 	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm);
 }
 
 void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index b8ad14bb1170..8615fab87c43 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1321,7 +1321,11 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
 #endif /* CONFIG_PPC_64K_PAGES */
 
-	/* Get PTE and page size from page tables */
+	/* Get PTE and page size from page tables :
+	 * Called in from DataAccess interrupt (data_access_common: 0x300),
+	 * interrupts are disabled here.
+	 */
+	start_lockless_pgtbl_walk(mm);
 	ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
 	if (ptep == NULL || !pte_present(*ptep)) {
 		DBG_LOW(" no PTE !\n");
@@ -1438,6 +1442,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 	DBG_LOW(" -> rc=%d\n", rc);
 
 bail:
+	end_lockless_pgtbl_walk(mm);
 	exception_exit(prev_state);
 	return rc;
 }
@@ -1547,10 +1552,12 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 	vsid = get_user_vsid(&mm->context, ea, ssize);
 	if (!vsid)
 		return;
+
 	/*
 	 * Hash doesn't like irqs. Walking linux page table with irq disabled
 	 * saves us from holding multiple locks.
 	 */
+	start_lockless_pgtbl_walk(mm);
 	local_irq_save(flags);
 
 	/*
@@ -1597,6 +1604,7 @@ void hash_preload(struct mm_struct *mm, unsigned long ea,
 				   pte_val(*ptep));
 out_exit:
 	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm);
 }
 
 #ifdef CONFIG_PPC_MEM_KEYS
@@ -1613,11 +1621,13 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
 	if (!mm || !mm->pgd)
 		return 0;
 
+	start_lockless_pgtbl_walk(mm);
 	local_irq_save(flags);
 	ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
 	if (ptep)
 		pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
 	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm);
 
 	return pkey;
 }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (5 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
kvmppc_e500_shadow_map().

Fixes the place where local_irq_restore() is called: previously, if ptep
was NULL, local_irq_restore() would never be called.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/e500_mmu_host.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 321db0fdb9db..a0be76ad8d14 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -473,6 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	 * We are holding kvm->mmu_lock so a notifier invalidate
 	 * can't run hence pfn won't change.
 	 */
+	start_lockless_pgtbl_walk(kvm->mm);
 	local_irq_save(flags);
 	ptep = find_linux_pte(pgdir, hva, NULL, NULL);
 	if (ptep) {
@@ -481,15 +482,18 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		if (pte_present(pte)) {
 			wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
 				MAS2_WIMGE_MASK;
-			local_irq_restore(flags);
 		} else {
 			local_irq_restore(flags);
+			end_lockless_pgtbl_walk(kvm->mm);
 			pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n",
 					   __func__, (long)gfn, pfn);
 			ret = -EINVAL;
 			goto out;
 		}
 	}
+	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(kvm->mm);
+
 	kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
 
 	kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (6 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 07/11] powerpc/kvm/e500: " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0

kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after
the last usage of ptep).

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_nested.c | 22 ++++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 18 ++++++++++++++----
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index 735e0ac6f5b2..5a641b559de7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
 	if (!gp)
 		return;
 
-	/* Find the pte */
+	/* Find the pte:
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	/*
 	 * If the pte is present and the pfn is still the same, update the pte.
@@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap,
 	if (!gp)
 		return;
 
-	/* Find and invalidate the pte */
+	/* Find and invalidate the pte:
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	/* Don't spuriously invalidate ptes if the pfn has changed */
 	if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
@@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
 	int shift;
 
 	spin_lock(&kvm->mmu_lock);
+	/*
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
@@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run,
 	/* See if can find translation in our partition scoped tables for L1 */
 	pte = __pte(0);
 	spin_lock(&kvm->mmu_lock);
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..2076a7ac230a 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -252,6 +252,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	 * If we had a page table table change after lookup, we would
 	 * retry via mmu_notifier_retry.
 	 */
+	start_lockless_pgtbl_walk(kvm->mm);
 	if (!realmode)
 		local_irq_save(irq_flags);
 	/*
@@ -287,8 +288,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			pa |= gpa & ~PAGE_MASK;
 		}
 	}
-	if (!realmode)
-		local_irq_restore(irq_flags);
 
 	ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
 	ptel |= pa;
@@ -311,6 +310,9 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
 		ptel |= HPTE_R_M;
 	}
+	if (!realmode)
+		local_irq_restore(irq_flags);
+	end_lockless_pgtbl_walk(kvm->mm);
 
 	/* Find and lock the HPTEG slot to use */
  do_insert:
@@ -885,11 +887,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
 	/* Translate to host virtual address */
 	hva = __gfn_to_hva_memslot(memslot, gfn);
 
-	/* Try to find the host pte for that virtual address */
+	/* Try to find the host pte for that virtual address :
+	 * Called by hcall_real_table (real mode + MSR_EE=0)
+	 * Interrupts are disabled here.
+	 */
+	start_lockless_pgtbl_walk(kvm->mm);
 	ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
-	if (!ptep)
+	if (!ptep) {
+		end_lockless_pgtbl_walk(kvm->mm);
 		return H_TOO_HARD;
+	}
 	pte = kvmppc_read_update_linux_pte(ptep, writing);
+	end_lockless_pgtbl_walk(kvm->mm);
+
 	if (!pte_present(pte))
 		return H_TOO_HARD;
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (7 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 30 ++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  3 +++
 3 files changed, 35 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..fcd3dad1297f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -620,6 +620,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 			 * We need to protect against page table destruction
 			 * hugepage split and collapse.
 			 */
+			start_lockless_pgtbl_walk(kvm->mm);
 			local_irq_save(flags);
 			ptep = find_current_mm_pte(current->mm->pgd,
 						   hva, NULL, NULL);
@@ -629,6 +630,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 					write_ok = 1;
 			}
 			local_irq_restore(flags);
+			end_lockless_pgtbl_walk(kvm->mm);
 		}
 	}
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..9b374b9838fa 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,6 +813,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 	 * Read the PTE from the process' radix tree and use that
 	 * so we get the shift and attribute bits.
 	 */
+	start_lockless_pgtbl_walk(kvm->mm);
 	local_irq_disable();
 	ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
 	/*
@@ -821,12 +822,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 	 */
 	if (!ptep) {
 		local_irq_enable();
+		end_lockless_pgtbl_walk(kvm->mm);
 		if (page)
 			put_page(page);
 		return RESUME_GUEST;
 	}
 	pte = *ptep;
 	local_irq_enable();
+	end_lockless_pgtbl_walk(kvm->mm);
 
 	/* If we're logging dirty pages, always map single pages */
 	large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +975,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep))
 		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 				 kvm->arch.lpid);
+
 	return 0;				
 }
 
@@ -989,6 +998,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	int ref = 0;
 	unsigned long old, *rmapp;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
 		old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1027,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	unsigned int shift;
 	int ref = 0;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep))
 		ref = 1;
@@ -1030,6 +1049,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 	int ret = 0;
 	unsigned long old, *rmapp;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
 		ret = 1;
@@ -1046,6 +1070,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 					       1UL << shift);
 		spin_unlock(&kvm->mmu_lock);
 	}
+
 	return ret;
 }
 
@@ -1085,6 +1110,11 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
 	gpa = memslot->base_gfn << PAGE_SHIFT;
 	spin_lock(&kvm->mmu_lock);
 	for (n = memslot->npages; n; --n) {
+		/*
+		 * We are walking the secondary (partition-scoped) page table here.
+		 * We can do this without disabling irq because the Linux MM
+		 * subsystem doesn't do THP splits and collapses on this tree.
+		 */
 		ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 		if (ptep && pte_present(*ptep))
 			kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index b4f20f13b860..376d069a92dd 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -431,6 +431,7 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 		unsigned long ua, unsigned long *phpa)
 {
+	struct kvm *kvm = vcpu->kvm;
 	pte_t *ptep, pte;
 	unsigned shift = 0;
 
@@ -443,10 +444,12 @@ static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 	 * to exit which will agains result in the below page table walk
 	 * to finish.
 	 */
+	start_lockless_pgtbl_walk(kvm->mm);
 	ptep = __find_linux_pte(vcpu->arch.pgdir, ua, NULL, &shift);
 	if (!ptep || !pte_present(*ptep))
 		return -ENXIO;
 	pte = *ptep;
+	end_lockless_pgtbl_walk(kvm->mm);
 
 	if (!shift)
 		shift = PAGE_SHIFT;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (8 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  2019-09-27 23:40 ` [PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Enables count-based monitoring method for lockless pagetable walks on
PowerPC book3s_64.

Other architectures/platforms fallback to using generic dummy functions.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8308f32e9782..eb9b26a4a483 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1370,5 +1370,10 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER
+void start_lockless_pgtbl_walk(struct mm_struct *mm);
+void end_lockless_pgtbl_walk(struct mm_struct *mm);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
  2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
                   ` (9 preceding siblings ...)
  2019-09-27 23:40 ` [PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk Leonardo Bras
@ 2019-09-27 23:40 ` Leonardo Bras
  10 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-27 23:40 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Andrew Morton, Dan Williams, Nicholas Piggin,
	Mahesh Salgaonkar, Allison Randal, Thomas Gleixner,
	Ganesh Goudar, Mike Rapoport, YueHaibing, Greg Kroah-Hartman,
	Ira Weiny, Jason Gunthorpe, John Hubbard, Keith Busch

Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 6ba6195bff1b..27966481f2a6 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
 	smp_mb();
-	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+	if (running_lockless_pgtbl_walk(mm))
+		smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
 /*
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-27 23:39 ` [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks Leonardo Bras
@ 2019-09-29 22:40   ` John Hubbard
  2019-09-29 23:17     ` John Hubbard
  2019-09-30 15:14     ` Leonardo Bras
  0 siblings, 2 replies; 26+ messages in thread
From: John Hubbard @ 2019-09-29 22:40 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar, Allison Randal,
	Thomas Gleixner, Ganesh Goudar, Mike Rapoport, YueHaibing,
	Greg Kroah-Hartman, Ira Weiny, Jason Gunthorpe, Keith Busch

On 9/27/19 4:39 PM, Leonardo Bras wrote:
> It's necessary to monitor lockless pagetable walks, in order to avoid doing
> THP splitting/collapsing during them.
> 
> Some methods rely on local_irq_{save,restore}, but that can be slow on
> cases with a lot of cpus are used for the process.
> 
> In order to speedup some cases, I propose a refcount-based approach, that
> counts the number of lockless pagetable	walks happening on the process.
> 
> This method does not exclude the current irq-oriented method. It works as a
> complement to skip unnecessary waiting.
> 
> start_lockless_pgtbl_walk(mm)
> 	Insert before starting any lockless pgtable walk
> end_lockless_pgtbl_walk(mm)
> 	Insert after the end of any lockless pgtable walk
> 	(Mostly after the ptep is last used)
> running_lockless_pgtbl_walk(mm)
> 	Returns the number of lockless pgtable walks running
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
>   arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
>   arch/powerpc/mm/book3s64/mmu_context.c   |  1 +
>   arch/powerpc/mm/book3s64/pgtable.c       | 45 ++++++++++++++++++++++++
>   3 files changed, 49 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
> index 23b83d3593e2..13b006e7dde4 100644
> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> @@ -116,6 +116,9 @@ typedef struct {
>   	/* Number of users of the external (Nest) MMU */
>   	atomic_t copros;
>   
> +	/* Number of running instances of lockless pagetable walk*/
> +	atomic_t lockless_pgtbl_walk_count;
> +
>   	struct hash_mm_context *hash_context;
>   
>   	unsigned long vdso_base;
> diff --git a/arch/powerpc/mm/book3s64/mmu_context.c b/arch/powerpc/mm/book3s64/mmu_context.c
> index 2d0cb5ba9a47..3dd01c0ca5be 100644
> --- a/arch/powerpc/mm/book3s64/mmu_context.c
> +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> @@ -200,6 +200,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>   #endif
>   	atomic_set(&mm->context.active_cpus, 0);
>   	atomic_set(&mm->context.copros, 0);
> +	atomic_set(&mm->context.lockless_pgtbl_walk_count, 0);
>   
>   	return 0;
>   }
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 7d0e0d0d22c4..6ba6195bff1b 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -98,6 +98,51 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
>   	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
>   }
>   
> +/*
> + * Counting method to monitor lockless pagetable walks:
> + * Uses start_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
> + * number of lockless pgtable walks happening, and
> + * running_lockless_pgtbl_walk to return this value.
> + */
> +
> +/* start_lockless_pgtbl_walk: Must be inserted before a function call that does
> + *   lockless pagetable walks, such as __find_linux_pte()
> + */
> +void start_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	atomic_inc(&mm->context.lockless_pgtbl_walk_count);
> +	/* Avoid reorder to garantee that the increment will happen before any
> +	 * part of the walkless pagetable walk after it.
> +	 */
> +	smp_mb();
> +}
> +EXPORT_SYMBOL(start_lockless_pgtbl_walk);
> +
> +/*
> + * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
> + *   returned by a lockless pagetable walk, such as __find_linux_pte()
> +*/
> +void end_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	/* Avoid reorder to garantee that it will only decrement after the last
> +	 * use of the returned ptep from the lockless pagetable walk.
> +	 */
> +	smp_mb();
> +	atomic_dec(&mm->context.lockless_pgtbl_walk_count);
> +}
> +EXPORT_SYMBOL(end_lockless_pgtbl_walk);
> +
> +/*
> + * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
> + *   currently running. If it returns 0, there is no running pagetable walk, and
> + *   THP split/collapse can be safely done. This can be used to avoid more
> + *   expensive approaches like serialize_against_pte_lookup()
> + */
> +int running_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	return atomic_read(&mm->context.lockless_pgtbl_walk_count);
> +}
> +
>   /*
>    * We use this to invalidate a pmdp entry before switching from a
>    * hugepte to regular pmd entry.
> 

Hi, Leonardo,

Can we please do it as shown below, instead (compile-tested only)?

This addresses all of the comments that I was going to make about structure
of this patch, which are:

* The lockless synch is tricky, so it should be encapsulated in function
   calls if possible.

* This is really a core mm function, so don't hide it away in arch layers.
   (If you're changing mm/ files, that's a big hint.)

* Other things need parts of this: gup.c needs the memory barriers; IMHO you'll
   be fixing a pre-existing, theoretical (we've never seen bug reports) problem.

* The documentation needs to accurately explain what's going on here.

(Not shown: one or more of the PPC Kconfig files should select
LOCKLESS_PAGE_TABLE_WALK_TRACKING.)

So:


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 294a67b94147..c9e5defb4d7e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1541,6 +1541,9 @@ int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
  			struct task_struct *task, bool bypass_rlim);
  
+void register_lockless_page_table_walker(unsigned long *flags);
+void deregister_lockless_page_table_walker(unsigned long *flags);
+
  /* Container for pinned pfns / pages */
  struct frame_vector {
  	unsigned int nr_allocated;	/* Number of frames we have space for */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5183e0d77dfa..83b7930a995f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,16 @@ struct mm_struct {
  		 */
  		atomic_t mm_count;
  
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+		/*
+		 * Number of callers who are doing a lockless walk of the
+		 * page tables. Typically arches might enable this in order to
+		 * help optimize performance, by possibly avoiding expensive
+		 * IPIs at the wrong times.
+		 */
+		atomic_t lockless_pgtbl_nr_walkers;
+#endif
+
  #ifdef CONFIG_MMU
  		atomic_long_t pgtables_bytes;	/* PTE page table pages */
  #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..1cf58f668fe1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
  config ARCH_HAS_HUGEPD
  	bool
  
+config LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	bool "Tracking (and optimization) of lockless page table walkers"
+	default n
+
+	help
+	  Maintain a reference count of active lockless page table
+	  walkers. This adds 4 bytes to struct mm size, and two atomic
+	  operations to calls such as get_user_pages_fast(). Some
+	  architectures can optimize page table operations if this
+	  is enabled.
+
  endmenu
diff --git a/mm/gup.c b/mm/gup.c
index 60c3915c8ee6..7b1be8ed1e8f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2302,6 +2302,62 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
  }
  #endif
  
+/*
+ * register_lockless_page_table_walker() - start a lockless page table walk
+ *
+ * @flags: for saving and restoring irq state
+ *
+ * Lockless page table walking still requires synchronization against freeing
+ * of the page tables, and against splitting of huge pages. This is done by
+ * interacting with interrupts, as first described in the struct mmu_table_batch
+ * comments in include/asm-generic/tlb.h.
+ *
+ * In order to do the right thing, code that walks page tables in the style of
+ * get_user_pages_fast() should call register_lockless_page_table_walker()
+ * before starting the walk, and deregister_lockless_page_table_walker() upon
+ * finishing.
+ */
+void register_lockless_page_table_walker(unsigned long *flags)
+{
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_inc(&current->mm->lockless_pgtbl_nr_walkers);
+#endif
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. In order for that to work,
+	 * interrupts must also be disabled during the lockless page table
+	 * walk. That's because the deleting or splitting involves flushing
+	 * TLBs, which in turn issues interrupts, which will block here.
+	 * However, without memory barriers, the page tables could be
+	 * read speculatively outside of interrupt disabling.
+	 */
+	smp_mb();
+	local_irq_save(*flags);
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);
+
+/*
+ * register_lockless_page_table_walker() - finish a lockless page table walk
+ *
+ * This is the complement to register_lockless_page_table_walker().
+ *
+ * @flags: for saving and restoring irq state
+ */
+void deregister_lockless_page_table_walker(unsigned long *flags)
+{
+	local_irq_restore(flags);
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. See the comments in
+	 * gup_fast_lock_acquire() for details.
+	 */
+	smp_mb();
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_dec(&current->mm->lockless_pgtbl_nr_walkers);
+#endif
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_release);
+
  /*
   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
   * the regular GUP.
@@ -2341,9 +2397,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
  
  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
  	    gup_fast_permitted(start, end)) {
-		local_irq_save(flags);
+		register_lockless_page_table_walker(&flags);
  		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
-		local_irq_restore(flags);
+		deregister_lockless_page_table_walker(&flags);
  	}
  
  	return nr;
@@ -2392,7 +2448,7 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
  int get_user_pages_fast(unsigned long start, int nr_pages,
  			unsigned int gup_flags, struct page **pages)
  {
-	unsigned long addr, len, end;
+	unsigned long addr, len, end, flags;
  	int nr = 0, ret = 0;
  
  	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
@@ -2410,9 +2466,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
  
  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
  	    gup_fast_permitted(start, end)) {
-		local_irq_disable();
+		register_lockless_page_table_walker(&flags);
  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
-		local_irq_enable();
+		deregister_lockless_page_table_walker(&flags);
  		ret = nr;
  	}
  


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-29 22:40   ` John Hubbard
@ 2019-09-29 23:17     ` John Hubbard
  2019-09-30 15:14     ` Leonardo Bras
  1 sibling, 0 replies; 26+ messages in thread
From: John Hubbard @ 2019-09-29 23:17 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar, Allison Randal,
	Thomas Gleixner, Ganesh Goudar, Mike Rapoport, YueHaibing,
	Greg Kroah-Hartman, Ira Weiny, Jason Gunthorpe, Keith Busch

On 9/29/19 3:40 PM, John Hubbard wrote:
> On 9/27/19 4:39 PM, Leonardo Bras wrote:
...
> +config LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +    bool "Tracking (and optimization) of lockless page table walkers"
> +    default n
> +
> +    help
> +      Maintain a reference count of active lockless page table
> +      walkers. This adds 4 bytes to struct mm size, and two atomic
> +      operations to calls such as get_user_pages_fast(). Some
> +      architectures can optimize page table operations if this
> +      is enabled.
> +
>   endmenu

Actually, the above should be an internal-only config option (PPC arch can
auto-select it), so just:

+config LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	bool

...because it's entirely up to other code (as opposed to other people)
as to whether this should be selected.

I got carried away. :)

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-09-27 23:40 ` [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
@ 2019-09-30 11:09   ` Kirill A. Shutemov
  2019-09-30 14:27     ` Leonardo Bras
  2019-09-30 21:51   ` John Hubbard
  1 sibling, 1 reply; 26+ messages in thread
From: Kirill A. Shutemov @ 2019-09-30 11:09 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar, Allison Randal,
	Thomas Gleixner, Ganesh Goudar, Mike Rapoport, YueHaibing,
	Greg Kroah-Hartman, Ira Weiny, Jason Gunthorpe, John Hubbard,
	Keith Busch

On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
     ^ typo

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-09-30 11:09   ` Kirill A. Shutemov
@ 2019-09-30 14:27     ` Leonardo Bras
  0 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-09-30 14:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar, Allison Randal,
	Thomas Gleixner, Ganesh Goudar, Mike Rapoport, YueHaibing,
	Greg Kroah-Hartman, Ira Weiny, Jason Gunthorpe, John Hubbard,
	Keith Busch

[-- Attachment #1: Type: text/plain, Size: 241 bytes --]

On Mon, 2019-09-30 at 14:09 +0300, Kirill A. Shutemov wrote:
> On Fri, Sep 27, 2019 at 08:40:00PM -0300, Leonardo Bras wrote:
> > As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
>      ^ typo
> 
Fixed, thanks!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-29 22:40   ` John Hubbard
  2019-09-29 23:17     ` John Hubbard
@ 2019-09-30 15:14     ` Leonardo Bras
  2019-09-30 17:57       ` John Hubbard
  1 sibling, 1 reply; 26+ messages in thread
From: Leonardo Bras @ 2019-09-30 15:14 UTC (permalink / raw)
  To: John Hubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Keith Busch, Thomas Gleixner, Arnd Bergmann, Greg Kroah-Hartman,
	YueHaibing, Nicholas Piggin, Mike Rapoport, Mahesh Salgaonkar,
	Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V, Ganesh Goudar,
	Andrew Morton, Ira Weiny, Dan Williams, Allison Randal

[-- Attachment #1: Type: text/plain, Size: 2075 bytes --]

On Sun, 2019-09-29 at 15:40 -0700, John Hubbard wrote:
> Hi, Leonardo,

Hello John, thanks for the feedback.

> Can we please do it as shown below, instead (compile-tested only)?
> 
> This addresses all of the comments that I was going to make about structure
> of this patch, which are:
> 
> * The lockless synch is tricky, so it should be encapsulated in function
>    calls if possible.

As I told before, there are cases where this function is called from
'real mode' in powerpc, which doesn't disable irqs and may have a
tricky behavior if we do. So, encapsulate the irq disable in this
function can be a bad choice.

Of course, if we really need that, we can add a bool parameter to the
function to choose about disabling/enabling irqs.
> 
> * This is really a core mm function, so don't hide it away in arch layers.
>    (If you're changing mm/ files, that's a big hint.)

My idea here is to let the arch decide on how this 'register' is going
to work, as archs may have different needs (in powerpc for example, we
can't always disable irqs, since we may be in realmode).

Maybe we can create a generic function instead of a dummy, and let it
be replaced in case the arch needs to do so.

> * Other things need parts of this: gup.c needs the memory barriers; IMHO you'll
>    be fixing a pre-existing, theoretical (we've never seen bug reports) problem.

Humm, you are right. Here I would suggest adding the barrier to the
generic function.

> * The documentation needs to accurately explain what's going on here.

Yes, my documentation was probably not good enough due to my lack of
experience with memory barriers (I learnt about using them last week,
and tried to come with the best solution.)

> (Not shown: one or more of the PPC Kconfig files should select
> LOCKLESS_PAGE_TABLE_WALK_TRACKING.)

The way it works today is defining it on platform pgtable.h. I agree
that using Kconfig may be a better solution that can make this config
more visible to disable/enable. 

Thanks for the feedback,

Leonardo Bras

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-30 15:14     ` Leonardo Bras
@ 2019-09-30 17:57       ` John Hubbard
  2019-09-30 18:42         ` Leonardo Bras
  0 siblings, 1 reply; 26+ messages in thread
From: John Hubbard @ 2019-09-30 17:57 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Keith Busch, Thomas Gleixner, Arnd Bergmann, Greg Kroah-Hartman,
	YueHaibing, Nicholas Piggin, Mike Rapoport, Mahesh Salgaonkar,
	Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V, Ganesh Goudar,
	Andrew Morton, Ira Weiny, Dan Williams, Allison Randal

On 9/30/19 8:14 AM, Leonardo Bras wrote:
> On Sun, 2019-09-29 at 15:40 -0700, John Hubbard wrote:
>> Hi, Leonardo,
> 
> Hello John, thanks for the feedback.
> 
>> Can we please do it as shown below, instead (compile-tested only)?
>>
>> This addresses all of the comments that I was going to make about structure
>> of this patch, which are:
>>
>> * The lockless synch is tricky, so it should be encapsulated in function
>>     calls if possible.
> 
> As I told before, there are cases where this function is called from
> 'real mode' in powerpc, which doesn't disable irqs and may have a
> tricky behavior if we do. So, encapsulate the irq disable in this
> function can be a bad choice.

You still haven't explained how this works in that case. So far, the
synchronization we've discussed has depended upon interrupt disabling
as part of the solution, in order to hold off page splitting and page
table freeing.

Simply skipping that means that an additional mechanism is required...which
btw might involve a new, ppc-specific routine, so maybe this is going to end
up pretty close to what I pasted in after all...

> 
> Of course, if we really need that, we can add a bool parameter to the
> function to choose about disabling/enabling irqs.
>>
>> * This is really a core mm function, so don't hide it away in arch layers.
>>     (If you're changing mm/ files, that's a big hint.)
> 
> My idea here is to let the arch decide on how this 'register' is going
> to work, as archs may have different needs (in powerpc for example, we
> can't always disable irqs, since we may be in realmode).
> 
> Maybe we can create a generic function instead of a dummy, and let it
> be replaced in case the arch needs to do so.

Yes, that might be what we need, if it turns out that ppc can't use this
approach (although let's see about that).


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-30 17:57       ` John Hubbard
@ 2019-09-30 18:42         ` Leonardo Bras
  2019-09-30 21:47           ` John Hubbard
  0 siblings, 1 reply; 26+ messages in thread
From: Leonardo Bras @ 2019-09-30 18:42 UTC (permalink / raw)
  To: John Hubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Arnd Bergmann, Greg Kroah-Hartman, YueHaibing, Nicholas Piggin,
	Mike Rapoport, Keith Busch, Jason Gunthorpe, Paul Mackerras,
	Aneesh Kumar K.V, Allison Randal, Mahesh Salgaonkar,
	Ganesh Goudar, Thomas Gleixner, Ira Weiny, Andrew Morton,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1903 bytes --]

On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
> > As I told before, there are cases where this function is called from
> > 'real mode' in powerpc, which doesn't disable irqs and may have a
> > tricky behavior if we do. So, encapsulate the irq disable in this
> > function can be a bad choice.
> 
> You still haven't explained how this works in that case. So far, the
> synchronization we've discussed has depended upon interrupt disabling
> as part of the solution, in order to hold off page splitting and page
> table freeing.

The irqs are already disabled by another mechanism (hw): MSR_EE=0.
So, serialize will work as expected.

> Simply skipping that means that an additional mechanism is required...which
> btw might involve a new, ppc-specific routine, so maybe this is going to end
> up pretty close to what I pasted in after all...
> > Of course, if we really need that, we can add a bool parameter to the
> > function to choose about disabling/enabling irqs.
> > > * This is really a core mm function, so don't hide it away in arch layers.
> > >     (If you're changing mm/ files, that's a big hint.)
> > 
> > My idea here is to let the arch decide on how this 'register' is going
> > to work, as archs may have different needs (in powerpc for example, we
> > can't always disable irqs, since we may be in realmode).
> > 
> > Maybe we can create a generic function instead of a dummy, and let it
> > be replaced in case the arch needs to do so.
> 
> Yes, that might be what we need, if it turns out that ppc can't use this
> approach (although let's see about that).
> 

I initially used the dummy approach because I did not see anything like
serialize in other archs. 

I mean, even if I put some generic function here, if there is no
function to use the 'lockless_pgtbl_walk_count', it becomes only a
overhead.

> 
> thanks,

Thank you!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-30 18:42         ` Leonardo Bras
@ 2019-09-30 21:47           ` John Hubbard
  2019-10-01 18:39             ` Leonardo Bras
  0 siblings, 1 reply; 26+ messages in thread
From: John Hubbard @ 2019-09-30 21:47 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Arnd Bergmann, Greg Kroah-Hartman, YueHaibing, Nicholas Piggin,
	Mike Rapoport, Keith Busch, Jason Gunthorpe, Paul Mackerras,
	Aneesh Kumar K.V, Allison Randal, Mahesh Salgaonkar,
	Ganesh Goudar, Thomas Gleixner, Ira Weiny, Andrew Morton,
	Dan Williams

On 9/30/19 11:42 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
>>> As I told before, there are cases where this function is called from
>>> 'real mode' in powerpc, which doesn't disable irqs and may have a
>>> tricky behavior if we do. So, encapsulate the irq disable in this
>>> function can be a bad choice.
>>
>> You still haven't explained how this works in that case. So far, the
>> synchronization we've discussed has depended upon interrupt disabling
>> as part of the solution, in order to hold off page splitting and page
>> table freeing.
> 
> The irqs are already disabled by another mechanism (hw): MSR_EE=0.
> So, serialize will work as expected.

I get that they're disabled. But will this interlock with the code that
issues IPIs?? Because it's not just disabling interrupts that matters, but
rather, synchronizing with the code (TLB flushing) that *happens* to 
require issuing IPIs, which in turn interact with disabling interrupts.

So I'm still not seeing how that could work here, unless there is something
interesting about the smp_call_function_many() on ppc with MSR_EE=0 mode...?

> 
>> Simply skipping that means that an additional mechanism is required...which
>> btw might involve a new, ppc-specific routine, so maybe this is going to end
>> up pretty close to what I pasted in after all...
>>> Of course, if we really need that, we can add a bool parameter to the
>>> function to choose about disabling/enabling irqs.
>>>> * This is really a core mm function, so don't hide it away in arch layers.
>>>>     (If you're changing mm/ files, that's a big hint.)
>>>
>>> My idea here is to let the arch decide on how this 'register' is going
>>> to work, as archs may have different needs (in powerpc for example, we
>>> can't always disable irqs, since we may be in realmode).

Yes, the tension there is that a) some things are per-arch, and b) it's easy 
to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
that.

So, I would like core mm/ functions that guide the way, but the interrupt
behavior complicates it. I think your original passing of just struct_mm
is probably the right balance, assuming that I'm wrong about interrupts.


>>>
>>> Maybe we can create a generic function instead of a dummy, and let it
>>> be replaced in case the arch needs to do so.
>>
>> Yes, that might be what we need, if it turns out that ppc can't use this
>> approach (although let's see about that).
>>
> 
> I initially used the dummy approach because I did not see anything like
> serialize in other archs. 
> 
> I mean, even if I put some generic function here, if there is no
> function to use the 'lockless_pgtbl_walk_count', it becomes only a
> overhead.
> 

Not really: the memory barrier is required in all cases, and this code
would be good I think:

+void register_lockless_pgtable_walker(struct mm_struct *mm)
+{
+#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
+       atomic_inc(&mm->lockless_pgtbl_nr_walkers);
+#endif
+       /*
+        * This memory barrier pairs with any code that is either trying to
+        * delete page tables, or split huge pages.
+        */
+       smp_mb();
+}
+EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);

And this is the same as your original patch, with just a minor name change:

@@ -2341,9 +2395,11 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
        if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
            gup_fast_permitted(start, end)) {
+               register_lockless_pgtable_walker(current->mm);
                local_irq_save(flags);
                gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
                local_irq_restore(flags);
+               deregister_lockless_pgtable_walker(current->mm);


Btw, hopefully minor note: it also looks like there's a number of changes in the same 
area that conflict, for example:

    commit d9101bfa6adc ("powerpc/mm/mce: Keep irqs disabled during lockless 
         page table walk") <Aneesh Kumar K.V> (Thu, 19 Sep 2019)

...so it would be good to rebase this onto 5.4-rc1, now that that's here.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-09-27 23:40 ` [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
  2019-09-30 11:09   ` Kirill A. Shutemov
@ 2019-09-30 21:51   ` John Hubbard
  2019-10-01 17:56     ` Leonardo Bras
  1 sibling, 1 reply; 26+ messages in thread
From: John Hubbard @ 2019-09-30 21:51 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy, Andrew Morton,
	Dan Williams, Nicholas Piggin, Mahesh Salgaonkar, Allison Randal,
	Thomas Gleixner, Ganesh Goudar, Mike Rapoport, YueHaibing,
	Greg Kroah-Hartman, Ira Weiny, Jason Gunthorpe, Keith Busch

On 9/27/19 4:40 PM, Leonardo Bras wrote:
> As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
> monitor against THP split/collapse with the couting method, it's necessary

s/couting/counting/

> to bound it with {start,end}_lockless_pgtbl_walk.
> 
> There are dummy functions, so it is not going to add any overhead on archs
> that don't use this method.
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
>  mm/gup.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 98f13ab37bac..7105c829cf44 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			  struct page **pages)
>  {
> +	struct mm_struct *mm;

I don't think that this local variable adds any value, so let's not use it.
Similar point in a few other patches too.

>  	unsigned long len, end;
>  	unsigned long flags;
>  	int nr = 0;
> @@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>  	    gup_fast_permitted(start, end)) {
> +		mm = current->mm;
> +		start_lockless_pgtbl_walk(mm);
>  		local_irq_save(flags);
>  		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
>  		local_irq_restore(flags);
> +		end_lockless_pgtbl_walk(mm);
>  	}
>  
>  	return nr;
> @@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages)
>  {
>  	unsigned long addr, len, end;
> +	struct mm_struct *mm;

Same here.

>  	int nr = 0, ret = 0;
>  
>  	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> @@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  
>  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>  	    gup_fast_permitted(start, end)) {
> +		mm = current->mm;
> +		start_lockless_pgtbl_walk(mm);

Minor: I'd like to rename this register_lockless_pgtable_walker().

>  		local_irq_disable();
>  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
>  		local_irq_enable();
> +		end_lockless_pgtbl_walk(mm);

...and deregister_lockless_pgtable_walker().


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-09-30 21:51   ` John Hubbard
@ 2019-10-01 17:56     ` Leonardo Bras
  2019-10-01 19:04       ` John Hubbard
  0 siblings, 1 reply; 26+ messages in thread
From: Leonardo Bras @ 2019-10-01 17:56 UTC (permalink / raw)
  To: John Hubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Keith Busch, Thomas Gleixner, Arnd Bergmann, Greg Kroah-Hartman,
	YueHaibing, Nicholas Piggin, Mike Rapoport, Mahesh Salgaonkar,
	Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V, Ganesh Goudar,
	Andrew Morton, Ira Weiny, Dan Williams, Allison Randal

[-- Attachment #1: Type: text/plain, Size: 2915 bytes --]

On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
> On 9/27/19 4:40 PM, Leonardo Bras wrote:
> > As decribed, gup_pgd_range is a lockless pagetable walk. So, in order to
> > monitor against THP split/collapse with the couting method, it's necessary
> 
> s/couting/counting/
> 

Thanks, fixed for v5.

> > to bound it with {start,end}_lockless_pgtbl_walk.
> > 
> > There are dummy functions, so it is not going to add any overhead on archs
> > that don't use this method.
> > 
> > Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> > ---
> >  mm/gup.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 98f13ab37bac..7105c829cf44 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
> >  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  			  struct page **pages)
> >  {
> > +	struct mm_struct *mm;
> 
> I don't think that this local variable adds any value, so let's not use it.
> Similar point in a few other patches too.

It avoids 1 deference of current->mm, it's a little performance gain.

> 
> >  	unsigned long len, end;
> >  	unsigned long flags;
> >  	int nr = 0;
> > @@ -2352,9 +2353,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> >  
> >  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
> >  	    gup_fast_permitted(start, end)) {
> > +		mm = current->mm;
> > +		start_lockless_pgtbl_walk(mm);
> >  		local_irq_save(flags);
> >  		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
> >  		local_irq_restore(flags);
> > +		end_lockless_pgtbl_walk(mm);
> >  	}
> >  
> >  	return nr;
> > @@ -2404,6 +2408,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >  			unsigned int gup_flags, struct page **pages)
> >  {
> >  	unsigned long addr, len, end;
> > +	struct mm_struct *mm;
> 
> Same here.
> 
> >  	int nr = 0, ret = 0;
> >  
> >  	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> > @@ -2421,9 +2426,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> >  
> >  	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
> >  	    gup_fast_permitted(start, end)) {
> > +		mm = current->mm;
> > +		start_lockless_pgtbl_walk(mm);
> 
> Minor: I'd like to rename this register_lockless_pgtable_walker().
> 
> >  		local_irq_disable();
> >  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
> >  		local_irq_enable();
> > +		end_lockless_pgtbl_walk(mm);
> 
> ...and deregister_lockless_pgtable_walker().
> 

I have no problem changing the name, but I don't register/deregister
are good terms for this. 

I would rather use start/finish, begin/end, and so on. Register sounds
like something more complicated than what we are trying to achieve
here. 

> 
> thanks,

Thank you!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-09-30 21:47           ` John Hubbard
@ 2019-10-01 18:39             ` Leonardo Bras
  2019-10-01 18:52               ` John Hubbard
  0 siblings, 1 reply; 26+ messages in thread
From: Leonardo Bras @ 2019-10-01 18:39 UTC (permalink / raw)
  To: John Hubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Dan Williams, Arnd Bergmann, Greg Kroah-Hartman,
	Mahesh Salgaonkar, YueHaibing, Nicholas Piggin, Mike Rapoport,
	Keith Busch, Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V,
	Ganesh Goudar, Thomas Gleixner, Ira Weiny, Andrew Morton,
	Allison Randal

[-- Attachment #1: Type: text/plain, Size: 5532 bytes --]

On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
> On 9/30/19 11:42 AM, Leonardo Bras wrote:
> > On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
> > > > As I told before, there are cases where this function is called from
> > > > 'real mode' in powerpc, which doesn't disable irqs and may have a
> > > > tricky behavior if we do. So, encapsulate the irq disable in this
> > > > function can be a bad choice.
> > > 
> > > You still haven't explained how this works in that case. So far, the
> > > synchronization we've discussed has depended upon interrupt disabling
> > > as part of the solution, in order to hold off page splitting and page
> > > table freeing.
> > 
> > The irqs are already disabled by another mechanism (hw): MSR_EE=0.
> > So, serialize will work as expected.
> 
> I get that they're disabled. But will this interlock with the code that
> issues IPIs?? Because it's not just disabling interrupts that matters, but
> rather, synchronizing with the code (TLB flushing) that *happens* to 
> require issuing IPIs, which in turn interact with disabling interrupts.
> 
> So I'm still not seeing how that could work here, unless there is something
> interesting about the smp_call_function_many() on ppc with MSR_EE=0 mode...?
> 

I am failing to understand the issue.
I mean, smp_call_function_many() will issue a IPI to each CPU in
CPUmask and wait it to run before returning. 
If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
the IPI will not run on that CPU, and the wait part will make sure to
lock the thread until the interrupts are enabled again. 

Could you please point the issue there?

> > > Simply skipping that means that an additional mechanism is required...which
> > > btw might involve a new, ppc-specific routine, so maybe this is going to end
> > > up pretty close to what I pasted in after all...
> > > > Of course, if we really need that, we can add a bool parameter to the
> > > > function to choose about disabling/enabling irqs.
> > > > > * This is really a core mm function, so don't hide it away in arch layers.
> > > > >     (If you're changing mm/ files, that's a big hint.)
> > > > 
> > > > My idea here is to let the arch decide on how this 'register' is going
> > > > to work, as archs may have different needs (in powerpc for example, we
> > > > can't always disable irqs, since we may be in realmode).
> 
> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
> that.
> 
> So, I would like core mm/ functions that guide the way, but the interrupt
> behavior complicates it. I think your original passing of just struct_mm
> is probably the right balance, assuming that I'm wrong about interrupts.
> 

I think, for the generic function, that including {en,dis}abling the
interrupt is fine. I mean, if disabling the interrupt is the generic
behavior, it's ok. 
I will just make sure to explain that the interrupt {en,dis}abling is
part of the sync process. If an arch don't like it, it can write a
specific function that does the sync in a better way. (and defining
__HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)

In this case, the generic function would also include the ifdef'ed
atomic inc and the memory barrier. 

> 
> > > > Maybe we can create a generic function instead of a dummy, and let it
> > > > be replaced in case the arch needs to do so.
> > > 
> > > Yes, that might be what we need, if it turns out that ppc can't use this
> > > approach (although let's see about that).
> > > 
> > 
> > I initially used the dummy approach because I did not see anything like
> > serialize in other archs. 
> > 
> > I mean, even if I put some generic function here, if there is no
> > function to use the 'lockless_pgtbl_walk_count', it becomes only a
> > overhead.
> > 
> 
> Not really: the memory barrier is required in all cases, and this code
> would be good I think:
> 
> +void register_lockless_pgtable_walker(struct mm_struct *mm)
> +{
> +#ifdef LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +       atomic_inc(&mm->lockless_pgtbl_nr_walkers);
> +#endif
> +       /*
> +        * This memory barrier pairs with any code that is either trying to
> +        * delete page tables, or split huge pages.
> +        */
> +       smp_mb();
> +}
> +EXPORT_SYMBOL_GPL(gup_fast_lock_acquire);
> 
> And this is the same as your original patch, with just a minor name change:
> 
> @@ -2341,9 +2395,11 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  
>         if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
>             gup_fast_permitted(start, end)) {
> +               register_lockless_pgtable_walker(current->mm);
>                 local_irq_save(flags);
>                 gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
>                 local_irq_restore(flags);
> +               deregister_lockless_pgtable_walker(current->mm);
> 
> 
> Btw, hopefully minor note: it also looks like there's a number of changes in the same 
> area that conflict, for example:
> 
>     commit d9101bfa6adc ("powerpc/mm/mce: Keep irqs disabled during lockless 
>          page table walk") <Aneesh Kumar K.V> (Thu, 19 Sep 2019)
> 
> ...so it would be good to rebase this onto 5.4-rc1, now that that's here.
> 

Yeap, agree. Already rebased on top of v5.4-rc1.

> 
> thanks,

Thank you!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-01 18:39             ` Leonardo Bras
@ 2019-10-01 18:52               ` John Hubbard
  0 siblings, 0 replies; 26+ messages in thread
From: John Hubbard @ 2019-10-01 18:52 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Dan Williams, Arnd Bergmann, Greg Kroah-Hartman,
	Mahesh Salgaonkar, YueHaibing, Nicholas Piggin, Mike Rapoport,
	Keith Busch, Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V,
	Ganesh Goudar, Thomas Gleixner, Ira Weiny, Andrew Morton,
	Allison Randal

On 10/1/19 11:39 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 14:47 -0700, John Hubbard wrote:
>> On 9/30/19 11:42 AM, Leonardo Bras wrote:
>>> On Mon, 2019-09-30 at 10:57 -0700, John Hubbard wrote:
...
> 
> I am failing to understand the issue.
> I mean, smp_call_function_many() will issue a IPI to each CPU in
> CPUmask and wait it to run before returning. 
> If interrupts are disabled (either by MSR_EE=0 or local_irq_disable),
> the IPI will not run on that CPU, and the wait part will make sure to
> lock the thread until the interrupts are enabled again. 
> 
> Could you please point the issue there?

The biggest problem here is evidently my not knowing much about ppc. :) 
So if that's how it behaves, then all is well, sorry it took me a while
to understand the MSR_EE=0 behavior.

> 
>>>> Simply skipping that means that an additional mechanism is required...which
>>>> btw might involve a new, ppc-specific routine, so maybe this is going to end
>>>> up pretty close to what I pasted in after all...
>>>>> Of course, if we really need that, we can add a bool parameter to the
>>>>> function to choose about disabling/enabling irqs.
>>>>>> * This is really a core mm function, so don't hide it away in arch layers.
>>>>>>     (If you're changing mm/ files, that's a big hint.)
>>>>>
>>>>> My idea here is to let the arch decide on how this 'register' is going
>>>>> to work, as archs may have different needs (in powerpc for example, we
>>>>> can't always disable irqs, since we may be in realmode).
>>
>> Yes, the tension there is that a) some things are per-arch, and b) it's easy 
>> to get it wrong. The commit below (d9101bfa6adc) is IMHO a perfect example of
>> that.
>>
>> So, I would like core mm/ functions that guide the way, but the interrupt
>> behavior complicates it. I think your original passing of just struct_mm
>> is probably the right balance, assuming that I'm wrong about interrupts.
>>
> 
> I think, for the generic function, that including {en,dis}abling the
> interrupt is fine. I mean, if disabling the interrupt is the generic
> behavior, it's ok. 
> I will just make sure to explain that the interrupt {en,dis}abling is
> part of the sync process. If an arch don't like it, it can write a
> specific function that does the sync in a better way. (and defining
> __HAVE_ARCH_LOCKLESS_PGTBL_WALK_COUNTER to ignore the generic function)
> 

Tentatively, that sounds good. We still end up with the counter variable
directly in struct mm_struct, and the generic function in mm/gup.c 
(or mm/somewhere), where it's easy to find and see what's going on. sure.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-10-01 17:56     ` Leonardo Bras
@ 2019-10-01 19:04       ` John Hubbard
  2019-10-01 19:40         ` Leonardo Bras
  0 siblings, 1 reply; 26+ messages in thread
From: John Hubbard @ 2019-10-01 19:04 UTC (permalink / raw)
  To: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Keith Busch, Thomas Gleixner, Arnd Bergmann, Greg Kroah-Hartman,
	YueHaibing, Nicholas Piggin, Mike Rapoport, Mahesh Salgaonkar,
	Jason Gunthorpe, Paul Mackerras, Aneesh Kumar K.V, Ganesh Goudar,
	Andrew Morton, Ira Weiny, Dan Williams, Allison Randal

On 10/1/19 10:56 AM, Leonardo Bras wrote:
> On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
>> On 9/27/19 4:40 PM, Leonardo Bras wrote:
...
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 98f13ab37bac..7105c829cf44 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
>>>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>>>  			  struct page **pages)
>>>  {
>>> +	struct mm_struct *mm;
>>
>> I don't think that this local variable adds any value, so let's not use it.
>> Similar point in a few other patches too.
> 
> It avoids 1 deference of current->mm, it's a little performance gain.
> 

No, it isn't. :) 

Longer answer: at this level (by which I mean, "wrote the C code, haven't looked
at the generated asm yet, and haven't done a direct perf test yet"), none of us
C programmers are entitled to imagine that we can second guess both the compiler 
and the CPU well enough to claim that  declaring a local pointer variable on the
stack will even *affect* performance, much less know which way it will go!

The compiler at -O2 will *absolutely* optimize away any local variables that
it doesn't need.

And that leads to how kernel programmers routinely decide about that kind of 
variable: "does the variable's added clarity compensate for the extra visual 
noise and for the need to manage the variable?"

Here, and in most (all?) other points in the patchset where you've added an
mm local variable, the answer is no.


>>
...	start_lockless_pgtbl_walk(mm);
>>
>> Minor: I'd like to rename this register_lockless_pgtable_walker().
>>
>>>  		local_irq_disable();
>>>  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
>>>  		local_irq_enable();
>>> +		end_lockless_pgtbl_walk(mm);
>>
>> ...and deregister_lockless_pgtable_walker().
>>
> 
> I have no problem changing the name, but I don't register/deregister
> are good terms for this. 
> 
> I would rather use start/finish, begin/end, and so on. Register sounds
> like something more complicated than what we are trying to achieve
> here. 
> 

OK, well, I don't want to bikeshed on naming more than I usually do, and 
what you have is reasonable, so I'll leave that alone. :)

thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-10-01 19:04       ` John Hubbard
@ 2019-10-01 19:40         ` Leonardo Bras
  0 siblings, 0 replies; 26+ messages in thread
From: Leonardo Bras @ 2019-10-01 19:40 UTC (permalink / raw)
  To: John Hubbard, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Arnd Bergmann, Greg Kroah-Hartman, YueHaibing, Nicholas Piggin,
	Mike Rapoport, Keith Busch, Jason Gunthorpe, Paul Mackerras,
	Aneesh Kumar K.V, Allison Randal, Mahesh Salgaonkar,
	Ganesh Goudar, Thomas Gleixner, Ira Weiny, Andrew Morton,
	Dan Williams

[-- Attachment #1: Type: text/plain, Size: 2911 bytes --]

On Tue, 2019-10-01 at 12:04 -0700, John Hubbard wrote:
> On 10/1/19 10:56 AM, Leonardo Bras wrote:
> > On Mon, 2019-09-30 at 14:51 -0700, John Hubbard wrote:
> > > On 9/27/19 4:40 PM, Leonardo Bras wrote:
> ...
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index 98f13ab37bac..7105c829cf44 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -2325,6 +2325,7 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
> > > >  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > > >  			  struct page **pages)
> > > >  {
> > > > +	struct mm_struct *mm;
> > > 
> > > I don't think that this local variable adds any value, so let's not use it.
> > > Similar point in a few other patches too.
> > 
> > It avoids 1 deference of current->mm, it's a little performance gain.
> > 
> 
> No, it isn't. :) 
> 
> Longer answer: at this level (by which I mean, "wrote the C code, haven't looked
> at the generated asm yet, and haven't done a direct perf test yet"), none of us
> C programmers are entitled to imagine that we can second guess both the compiler 
> and the CPU well enough to claim that  declaring a local pointer variable on the
> stack will even *affect* performance, much less know which way it will go!
> 

I did this based on how costly can be 'current', and I could notice
reduction in assembly size most of the time. (powerpc)
But I get what you mean, maybe the (possible) performance gain don't
worth the extra work.

> The compiler at -O2 will *absolutely* optimize away any local variables that
> it doesn't need.
> 
> And that leads to how kernel programmers routinely decide about that kind of 
> variable: "does the variable's added clarity compensate for the extra visual 
> noise and for the need to manage the variable?"

That's a good way to decide it. :)

> 
> Here, and in most (all?) other points in the patchset where you've added an
> mm local variable, the answer is no.
> 

Well, IMHO it's cleaner that way. But I get that other people may
disagree. 

> 
> ...	start_lockless_pgtbl_walk(mm);
> > > Minor: I'd like to rename this register_lockless_pgtable_walker().
> > > 
> > > >  		local_irq_disable();
> > > >  		gup_pgd_range(addr, end, gup_flags, pages, &nr);
> > > >  		local_irq_enable();
> > > > +		end_lockless_pgtbl_walk(mm);
> > > 
> > > ...and deregister_lockless_pgtable_walker().
> > > 
> > 
> > I have no problem changing the name, but I don't register/deregister
> > are good terms for this. 
> > 
> > I would rather use start/finish, begin/end, and so on. Register sounds
> > like something more complicated than what we are trying to achieve
> > here. 
> > 
> 
> OK, well, I don't want to bikeshed on naming more than I usually do, and 
> what you have is reasonable, so I'll leave that alone. :)
> 
> thanks,

Thank for the feedback,


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-10-01 19:41 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-27 23:39 [PATCH v4 00/11] Introduces new count-based method for monitoring lockless pagetable walks Leonardo Bras
2019-09-27 23:39 ` [PATCH v4 01/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks Leonardo Bras
2019-09-29 22:40   ` John Hubbard
2019-09-29 23:17     ` John Hubbard
2019-09-30 15:14     ` Leonardo Bras
2019-09-30 17:57       ` John Hubbard
2019-09-30 18:42         ` Leonardo Bras
2019-09-30 21:47           ` John Hubbard
2019-10-01 18:39             ` Leonardo Bras
2019-10-01 18:52               ` John Hubbard
2019-09-27 23:39 ` [PATCH v4 02/11] asm-generic/pgtable: Adds dummy functions " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
2019-09-30 11:09   ` Kirill A. Shutemov
2019-09-30 14:27     ` Leonardo Bras
2019-09-30 21:51   ` John Hubbard
2019-10-01 17:56     ` Leonardo Bras
2019-10-01 19:04       ` John Hubbard
2019-10-01 19:40         ` Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 05/11] powerpc/perf: " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 07/11] powerpc/kvm/e500: " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 10/11] powerpc/book3s_64: Enables counting method to monitor lockless pgtbl walk Leonardo Bras
2019-09-27 23:40 ` [PATCH v4 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).