Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
@ 2019-10-03  1:33 Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

If a process (qemu) with a lot of CPUs (128) try to munmap() a large
chunk of memory (496GB) mapped with THP, it takes an average of 275
seconds, which can cause a lot of problems to the load (in qemu case,
the guest will lock for this time).

Trying to find the source of this bug, I found out most of this time is
spent on serialize_against_pte_lookup(). This function will take a lot
of time in smp_call_function_many() if there is more than a couple CPUs
running the user process. Since it has to happen to all THP mapped, it
will take a very long time for large amounts of memory.

By the docs, serialize_against_pte_lookup() is needed in order to avoid
pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
pagetable walk, to happen concurrently with THP splitting/collapsing.

It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
after interrupts are re-enabled.
Since, interrupts are (usually) disabled during lockless pagetable
walk, and serialize_against_pte_lookup will only return after
interrupts are enabled, it is protected.

So, by what I could understand, if there is no lockless pagetable walk
running, there is no need to call serialize_against_pte_lookup().

So, to avoid the cost of running serialize_against_pte_lookup(), I
propose a counter that keeps track of how many find_current_mm_pte()
are currently running, and if there is none, just skip
smp_call_function_many().

The related functions are:
begin_lockless_pgtbl_walk()
        Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk()
        Insert after the end of any lockless pgtable walk
        (Mostly after the ptep is last used)
running_lockless_pgtbl_walk()
        Returns the number of lockless pgtable walks running

On my workload (qemu), I could see munmap's time reduction from 275
seconds to 418ms.

Also, I documented some lockless pagetable walks in which it's not
necessary to keep track, given they work on init_mm or guest pgd.

The patchset works by focusing all steps needed to begin/end lockless
pagetable walks on the above functions, and then adding the config option
to enable the tracking of these functions using the counting method.

Changes since v4:
 Rebased on top of v5.4-rc1
 Declared real generic functions instead of dummies
 start_lockless_pgtbl_walk renamed to begin_lockless_pgtbl_walk
 Interrupt {dis,en}able is now inside of {begin,end}_lockless_pgtbl_walk
 Power implementation has option to not {dis,en}able interrupt
 More documentation inside the funtions.
 Some irq maks variables renamed
 Removed some proxy mm_structs
 Few typos fixed

Changes since v3:
 Explain (comments) why some lockless pgtbl walks don't need
	local_irq_disable (real mode + MSR_EE=0)
 Explain (comments) places where counting method is not needed (guest pgd,
	which is not touched by THP)
 Fixes some misplaced local_irq_restore()
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=132417

Changes since v2:
 Rebased to v5.3
 Adds support on __get_user_pages_fast
 Adds usage decription to *_lockless_pgtbl_walk()
 Better style to dummy functions
 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131839

Changes since v1:
 Isolated atomic operations in functions *_lockless_pgtbl_walk()
 Fixed behavior of decrementing before last ptep was used
 Link: http://patchwork.ozlabs.org/patch/1163093/


Leonardo Bras (11):
  asm-generic/pgtable: Adds generic functions to monitor lockless
    pgtable walks
  powerpc/mm: Adds counting method to monitor lockless pgtable walks
  mm/gup: Applies counting method to monitor gup_pgd_range
  powerpc/mce_power: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  powerpc/mm/book3s64/hash: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl
    walks
  powerpc/kvm/book3s_hv: Applies counting method to monitor lockless
    pgtbl walks
  powerpc/kvm/book3s_64: Applies counting method to monitor lockless
    pgtbl walks
  mm/Kconfig: Adds config option to track lockless pagetable walks
  powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing

 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/kernel/mce_power.c              |   6 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |   6 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c       |  34 +++++-
 arch/powerpc/kvm/book3s_64_vio_hv.c          |   7 +-
 arch/powerpc/kvm/book3s_hv_nested.c          |  22 +++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c          |  32 ++---
 arch/powerpc/kvm/e500_mmu_host.c             |   9 +-
 arch/powerpc/mm/book3s64/hash_tlb.c          |   6 +-
 arch/powerpc/mm/book3s64/hash_utils.c        |  27 +++--
 arch/powerpc/mm/book3s64/pgtable.c           | 120 ++++++++++++++++++-
 arch/powerpc/perf/callchain.c                |   6 +-
 include/asm-generic/pgtable.h                |  58 +++++++++
 include/linux/mm_types.h                     |  11 ++
 kernel/fork.c                                |   3 +
 mm/Kconfig                                   |  11 ++
 mm/gup.c                                     |  10 +-
 17 files changed, 325 insertions(+), 52 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
@ 2019-10-03  1:33 ` Leonardo Bras
  2019-10-03  7:11   ` Peter Zijlstra
  2019-10-03  1:33 ` [PATCH v5 02/11] powerpc/mm: Adds counting method " Leonardo Bras
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

Some methods rely on irq enable/disable, but that can be slow on
cases with a lot of cpus are used for the process, given all these cpus
have to run a IPI.

In order to speedup some cases, I propose a refcount-based approach,
that counts the number of lockless pagetable walks happening on the
process. If this count is zero, it skips the irq-oriented method.

Given that there are lockless pagetable walks on generic code, it's
necessary to create documented generic functions that may be enough for
most archs and but let open to arch-specific implemenations.

This method does not exclude the current irq-oriented method. It works as a
complement to skip unnecessary waiting.

begin_lockless_pgtbl_walk(mm)
        Insert before starting any lockless pgtable walk
end_lockless_pgtbl_walk(mm)
        Insert after the end of any lockless pgtable walk
        (Mostly after the ptep is last used)
running_lockless_pgtbl_walk(mm)
        Returns the number of lockless pgtable walks running

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 include/asm-generic/pgtable.h | 58 +++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h      | 11 +++++++
 kernel/fork.c                 |  3 ++
 3 files changed, 72 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 818691846c90..3043ea9812d5 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	unsigned long irq_mask;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 */
+	local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+
+static inline void end_lockless_pgtbl_walk(struct mm_struct *mm,
+					   unsigned long irq_mask)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 */
+	local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+
+static inline int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it falls back to sync method */
+	return 1;
+}
+#endif
+
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2222fa795284..277462f0b4fd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -521,6 +521,17 @@ struct mm_struct {
 		struct work_struct async_put_work;
 	} __randomize_layout;
 
+#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	/*
+	 * Number of callers who are doing a lockless walk of the
+	 * page tables. Typically arches might enable this in order to
+	 * help optimize performance, by possibly avoiding expensive
+	 * IPIs at the wrong times.
+	 */
+	atomic_t lockless_pgtbl_walkers;
+
+#endif
+
 	/*
 	 * The mm_cpumask needs to be at the end of mm_struct, because it
 	 * is dynamically sized based on nr_cpu_ids.
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f416126..2cbca867f5a5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1029,6 +1029,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #endif
 	mm_init_uprobes_state(mm);
 
+#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	atomic_set(&mm->lockless_pgtbl_walkers, 0);
+#endif
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
 		mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-08 15:11   ` Christopher Lameter
  2019-10-03  1:33 ` [PATCH v5 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

It's necessary to monitor lockless pagetable walks, in order to avoid doing
THP splitting/collapsing during them.

On powerpc, we need to do some lockless pagetable walks from functions
that already have disabled interrupts, specially from real mode with
MSR[EE=0].

In these contexts, disabling/enabling interrupts can be very troubling.

So, this arch-specific implementation features functions with an extra
argument that allows interrupt enable/disable to be skipped:
__begin_lockless_pgtbl_walk() and __end_lockless_pgtbl_walk().

Functions similar to the generic ones are also exported, by calling
the above functions with parameter *able_irq = false.

While there is no config option, the method is disabled and these functions
are only doing what was already needed to lockless pagetable walks
(disabling interrupt). A memory barrier was also added just to make sure
there is no speculative read outside the interrupt disabled area.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |   9 ++
 arch/powerpc/mm/book3s64/pgtable.c           | 117 +++++++++++++++++++
 2 files changed, 126 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b01624e5c467..8330b35cd28d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1372,5 +1372,14 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 	return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
 }
 
+#define __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm);
+unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+					  bool disable_irq);
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask);
+void __end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask,
+			       bool enable_irq);
+int running_lockless_pgtbl_walk(struct mm_struct *mm);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 75483b40fcb1..ae557fdce9a3 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -98,6 +98,123 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
 	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
+/*
+ * Counting method to monitor lockless pagetable walks:
+ * Uses begin_lockless_pgtbl_walk and end_lockless_pgtbl_walk to track the
+ * number of lockless pgtable walks happening, and
+ * running_lockless_pgtbl_walk to return this value.
+ */
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version allows setting disable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+
+inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
+						 bool disable_irq)
+{
+	unsigned long irq_mask = 0;
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_inc(&mm->lockless_pgtbl_walkers);
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (disable_irq)
+		local_irq_save(irq_mask);
+
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	return irq_mask;
+}
+EXPORT_SYMBOL(__begin_lockless_pgtbl_walk);
+
+/* begin_lockless_pgtbl_walk: Must be inserted before a function call that does
+ *   lockless pagetable walks, such as __find_linux_pte().
+ * This version is used by generic code, and always assume irqs being disabled
+ */
+unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	return __begin_lockless_pgtbl_walk(mm, true);
+}
+EXPORT_SYMBOL(begin_lockless_pgtbl_walk);
+
+/*
+ * __end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version allows setting enable_irq=false, so irqs are not touched, which
+ *   is quite useful for running when ints are already disabled (like real-mode)
+ */
+inline void __end_lockless_pgtbl_walk(struct mm_struct *mm,
+				      unsigned long irq_mask, bool enable_irq)
+{
+	/*
+	 * This memory barrier pairs with any code that is either trying to
+	 * delete page tables, or split huge pages. Without this barrier,
+	 * the page tables could be read speculatively outside of interrupt
+	 * disabling or reference counting.
+	 */
+	smp_mb();
+
+	/*
+	 * Interrupts must be disabled during the lockless page table walk.
+	 * That's because the deleting or splitting involves flushing TLBs,
+	 * which in turn issues interrupts, that will block when disabled.
+	 *
+	 * When this function is called from realmode with MSR[EE=0],
+	 * it's not needed to touch irq, since it's already disabled.
+	 */
+	if (enable_irq)
+		local_irq_restore(irq_mask);
+
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		atomic_dec(&mm->lockless_pgtbl_walkers);
+}
+EXPORT_SYMBOL(__end_lockless_pgtbl_walk);
+
+/*
+ * end_lockless_pgtbl_walk: Must be inserted after the last use of a pointer
+ *   returned by a lockless pagetable walk, such as __find_linux_pte()
+ * This version is used by generic code, and always assume irqs being enabled
+ */
+
+void end_lockless_pgtbl_walk(struct mm_struct *mm, unsigned long irq_mask)
+{
+	__end_lockless_pgtbl_walk(mm, irq_mask, true);
+}
+EXPORT_SYMBOL(end_lockless_pgtbl_walk);
+
+/*
+ * running_lockless_pgtbl_walk: Returns the number of lockless pagetable walks
+ *   currently running. If it returns 0, there is no running pagetable walk, and
+ *   THP split/collapse can be safely done. This can be used to avoid more
+ *   expensive approaches like serialize_against_pte_lookup()
+ */
+int running_lockless_pgtbl_walk(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
+		return atomic_read(&mm->lockless_pgtbl_walkers);
+
+	/* If disabled, must return > 0, so it fallback to sync method
+	 * (serialize_against_pte_lookup)
+	 */
+	return 1;
+}
+EXPORT_SYMBOL(running_lockless_pgtbl_walk);
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 03/11] mm/gup: Applies counting method to monitor gup_pgd_range
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 02/11] powerpc/mm: Adds counting method " Leonardo Bras
@ 2019-10-03  1:33 ` Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

As described, gup_pgd_range is a lockless pagetable walk. So, in order to
monitor against THP split/collapse with the counting method, it's necessary
to bound it with {begin,end}_lockless_pgtbl_walk.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 mm/gup.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 23a9f9c9d377..52e53b4f39d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2319,7 +2319,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
 {
 	unsigned long len, end;
-	unsigned long flags;
+	unsigned long irq_mask;
 	int nr = 0;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2345,9 +2345,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
-		local_irq_save(flags);
+		irq_mask = begin_lockless_pgtbl_walk(current->mm);
 		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
-		local_irq_restore(flags);
+		end_lockless_pgtbl_walk(current->mm, irq_mask);
 	}
 
 	return nr;
@@ -2414,9 +2414,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
-		local_irq_disable();
+		begin_lockless_pgtbl_walk(current->mm);
 		gup_pgd_range(addr, end, gup_flags, pages, &nr);
-		local_irq_enable();
+		end_lockless_pgtbl_walk(current->mm, IRQS_ENABLED);
 		ret = nr;
 	}
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (2 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
@ 2019-10-03  1:33 ` Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 05/11] powerpc/perf: " Leonardo Bras
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
addr_to_pfn().

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kernel/mce_power.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 1cbf7f1a4e3d..7f888fb6f65c 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -29,7 +29,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 {
 	pte_t *ptep;
 	unsigned int shift;
-	unsigned long pfn, flags;
+	unsigned long pfn, irq_mask;
 	struct mm_struct *mm;
 
 	if (user_mode(regs))
@@ -37,7 +37,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 	else
 		mm = &init_mm;
 
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(mm);
 	ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
 
 	if (!ptep || pte_special(*ptep)) {
@@ -53,7 +53,7 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 	}
 
 out:
-	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm, irq_mask);
 	return pfn;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 05/11] powerpc/perf: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (3 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
read_user_stack_slow.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/perf/callchain.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index c84bbd4298a0..43e49d8be344 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -116,14 +116,14 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
 	unsigned shift;
 	unsigned long addr = (unsigned long) ptr;
 	unsigned long offset;
-	unsigned long pfn, flags;
+	unsigned long pfn, irq_mask;
 	void *kaddr;
 
 	pgdir = current->mm->pgd;
 	if (!pgdir)
 		return -EFAULT;
 
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(current->mm);
 	ptep = find_current_mm_pte(pgdir, addr, NULL, &shift);
 	if (!ptep)
 		goto err_out;
@@ -145,7 +145,7 @@ static int read_user_stack_slow(void __user *ptr, void *buf, int nb)
 	memcpy(buf, kaddr + offset, nb);
 	ret = 0;
 err_out:
-	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(current->mm, irq_mask);
 	return ret;
 }
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 06/11] powerpc/mm/book3s64/hash: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (4 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 05/11] powerpc/perf: " Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 07/11] powerpc/kvm/e500: " Leonardo Bras
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all hash-related functions
that do lockless pagetable walks.

hash_page_mm: Adds comment that explain that there is no need to
local_int_disable/save given that it is only called from DataAccess
interrupt, so interrupts are already disabled.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the	irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/hash_tlb.c   |  6 +++---
 arch/powerpc/mm/book3s64/hash_utils.c | 27 +++++++++++++++++----------
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index 4a70d8dd39cd..b0ef67d8c88a 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -194,7 +194,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 {
 	bool is_thp;
 	int hugepage_shift;
-	unsigned long flags;
+	unsigned long irq_mask;
 
 	start = _ALIGN_DOWN(start, PAGE_SIZE);
 	end = _ALIGN_UP(end, PAGE_SIZE);
@@ -209,7 +209,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 	 * to being hashed). This is not the most performance oriented
 	 * way to do things but is fine for our needs here.
 	 */
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; start < end; start += PAGE_SIZE) {
 		pte_t *ptep = find_current_mm_pte(mm->pgd, start, &is_thp,
@@ -229,7 +229,7 @@ void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
 			hpte_need_flush(mm, start, ptep, pte, hugepage_shift);
 	}
 	arch_leave_lazy_mmu_mode();
-	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm, irq_mask);
 }
 
 void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 6c123760164e..7a01a12a19bb 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1313,12 +1313,16 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 		ea &= ~((1ul << mmu_psize_defs[psize].shift) - 1);
 #endif /* CONFIG_PPC_64K_PAGES */
 
-	/* Get PTE and page size from page tables */
+	/* Get PTE and page size from page tables :
+	 * Called in from DataAccess interrupt (data_access_common: 0x300),
+	 * interrupts are disabled here.
+	 */
+	__begin_lockless_pgtbl_walk(mm, false);
 	ptep = find_linux_pte(pgdir, ea, &is_thp, &hugeshift);
 	if (ptep == NULL || !pte_present(*ptep)) {
 		DBG_LOW(" no PTE !\n");
 		rc = 1;
-		goto bail;
+		goto bail_pgtbl_walk;
 	}
 
 	/* Add _PAGE_PRESENT to the required access perm */
@@ -1331,7 +1335,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 	if (!check_pte_access(access, pte_val(*ptep))) {
 		DBG_LOW(" no access !\n");
 		rc = 1;
-		goto bail;
+		goto bail_pgtbl_walk;
 	}
 
 	if (hugeshift) {
@@ -1355,7 +1359,7 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 		if (current->mm == mm)
 			check_paca_psize(ea, mm, psize, user_region);
 
-		goto bail;
+		goto bail_pgtbl_walk;
 	}
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -1429,6 +1433,8 @@ int hash_page_mm(struct mm_struct *mm, unsigned long ea,
 #endif
 	DBG_LOW(" -> rc=%d\n", rc);
 
+bail_pgtbl_walk:
+	__end_lockless_pgtbl_walk(mm, 0, false);
 bail:
 	exception_exit(prev_state);
 	return rc;
@@ -1517,7 +1523,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
 	unsigned long vsid;
 	pgd_t *pgdir;
 	pte_t *ptep;
-	unsigned long flags;
+	unsigned long irq_mask;
 	int rc, ssize, update_flags = 0;
 	unsigned long access = _PAGE_PRESENT | _PAGE_READ | (is_exec ? _PAGE_EXEC : 0);
 
@@ -1539,11 +1545,12 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
 	vsid = get_user_vsid(&mm->context, ea, ssize);
 	if (!vsid)
 		return;
+
 	/*
 	 * Hash doesn't like irqs. Walking linux page table with irq disabled
 	 * saves us from holding multiple locks.
 	 */
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(mm);
 
 	/*
 	 * THP pages use update_mmu_cache_pmd. We don't do
@@ -1588,7 +1595,7 @@ static void hash_preload(struct mm_struct *mm, unsigned long ea,
 				   mm_ctx_user_psize(&mm->context),
 				   pte_val(*ptep));
 out_exit:
-	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm, irq_mask);
 }
 
 /*
@@ -1651,16 +1658,16 @@ u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
 {
 	pte_t *ptep;
 	u16 pkey = 0;
-	unsigned long flags;
+	unsigned long irq_mask;
 
 	if (!mm || !mm->pgd)
 		return 0;
 
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(mm);
 	ptep = find_linux_pte(mm->pgd, address, NULL, NULL);
 	if (ptep)
 		pkey = pte_to_pkey_bits(pte_val(READ_ONCE(*ptep)));
-	local_irq_restore(flags);
+	end_lockless_pgtbl_walk(mm, irq_mask);
 
 	return pkey;
 }
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 07/11] powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (5 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring lockless pgtable walks on
kvmppc_e500_shadow_map().

Fixes the place where local_irq_restore() is called: previously, if ptep
was NULL, local_irq_restore() would never be called.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the	irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/e500_mmu_host.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c
index 321db0fdb9db..36f07c6a5f10 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -336,7 +336,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	pte_t *ptep;
 	unsigned int wimg = 0;
 	pgd_t *pgdir;
-	unsigned long flags;
+	unsigned long irq_mask;
 
 	/* used to check for invalidations in progress */
 	mmu_seq = kvm->mmu_notifier_seq;
@@ -473,7 +473,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 	 * We are holding kvm->mmu_lock so a notifier invalidate
 	 * can't run hence pfn won't change.
 	 */
-	local_irq_save(flags);
+	irq_mask = begin_lockless_pgtbl_walk(kvm->mm);
 	ptep = find_linux_pte(pgdir, hva, NULL, NULL);
 	if (ptep) {
 		pte_t pte = READ_ONCE(*ptep);
@@ -481,15 +481,16 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 		if (pte_present(pte)) {
 			wimg = (pte_val(pte) >> PTE_WIMGE_SHIFT) &
 				MAS2_WIMGE_MASK;
-			local_irq_restore(flags);
 		} else {
-			local_irq_restore(flags);
+			end_lockless_pgtbl_walk(kvm->mm, irq_mask);
 			pr_err_ratelimited("%s: pte not present: gfn %lx,pfn %lx\n",
 					   __func__, (long)gfn, pfn);
 			ret = -EINVAL;
 			goto out;
 		}
 	}
+	end_lockless_pgtbl_walk(kvm->mm, irq_mask);
+
 	kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg);
 
 	kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 08/11] powerpc/kvm/book3s_hv: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (6 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 07/11] powerpc/kvm/e500: " Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all book3s_hv related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0

kvmppc_do_h_enter: Fixes where local_irq_restore() must be placed (after
the last usage of ptep).

Given that some of these functions can be called in real mode, and others
always are, we use __{begin,end}_lockless_pgtbl_walk so we can decide when
to disable interrupts.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv_nested.c | 22 ++++++++++++++++++--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 32 ++++++++++++++++-------------
 2 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
index cdf30c6eaf54..89944c699fd6 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -803,7 +803,11 @@ static void kvmhv_update_nest_rmap_rc(struct kvm *kvm, u64 n_rmap,
 	if (!gp)
 		return;
 
-	/* Find the pte */
+	/* Find the pte:
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	/*
 	 * If the pte is present and the pfn is still the same, update the pte.
@@ -853,7 +857,11 @@ static void kvmhv_remove_nest_rmap(struct kvm *kvm, u64 n_rmap,
 	if (!gp)
 		return;
 
-	/* Find and invalidate the pte */
+	/* Find and invalidate the pte:
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	/* Don't spuriously invalidate ptes if the pfn has changed */
 	if (ptep && pte_present(*ptep) && ((pte_val(*ptep) & mask) == hpa))
@@ -921,6 +929,11 @@ static bool kvmhv_invalidate_shadow_pte(struct kvm_vcpu *vcpu,
 	int shift;
 
 	spin_lock(&kvm->mmu_lock);
+	/*
+	 * We are walking the nested guest (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(gp->shadow_pgtable, gpa, NULL, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
@@ -1362,6 +1375,11 @@ static long int __kvmhv_nested_page_fault(struct kvm_run *run,
 	/* See if can find translation in our partition scoped tables for L1 */
 	pte = __pte(0);
 	spin_lock(&kvm->mmu_lock);
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (!shift)
 		shift = PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 220305454c23..a8be42f5be1e 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -210,7 +210,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	pte_t *ptep;
 	unsigned int writing;
 	unsigned long mmu_seq;
-	unsigned long rcbits, irq_flags = 0;
+	unsigned long rcbits, irq_mask = 0;
 
 	if (kvm_is_radix(kvm))
 		return H_FUNCTION;
@@ -252,12 +252,8 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 	 * If we had a page table table change after lookup, we would
 	 * retry via mmu_notifier_retry.
 	 */
-	if (!realmode)
-		local_irq_save(irq_flags);
-	/*
-	 * If called in real mode we have MSR_EE = 0. Otherwise
-	 * we disable irq above.
-	 */
+	irq_mask = __begin_lockless_pgtbl_walk(kvm->mm, !realmode);
+
 	ptep = __find_linux_pte(pgdir, hva, NULL, &hpage_shift);
 	if (ptep) {
 		pte_t pte;
@@ -272,8 +268,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		 * to <= host page size, if host is using hugepage
 		 */
 		if (host_pte_size < psize) {
-			if (!realmode)
-				local_irq_restore(flags);
+			__end_lockless_pgtbl_walk(kvm->mm, irq_mask, !realmode);
 			return H_PARAMETER;
 		}
 		pte = kvmppc_read_update_linux_pte(ptep, writing);
@@ -287,8 +282,6 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 			pa |= gpa & ~PAGE_MASK;
 		}
 	}
-	if (!realmode)
-		local_irq_restore(irq_flags);
 
 	ptel &= HPTE_R_KEY | HPTE_R_PP0 | (psize-1);
 	ptel |= pa;
@@ -302,8 +295,10 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 
 	/*If we had host pte mapping then  Check WIMG */
 	if (ptep && !hpte_cache_flags_ok(ptel, is_ci)) {
-		if (is_ci)
+		if (is_ci) {
+			__end_lockless_pgtbl_walk(kvm->mm, irq_mask, !realmode);
 			return H_PARAMETER;
+		}
 		/*
 		 * Allow guest to map emulated device memory as
 		 * uncacheable, but actually make it cacheable.
@@ -311,6 +306,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
 		ptel &= ~(HPTE_R_W|HPTE_R_I|HPTE_R_G);
 		ptel |= HPTE_R_M;
 	}
+	__end_lockless_pgtbl_walk(kvm->mm, irq_mask, !realmode);
 
 	/* Find and lock the HPTEG slot to use */
  do_insert:
@@ -907,11 +903,19 @@ static int kvmppc_get_hpa(struct kvm_vcpu *vcpu, unsigned long gpa,
 	/* Translate to host virtual address */
 	hva = __gfn_to_hva_memslot(memslot, gfn);
 
-	/* Try to find the host pte for that virtual address */
+	/* Try to find the host pte for that virtual address :
+	 * Called by hcall_real_table (real mode + MSR_EE=0)
+	 * Interrupts are disabled here.
+	 */
+	__begin_lockless_pgtbl_walk(kvm->mm, false);
 	ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
-	if (!ptep)
+	if (!ptep) {
+		__end_lockless_pgtbl_walk(kvm->mm, 0, false);
 		return H_TOO_HARD;
+	}
 	pte = kvmppc_read_update_linux_pte(ptep, writing);
+	__end_lockless_pgtbl_walk(kvm->mm, 0, false);
+
 	if (!pte_present(pte))
 		return H_TOO_HARD;
 
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 09/11] powerpc/kvm/book3s_64: Applies counting method to monitor lockless pgtbl walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (7 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
@ 2019-10-03  1:33 ` " Leonardo Bras
  2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Applies the counting-based method for monitoring all book3s_64-related
functions that do lockless pagetable walks.

Adds comments explaining that some lockless pagetable walks don't need
protection due to guest pgd not being a target of THP collapse/split, or
due to being called from Realmode + MSR_EE = 0.

Given that some of these functions always are called in realmode,  we use
__{begin,end}_lockless_pgtbl_walk so we can decide when to disable
interrupts.

local_irq_{save,restore} is already inside {begin,end}_lockless_pgtbl_walk,
so there is no need to repeat it here.

Variable that saves the	irq mask was renamed from flags to irq_mask so it
doesn't lose meaning now it's not directly passed to local_irq_* functions.

There are also a function that uses local_irq_{en,dis}able, so the return
value of begin_lockless_pgtbl_walk() is ignored and we pass IRQS_ENABLED to
end_lockless_pgtbl_walk() to mimic the effect of local_irq_enable().

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  6 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 34 +++++++++++++++++++++++---
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  7 +++++-
 3 files changed, 40 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 9a75f0e1933b..d8f374c979f5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -615,12 +615,12 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 		/* if the guest wants write access, see if that is OK */
 		if (!writing && hpte_is_writable(r)) {
 			pte_t *ptep, pte;
-			unsigned long flags;
+			unsigned long irq_mask;
 			/*
 			 * We need to protect against page table destruction
 			 * hugepage split and collapse.
 			 */
-			local_irq_save(flags);
+			irq_mask = begin_lockless_pgtbl_walk(kvm->mm);
 			ptep = find_current_mm_pte(current->mm->pgd,
 						   hva, NULL, NULL);
 			if (ptep) {
@@ -628,7 +628,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
 				if (__pte_write(pte))
 					write_ok = 1;
 			}
-			local_irq_restore(flags);
+			end_lockless_pgtbl_walk(kvm->mm, irq_mask);
 		}
 	}
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..8ba9742e2fc8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -813,20 +813,20 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 	 * Read the PTE from the process' radix tree and use that
 	 * so we get the shift and attribute bits.
 	 */
-	local_irq_disable();
+	begin_lockless_pgtbl_walk(kvm->mm);
 	ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
 	/*
 	 * If the PTE disappeared temporarily due to a THP
 	 * collapse, just return and let the guest try again.
 	 */
 	if (!ptep) {
-		local_irq_enable();
+		end_lockless_pgtbl_walk(kvm->mm, IRQS_ENABLED);
 		if (page)
 			put_page(page);
 		return RESUME_GUEST;
 	}
 	pte = *ptep;
-	local_irq_enable();
+	end_lockless_pgtbl_walk(kvm->mm, IRQS_ENABLED);
 
 	/* If we're logging dirty pages, always map single pages */
 	large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
@@ -972,10 +972,16 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	unsigned long gpa = gfn << PAGE_SHIFT;
 	unsigned int shift;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep))
 		kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 				 kvm->arch.lpid);
+
 	return 0;				
 }
 
@@ -989,6 +995,11 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	int ref = 0;
 	unsigned long old, *rmapp;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
 		old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1024,11 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	unsigned int shift;
 	int ref = 0;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_young(*ptep))
 		ref = 1;
@@ -1030,6 +1046,11 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 	int ret = 0;
 	unsigned long old, *rmapp;
 
+	/*
+	 * We are walking the secondary (partition-scoped) page table here.
+	 * We can do this without disabling irq because the Linux MM
+	 * subsystem doesn't do THP splits and collapses on this tree.
+	 */
 	ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 	if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
 		ret = 1;
@@ -1046,6 +1067,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 					       1UL << shift);
 		spin_unlock(&kvm->mmu_lock);
 	}
+
 	return ret;
 }
 
@@ -1085,6 +1107,12 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
 	gpa = memslot->base_gfn << PAGE_SHIFT;
 	spin_lock(&kvm->mmu_lock);
 	for (n = memslot->npages; n; --n) {
+		/*
+		 * We are walking the secondary (partition-scoped) page table
+		 * here.
+		 * We can do this without disabling irq because the Linux MM
+		 * subsystem doesn't do THP splits and collapses on this tree.
+		 */
 		ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
 		if (ptep && pte_present(*ptep))
 			kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index ab6eeb8e753e..0091c3d0ce89 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -441,6 +441,7 @@ long kvmppc_rm_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 		unsigned long ua, unsigned long *phpa)
 {
+	struct kvm *kvm = vcpu->kvm;
 	pte_t *ptep, pte;
 	unsigned shift = 0;
 
@@ -453,10 +454,14 @@ static long kvmppc_rm_ua_to_hpa(struct kvm_vcpu *vcpu,
 	 * to exit which will agains result in the below page table walk
 	 * to finish.
 	 */
+	__begin_lockless_pgtbl_walk(kvm->mm, false);
 	ptep = __find_linux_pte(vcpu->arch.pgdir, ua, NULL, &shift);
-	if (!ptep || !pte_present(*ptep))
+	if (!ptep || !pte_present(*ptep)) {
+		__end_lockless_pgtbl_walk(kvm->mm, 0, false);
 		return -ENXIO;
+	}
 	pte = *ptep;
+	__end_lockless_pgtbl_walk(kvm->mm, 0, false);
 
 	if (!shift)
 		shift = PAGE_SHIFT;
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (8 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
@ 2019-10-03  1:33 ` Leonardo Bras
  2019-10-03  2:08   ` Qian Cai
  2019-10-03  7:44   ` Peter Zijlstra
  2019-10-03  1:33 ` [PATCH v5 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
  2019-10-03  7:29 ` [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Peter Zijlstra
  11 siblings, 2 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Adds config option LOCKLESS_PAGE_TABLE_WALK_TRACKING to make possible
enabling tracking lockless pagetable walks directly from kernel config.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 mm/Kconfig | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb51..00f487a0122f 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
 config ARCH_HAS_HUGEPD
 	bool
 
+config LOCKLESS_PAGE_TABLE_WALK_TRACKING
+	bool "Track (and optimize) lockless page table walks"
+	default n
+
+	help
+	  Maintain a reference count of active lockless page table
+	  walkers. This adds 4 bytes to struct mm size, and two atomic
+	  operations to calls such as get_user_pages_fast(). Some
+	  architectures can optimize lockless page table operations if
+	  this is enabled.
+
 endmenu
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v5 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (9 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
@ 2019-10-03  1:33 ` Leonardo Bras
  2019-10-03  7:29 ` [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Peter Zijlstra
  11 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03  1:33 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm
  Cc: Leonardo Bras, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

Skips slow part of serialize_against_pte_lookup if there is no running
lockless pagetable walk.

Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/pgtable.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index ae557fdce9a3..0fef9400f210 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -95,7 +95,8 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
 	smp_mb();
-	smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+	if (running_lockless_pgtbl_walk(mm))
+		smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
 /*
-- 
2.20.1



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
@ 2019-10-03  2:08   ` Qian Cai
  2019-10-03 19:04     ` Leonardo Bras
  2019-10-03  7:44   ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Qian Cai @ 2019-10-03  2:08 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch



> On Oct 2, 2019, at 9:36 PM, Leonardo Bras <leonardo@linux.ibm.com> wrote:
> 
> Adds config option LOCKLESS_PAGE_TABLE_WALK_TRACKING to make possible
> enabling tracking lockless pagetable walks directly from kernel config.

Can’t this name and all those new *lockless* function names be shorter? There are many functions name with *_locked, so how about dropping lockless at all, i.e., PAGE_TABLE_WALK_TRACKING blah blah?

> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
> mm/Kconfig | 11 +++++++++++
> 1 file changed, 11 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a5dae9a7eb51..00f487a0122f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
> config ARCH_HAS_HUGEPD
>    bool
> 
> +config LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +    bool "Track (and optimize) lockless page table walks"
> +    default n
> +
> +    help
> +      Maintain a reference count of active lockless page table
> +      walkers. This adds 4 bytes to struct mm size, and two atomic
> +      operations to calls such as get_user_pages_fast(). Some
> +      architectures can optimize lockless page table operations if
> +      this is enabled.
> +
> endmenu
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
@ 2019-10-03  7:11   ` Peter Zijlstra
  2019-10-03 11:51     ` Peter Zijlstra
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-03  7:11 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 818691846c90..3043ea9812d5 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
>  #endif
>  #endif
>  
> +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> +{
> +	unsigned long irq_mask;
> +
> +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> +		atomic_inc(&mm->lockless_pgtbl_walkers);

This will not work for file backed THP. Also, this is a fairly serious
contention point all on its own.

> +	/*
> +	 * Interrupts must be disabled during the lockless page table walk.
> +	 * That's because the deleting or splitting involves flushing TLBs,
> +	 * which in turn issues interrupts, that will block when disabled.
> +	 */
> +	local_irq_save(irq_mask);
> +
> +	/*
> +	 * This memory barrier pairs with any code that is either trying to
> +	 * delete page tables, or split huge pages. Without this barrier,
> +	 * the page tables could be read speculatively outside of interrupt
> +	 * disabling.
> +	 */
> +	smp_mb();

I don't think this is something smp_mb() can guarantee. smp_mb() is
defined to order memory accesses, in this case the store of the old
flags vs whatever comes after this.

It cannot (in generic) order against completion of prior instructions,
like clearing the interrupt enabled flags.

Possibly you want barrier_nospec().

> +
> +	return irq_mask;
> +}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
                   ` (10 preceding siblings ...)
  2019-10-03  1:33 ` [PATCH v5 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
@ 2019-10-03  7:29 ` Peter Zijlstra
  2019-10-03 20:36   ` Leonardo Bras
  11 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-03  7:29 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> chunk of memory (496GB) mapped with THP, it takes an average of 275
> seconds, which can cause a lot of problems to the load (in qemu case,
> the guest will lock for this time).
> 
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot
> of time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it
> will take a very long time for large amounts of memory.
> 
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> pagetable walk, to happen concurrently with THP splitting/collapsing.
> 
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> after interrupts are re-enabled.
> Since, interrupts are (usually) disabled during lockless pagetable
> walk, and serialize_against_pte_lookup will only return after
> interrupts are enabled, it is protected.

This is something entirely specific to Power, you shouldn't be touching
generic code at all.

Also, I'm not sure I understand things properly.

So serialize_against_pte_lookup() wants to wait for all currently
out-standing __find_linux_pte() instances (which are very similar to
gup_fast).

It seems to want to do this before flushing the THP TLB for some reason;
why? Should not THP observe the normal page table freeing rules which
includes a RCU-like grace period like this already.

Why is THP special here? This doesn't seem adequately explained.

Also, specifically to munmap(), this seems entirely superfluous,
munmap() uses the normal page-table freeing code and should be entirely
fine without additional waiting.

Furthermore, Power never accurately tracks mm_cpumask(), so using that
makes the whole thing more expensive than it needs to be. Also, I
suppose that is buggered vs file backed THP.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
  2019-10-03  2:08   ` Qian Cai
@ 2019-10-03  7:44   ` Peter Zijlstra
  2019-10-03 20:40     ` Leonardo Bras
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-03  7:44 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On Wed, Oct 02, 2019 at 10:33:24PM -0300, Leonardo Bras wrote:
> Adds config option LOCKLESS_PAGE_TABLE_WALK_TRACKING to make possible
> enabling tracking lockless pagetable walks directly from kernel config.
> 
> Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com>
> ---
>  mm/Kconfig | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a5dae9a7eb51..00f487a0122f 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -736,4 +736,15 @@ config ARCH_HAS_PTE_SPECIAL
>  config ARCH_HAS_HUGEPD
>  	bool
>  
> +config LOCKLESS_PAGE_TABLE_WALK_TRACKING
> +	bool "Track (and optimize) lockless page table walks"
> +	default n
> +
> +	help
> +	  Maintain a reference count of active lockless page table
> +	  walkers. This adds 4 bytes to struct mm size, and two atomic
> +	  operations to calls such as get_user_pages_fast(). Some
> +	  architectures can optimize lockless page table operations if
> +	  this is enabled.

This shouldn't be a user visible option at all. Either the arch needs
it and selects it or not.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03  7:11   ` Peter Zijlstra
@ 2019-10-03 11:51     ` Peter Zijlstra
  2019-10-03 20:40       ` John Hubbard
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-03 11:51 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > index 818691846c90..3043ea9812d5 100644
> > --- a/include/asm-generic/pgtable.h
> > +++ b/include/asm-generic/pgtable.h
> > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> >  #endif
> >  #endif
> >  
> > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > +{
> > +	unsigned long irq_mask;
> > +
> > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> 
> This will not work for file backed THP. Also, this is a fairly serious
> contention point all on its own.

Kiryl says we have tmpfs-thp, this would be broken vs that, as would
your (PowerPC) use of mm_cpumask() for that IPI.

> > +	/*
> > +	 * Interrupts must be disabled during the lockless page table walk.
> > +	 * That's because the deleting or splitting involves flushing TLBs,
> > +	 * which in turn issues interrupts, that will block when disabled.
> > +	 */
> > +	local_irq_save(irq_mask);
> > +
> > +	/*
> > +	 * This memory barrier pairs with any code that is either trying to
> > +	 * delete page tables, or split huge pages. Without this barrier,
> > +	 * the page tables could be read speculatively outside of interrupt
> > +	 * disabling.
> > +	 */
> > +	smp_mb();
> 
> I don't think this is something smp_mb() can guarantee. smp_mb() is
> defined to order memory accesses, in this case the store of the old
> flags vs whatever comes after this.
> 
> It cannot (in generic) order against completion of prior instructions,
> like clearing the interrupt enabled flags.
> 
> Possibly you want barrier_nospec().

I'm still really confused about this barrier. It just doesn't make
sense.

If an interrupt happens before the local_irq_disable()/save(), then it
will discard any and all speculation that would be in progress to handle
the exception.

If there isn't an interrupt (or it happens after disable) it is
irrelevant.

Specifically, that serialize-IPI thing wants to ensure in-progress
lookups are complete, and I can't find a scenario where
local_irq_disable/enable() needs additional help vs IPIs. The moment an
interrupt lands it kills speculation and forces things into
program-order.

Did you perhaps want something like:

	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
		atomic_inc(&foo);
		smp_mb__after_atomic();
	}

	...

	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
		smp_mb__before_atomic();
		atomic_dec(&foo);
	}

To ensure everything happens inside of the increment?

And I still think all that wrong, you really shouldn't need to wait on
munmap().


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03  2:08   ` Qian Cai
@ 2019-10-03 19:04     ` Leonardo Bras
  2019-10-03 19:08       ` Leonardo Bras
  0 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 19:04 UTC (permalink / raw)
  To: Qian Cai
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 438 bytes --]

On Wed, 2019-10-02 at 22:08 -0400, Qian Cai wrote:
> Can’t this name and all those new *lockless* function names be shorter? 
> There are many functions name with *_locked, so how about dropping 
> lockless at all, i.e., PAGE_TABLE_WALK_TRACKING blah blah?

Thanks for the feedback!

Well, in this case it only tracks the 'lockless pagetable walks'. In
this approach, the 'locked pagetable walks' don't need to be tracked.



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03 19:04     ` Leonardo Bras
@ 2019-10-03 19:08       ` Leonardo Bras
  0 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 19:08 UTC (permalink / raw)
  To: Qian Cai
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Thu, 2019-10-03 at 16:04 -0300, Leonardo Bras wrote:
> On Wed, 2019-10-02 at 22:08 -0400, Qian Cai wrote:
> > Can’t this name and all those new *lockless* function names be shorter? 
> > There are many functions name with *_locked, so how about dropping 
> > lockless at all, i.e., PAGE_TABLE_WALK_TRACKING blah blah?
> 
> Thanks for the feedback!
> 
> Well, in this case it only tracks the 'lockless pagetable walks'. In
> this approach, the 'locked pagetable walks' don't need to be tracked.

So, using PAGE_TABLE_WALK_TRACKING would not be very accurate (as said
before, there are the locked ones).

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-03  7:29 ` [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Peter Zijlstra
@ 2019-10-03 20:36   ` Leonardo Bras
  2019-10-03 20:49     ` John Hubbard
  2019-10-04 11:42     ` Peter Zijlstra
  0 siblings, 2 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 20:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 4020 bytes --]

Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote:
> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
> > If a process (qemu) with a lot of CPUs (128) try to munmap() a large
> > chunk of memory (496GB) mapped with THP, it takes an average of 275
> > seconds, which can cause a lot of problems to the load (in qemu case,
> > the guest will lock for this time).
> > 
> > Trying to find the source of this bug, I found out most of this time is
> > spent on serialize_against_pte_lookup(). This function will take a lot
> > of time in smp_call_function_many() if there is more than a couple CPUs
> > running the user process. Since it has to happen to all THP mapped, it
> > will take a very long time for large amounts of memory.
> > 
> > By the docs, serialize_against_pte_lookup() is needed in order to avoid
> > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless
> > pagetable walk, to happen concurrently with THP splitting/collapsing.
> > 
> > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[],
> > after interrupts are re-enabled.
> > Since, interrupts are (usually) disabled during lockless pagetable
> > walk, and serialize_against_pte_lookup will only return after
> > interrupts are enabled, it is protected.
> 
> This is something entirely specific to Power, you shouldn't be touching
> generic code at all.

Up to v4, I was declaring dummy functions so it would not mess up with
other archs: http://patchwork.ozlabs.org/patch/1168779/

But I was recommended to create a generic function that could guide the
way to archs: http://patchwork.ozlabs.org/patch/1168775/

The idea was to concentrate all routines of beginning/ending lockless
pagetable walks on these functions, and call them instead of
irq_disable/irq_enable.
Then it was easy to place the refcount-based tracking in these
functions. It should only be enabled in case the config chooses to do
so. 

> 
> Also, I'm not sure I understand things properly.
> 
> So serialize_against_pte_lookup() wants to wait for all currently
> out-standing __find_linux_pte() instances (which are very similar to
> gup_fast).
> 
> It seems to want to do this before flushing the THP TLB for some reason;
> why? Should not THP observe the normal page table freeing rules which
> includes a RCU-like grace period like this already.
> 
> Why is THP special here? This doesn't seem adequately explained.

"It's necessary to monitor lockless pagetable walks, in order to avoid
doing THP splitting/collapsing during them."

If a there is a THP split/collapse during the lockless pagetable walk,
the returned ptep can be a pointing to an invalid pte. 

To avoid that, the pmd is updated, then serialize_against_pte_lookup is
ran. Serialize runs a do_nothing in all cpu in cpu_mask. 

So, after all cpus finish running do_nothing(), there is a guarantee
that if there is any 'lockless pagetable walk' it is running on top of
a updated version of this pmd, and so, collapsing/splitting THP is
safe.

> 
> Also, specifically to munmap(), this seems entirely superfluous,
> munmap() uses the normal page-table freeing code and should be entirely
> fine without additional waiting.

To be honest, I remember it being needed in munmap case, but I really
don't remember the details. I will take a deeper look and come back
with this answer. 

> Furthermore, Power never accurately tracks mm_cpumask(), so using that
> makes the whole thing more expensive than it needs to be. Also, I
> suppose that is buggered vs file backed THP.

That accuracy of mm_cpumask is above my knowledge right now. =)

I agree that it's to expensive to do that. That's why I suggested this
method, that can check if there is any 'lockless pagetable walk'
running before trying to serialize. It reduced the waiting time a lot
for large amounts of memory. (more details on cover letter)

Best regards,

Leonardo Brás

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks
  2019-10-03  7:44   ` Peter Zijlstra
@ 2019-10-03 20:40     ` Leonardo Bras
  0 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 20:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 226 bytes --]

On Thu, 2019-10-03 at 09:44 +0200, Peter Zijlstra wrote:
> This shouldn't be a user visible option at all. Either the arch needs
> it and selects it or not.

You are right. I will do that on v6.
Thanks for the feedback!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03 11:51     ` Peter Zijlstra
@ 2019-10-03 20:40       ` John Hubbard
  2019-10-04 11:24         ` Peter Zijlstra
  2019-10-03 21:24       ` Leonardo Bras
  2019-10-05  8:35       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 37+ messages in thread
From: John Hubbard @ 2019-10-03 20:40 UTC (permalink / raw)
  To: Peter Zijlstra, Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	Keith Busch, Paul McKenney

On 10/3/19 4:51 AM, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
...
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 

Hi Peter,

So, would that imply that it's correct to apply approximately the following
patch:

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1adbb8a371c7..cf41eff37e24 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2099,9 +2099,9 @@ INTERRUPT DISABLING FUNCTIONS
 -----------------------------
 
 Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
-(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
-barriers are required in such a situation, they must be provided from some
-other means.
+(RELEASE equivalent) will act as full memory barriers. This is because, for
+all supported CPU architectures, interrupt arrival causes all speculative
+memory accesses to be discarded.
 
?

We're also discussing this over in [1] ("mm: don't expose non-hugetlb page to
fast gup prematurely"), so I'm adding Paul to this thread here as well.

[1] https://lore.kernel.org/r/20191002092447.GC9320@quack2.suse.cz

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-03 20:36   ` Leonardo Bras
@ 2019-10-03 20:49     ` John Hubbard
  2019-10-03 21:38       ` Leonardo Bras
  2019-10-04 11:42     ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: John Hubbard @ 2019-10-03 20:49 UTC (permalink / raw)
  To: Leonardo Bras, Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Christian Brauner, Greg Kroah-Hartman, linux-kernel,
	Logan Gunthorpe, Souptick Joarder, Andrew Morton, linuxppc-dev,
	Roman Gushchin, Kirill A. Shutemov, Al Viro

On 10/3/19 1:36 PM, Leonardo Bras wrote:
> On Thu, 2019-10-03 at 09:29 +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:14PM -0300, Leonardo Bras wrote:
...
>> This is something entirely specific to Power, you shouldn't be touching
>> generic code at all.
> 
> Up to v4, I was declaring dummy functions so it would not mess up with
> other archs: http://patchwork.ozlabs.org/patch/1168779/
> 
> But I was recommended to create a generic function that could guide the
> way to archs: http://patchwork.ozlabs.org/patch/1168775/

Yes. And to clarify, I was assuming that the changes to mm/gup.c were 
required in order to accomplish your goals. Given that assumption, I 
wanted the generic code to be "proper", and that's what that feedback
is about.

If you can somehow do it entirely as an arch-specific thing, then probably
that's even better. Although the other questions about file-backed THP
make it sound like some rethinking across the board is required now.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03 11:51     ` Peter Zijlstra
  2019-10-03 20:40       ` John Hubbard
@ 2019-10-03 21:24       ` Leonardo Bras
  2019-10-04 11:28         ` Peter Zijlstra
  2019-10-05  8:35       ` Aneesh Kumar K.V
  2 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 21:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 3835 bytes --]

Hello Peter, thanks for the feedback!

On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > index 818691846c90..3043ea9812d5 100644
> > > --- a/include/asm-generic/pgtable.h
> > > +++ b/include/asm-generic/pgtable.h
> > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > >  #endif
> > >  #endif
> > >  
> > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > > +{
> > > +	unsigned long irq_mask;
> > > +
> > > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > 
> > This will not work for file backed THP. Also, this is a fairly serious
> > contention point all on its own.
> 
> Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> your (PowerPC) use of mm_cpumask() for that IPI.

Could you please explain it?
I mean, why this breaks tmpfs-thp?

Also, why mm_cpumask() is also broken?

> 
> > > +	/*
> > > +	 * Interrupts must be disabled during the lockless page table walk.
> > > +	 * That's because the deleting or splitting involves flushing TLBs,
> > > +	 * which in turn issues interrupts, that will block when disabled.
> > > +	 */
> > > +	local_irq_save(irq_mask);
> > > +
> > > +	/*
> > > +	 * This memory barrier pairs with any code that is either trying to
> > > +	 * delete page tables, or split huge pages. Without this barrier,
> > > +	 * the page tables could be read speculatively outside of interrupt
> > > +	 * disabling.
> > > +	 */
> > > +	smp_mb();
> > 
> > I don't think this is something smp_mb() can guarantee. smp_mb() is
> > defined to order memory accesses, in this case the store of the old
> > flags vs whatever comes after this.
> > 
> > It cannot (in generic) order against completion of prior instructions,
> > like clearing the interrupt enabled flags.
> > 
> > Possibly you want barrier_nospec().
> 
> I'm still really confused about this barrier. It just doesn't make
> sense.
> 
> If an interrupt happens before the local_irq_disable()/save(), then it
> will discard any and all speculation that would be in progress to handle
> the exception.
> 
> If there isn't an interrupt (or it happens after disable) it is
> irrelevant.
> 
> Specifically, that serialize-IPI thing wants to ensure in-progress
> lookups are complete, and I can't find a scenario where
> local_irq_disable/enable() needs additional help vs IPIs. The moment an
> interrupt lands it kills speculation and forces things into
> program-order.
> 
> Did you perhaps want something like:
> 
> 	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
> 		atomic_inc(&foo);
> 		smp_mb__after_atomic();
> 	}
> 
> 	...
> 
> 	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING)) {
> 		smp_mb__before_atomic();
> 		atomic_dec(&foo);
> 	}
> 
> To ensure everything happens inside of the increment?
> 

I need to rethink this barrier, but yes. I think that's it. 
It's how it was on v4.

I have changed it because I thought it would be better this way. Well,
it was probably a mistake of my part.

> And I still think all that wrong, you really shouldn't need to wait on
> munmap().

That is something I need to better understand. I mean, before coming
with this patch, I thought exactly this: not serialize when on munmap. 

But on the way I was convinced it would not work on munmap. I need to
recall why, and if it was false to assume this, re-think the whole
solution.

Best regards,

Leonardo Brás

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-03 20:49     ` John Hubbard
@ 2019-10-03 21:38       ` Leonardo Bras
  0 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-03 21:38 UTC (permalink / raw)
  To: John Hubbard, Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Christian Brauner, Greg Kroah-Hartman, linux-kernel,
	Logan Gunthorpe, Souptick Joarder, Andrew Morton, linuxppc-dev,
	Roman Gushchin, Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 798 bytes --]

On Thu, 2019-10-03 at 13:49 -0700, John Hubbard wrote:
> Yes. And to clarify, I was assuming that the changes to mm/gup.c were 
> required in order to accomplish your goals. Given that assumption, I 
> wanted the generic code to be "proper", and that's what that feedback
> is about.

You assumed right. On my counting approach it's necessary count all
'lockless pagetable walks', including the ones in generic code.

And, I think even without the counting approach, it was a good way
focus this 'lockless pagetable walk' routine in a single place.

> 
> Although the other questions about file-backed THP
> make it sound like some rethinking across the board is required now.

Yeap, I need to better understand how the file-backed THP problem
works.

Thanks,

Leonardo Brás

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03 20:40       ` John Hubbard
@ 2019-10-04 11:24         ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-04 11:24 UTC (permalink / raw)
  To: John Hubbard
  Cc: Leonardo Bras, linuxppc-dev, linux-kernel, kvm-ppc, linux-arch,
	linux-mm, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Arnd Bergmann, Aneesh Kumar K.V,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	Keith Busch, Paul McKenney

On Thu, Oct 03, 2019 at 01:40:38PM -0700, John Hubbard wrote:
> On 10/3/19 4:51 AM, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> >> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> ...
> > 
> > I'm still really confused about this barrier. It just doesn't make
> > sense.
> > 
> > If an interrupt happens before the local_irq_disable()/save(), then it
> > will discard any and all speculation that would be in progress to handle
> > the exception.
> > 
> 
> Hi Peter,
> 
> So, would that imply that it's correct to apply approximately the following
> patch:
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 1adbb8a371c7..cf41eff37e24 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -2099,9 +2099,9 @@ INTERRUPT DISABLING FUNCTIONS
>  -----------------------------
>  
>  Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> -(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> -barriers are required in such a situation, they must be provided from some
> -other means.
> +(RELEASE equivalent) will act as full memory barriers. This is because, for
> +all supported CPU architectures, interrupt arrival causes all speculative
> +memory accesses to be discarded.
>  
> ?

No, you're misunderstanding. They imply nothing of the sort.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03 21:24       ` Leonardo Bras
@ 2019-10-04 11:28         ` Peter Zijlstra
  2019-10-09 18:09           ` Leonardo Bras
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-04 11:28 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

On Thu, Oct 03, 2019 at 06:24:07PM -0300, Leonardo Bras wrote:
> Hello Peter, thanks for the feedback!
> 
> On Thu, 2019-10-03 at 13:51 +0200, Peter Zijlstra wrote:
> > On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
> > > > diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> > > > index 818691846c90..3043ea9812d5 100644
> > > > --- a/include/asm-generic/pgtable.h
> > > > +++ b/include/asm-generic/pgtable.h
> > > > @@ -1171,6 +1171,64 @@ static inline bool arch_has_pfn_modify_check(void)
> > > >  #endif
> > > >  #endif
> > > >  
> > > > +#ifndef __HAVE_ARCH_LOCKLESS_PGTBL_WALK_CONTROL
> > > > +static inline unsigned long begin_lockless_pgtbl_walk(struct mm_struct *mm)
> > > > +{
> > > > +	unsigned long irq_mask;
> > > > +
> > > > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > > > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > > 
> > > This will not work for file backed THP. Also, this is a fairly serious
> > > contention point all on its own.
> > 
> > Kiryl says we have tmpfs-thp, this would be broken vs that, as would
> > your (PowerPC) use of mm_cpumask() for that IPI.
> 
> Could you please explain it?
> I mean, why this breaks tmpfs-thp?
> Also, why mm_cpumask() is also broken?

Because shared pages are not bound by a mm; or does it not share the thp
state between mappings?

> > And I still think all that wrong, you really shouldn't need to wait on
> > munmap().
> 
> That is something I need to better understand. I mean, before coming
> with this patch, I thought exactly this: not serialize when on munmap. 
> 
> But on the way I was convinced it would not work on munmap. I need to
> recall why, and if it was false to assume this, re-think the whole
> solution.

And once you (re)figure it out, please write it down. It is a crucial
bit of the puzzle and needs to be part of the Changelogs.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-03 20:36   ` Leonardo Bras
  2019-10-03 20:49     ` John Hubbard
@ 2019-10-04 11:42     ` Peter Zijlstra
  2019-10-04 12:57       ` Peter Zijlstra
  1 sibling, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-04 11:42 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

On Thu, Oct 03, 2019 at 05:36:31PM -0300, Leonardo Bras wrote:

> > Also, I'm not sure I understand things properly.
> > 
> > So serialize_against_pte_lookup() wants to wait for all currently
> > out-standing __find_linux_pte() instances (which are very similar to
> > gup_fast).
> > 
> > It seems to want to do this before flushing the THP TLB for some reason;
> > why? Should not THP observe the normal page table freeing rules which
> > includes a RCU-like grace period like this already.
> > 
> > Why is THP special here? This doesn't seem adequately explained.
> 
> "It's necessary to monitor lockless pagetable walks, in order to avoid
> doing THP splitting/collapsing during them."
> 
> If a there is a THP split/collapse during the lockless pagetable walk,
> the returned ptep can be a pointing to an invalid pte. 

So the whole premise of lockless page-table walks (gup_fast) is that it
can work on in-flux page-tables. Specifically gup_fast() never returns
PTEs, only struct page *, and only if it can increment the page
refcount.

In order to enable this, page-table pages are RCU(-like) freed, such
that even if we access page-tables that have (concurrently) been
unlinked, we'll not UaF (see asm-generic/tlb.h, the comment at
HAVE_RCU_TABLE_FREE). IOW, the worst case if not getting a struct page
*.

I really don't see how THP splitting/collapsing is special here, either
we see the PMD and find a struct page * or we see a PTE and find the
same struct page * (compound page head).

The only thing that needs to be guaranteed is that both PTEs and PMD
page-tables are valid. Is this not so?

> To avoid that, the pmd is updated, then serialize_against_pte_lookup is
> ran. Serialize runs a do_nothing in all cpu in cpu_mask. 
> 
> So, after all cpus finish running do_nothing(), there is a guarantee
> that if there is any 'lockless pagetable walk' it is running on top of
> a updated version of this pmd, and so, collapsing/splitting THP is
> safe.

But why would it matter?! It would find the same struct page * through
either version of the page-tables. *confused*

> > Also, specifically to munmap(), this seems entirely superfluous,
> > munmap() uses the normal page-table freeing code and should be entirely
> > fine without additional waiting.
> 
> To be honest, I remember it being needed in munmap case, but I really
> don't remember the details. I will take a deeper look and come back
> with this answer. 

munmap does normal mmu_gather page-table teardown, the THP PMD should be
RCU-like freed just like any other PMD. Which should be perfectly safe
vs lockless page-table walks.

If you can find anything there that isn't right, please explain that in
detail and we'll need to look hard at fixing _that_.

> > Furthermore, Power never accurately tracks mm_cpumask(), so using that
> > makes the whole thing more expensive than it needs to be. Also, I
> > suppose that is buggered vs file backed THP.
> 
> That accuracy of mm_cpumask is above my knowledge right now. =)

Basically PowerPC only ever sets bits in there, unlike x86 that also
clears bits (at expense, but it's worth it for us).



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks
  2019-10-04 11:42     ` Peter Zijlstra
@ 2019-10-04 12:57       ` Peter Zijlstra
  0 siblings, 0 replies; 37+ messages in thread
From: Peter Zijlstra @ 2019-10-04 12:57 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

On Fri, Oct 04, 2019 at 01:42:36PM +0200, Peter Zijlstra wrote:
> If you can find anything there that isn't right, please explain that in
> detail and we'll need to look hard at fixing _that_.

Also, I can't imagine Nick is happy with 128 CPUs banging on that atomic
counter, esp. since atomics are horrifically slow on Power.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-03 11:51     ` Peter Zijlstra
  2019-10-03 20:40       ` John Hubbard
  2019-10-03 21:24       ` Leonardo Bras
@ 2019-10-05  8:35       ` Aneesh Kumar K.V
  2019-10-08 14:47         ` Kirill A. Shutemov
  2 siblings, 1 reply; 37+ messages in thread
From: Aneesh Kumar K.V @ 2019-10-05  8:35 UTC (permalink / raw)
  To: Peter Zijlstra, Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On 10/3/19 5:21 PM, Peter Zijlstra wrote:
> On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:


....


> 
> And I still think all that wrong, you really shouldn't need to wait on
> munmap().
> 

I do have a patch that does something like that.


+#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
+static inline pmd_t pmdp_huge_get_and_clear_full(struct mm_struct *mm,
+						 unsigned long address, pmd_t *pmdp,
+						 int full)
+{
+	bool serialize = true;
+	/*
+	 * We don't need to serialze against a lockless page table walk if
+	 * we are clearing the pmd due to task exit. For regular mnumap, we
+	 * still need to serialize due the possibility of MADV_DONTNEED running
+	 * parallel to a page fault which can convert a THP pte entry to a
+	 * pointer to level 4 table.
+	 * Here MADV_DONTNEED is removing the THP entry and the fault is filling
+	 * a level 4 pte.
+	 */
+	if (full == 1)
+		serialize = false;
+	return __pmdp_huge_get_and_clear(mm, address, pmdp, serialize);
  }


if it is a fullmm flush we can skip that serialize, But for everything 
else we need to serialize. MADV_DONTNEED is another case. I haven't sent 
this yet, because I was trying to look at what it takes to switch that 
MADV variant to take mmap_sem in write mode.

MADV_DONTNEED has caused us multiple issues due to the fact that it can 
run in parallel to page fault. I am not sure whether we have a 
known/noticeable performance gain in allowing that with mmap_sem held in 
read mode.




-aneesh


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-05  8:35       ` Aneesh Kumar K.V
@ 2019-10-08 14:47         ` Kirill A. Shutemov
  0 siblings, 0 replies; 37+ messages in thread
From: Kirill A. Shutemov @ 2019-10-08 14:47 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Peter Zijlstra, Leonardo Bras, linuxppc-dev, linux-kernel,
	kvm-ppc, linux-arch, linux-mm, Benjamin Herrenschmidt,
	Paul Mackerras, Michael Ellerman, Arnd Bergmann,
	Christophe Leroy, Nicholas Piggin, Andrew Morton,
	Mahesh Salgaonkar, Reza Arbab, Santosh Sivaraj, Balbir Singh,
	Thomas Gleixner, Greg Kroah-Hartman, Mike Rapoport,
	Allison Randal, Jason Gunthorpe, Dan Williams, Vlastimil Babka,
	Christoph Lameter, Logan Gunthorpe, Andrey Ryabinin,
	Alexey Dobriyan, Souptick Joarder, Mathieu Desnoyers,
	Ralph Campbell, Jesper Dangaard Brouer, Jann Horn,
	Davidlohr Bueso, Ingo Molnar, Christian Brauner, Michal Hocko,
	Elena Reshetova, Roman Gushchin, Andrea Arcangeli, Al Viro,
	Dmitry V. Levin, Jérôme Glisse, Song Liu,
	Bartlomiej Zolnierkiewicz, Ira Weiny, Kirill A. Shutemov,
	John Hubbard, Keith Busch

On Sat, Oct 05, 2019 at 02:05:29PM +0530, Aneesh Kumar K.V wrote:
> MADV_DONTNEED has caused us multiple issues due to the fact that it can run
> in parallel to page fault. I am not sure whether we have a known/noticeable
> performance gain in allowing that with mmap_sem held in read mode.

Yes we do. MADV_DONTNEED used a lot by userspace memory allocators and it
will be very noticible performance regression if we would switch it to
down_write(mmap_sem).

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-03  1:33 ` [PATCH v5 02/11] powerpc/mm: Adds counting method " Leonardo Bras
@ 2019-10-08 15:11   ` Christopher Lameter
  2019-10-08 17:13     ` Leonardo Bras
  0 siblings, 1 reply; 37+ messages in thread
From: Christopher Lameter @ 2019-10-08 15:11 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: linuxppc-dev, linux-kernel, kvm-ppc, linux-arch, linux-mm,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Arnd Bergmann, Aneesh Kumar K.V, Christophe Leroy,
	Nicholas Piggin, Andrew Morton, Mahesh Salgaonkar, Reza Arbab,
	Santosh Sivaraj, Balbir Singh, Thomas Gleixner,
	Greg Kroah-Hartman, Mike Rapoport, Allison Randal,
	Jason Gunthorpe, Dan Williams, Vlastimil Babka, Logan Gunthorpe,
	Andrey Ryabinin, Alexey Dobriyan, Souptick Joarder,
	Mathieu Desnoyers, Ralph Campbell, Jesper Dangaard Brouer,
	Jann Horn, Davidlohr Bueso, Peter Zijlstra (Intel),
	Ingo Molnar, Christian Brauner, Michal Hocko, Elena Reshetova,
	Roman Gushchin, Andrea Arcangeli, Al Viro, Dmitry V. Levin,
	Jérôme Glisse, Song Liu, Bartlomiej Zolnierkiewicz,
	Ira Weiny, Kirill A. Shutemov, John Hubbard, Keith Busch



On Wed, 2 Oct 2019, Leonardo Bras wrote:

> +
> +inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
> +						 bool disable_irq)
> +{
> +	unsigned long irq_mask = 0;
> +
> +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> +		atomic_inc(&mm->lockless_pgtbl_walkers);
> +

You are creating contention on a single exclusive cacheline. Doesnt this
defeat the whole purpose of the lockless page table walk? Use mmap_sem or
so should cause the same performance regression?


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-08 15:11   ` Christopher Lameter
@ 2019-10-08 17:13     ` Leonardo Bras
  2019-10-08 17:43       ` Christopher Lameter
  0 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-08 17:13 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Song Liu, Michal Hocko, Peter Zijlstra (Intel),
	Dmitry V. Levin, Keith Busch, linux-mm, Paul Mackerras,
	Christian Brauner, Ira Weiny, Dan Williams, Elena Reshetova,
	linux-arch, Santosh Sivaraj, Davidlohr Bueso, Aneesh Kumar K.V,
	Bartlomiej Zolnierkiewicz, Mike Rapoport, Jason Gunthorpe,
	Vlastimil Babka, Mahesh Salgaonkar, Andrey Ryabinin,
	Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli, Ralph Campbell,
	Arnd Bergmann, Jann Horn, John Hubbard, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Greg Kroah-Hartman, linux-kernel, Logan Gunthorpe,
	Souptick Joarder, Andrew Morton, linuxppc-dev, Roman Gushchin,
	Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 1062 bytes --]

On Tue, 2019-10-08 at 15:11 +0000, Christopher Lameter wrote:
> 
> On Wed, 2 Oct 2019, Leonardo Bras wrote:
> 
> > +
> > +inline unsigned long __begin_lockless_pgtbl_walk(struct mm_struct *mm,
> > +						 bool disable_irq)
> > +{
> > +	unsigned long irq_mask = 0;
> > +
> > +	if (IS_ENABLED(CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING))
> > +		atomic_inc(&mm->lockless_pgtbl_walkers);
> > +
> 
> You are creating contention on a single exclusive cacheline. Doesnt this
> defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> so should cause the same performance regression?

Sorry, I did not understand that question.
I mean, this is just a refcount and never causes a lock.  


FYI: This function was updated as following, and will be in v6:

#ifdef CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
	atomic_inc(&mm->lockless_pgtbl_walkers);
#endif
	smp_mb();

IS_ENABLED doesnt work fine if CONFIG_LOCKLESS_PAGE_TABLE_WALK_TRACKING
is not defined, causing an error: the mm member lockless_pgtbl_walkers
doesn't exist.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-08 17:13     ` Leonardo Bras
@ 2019-10-08 17:43       ` Christopher Lameter
  2019-10-08 18:02         ` Leonardo Bras
  0 siblings, 1 reply; 37+ messages in thread
From: Christopher Lameter @ 2019-10-08 17:43 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Song Liu, Michal Hocko, Peter Zijlstra (Intel),
	Dmitry V. Levin, Keith Busch, linux-mm, Paul Mackerras,
	Christian Brauner, Ira Weiny, Dan Williams, Elena Reshetova,
	linux-arch, Santosh Sivaraj, Davidlohr Bueso, Aneesh Kumar K.V,
	Bartlomiej Zolnierkiewicz, Mike Rapoport, Jason Gunthorpe,
	Vlastimil Babka, Mahesh Salgaonkar, Andrey Ryabinin,
	Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli, Ralph Campbell,
	Arnd Bergmann, Jann Horn, John Hubbard, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Greg Kroah-Hartman, linux-kernel, Logan Gunthorpe,
	Souptick Joarder, Andrew Morton, linuxppc-dev, Roman Gushchin,
	Kirill A. Shutemov, Al Viro

On Tue, 8 Oct 2019, Leonardo Bras wrote:

> > You are creating contention on a single exclusive cacheline. Doesnt this
> > defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> > so should cause the same performance regression?
>
> Sorry, I did not understand that question.
> I mean, this is just a refcount and never causes a lock.

Locks also use atomic operations like a refcount increment. Both require
the cacheline to be in exclusive state. So the impact is very similar.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-08 17:43       ` Christopher Lameter
@ 2019-10-08 18:02         ` Leonardo Bras
  2019-10-08 18:27           ` Christopher Lameter
  0 siblings, 1 reply; 37+ messages in thread
From: Leonardo Bras @ 2019-10-08 18:02 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: Song Liu, Michal Hocko, Peter Zijlstra (Intel),
	Dmitry V. Levin, Keith Busch, linux-mm, Paul Mackerras,
	Christian Brauner, Ira Weiny, Dan Williams, Elena Reshetova,
	linux-arch, Santosh Sivaraj, Davidlohr Bueso, Aneesh Kumar K.V,
	Bartlomiej Zolnierkiewicz, Mike Rapoport, Jason Gunthorpe,
	Vlastimil Babka, Mahesh Salgaonkar, Andrey Ryabinin,
	Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli, Ralph Campbell,
	Arnd Bergmann, Jann Horn, John Hubbard, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Greg Kroah-Hartman, linux-kernel, Logan Gunthorpe,
	Souptick Joarder, Andrew Morton, linuxppc-dev, Roman Gushchin,
	Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 985 bytes --]

On Tue, 2019-10-08 at 17:43 +0000, Christopher Lameter wrote:
> On Tue, 8 Oct 2019, Leonardo Bras wrote:
> 
> > > You are creating contention on a single exclusive cacheline. Doesnt this
> > > defeat the whole purpose of the lockless page table walk? Use mmap_sem or
> > > so should cause the same performance regression?
> > 
> > Sorry, I did not understand that question.
> > I mean, this is just a refcount and never causes a lock.
> 
> Locks also use atomic operations like a refcount increment. Both require
> the cacheline to be in exclusive state. So the impact is very similar.

Thanks for explaining. :)

So you say that the performance impact of using my approach is the same
as using locks? (supposing that lock never waits)

So, there are 'lockless pagetable walks' only for the sake of better
performance? 

I thought they existed to enable doing pagetable walks in states where
locking was not safe.

Is that right?

Thanks!
Leonardo Brás,


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 02/11] powerpc/mm: Adds counting method to monitor lockless pgtable walks
  2019-10-08 18:02         ` Leonardo Bras
@ 2019-10-08 18:27           ` Christopher Lameter
  0 siblings, 0 replies; 37+ messages in thread
From: Christopher Lameter @ 2019-10-08 18:27 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Song Liu, Michal Hocko, Peter Zijlstra (Intel),
	Dmitry V. Levin, Keith Busch, linux-mm, Paul Mackerras,
	Christian Brauner, Ira Weiny, Dan Williams, Elena Reshetova,
	linux-arch, Santosh Sivaraj, Davidlohr Bueso, Aneesh Kumar K.V,
	Bartlomiej Zolnierkiewicz, Mike Rapoport, Jason Gunthorpe,
	Vlastimil Babka, Mahesh Salgaonkar, Andrey Ryabinin,
	Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli, Ralph Campbell,
	Arnd Bergmann, Jann Horn, John Hubbard, Jesper Dangaard Brouer,
	Nicholas Piggin, Jérôme Glisse, Mathieu Desnoyers,
	kvm-ppc, Thomas Gleixner, Reza Arbab, Allison Randal,
	Greg Kroah-Hartman, linux-kernel, Logan Gunthorpe,
	Souptick Joarder, Andrew Morton, linuxppc-dev, Roman Gushchin,
	Kirill A. Shutemov, Al Viro



On Tue, 8 Oct 2019, Leonardo Bras wrote:

> So you say that the performance impact of using my approach is the same
> as using locks? (supposing that lock never waits)
>
> So, there are 'lockless pagetable walks' only for the sake of better
> performance?

I thought that was the major motivation here.

> I thought they existed to enable doing pagetable walks in states where
> locking was not safe.

That sounds profoundly dangerous. Usually locking makes things safe to
access. Accesses without locking require a lot of care.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks
  2019-10-04 11:28         ` Peter Zijlstra
@ 2019-10-09 18:09           ` Leonardo Bras
  0 siblings, 0 replies; 37+ messages in thread
From: Leonardo Bras @ 2019-10-09 18:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Song Liu, Michal Hocko, Dmitry V. Levin, Keith Busch, linux-mm,
	Paul Mackerras, Christoph Lameter, Ira Weiny, Dan Williams,
	Elena Reshetova, linux-arch, Santosh Sivaraj, Davidlohr Bueso,
	Aneesh Kumar K.V, Bartlomiej Zolnierkiewicz, Mike Rapoport,
	Jason Gunthorpe, Vlastimil Babka, Mahesh Salgaonkar,
	Andrey Ryabinin, Alexey Dobriyan, Ingo Molnar, Andrea Arcangeli,
	Ralph Campbell, Arnd Bergmann, Jann Horn, John Hubbard,
	Jesper Dangaard Brouer, Nicholas Piggin, Jérôme Glisse,
	Mathieu Desnoyers, kvm-ppc, Thomas Gleixner, Reza Arbab,
	Allison Randal, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Logan Gunthorpe, Souptick Joarder, Andrew Morton,
	linuxppc-dev, Roman Gushchin, Kirill A. Shutemov, Al Viro

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Fri, 2019-10-04 at 13:28 +0200, Peter Zijlstra wrote:
> > Could you please explain it?
> > I mean, why this breaks tmpfs-thp?
> > Also, why mm_cpumask() is also broken?
> 
> Because shared pages are not bound by a mm; or does it not share the thp
> state between mappings?

By what I could understand, even though the memory is shared, the
mapping may differ for different processes (i.e. the same physical
memory that is mapped as a hugepage in process A can be mapped as a lot
of smallpages in process B).

Did I miss something here?  

> And once you (re)figure it out, please write it down. It is a crucial
> bit of the puzzle and needs to be part of the Changelogs.

I am still investing time studying this. More on this later :)

Thanks!

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, back to index

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-03  1:33 [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 01/11] asm-generic/pgtable: Adds generic functions to monitor lockless pgtable walks Leonardo Bras
2019-10-03  7:11   ` Peter Zijlstra
2019-10-03 11:51     ` Peter Zijlstra
2019-10-03 20:40       ` John Hubbard
2019-10-04 11:24         ` Peter Zijlstra
2019-10-03 21:24       ` Leonardo Bras
2019-10-04 11:28         ` Peter Zijlstra
2019-10-09 18:09           ` Leonardo Bras
2019-10-05  8:35       ` Aneesh Kumar K.V
2019-10-08 14:47         ` Kirill A. Shutemov
2019-10-03  1:33 ` [PATCH v5 02/11] powerpc/mm: Adds counting method " Leonardo Bras
2019-10-08 15:11   ` Christopher Lameter
2019-10-08 17:13     ` Leonardo Bras
2019-10-08 17:43       ` Christopher Lameter
2019-10-08 18:02         ` Leonardo Bras
2019-10-08 18:27           ` Christopher Lameter
2019-10-03  1:33 ` [PATCH v5 03/11] mm/gup: Applies counting method to monitor gup_pgd_range Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 04/11] powerpc/mce_power: Applies counting method to monitor lockless pgtbl walks Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 05/11] powerpc/perf: " Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 06/11] powerpc/mm/book3s64/hash: " Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 07/11] powerpc/kvm/e500: " Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 08/11] powerpc/kvm/book3s_hv: " Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 09/11] powerpc/kvm/book3s_64: " Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 10/11] mm/Kconfig: Adds config option to track lockless pagetable walks Leonardo Bras
2019-10-03  2:08   ` Qian Cai
2019-10-03 19:04     ` Leonardo Bras
2019-10-03 19:08       ` Leonardo Bras
2019-10-03  7:44   ` Peter Zijlstra
2019-10-03 20:40     ` Leonardo Bras
2019-10-03  1:33 ` [PATCH v5 11/11] powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing Leonardo Bras
2019-10-03  7:29 ` [PATCH v5 00/11] Introduces new count-based method for tracking lockless pagetable walks Peter Zijlstra
2019-10-03 20:36   ` Leonardo Bras
2019-10-03 20:49     ` John Hubbard
2019-10-03 21:38       ` Leonardo Bras
2019-10-04 11:42     ` Peter Zijlstra
2019-10-04 12:57       ` Peter Zijlstra

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox