All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier
@ 2018-06-26 17:31 Rik van Riel
  2018-06-26 17:31 ` [PATCH 1/6] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault, songliubraving

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels, using 1.9% of a 48 CPU system during a netperf run. Digging
into the profile, the atomic operations in cpumask_clear_cpu and
cpumask_set_cpu are responsible for about half of that CPU use.

However, the CPUs running netperf are simply switching back and
forth between netperf and the idle task, which does not require any
changes to the mm_cpumask if lazy TLB mode were used.

Additionally, the init_mm cpumask ends up being the most heavily
contended one in the system, for no reason at all.

Making really lazy TLB mode work again on modern kernels, by sending
a shootdown IPI only when page table pages are being unmapped, we get
back some performance.

v2 of the series implements things in the way suggested by Andy
Lutomirski, which is a nice simplification from before.  If patch 3
looks larger, it is because some of the existing code changed
indentation, so it can easily be used by both sides of the if/else
test.


Song's netperf performance results:

w/o patchset:

0.95%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off
0.77%  netserver        [kernel.vmlinux]          [k] switch_mm_irqs_off

w/ patchset:

Throughput: 1.74075e+06
0.87%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off

With these patches, netserver no longer calls switch_mm_irqs_off,
and the CPU use of enter_lazy_tlb was below the 0.05% threshold of
statistics gathered by Song's scripts.


With a memcache style workload, performance does not change measurably,
but the amount of CPU time used by switch_mm_irqs_off and other parts
of the context switch code do appear to go down in profiles.


I am still working on a patch to also get rid of the continuous
pounding on mm->count during lazy TLB entry and exit, when the same
mm_struct is being used all the time. I do not have that working yet.

Until then, these patches provide a nice performance boost, as well
as a little memory savings by shrinking the size of mm_struct.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/6] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  2018-06-26 17:31 ` [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm_types.h | 237 ++++++++++++++++++++++++-----------------------
 kernel/fork.c            |  15 +--
 mm/init-mm.c             |  11 +++
 3 files changed, 140 insertions(+), 123 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..e06de7e492d0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,179 @@ struct core_state {
 
 struct kioctx_table;
 struct mm_struct {
-	struct vm_area_struct *mmap;		/* list of VMAs */
-	struct rb_root mm_rb;
-	u32 vmacache_seqnum;                   /* per-thread vmacache */
+	struct {
+		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct rb_root mm_rb;
+		u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
+		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
 #endif
-	unsigned long mmap_base;		/* base of mmap area */
-	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
+		unsigned long mmap_base;	/* base of mmap area */
+		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-	/* Base adresses for compatible mmap() */
-	unsigned long mmap_compat_base;
-	unsigned long mmap_compat_legacy_base;
+		/* Base adresses for compatible mmap() */
+		unsigned long mmap_compat_base;
+		unsigned long mmap_compat_legacy_base;
 #endif
-	unsigned long task_size;		/* size of task vm space */
-	unsigned long highest_vm_end;		/* highest vma end address */
-	pgd_t * pgd;
-
-	/**
-	 * @mm_users: The number of users including userspace.
-	 *
-	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
-	 * to 0 (i.e. when the task exits and there are no other temporary
-	 * reference holders), we also release a reference on @mm_count
-	 * (which may then free the &struct mm_struct if @mm_count also
-	 * drops to 0).
-	 */
-	atomic_t mm_users;
-
-	/**
-	 * @mm_count: The number of references to &struct mm_struct
-	 * (@mm_users count as 1).
-	 *
-	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-	 * &struct mm_struct is freed.
-	 */
-	atomic_t mm_count;
+		unsigned long task_size;	/* size of task vm space */
+		unsigned long highest_vm_end;	/* highest vma end address */
+		pgd_t * pgd;
+
+		/**
+		 * @mm_users: The number of users including userspace.
+		 *
+		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+		 * drops to 0 (i.e. when the task exits and there are no other
+		 * temporary reference holders), we also release a reference on
+		 * @mm_count (which may then free the &struct mm_struct if
+		 * @mm_count also drops to 0).
+		 */
+		atomic_t mm_users;
+
+		/**
+		 * @mm_count: The number of references to &struct mm_struct
+		 * (@mm_users count as 1).
+		 *
+		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+		 * &struct mm_struct is freed.
+		 */
+		atomic_t mm_count;
 
 #ifdef CONFIG_MMU
-	atomic_long_t pgtables_bytes;		/* PTE page table pages */
+		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
-	int map_count;				/* number of VMAs */
+		int map_count;			/* number of VMAs */
 
-	spinlock_t page_table_lock;		/* Protects page tables and some counters */
-	struct rw_semaphore mmap_sem;
+		spinlock_t page_table_lock; /* Protects page tables and some
+					     * counters
+					     */
+		struct rw_semaphore mmap_sem;
 
-	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
-						 * together off init_mm.mmlist, and are protected
-						 * by mmlist_lock
-						 */
+		struct list_head mmlist; /* List of maybe swapped mm's.	These
+					  * are globally strung together off
+					  * init_mm.mmlist, and are protected
+					  * by mmlist_lock
+					  */
 
 
-	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
-	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+		unsigned long hiwater_rss; /* High-watermark of RSS usage */
+		unsigned long hiwater_vm;  /* High-water virtual memory usage */
 
-	unsigned long total_vm;		/* Total pages mapped */
-	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
-	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */
-	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
-	unsigned long stack_vm;		/* VM_STACK */
-	unsigned long def_flags;
+		unsigned long total_vm;	   /* Total pages mapped */
+		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
+		unsigned long pinned_vm;   /* Refcount permanently increased */
+		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+		unsigned long stack_vm;	   /* VM_STACK */
+		unsigned long def_flags;
 
-	spinlock_t arg_lock; /* protect the below fields */
-	unsigned long start_code, end_code, start_data, end_data;
-	unsigned long start_brk, brk, start_stack;
-	unsigned long arg_start, arg_end, env_start, env_end;
+		spinlock_t arg_lock; /* protect the below fields */
+		unsigned long start_code, end_code, start_data, end_data;
+		unsigned long start_brk, brk, start_stack;
+		unsigned long arg_start, arg_end, env_start, env_end;
 
-	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	/*
-	 * Special counters, in some configurations protected by the
-	 * page_table_lock, in other configurations by being atomic.
-	 */
-	struct mm_rss_stat rss_stat;
-
-	struct linux_binfmt *binfmt;
+		/*
+		 * Special counters, in some configurations protected by the
+		 * page_table_lock, in other configurations by being atomic.
+		 */
+		struct mm_rss_stat rss_stat;
 
-	cpumask_var_t cpu_vm_mask_var;
+		struct linux_binfmt *binfmt;
 
-	/* Architecture-specific MM context */
-	mm_context_t context;
+		/* Architecture-specific MM context */
+		mm_context_t context;
 
-	unsigned long flags; /* Must use atomic bitops to access the bits */
+		unsigned long flags; /* Must use atomic bitops to access */
 
-	struct core_state *core_state; /* coredumping support */
+		struct core_state *core_state; /* coredumping support */
 #ifdef CONFIG_MEMBARRIER
-	atomic_t membarrier_state;
+		atomic_t membarrier_state;
 #endif
 #ifdef CONFIG_AIO
-	spinlock_t			ioctx_lock;
-	struct kioctx_table __rcu	*ioctx_table;
+		spinlock_t			ioctx_lock;
+		struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+		/*
+		 * "owner" points to a task that is regarded as the canonical
+		 * user/owner of this mm. All of the following must be true in
+		 * order for it to be changed:
+		 *
+		 * current == mm->owner
+		 * current->mm != mm
+		 * new_owner->mm == mm
+		 * new_owner->alloc_lock is held
+		 */
+		struct task_struct __rcu *owner;
 #endif
-	struct user_namespace *user_ns;
+		struct user_namespace *user_ns;
 
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file __rcu *exe_file;
+		/* store ref to file /proc/<pid>/exe symlink points to */
+		struct file __rcu *exe_file;
 #ifdef CONFIG_MMU_NOTIFIER
-	struct mmu_notifier_mm *mmu_notifier_mm;
+		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
+		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * numa_next_scan is the next time that the PTEs will be marked
-	 * pte_numa. NUMA hinting faults will gather statistics and migrate
-	 * pages to new nodes if necessary.
-	 */
-	unsigned long numa_next_scan;
+		/*
+		 * numa_next_scan is the next time that the PTEs will be marked
+		 * pte_numa. NUMA hinting faults will gather statistics and
+		 * migrate pages to new nodes if necessary.
+		 */
+		unsigned long numa_next_scan;
 
-	/* Restart point for scanning and setting pte_numa */
-	unsigned long numa_scan_offset;
+		/* Restart point for scanning and setting pte_numa */
+		unsigned long numa_scan_offset;
 
-	/* numa_scan_seq prevents two threads setting pte_numa */
-	int numa_scan_seq;
+		/* numa_scan_seq prevents two threads setting pte_numa */
+		int numa_scan_seq;
 #endif
-	/*
-	 * An operation with batched TLB flushing is going on. Anything that
-	 * can move process memory needs to flush the TLB when moving a
-	 * PROT_NONE or PROT_NUMA mapped page.
-	 */
-	atomic_t tlb_flush_pending;
+		/*
+		 * An operation with batched TLB flushing is going on. Anything
+		 * that can move process memory needs to flush the TLB when
+		 * moving a PROT_NONE or PROT_NUMA mapped page.
+		 */
+		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/* See flush_tlb_batched_pending() */
-	bool tlb_flush_batched;
+		/* See flush_tlb_batched_pending() */
+		bool tlb_flush_batched;
 #endif
-	struct uprobes_state uprobes_state;
+		struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
-	atomic_long_t hugetlb_usage;
+		atomic_long_t hugetlb_usage;
 #endif
-	struct work_struct async_put_work;
+		struct work_struct async_put_work;
 
 #if IS_ENABLED(CONFIG_HMM)
-	/* HMM needs to track a few things per mm */
-	struct hmm *hmm;
+		/* HMM needs to track a few things per mm */
+		struct hmm *hmm;
 #endif
-} __randomize_layout;
+	} __randomize_layout;
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
+};
 
 extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	cpumask_clear((struct cpumask *)&mm->cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on the maximum CPU number this system
+	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
  2018-06-26 17:31 ` [PATCH 1/6] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  2018-06-27  6:03   ` kbuild test robot
  2018-06-26 17:31 ` [PATCH 3/6] x86,tlb: make lazy TLB mode lazier Rik van Riel
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h |  5 +++++
 arch/x86/mm/tlb.c               | 27 +++++++++++++++++++++++++++
 include/asm-generic/tlb.h       | 10 ++++++++++
 mm/memory.c                     | 22 ++++++++++++++--------
 4 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..3aa3204b5dc0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 	native_flush_tlb_others(mask, info)
 #endif
 
+extern void tlb_flush_remove_tables(struct mm_struct *mm);
+extern void tlb_flush_remove_tables_local(void *arg);
+
+#define HAVE_TLB_FLUSH_REMOVE_TABLES
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6eb1f34c3c85..9a893673c56b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -646,6 +646,33 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	put_cpu();
 }
 
+void tlb_flush_remove_tables_local(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+			this_cpu_read(cpu_tlbstate.is_lazy)) {
+		/*
+		 * We're in lazy mode.  We need to at least flush our
+		 * paging-structure cache to avoid speculatively reading
+		 * garbage into our TLB.  Since switching to init_mm is barely
+		 * slower than a minimal flush, just switch to init_mm.
+		 */
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
+	}
+}
+
+void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+	int cpu = get_cpu();
+	/*
+	 * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+
+	put_cpu();
+}
 
 static void do_flush_tlb_all(void *info)
 {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..f0b462ddfa15 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -295,4 +295,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
+/*
+ * Used to flush the TLB when page tables are removed, when lazy
+ * TLB mode may cause a CPU to retain intermediate translations
+ * pointing to about-to-be-freed page table memory.
+ */
+#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
+#define tlb_flush_remove_tables(mm) do {} while (0)
+#define tlb_flush_remove_tables_local(mm) do {} while (0)
+#endif
+
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..c2e83c1d124c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -326,16 +326,20 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 
-/*
- * See the comment near struct mmu_table_batch.
- */
-
 static void tlb_remove_table_smp_sync(void *arg)
 {
-	/* Simply deliver the interrupt */
+	struct mm_struct *mm = arg;
+	/*
+	 * On most architectures this does nothing. Simply delivering the
+	 * interrupt is enough to prevent races with software page table
+	 * walking like that done in get_user_pages_fast.
+	 *
+	 * See the comment near struct mmu_table_batch.
+	 */
+	tlb_flush_remove_tables_local(mm);
 }
 
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
 {
 	/*
 	 * This isn't an RCU grace period and hence the page-tables cannot be
@@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
 	 * It is however sufficient for software page-table walkers that rely on
 	 * IRQ disabling. See the comment near struct mmu_table_batch.
 	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
 	__tlb_remove_table(table);
 }
 
@@ -365,6 +369,8 @@ void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
+	tlb_flush_remove_tables(tlb->mm);
+
 	if (*batch) {
 		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
@@ -387,7 +393,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 	if (*batch == NULL) {
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 		if (*batch == NULL) {
-			tlb_remove_table_one(table);
+			tlb_remove_table_one(table, tlb);
 			return;
 		}
 		(*batch)->nr = 0;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/6] x86,tlb: make lazy TLB mode lazier
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
  2018-06-26 17:31 ` [PATCH 1/6] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
  2018-06-26 17:31 ` [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  2018-06-27 18:10   ` Andy Lutomirski
  2018-06-26 17:31 ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
which checks whether .tlb_gen changed, and the TLB invalidation code, which
increments .tlb_gen whenever page table entries get invalidated.

The atomic increment in inc_mm_tlb_gen is its own barrier; the context
switch code adds an explicit barrier between reading tlbstate.is_lazy and
next->context.tlb_gen.

Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
are not removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/uv/uv.h  |   6 +-
 arch/x86/mm/tlb.c             | 131 ++++++++++++++++++++++++++++--------------
 arch/x86/platform/uv/tlb_uv.c |   2 +-
 3 files changed, 92 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index a80c0673798f..d801afb5fe90 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -17,7 +17,7 @@ extern int is_uv_hubless(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 const struct flush_tlb_info *info);
 
 #else	/* X86_UV */
@@ -27,8 +27,8 @@ static inline int is_uv_system(void)	{ return 0; }
 static inline int is_uv_hubless(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask,
 		    const struct flush_tlb_info *info)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9a893673c56b..137a2c62c75b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/gfp.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -185,8 +186,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 {
 	struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
 	u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+	bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
 	unsigned cpu = smp_processor_id();
 	u64 next_tlb_gen;
+	bool need_flush;
+	u16 new_asid;
 
 	/*
 	 * NB: The scheduler will call us with prev == next when switching
@@ -250,10 +254,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
-		return;
+		/*
+		 * Switching straight from one thread in a process to another
+		 * thread in the same process requires no TLB flush at all.
+		 */
+		if (!was_lazy)
+			return;
+
+		/*
+		 * The code below checks whether there was a TLB flush while
+		 * this CPU was in lazy TLB mode. The barrier ensures ordering
+		 * with the TLB invalidation code advancing .tlb_gen.
+		 */
+		smp_rmb();
 	} else {
-		u16 new_asid;
-		bool need_flush;
 		u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id);
 
 		/*
@@ -290,48 +304,47 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				real_prev != &init_mm);
 		cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-		/*
-		 * Start remote flushes and then read tlb_gen.
-		 */
+		/* Start remote flushes. */
 		cpumask_set_cpu(cpu, mm_cpumask(next));
-		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
-
-		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
+	}
 
-		if (need_flush) {
-			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
-			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			load_new_mm_cr3(next->pgd, new_asid, true);
+	/* Read the tlb_gen to check whether a flush is needed. */
+	next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+	choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
 
-			/*
-			 * NB: This gets called via leave_mm() in the idle path
-			 * where RCU functions differently.  Tracing normally
-			 * uses RCU, so we need to use the _rcuidle variant.
-			 *
-			 * (There is no good reason for this.  The idle code should
-			 *  be rearranged to call this before rcu_idle_enter().)
-			 */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
-		} else {
-			/* The new ASID is already up to date. */
-			load_new_mm_cr3(next->pgd, new_asid, false);
-
-			/* See above wrt _rcuidle. */
-			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
-		}
+	if (need_flush) {
+		this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
+		this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
+		load_new_mm_cr3(next->pgd, new_asid, true);
 
 		/*
-		 * Record last user mm's context id, so we can avoid
-		 * flushing branch buffer with IBPB if we switch back
-		 * to the same user.
+		 * NB: This gets called via leave_mm() in the idle path
+		 * where RCU functions differently.  Tracing normally
+		 * uses RCU, so we need to use the _rcuidle variant.
+		 *
+		 * (There is no good reason for this.  The idle code should
+		 *  be rearranged to call this before rcu_idle_enter().)
 		 */
-		if (next != &init_mm)
-			this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
+	} else {
+		/* The new ASID is already up to date. */
+		load_new_mm_cr3(next->pgd, new_asid, false);
 
-		this_cpu_write(cpu_tlbstate.loaded_mm, next);
-		this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+		/* See above wrt _rcuidle. */
+		trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
 	}
 
+	/*
+	 * Record last user mm's context id, so we can avoid
+	 * flushing branch buffer with IBPB if we switch back
+	 * to the same user.
+	 */
+	if (next != &init_mm)
+		this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id);
+
+	this_cpu_write(cpu_tlbstate.loaded_mm, next);
+	this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid);
+
 	load_mm_cr4(next);
 	switch_ldt(real_prev, next);
 }
@@ -454,6 +467,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * This should be rare, with native_flush_tlb_others skipping
+		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
@@ -560,6 +576,22 @@ static void flush_tlb_func_remote(void *info)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
+	cpumask_t *mask = (struct cpumask *)cpumask;
+	cpumask_var_t varmask;
+	bool can_lazy_flush = false;
+	unsigned int cpu;
+
+	/*
+	 * A temporary cpumask allows the kernel to skip sending IPIs
+	 * to CPUs in lazy TLB state, without also removing them from
+	 * mm_cpumask(mm).
+	 */
+	if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) {
+		cpumask_copy(varmask, cpumask);
+		mask = varmask;
+		can_lazy_flush = true;
+	}
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -583,17 +615,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
-		cpumask = uv_flush_tlb_others(cpumask, info);
-		if (cpumask)
-			smp_call_function_many(cpumask, flush_tlb_func_remote,
+		mask = uv_flush_tlb_others(mask, info);
+		if (mask)
+			smp_call_function_many(mask, flush_tlb_func_remote,
 					       (void *)info, 1);
-		return;
+		goto out;
+	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch, or at page table free time.
+	 */
+	if (can_lazy_flush) {
+		for_each_cpu(cpu, mask) {
+			if (per_cpu(cpu_tlbstate.is_lazy, cpu))
+				cpumask_clear_cpu(cpu, mask);
+		}
 	}
-	smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+	smp_call_function_many(mask, flush_tlb_func_remote,
 			       (void *)info, 1);
+ out:
+	free_cpumask_var(varmask);
 }
 
 /*
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index ca446da48fd2..84a4c6679da6 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 					  const struct flush_tlb_info *info)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (2 preceding siblings ...)
  2018-06-26 17:31 ` [PATCH 3/6] x86,tlb: make lazy TLB mode lazier Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  2018-06-26 20:16   ` [RFC PATCH] x86,tlb: mm_fill_lazy_tlb_cpu_mask() can be static kbuild test robot
  2018-06-26 20:16   ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs kbuild test robot
  2018-06-26 17:31 ` [PATCH 5/6] x86,mm: always use lazy TLB mode Rik van Riel
  2018-06-26 17:31 ` [PATCH 6/6] x86,switch_mm: skip atomic operations for init_mm Rik van Riel
  5 siblings, 2 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

CPUs in !is_lazy have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.

Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/mm/tlb.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 137a2c62c75b..03512772395f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -707,14 +707,46 @@ void tlb_flush_remove_tables_local(void *arg)
 	}
 }
 
+void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm, struct cpumask* lazy_cpus)
+{
+	int cpu;
+
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
+			cpumask_set_cpu(cpu, lazy_cpus);
+	}
+}
+
 void tlb_flush_remove_tables(struct mm_struct *mm)
 {
 	int cpu = get_cpu();
+	cpumask_var_t lazy_cpus;
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids)
+		return;
+
+	if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) {
+		/*
+		 * If the cpumask allocation fails, do a brute force flush
+		 * on all the CPUs that have this mm loaded.
+		 */
+		smp_call_function_many(mm_cpumask(mm),
+				tlb_flush_remove_tables_local, (void *)mm, 1);
+		return;
+	}
+
 	/*
-	 * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+	 * CPUs with !is_lazy either received a TLB flush IPI while the user
+	 * pages in this address range were unmapped, or have context switched
+	 * and reloaded %CR3 since then.
+	 *
+	 * Shootdown IPIs at page table freeing time only need to be sent to
+	 * CPUs that may have out of date TLB contents.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+	mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus);
+	smp_call_function_many(lazy_cpus,
+				tlb_flush_remove_tables_local, (void *)mm, 1);
+	free_cpumask_var(lazy_cpus);
 
 	put_cpu();
 }
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/6] x86,mm: always use lazy TLB mode
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (3 preceding siblings ...)
  2018-06-26 17:31 ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  2018-06-26 17:31 ` [PATCH 6/6] x86,switch_mm: skip atomic operations for init_mm Rik van Riel
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h | 16 ----------------
 arch/x86/mm/tlb.c               | 15 +--------------
 2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3aa3204b5dc0..511bf5fae8b8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 #define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
 #endif
 
-static inline bool tlb_defer_switch_to_init_mm(void)
-{
-	/*
-	 * If we have PCID, then switching to init_mm is reasonably
-	 * fast.  If we don't have PCID, then switching to init_mm is
-	 * quite slow, so we try to defer it in the hopes that we can
-	 * avoid it entirely.  The latter approach runs the risk of
-	 * receiving otherwise unnecessary IPIs.
-	 *
-	 * This choice is just a heuristic.  The tlb code can handle this
-	 * function returning true or false regardless of whether we have
-	 * PCID.
-	 */
-	return !static_cpu_has(X86_FEATURE_PCID);
-}
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03512772395f..96ab4eacda95 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -367,20 +367,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
-	if (tlb_defer_switch_to_init_mm()) {
-		/*
-		 * There's a significant optimization that may be possible
-		 * here.  We have accurate enough TLB flush tracking that we
-		 * don't need to maintain coherence of TLB per se when we're
-		 * lazy.  We do, however, need to maintain coherence of
-		 * paging-structure caches.  We could, in principle, leave our
-		 * old mm loaded and only switch to init_mm when
-		 * tlb_remove_page() happens.
-		 */
-		this_cpu_write(cpu_tlbstate.is_lazy, true);
-	} else {
-		switch_mm(NULL, &init_mm, NULL);
-	}
+	this_cpu_write(cpu_tlbstate.is_lazy, true);
 }
 
 /*
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/6] x86,switch_mm: skip atomic operations for init_mm
  2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (4 preceding siblings ...)
  2018-06-26 17:31 ` [PATCH 5/6] x86,mm: always use lazy TLB mode Rik van Riel
@ 2018-06-26 17:31 ` Rik van Riel
  5 siblings, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-26 17:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, Rik van Riel

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels,using 1.8% of a 48 CPU system during a netperf to localhost run.
Digging into the profile, we noticed that cpumask_clear_cpu and
cpumask_set_cpu together take about half of the CPU time taken by
switch_mm_irqs_off.

However, the CPUs running netperf end up switching back and forth
between netperf and the idle task, which does not require changes
to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
the most heavily contended one in the system.`

Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
(mostly the idle task) reduced CPU use of switch_mm_irqs_off
from 1.8% of the CPU to 0.9% of the CPU, with the following
netperf commandline:

./super_netperf 300 -P 0 -t TCP_RR -p 8888 -H <host> -l 30 \
     -- -r 300,300 -o -s 1M,1M -S 1M,1M

w/o patchset:

Throughput: 1.71264e+06

perf profile:

0.95%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off
0.77%  netserver        [kernel.vmlinux]          [k] switch_mm_irqs_off

w/ patchset:

Throughput: 1.74075e+06
0.87%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off

CPU use by enter_lazy_tlb is negligible. The bulk of the
savings from switch_mm_irqs_off seem to go towards higher
netperf throughput.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reported-and-tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/mm/tlb.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 96ab4eacda95..ab3992d82c40 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -300,12 +300,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		}
 
 		/* Stop remote flushes for the previous mm */
-		VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) &&
-				real_prev != &init_mm);
-		cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+		if (real_prev != &init_mm) {
+			VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu,
+					mm_cpumask(real_prev)));
+			cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
+		}
 
 		/* Start remote flushes. */
-		cpumask_set_cpu(cpu, mm_cpumask(next));
+		if (next != &init_mm)
+			cpumask_set_cpu(cpu, mm_cpumask(next));
 	}
 
 	/* Read the tlb_gen to check whether a flush is needed. */
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH] x86,tlb: mm_fill_lazy_tlb_cpu_mask() can be static
  2018-06-26 17:31 ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
@ 2018-06-26 20:16   ` kbuild test robot
  2018-06-26 20:16   ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs kbuild test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-06-26 20:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, x86, luto, dave.hansen, mingo,
	kernel-team, tglx, efault, songliubraving, Rik van Riel


Fixes: 4d7614ce489c ("x86,tlb: only send page table free TLB flush to lazy TLB CPUs")
Signed-off-by: kbuild test robot <fengguang.wu@intel.com>
---
 tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03512772..e6d5b01 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -707,7 +707,7 @@ void tlb_flush_remove_tables_local(void *arg)
 	}
 }
 
-void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm, struct cpumask* lazy_cpus)
+static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm, struct cpumask* lazy_cpus)
 {
 	int cpu;
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs
  2018-06-26 17:31 ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
  2018-06-26 20:16   ` [RFC PATCH] x86,tlb: mm_fill_lazy_tlb_cpu_mask() can be static kbuild test robot
@ 2018-06-26 20:16   ` kbuild test robot
  1 sibling, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-06-26 20:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, x86, luto, dave.hansen, mingo,
	kernel-team, tglx, efault, songliubraving, Rik van Riel

Hi Rik,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.18-rc2 next-20180626]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/mm-allocate-mm_cpumask-dynamically-based-on-nr_cpu_ids/20180627-021116
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   arch/x86/mm/tlb.c:39:6: sparse: symbol 'clear_asid_other' was not declared. Should it be static?
>> arch/x86/mm/tlb.c:710:6: sparse: symbol 'mm_fill_lazy_tlb_cpu_mask' was not declared. Should it be static?
   arch/x86/mm/tlb.c:833:15: sparse: expression using sizeof(void)

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-26 17:31 ` [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
@ 2018-06-27  6:03   ` kbuild test robot
  0 siblings, 0 replies; 13+ messages in thread
From: kbuild test robot @ 2018-06-27  6:03 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, x86, luto, dave.hansen, mingo,
	kernel-team, tglx, efault, songliubraving, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

Hi Rik,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc2 next-20180626]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/mm-allocate-mm_cpumask-dynamically-based-on-nr_cpu_ids/20180627-021116
config: arm-keystone_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   mm/memory.c: In function 'tlb_remove_table_smp_sync':
>> mm/memory.c:339:2: error: implicit declaration of function 'tlb_flush_remove_tables_local'; did you mean 'tlb_remove_table'? [-Werror=implicit-function-declaration]
     tlb_flush_remove_tables_local(mm);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     tlb_remove_table
   mm/memory.c: In function 'tlb_table_flush':
>> mm/memory.c:372:2: error: implicit declaration of function 'tlb_flush_remove_tables'; did you mean 'tlb_remove_table'? [-Werror=implicit-function-declaration]
     tlb_flush_remove_tables(tlb->mm);
     ^~~~~~~~~~~~~~~~~~~~~~~
     tlb_remove_table
   cc1: some warnings being treated as errors

vim +339 mm/memory.c

   328	
   329	static void tlb_remove_table_smp_sync(void *arg)
   330	{
   331		struct mm_struct *mm = arg;
   332		/*
   333		 * On most architectures this does nothing. Simply delivering the
   334		 * interrupt is enough to prevent races with software page table
   335		 * walking like that done in get_user_pages_fast.
   336		 *
   337		 * See the comment near struct mmu_table_batch.
   338		 */
 > 339		tlb_flush_remove_tables_local(mm);
   340	}
   341	
   342	static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
   343	{
   344		/*
   345		 * This isn't an RCU grace period and hence the page-tables cannot be
   346		 * assumed to be actually RCU-freed.
   347		 *
   348		 * It is however sufficient for software page-table walkers that rely on
   349		 * IRQ disabling. See the comment near struct mmu_table_batch.
   350		 */
   351		smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
   352		__tlb_remove_table(table);
   353	}
   354	
   355	static void tlb_remove_table_rcu(struct rcu_head *head)
   356	{
   357		struct mmu_table_batch *batch;
   358		int i;
   359	
   360		batch = container_of(head, struct mmu_table_batch, rcu);
   361	
   362		for (i = 0; i < batch->nr; i++)
   363			__tlb_remove_table(batch->tables[i]);
   364	
   365		free_page((unsigned long)batch);
   366	}
   367	
   368	void tlb_table_flush(struct mmu_gather *tlb)
   369	{
   370		struct mmu_table_batch **batch = &tlb->batch;
   371	
 > 372		tlb_flush_remove_tables(tlb->mm);
   373	
   374		if (*batch) {
   375			call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
   376			*batch = NULL;
   377		}
   378	}
   379	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21469 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/6] x86,tlb: make lazy TLB mode lazier
  2018-06-26 17:31 ` [PATCH 3/6] x86,tlb: make lazy TLB mode lazier Rik van Riel
@ 2018-06-27 18:10   ` Andy Lutomirski
  2018-06-27 18:17     ` Rik van Riel
  2018-06-28 20:05     ` Rik van Riel
  0 siblings, 2 replies; 13+ messages in thread
From: Andy Lutomirski @ 2018-06-27 18:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, X86 ML, Andrew Lutomirski, Dave Hansen, Ingo Molnar,
	kernel-team, Thomas Gleixner, Mike Galbraith, songliubraving

On Tue, Jun 26, 2018 at 10:31 AM Rik van Riel <riel@surriel.com> wrote:
>
> Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
> when all it really needs to do is reload %CR3 at the next context switch,
> assuming no page table pages got freed.
>
> Memory ordering is used to prevent race conditions between switch_mm_irqs_off,
> which checks whether .tlb_gen changed, and the TLB invalidation code, which
> increments .tlb_gen whenever page table entries get invalidated.
>
> The atomic increment in inc_mm_tlb_gen is its own barrier; the context
> switch code adds an explicit barrier between reading tlbstate.is_lazy and
> next->context.tlb_gen.
>
> Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set
> are not removed from the mm_cpumask(mm), since that would prevent the TLB
> flush IPIs at page table free time from being sent to all the CPUs
> that need them.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/include/asm/uv/uv.h  |   6 +-
>  arch/x86/mm/tlb.c             | 131 ++++++++++++++++++++++++++++--------------
>  arch/x86/platform/uv/tlb_uv.c |   2 +-
>  3 files changed, 92 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
> index a80c0673798f..d801afb5fe90 100644
> --- a/arch/x86/include/asm/uv/uv.h
> +++ b/arch/x86/include/asm/uv/uv.h
> @@ -17,7 +17,7 @@ extern int is_uv_hubless(void);
>  extern void uv_cpu_init(void);
>  extern void uv_nmi_init(void);
>  extern void uv_system_init(void);
> -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
> +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
>                                                  const struct flush_tlb_info *info);
>
>  #else  /* X86_UV */
> @@ -27,8 +27,8 @@ static inline int is_uv_system(void)  { return 0; }
>  static inline int is_uv_hubless(void)  { return 0; }
>  static inline void uv_cpu_init(void)   { }
>  static inline void uv_system_init(void)        { }
> -static inline const struct cpumask *
> -uv_flush_tlb_others(const struct cpumask *cpumask,
> +static inline struct cpumask *
> +uv_flush_tlb_others(struct cpumask *cpumask,
>                     const struct flush_tlb_info *info)
>  { return cpumask; }
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 9a893673c56b..137a2c62c75b 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/cpu.h>
>  #include <linux/debugfs.h>
> +#include <linux/gfp.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -185,8 +186,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  {
>         struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
>         u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
> +       bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
>         unsigned cpu = smp_processor_id();
>         u64 next_tlb_gen;
> +       bool need_flush;
> +       u16 new_asid;
>
>         /*
>          * NB: The scheduler will call us with prev == next when switching
> @@ -250,10 +254,20 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                                  !cpumask_test_cpu(cpu, mm_cpumask(next))))
>                         cpumask_set_cpu(cpu, mm_cpumask(next));
>
> -               return;

You left this comment:

                /*
                 * We don't currently support having a real mm loaded without
                 * our cpu set in mm_cpumask().  We have all the bookkeeping
                 * in place to figure out whether we would need to flush
                 * if our cpu were cleared in mm_cpumask(), but we don't
                 * currently use it.
                 */

Presumably you should either clear the cpu from mm_cpumask when lazy
or you shoudl update the comment.

> +               /*
> +                * Switching straight from one thread in a process to another
> +                * thread in the same process requires no TLB flush at all.
> +                */
> +               if (!was_lazy)
> +                       return;

Comment doesn't match code.  Maybe add "... if we weren't lazy"?

> +
> +               /*
> +                * The code below checks whether there was a TLB flush while
> +                * this CPU was in lazy TLB mode. The barrier ensures ordering
> +                * with the TLB invalidation code advancing .tlb_gen.
> +                */
> +               smp_rmb();

I think it may need to be smp_mb().  You're trying to order
this_cpu_write() against subsequent reads.

In general, the changes to this function are very hard to review
because you're mixing semantic changes and restructuring the function.
Is there any way you could avoid that?  Or maybe just open-code a
tlb_gen check in the unlazying path?


> +       /*
> +        * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +        * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +        * at the next context switch, or at page table free time.
> +        */

Stale comment?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/6] x86,tlb: make lazy TLB mode lazier
  2018-06-27 18:10   ` Andy Lutomirski
@ 2018-06-27 18:17     ` Rik van Riel
  2018-06-28 20:05     ` Rik van Riel
  1 sibling, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-27 18:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Dave Hansen, Ingo Molnar, kernel-team,
	Thomas Gleixner, Mike Galbraith, songliubraving

[-- Attachment #1: Type: text/plain, Size: 633 bytes --]

On Wed, 2018-06-27 at 11:10 -0700, Andy Lutomirski wrote:
> On Tue, Jun 26, 2018 at 10:31 AM Rik van Riel <riel@surriel.com>
> wrote:

> In general, the changes to this function are very hard to review
> because you're mixing semantic changes and restructuring the
> function.
> Is there any way you could avoid that?  Or maybe just open-code a
> tlb_gen check in the unlazying path?

Let me re-do v2 of the patch series in a way that
splits out the restructuring from the semantic
changes.

I already have code here that open-codes the tlb_gen
check in the unlazying path (not in v2 yet).

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/6] x86,tlb: make lazy TLB mode lazier
  2018-06-27 18:10   ` Andy Lutomirski
  2018-06-27 18:17     ` Rik van Riel
@ 2018-06-28 20:05     ` Rik van Riel
  1 sibling, 0 replies; 13+ messages in thread
From: Rik van Riel @ 2018-06-28 20:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Dave Hansen, Ingo Molnar, kernel-team,
	Thomas Gleixner, Mike Galbraith, songliubraving

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

On Wed, 2018-06-27 at 11:10 -0700, Andy Lutomirski wrote:
> 
> You left this comment:
> 
>                 /*
>                  * We don't currently support having a real mm loaded
> without
>                  * our cpu set in mm_cpumask().  We have all the
> bookkeeping
>                  * in place to figure out whether we would need to
> flush
>                  * if our cpu were cleared in mm_cpumask(), but we
> don't
>                  * currently use it.
>                  */
> 
> Presumably you should either clear the cpu from mm_cpumask when lazy
> or you shoudl update the comment.

The lazy TLB mode leaves the mm loaded, AND the
cpu set in mm_cpumask(). However, I guess while
the comment is technically accurate, it is no longer
relevant, so I will update it :)

> > +               /*
> > +                * Switching straight from one thread in a process
> > to another
> > +                * thread in the same process requires no TLB flush
> > at all.
> > +                */
> > +               if (!was_lazy)
> > +                       return;
> 
> Comment doesn't match code.  Maybe add "... if we weren't lazy"?

Done.

> > +
> > +               /*
> > +                * The code below checks whether there was a TLB
> > flush while
> > +                * this CPU was in lazy TLB mode. The barrier
> > ensures ordering
> > +                * with the TLB invalidation code advancing
> > .tlb_gen.
> > +                */
> > +               smp_rmb();
> 
> I think it may need to be smp_mb().  You're trying to order
> this_cpu_write() against subsequent reads.

I have updated the barrier to an smp_mb().

> In general, the changes to this function are very hard to review
> because you're mixing semantic changes and restructuring the
> function.
> Is there any way you could avoid that?  Or maybe just open-code a
> tlb_gen check in the unlazying path?
> 
> 
> > +       /*
> > +        * Instead of sending IPIs to CPUs in lazy TLB mode, move
> > that
> > +        * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be
> > flushed
> > +        * at the next context switch, or at page table free time.
> > +        */
> 
> Stale comment?

Will fix.

I am running some last tests now, and will send
v3 soon.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-06-28 20:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-26 17:31 [PATCH v2 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-06-26 17:31 ` [PATCH 1/6] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-06-26 17:31 ` [PATCH 2/6] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
2018-06-27  6:03   ` kbuild test robot
2018-06-26 17:31 ` [PATCH 3/6] x86,tlb: make lazy TLB mode lazier Rik van Riel
2018-06-27 18:10   ` Andy Lutomirski
2018-06-27 18:17     ` Rik van Riel
2018-06-28 20:05     ` Rik van Riel
2018-06-26 17:31 ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
2018-06-26 20:16   ` [RFC PATCH] x86,tlb: mm_fill_lazy_tlb_cpu_mask() can be static kbuild test robot
2018-06-26 20:16   ` [PATCH 4/6] x86,tlb: only send page table free TLB flush to lazy TLB CPUs kbuild test robot
2018-06-26 17:31 ` [PATCH 5/6] x86,mm: always use lazy TLB mode Rik van Riel
2018-06-26 17:31 ` [PATCH 6/6] x86,switch_mm: skip atomic operations for init_mm Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.