All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier
@ 2018-06-20 19:56 Rik van Riel
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
                   ` (6 more replies)
  0 siblings, 7 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving, kernel-team

Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
kernels, using 1.9% of a 48 CPU system during a netperf run. Digging
into the profile, the atomic operations in cpumask_clear_cpu and
cpumask_set_cpu are responsible for about half of that CPU use.

However, the CPUs running netperf are simply switching back and
forth between netperf and the idle task, which does not require any
changes to the mm_cpumask if lazy TLB mode were used.

Additionally, the init_mm cpumask ends up being the most heavily
contended one in the system, for no reason at all.

Making really lazy TLB mode work again on modern kernels, by sending
a shootdown IPI only when page table pages are being unmapped, we get
back some performance.


Using a memcache style workload on Broadwell systems, these patches
result in about a 0.5% reduction of CPU use on the system. Numbers
on Haswell are inconclusive so far.


Song's netperf performance results:

w/o patchset:

Throughput: 1.74716e+06
perf profile:
+    0.95%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off
+    0.82%  netserver        [kernel.vmlinux]          [k] switch_mm_irqs_off

w/ patchset:

Throughput: 1.76911e+06
perf profile:
+    0.81%  swapper          [kernel.vmlinux]          [k] switch_mm_irqs_off

With these patches, netserver no longer calls switch_mm_irqs_off,
and the CPU use of enter_lazy_tlb was below the 0.05% threshold of
statistics gathered by Song's scripts.


I am still working on a patch to also get rid of the continuous
pounding on mm->count during lazy TLB entry and exit, when the same
mm_struct is being used all the time. I do not have that working yet.

Until then, these patches provide a nice performance boost, as well
as a little memory savings by shrinking the size of mm_struct.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-20 21:32   ` kbuild test robot
                     ` (2 more replies)
  2018-06-20 19:56 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm_types.h | 18 ++++++++----------
 kernel/fork.c            | 14 ++++++++------
 mm/init-mm.c             | 10 ++++++++++
 3 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..8e91632958f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,8 +427,6 @@ struct mm_struct {
 
 	struct linux_binfmt *binfmt;
 
-	cpumask_var_t cpu_vm_mask_var;
-
 	/* Architecture-specific MM context */
 	mm_context_t context;
 
@@ -465,9 +463,6 @@ struct mm_struct {
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
-#endif
 #ifdef CONFIG_NUMA_BALANCING
 	/*
 	 * numa_next_scan is the next time that the PTEs will be marked
@@ -502,22 +497,25 @@ struct mm_struct {
 	/* HMM needs to track a few things per mm */
 	struct hmm *hmm;
 #endif
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
 } __randomize_layout;
 
 extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	cpumask_clear((struct cpumask *)&mm->cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..c6a20bc78102 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2242,6 +2242,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2258,15 +2260,15 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on nr_cpu_ids.
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d15ebc0..20fe222fe4c0 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,15 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on nr_cpu_ids. That
+ * way we allocate only as much memory for mm_cpumask() as needed for the
+ * hundreds, or thousands of processes that a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -24,5 +33,6 @@ struct mm_struct init_mm = {
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-21  0:23   ` Rik van Riel
  2018-06-22 14:58   ` Andy Lutomirski
  2018-06-20 19:56 ` [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state Rik van Riel
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.

The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.

The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h |  5 +++++
 arch/x86/mm/tlb.c               | 24 ++++++++++++++++++++++++
 include/asm-generic/tlb.h       | 10 ++++++++++
 mm/memory.c                     | 22 ++++++++++++++--------
 4 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6690cd3fc8b1..3aa3204b5dc0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 	native_flush_tlb_others(mask, info)
 #endif
 
+extern void tlb_flush_remove_tables(struct mm_struct *mm);
+extern void tlb_flush_remove_tables_local(void *arg);
+
+#define HAVE_TLB_FLUSH_REMOVE_TABLES
+
 #endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e055d1a06699..61773b07ed54 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -646,6 +646,30 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	put_cpu();
 }
 
+void tlb_flush_remove_tables_local(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
+			this_cpu_read(cpu_tlbstate.is_lazy))
+		/*
+		 * We're in lazy mode.  We need to at least flush our
+		 * paging-structure cache to avoid speculatively reading
+		 * garbage into our TLB.  Since switching to init_mm is barely
+		 * slower than a minimal flush, just switch to init_mm.
+		 */
+		switch_mm_irqs_off(NULL, &init_mm, NULL);
+}
+
+void tlb_flush_remove_tables(struct mm_struct *mm)
+{
+	int cpu = get_cpu();
+	/*
+	 * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+	 */
+	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
+		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+}
 
 static void do_flush_tlb_all(void *info)
 {
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index faddde44de8c..f0b462ddfa15 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -295,4 +295,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 
 #define tlb_migrate_finish(mm) do {} while (0)
 
+/*
+ * Used to flush the TLB when page tables are removed, when lazy
+ * TLB mode may cause a CPU to retain intermediate translations
+ * pointing to about-to-be-freed page table memory.
+ */
+#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES
+#define tlb_flush_remove_tables(mm) do {} while (0)
+#define tlb_flush_remove_tables_local(mm) do {} while (0)
+#endif
+
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464e0fd2..77a908f88cb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -326,16 +326,20 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
 
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 
-/*
- * See the comment near struct mmu_table_batch.
- */
-
 static void tlb_remove_table_smp_sync(void *arg)
 {
-	/* Simply deliver the interrupt */
+	struct mm_struct *mm = arg;
+	/*
+	 * On most architectures this does nothing. Simply delivering the
+	 * interrupt is enough to prevent races with software page table
+	 * walking like that done in get_user_pages_fast.
+	 *
+	 * See the comment near struct mmu_table_batch.
+	 */
+	tlb_flush_remove_tables_local(mm);
 }
 
-static void tlb_remove_table_one(void *table)
+static void tlb_remove_table_one(void *table, struct mmu_gather *tlb)
 {
 	/*
 	 * This isn't an RCU grace period and hence the page-tables cannot be
@@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table)
 	 * It is however sufficient for software page-table walkers that rely on
 	 * IRQ disabling. See the comment near struct mmu_table_batch.
 	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1);
 	__tlb_remove_table(table);
 }
 
@@ -365,6 +369,8 @@ void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
+	tlb_flush_remove_tables(tlb->mm);
+
 	if (*batch) {
 		call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu);
 		*batch = NULL;
@@ -387,7 +393,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 	if (*batch == NULL) {
 		*batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
 		if (*batch == NULL) {
-			tlb_remove_table_one(table);
+			tlb_remove_table_one(table, tlb);
 			return;
 		}
 		(*batch)->nr = 0;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
  2018-06-20 19:56 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-22 17:01   ` Dave Hansen
  2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

Replace the tlbstate.is_lazy boolean with a tlbstate.state int,
so more states can be expressed. This is a preparation for the
next patch.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h | 11 +++++++----
 arch/x86/mm/tlb.c               | 10 +++++-----
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3aa3204b5dc0..88a4d6b87ff7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -169,6 +169,9 @@ struct tlb_context {
 	u64 tlb_gen;
 };
 
+#define TLBSTATE_OK	0
+#define TLBSTATE_LAZY	1
+
 struct tlb_state {
 	/*
 	 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -186,18 +189,18 @@ struct tlb_state {
 	 * We can be in one of several states:
 	 *
 	 *  - Actively using an mm.  Our CPU's bit will be set in
-	 *    mm_cpumask(loaded_mm) and is_lazy == false;
+	 *    mm_cpumask(loaded_mm) and state == TLBSTATE_OK
 	 *
 	 *  - Not using a real mm.  loaded_mm == &init_mm.  Our CPU's bit
-	 *    will not be set in mm_cpumask(&init_mm) and is_lazy == false.
+	 *    will not be set in mm_cpumask(&init_mm) and state == TLBSTATE_OK
 	 *
 	 *  - Lazily using a real mm.  loaded_mm != &init_mm, our bit
-	 *    is set in mm_cpumask(loaded_mm), but is_lazy == true.
+	 *    is set in mm_cpumask(loaded_mm), but state == TLBSTATE_LAZY.
 	 *    We're heuristically guessing that the CR3 load we
 	 *    skipped more than makes up for the overhead added by
 	 *    lazy mode.
 	 */
-	bool is_lazy;
+	int state;
 
 	/*
 	 * If set we changed the page tables in such a way that we
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 61773b07ed54..e063e623e52c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -136,7 +136,7 @@ void leave_mm(int cpu)
 		return;
 
 	/* Warn if we're not lazy. */
-	WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+	WARN_ON((this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK));
 
 	switch_mm(NULL, &init_mm, NULL);
 }
@@ -227,7 +227,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate.is_lazy, false);
+	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 
 	/*
 	 * The membarrier system call requires a full memory barrier and
@@ -364,7 +364,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 		 * old mm loaded and only switch to init_mm when
 		 * tlb_remove_page() happens.
 		 */
-		this_cpu_write(cpu_tlbstate.is_lazy, true);
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 	} else {
 		switch_mm(NULL, &init_mm, NULL);
 	}
@@ -448,7 +448,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 	VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
 		   loaded_mm->context.ctx_id);
 
-	if (this_cpu_read(cpu_tlbstate.is_lazy)) {
+	if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
 		/*
 		 * We're in lazy mode.  We need to at least flush our
 		 * paging-structure cache to avoid speculatively reading
@@ -651,7 +651,7 @@ void tlb_flush_remove_tables_local(void *arg)
 	struct mm_struct *mm = arg;
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
-			this_cpu_read(cpu_tlbstate.is_lazy))
+			this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK)
 		/*
 		 * We're in lazy mode.  We need to at least flush our
 		 * paging-structure cache to avoid speculatively reading
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (2 preceding siblings ...)
  2018-06-20 19:56 ` [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-22 15:04   ` Andy Lutomirski
  2018-06-22 17:05   ` Dave Hansen
  2018-06-20 19:56 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
when all it really needs to do is reload %CR3 at the next context switch,
assuming no page table pages got freed.

This patch deals with that issue by introducing a third TLB state,
TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context
switch.

Atomic compare and exchange is used to close races between the TLB
shootdown code and the context switch code. Keying off just the
tlb_gen is likely to not be enough, since that would not give
lazy_clb_can_skip_flush() information on when it is facing a race
and has to send the IPI to a CPU in the middle of a LAZY -> OK switch.

Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are not
removed from the mm_cpumask(mm), since that would prevent the TLB
flush IPIs at page table free time from being sent to all the CPUs
that need them.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h |   5 ++
 arch/x86/include/asm/uv/uv.h    |   6 +--
 arch/x86/mm/tlb.c               | 108 ++++++++++++++++++++++++++++++++++++----
 arch/x86/platform/uv/tlb_uv.c   |   2 +-
 4 files changed, 107 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 88a4d6b87ff7..0a8bdd6084de 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -171,6 +171,7 @@ struct tlb_context {
 
 #define TLBSTATE_OK	0
 #define TLBSTATE_LAZY	1
+#define TLBSTATE_FLUSH	2
 
 struct tlb_state {
 	/*
@@ -199,6 +200,10 @@ struct tlb_state {
 	 *    We're heuristically guessing that the CR3 load we
 	 *    skipped more than makes up for the overhead added by
 	 *    lazy mode.
+	 *
+	 *  - Lazily using a real mm, which has seen a TLB invalidation on
+	 *    other CPUs. TLB needs to be flushed at context switch time,
+	 *    state == TLBSTATE_FLUSH.
 	 */
 	int state;
 
diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
index a80c0673798f..d801afb5fe90 100644
--- a/arch/x86/include/asm/uv/uv.h
+++ b/arch/x86/include/asm/uv/uv.h
@@ -17,7 +17,7 @@ extern int is_uv_hubless(void);
 extern void uv_cpu_init(void);
 extern void uv_nmi_init(void);
 extern void uv_system_init(void);
-extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 						 const struct flush_tlb_info *info);
 
 #else	/* X86_UV */
@@ -27,8 +27,8 @@ static inline int is_uv_system(void)	{ return 0; }
 static inline int is_uv_hubless(void)	{ return 0; }
 static inline void uv_cpu_init(void)	{ }
 static inline void uv_system_init(void)	{ }
-static inline const struct cpumask *
-uv_flush_tlb_others(const struct cpumask *cpumask,
+static inline struct cpumask *
+uv_flush_tlb_others(struct cpumask *cpumask,
 		    const struct flush_tlb_info *info)
 { return cpumask; }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e063e623e52c..4b27d8469848 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/cpu.h>
 #include <linux/debugfs.h>
+#include <linux/gfp.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -227,8 +228,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		__flush_tlb_all();
 	}
 #endif
-	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
-
 	/*
 	 * The membarrier system call requires a full memory barrier and
 	 * core serialization before returning to user-space, after
@@ -236,8 +235,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * memory barrier and core serializing instruction.
 	 */
 	if (real_prev == next) {
+		int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
+		int oldstate = *tlbstate;
+
 		VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
-			   next->context.ctx_id);
+				next->context.ctx_id);
 
 		/*
 		 * We don't currently support having a real mm loaded without
@@ -250,6 +252,26 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 				 !cpumask_test_cpu(cpu, mm_cpumask(next))))
 			cpumask_set_cpu(cpu, mm_cpumask(next));
 
+		/*
+		 * In lazy TLB mode. The atomic exchange makes sure there was
+		 * no TLB shootdown all the way to this context switch.
+		 */
+		if (oldstate == TLBSTATE_LAZY)
+			oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
+
+		/*
+		 * There was a TLB invalidation while this CPU was idle.
+		 * The code in choose_new_asid will figure out what kind
+		 * of flush is needed.
+		 */
+		if (oldstate == TLBSTATE_FLUSH)
+			goto reload_cr3;
+
+		/*
+		 * Nothing to do. Either we are switching between two
+		 * threads in the same process, or we were in TLBSTATE_LAZY
+		 * without any TLB invalidations happening.
+		 */
 		return;
 	} else {
 		u16 new_asid;
@@ -294,6 +316,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * Start remote flushes and then read tlb_gen.
 		 */
 		cpumask_set_cpu(cpu, mm_cpumask(next));
+ reload_cr3:
+		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
 		next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 
 		choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
@@ -454,6 +478,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
 		 * paging-structure cache to avoid speculatively reading
 		 * garbage into our TLB.  Since switching to init_mm is barely
 		 * slower than a minimal flush, just switch to init_mm.
+		 *
+		 * This should be rare, with native_flush_tlb_others skipping
+		 * IPIs to lazy TLB mode CPUs.
 		 */
 		switch_mm_irqs_off(NULL, &init_mm, NULL);
 		return;
@@ -557,9 +584,57 @@ static void flush_tlb_func_remote(void *info)
 	flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
+/*
+ * Determine whether a CPU's TLB needs to be flushed now, or whether the
+ * flush can be delayed until the next context switch, by changing the
+ * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
+ */
+static bool lazy_tlb_can_skip_flush(int cpu)
+{
+	int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
+	int old;
+
+	switch (*tlbstate) {
+	case TLBSTATE_FLUSH:
+		/* The TLB will be flushed on the next context switch. */
+		return true;
+	case TLBSTATE_LAZY:
+		/*
+		 * The CPU is in TLBSTATE_LAZY, which could context switch back
+		 * to TLBSTATE_OK, re-using the old TLB state without a flush.
+		 * if that happened, send a TLB flush IPI.
+		 *
+		 * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
+		 * be flushed at the next context switch. Skip the IPI.
+		 */
+		old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
+		return old != TLBSTATE_OK;
+	case TLBSTATE_OK:
+	default:
+		/* A task on the CPU is actively using the mm. Flush the TLB. */
+		return false;
+	}
+}
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info)
 {
+	cpumask_t *mask = (struct cpumask *)cpumask;
+	cpumask_var_t varmask;
+	bool can_lazy_flush = false;
+	unsigned int cpu;
+
+	/*
+	 * A temporary cpumask allows the kernel to skip sending IPIs
+	 * to CPUs in lazy TLB state, without also removing them from
+	 * mm_cpumask(mm).
+	 */
+	if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) {
+		cpumask_copy(varmask, cpumask);
+		mask = varmask;
+		can_lazy_flush = true;
+	}
+
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
 	if (info->end == TLB_FLUSH_ALL)
 		trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
@@ -583,17 +658,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 		 * that UV should be updated so that smp_call_function_many(),
 		 * etc, are optimal on UV.
 		 */
-		unsigned int cpu;
-
 		cpu = smp_processor_id();
-		cpumask = uv_flush_tlb_others(cpumask, info);
-		if (cpumask)
-			smp_call_function_many(cpumask, flush_tlb_func_remote,
+		mask = uv_flush_tlb_others(mask, info);
+		if (mask)
+			smp_call_function_many(mask, flush_tlb_func_remote,
 					       (void *)info, 1);
-		return;
+		goto out;
+	}
+
+	/*
+	 * Instead of sending IPIs to CPUs in lazy TLB mode, move that
+	 * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
+	 * at the next context switch, or at page table free time.
+	 */
+	if (can_lazy_flush) {
+		for_each_cpu(cpu, mask) {
+			if (lazy_tlb_can_skip_flush(cpu))
+				cpumask_clear_cpu(cpu, mask);
+		}
 	}
-	smp_call_function_many(cpumask, flush_tlb_func_remote,
+
+	smp_call_function_many(mask, flush_tlb_func_remote,
 			       (void *)info, 1);
+ out:
+	free_cpumask_var(varmask);
 }
 
 /*
diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
index b36caae0fb2f..42730acec8b7 100644
--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
  * Returns pointer to cpumask if some remote flushing remains to be
  * done.  The returned pointer is valid till preemption is re-enabled.
  */
-const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
+struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
 					  const struct flush_tlb_info *info)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (3 preceding siblings ...)
  2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-22 17:23   ` Dave Hansen
  2018-06-20 19:56 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel
  2018-06-20 19:56 ` [PATCH 7/7] x86,idle: do not leave mm in idle state Rik van Riel
  6 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

CPUs in TLBSTATE_OK have either received TLB flush IPIs earlier on during
the munmap (when the user memory was unmapped), or have context switched
and reloaded during that stage of the munmap.

Page table free TLB flushes only need to be sent to CPUs in lazy TLB
mode, which TLB contents might not yet be up to date yet.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/mm/tlb.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4b27d8469848..40b00055c883 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -752,11 +752,31 @@ void tlb_flush_remove_tables_local(void *arg)
 void tlb_flush_remove_tables(struct mm_struct *mm)
 {
 	int cpu = get_cpu();
+	cpumask_var_t varmask;
+
+	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids)
+		return;
+
+	if (!zalloc_cpumask_var(&varmask, GFP_ATOMIC)) {
+		/* Flush the TLB on all CPUs that have this mm loaded. */
+		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+	}
+
 	/*
-	 * XXX: this really only needs to be called for CPUs in lazy TLB mode.
+	 * CPUs in TLBSTATE_OK either received a TLB flush IPI while the user
+	 * pages in this address range were unmapped, or have context switched
+	 * and reloaded %CR3 since then.
+	 *
+	 * Shootdown IPIs at page table freeing time only need to be sent to
+	 * CPUs that may have out of date TLB contents.
 	 */
-	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
-		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
+	for_each_cpu(cpu, mm_cpumask(mm)) {
+		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK)
+			cpumask_set_cpu(cpu, varmask);
+	}
+
+	smp_call_function_many(varmask, tlb_flush_remove_tables_local, (void *)mm, 1);
+	free_cpumask_var(varmask);
 }
 
 static void do_flush_tlb_all(void *info)
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 6/7] x86,mm: always use lazy TLB mode
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (4 preceding siblings ...)
  2018-06-20 19:56 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-20 19:56 ` [PATCH 7/7] x86,idle: do not leave mm in idle state Rik van Riel
  6 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/tlbflush.h | 16 ----------------
 arch/x86/mm/tlb.c               | 15 +--------------
 2 files changed, 1 insertion(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 0a8bdd6084de..d59c247afc14 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 #define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
 #endif
 
-static inline bool tlb_defer_switch_to_init_mm(void)
-{
-	/*
-	 * If we have PCID, then switching to init_mm is reasonably
-	 * fast.  If we don't have PCID, then switching to init_mm is
-	 * quite slow, so we try to defer it in the hopes that we can
-	 * avoid it entirely.  The latter approach runs the risk of
-	 * receiving otherwise unnecessary IPIs.
-	 *
-	 * This choice is just a heuristic.  The tlb code can handle this
-	 * function returning true or false regardless of whether we have
-	 * PCID.
-	 */
-	return !static_cpu_has(X86_FEATURE_PCID);
-}
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 40b00055c883..26b3aeef6266 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -378,20 +378,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
 		return;
 
-	if (tlb_defer_switch_to_init_mm()) {
-		/*
-		 * There's a significant optimization that may be possible
-		 * here.  We have accurate enough TLB flush tracking that we
-		 * don't need to maintain coherence of TLB per se when we're
-		 * lazy.  We do, however, need to maintain coherence of
-		 * paging-structure caches.  We could, in principle, leave our
-		 * old mm loaded and only switch to init_mm when
-		 * tlb_remove_page() happens.
-		 */
-		this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
-	} else {
-		switch_mm(NULL, &init_mm, NULL);
-	}
+	this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 }
 
 /*
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
                   ` (5 preceding siblings ...)
  2018-06-20 19:56 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel
@ 2018-06-20 19:56 ` Rik van Riel
  2018-06-20 22:20   ` kbuild test robot
  2018-06-22 15:36   ` Andy Lutomirski
  6 siblings, 2 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving,
	kernel-team, Rik van Riel

Do not call leave_mm when going into a cstate. Now that mprotect and
madvise no longer send IPIs for TLB shootdowns to idle CPUs, there is
no real reason to disable lazy TLB mode in idle states.

This seems to help performance on Broadwell systems. Haswell performance
numbers are inconclusive.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 arch/x86/include/asm/mmu.h |  2 --
 arch/x86/mm/tlb.c          | 26 +-------------------------
 drivers/idle/intel_idle.c  |  7 -------
 3 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5ff3e8af2c20..096ee9340685 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -61,6 +61,4 @@ typedef struct {
 		.ctx_id = 1,						\
 	}
 
-void leave_mm(int cpu);
-
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 26b3aeef6266..b861084e59d1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -121,28 +121,6 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
 	write_cr3(new_mm_cr3);
 }
 
-void leave_mm(int cpu)
-{
-	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
-
-	/*
-	 * It's plausible that we're in lazy TLB mode while our mm is init_mm.
-	 * If so, our callers still expect us to flush the TLB, but there
-	 * aren't any user TLB entries in init_mm to worry about.
-	 *
-	 * This needs to happen before any other sanity checks due to
-	 * intel_idle's shenanigans.
-	 */
-	if (loaded_mm == &init_mm)
-		return;
-
-	/* Warn if we're not lazy. */
-	WARN_ON((this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK));
-
-	switch_mm(NULL, &init_mm, NULL);
-}
-EXPORT_SYMBOL_GPL(leave_mm);
-
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 	       struct task_struct *tsk)
 {
@@ -194,8 +172,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * from lazy TLB mode to normal mode if active_mm isn't changing.
 	 * When this happens, we don't assume that CR3 (and hence
 	 * cpu_tlbstate.loaded_mm) matches next.
-	 *
-	 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 	 */
 
 	/* We don't want flush_tlb_func_* to run concurrently with us. */
@@ -205,7 +181,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	/*
 	 * Verify that CR3 is what we think it is.  This will catch
 	 * hypothetical buggy code that directly switches to swapper_pg_dir
-	 * without going through leave_mm() / switch_mm_irqs_off() or that
+	 * without going through switch_mm_irqs_off() or that
 	 * does something like write_cr3(read_cr3_pa()).
 	 *
 	 * Only do this check if CONFIG_DEBUG_VM=y because __read_cr3()
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index b2ccce5fb071..d3727aa34836 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -916,13 +916,6 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 	bool uninitialized_var(tick);
 	int cpu = smp_processor_id();
 
-	/*
-	 * leave_mm() to avoid costly and often unnecessary wakeups
-	 * for flushing the user TLB's associated with the active mm.
-	 */
-	if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-		leave_mm(cpu);
-
 	if (!static_cpu_has(X86_FEATURE_ARAT)) {
 		cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) &
 				MWAIT_CSTATE_MASK) + 1;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-06-20 21:32   ` kbuild test robot
  2018-06-21 20:18     ` Rik van Riel
  2018-06-21  0:24   ` kbuild test robot
  2018-06-22 15:10   ` Dave Hansen
  2 siblings, 1 reply; 47+ messages in thread
From: kbuild test robot @ 2018-06-20 21:32 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, 86, luto, mingo, tglx, dave.hansen,
	efault, songliubraving, kernel-team, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 3188 bytes --]

Hi Rik,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.17]
[also build test ERROR on next-20180620]
[cannot apply to tip/x86/core linus/master mmotm/master v4.18-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-even-lazier/20180621-045620
config: x86_64-randconfig-x016-201824 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: the linux-review/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-even-lazier/20180621-045620 HEAD 7f2e7d915758c367dd0515efc17af977592fa141 builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

>> mm/init-mm.c:38:1: error: Only string constants are supported as initializers for randomized structures with flexible arrays
    };
    ^

vim +38 mm/init-mm.c

bb1f17b0 Alexey Dobriyan   2009-06-16  17  
c59b389d Rik van Riel      2018-06-20  18  /*
c59b389d Rik van Riel      2018-06-20  19   * For dynamically allocated mm_structs, there is a dynamically sized cpumask
c59b389d Rik van Riel      2018-06-20  20   * at the end of the structure, the size of which depends on nr_cpu_ids. That
c59b389d Rik van Riel      2018-06-20  21   * way we allocate only as much memory for mm_cpumask() as needed for the
c59b389d Rik van Riel      2018-06-20  22   * hundreds, or thousands of processes that a system typically runs.
c59b389d Rik van Riel      2018-06-20  23   *
c59b389d Rik van Riel      2018-06-20  24   * Since there is only one init_mm in the entire system, keep it simple
c59b389d Rik van Riel      2018-06-20  25   * and size this cpu_bitmask to NR_CPUS.
c59b389d Rik van Riel      2018-06-20  26   */
bb1f17b0 Alexey Dobriyan   2009-06-16  27  struct mm_struct init_mm = {
bb1f17b0 Alexey Dobriyan   2009-06-16  28  	.mm_rb		= RB_ROOT,
bb1f17b0 Alexey Dobriyan   2009-06-16  29  	.pgd		= swapper_pg_dir,
bb1f17b0 Alexey Dobriyan   2009-06-16  30  	.mm_users	= ATOMIC_INIT(2),
bb1f17b0 Alexey Dobriyan   2009-06-16  31  	.mm_count	= ATOMIC_INIT(1),
bb1f17b0 Alexey Dobriyan   2009-06-16  32  	.mmap_sem	= __RWSEM_INITIALIZER(init_mm.mmap_sem),
bb1f17b0 Alexey Dobriyan   2009-06-16  33  	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
bb1f17b0 Alexey Dobriyan   2009-06-16  34  	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
bfedb589 Eric W. Biederman 2016-10-13  35  	.user_ns	= &init_user_ns,
c59b389d Rik van Riel      2018-06-20  36  	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
a1b200e2 Heiko Carstens    2010-08-09  37  	INIT_MM_CONTEXT(init_mm)
bb1f17b0 Alexey Dobriyan   2009-06-16 @38  };

:::::: The code at line 38 was first introduced by commit
:::::: bb1f17b0372de93758653ca3454bc0df18dc2e5c mm: consolidate init_mm definition

:::::: TO: Alexey Dobriyan <adobriyan@gmail.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29258 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-20 19:56 ` [PATCH 7/7] x86,idle: do not leave mm in idle state Rik van Riel
@ 2018-06-20 22:20   ` kbuild test robot
  2018-06-21  0:25     ` Rik van Riel
  2018-06-22 15:36   ` Andy Lutomirski
  1 sibling, 1 reply; 47+ messages in thread
From: kbuild test robot @ 2018-06-20 22:20 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, 86, luto, mingo, tglx, dave.hansen,
	efault, songliubraving, kernel-team, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 4540 bytes --]

Hi Rik,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on v4.17]
[also build test ERROR on next-20180620]
[cannot apply to tip/x86/core linus/master mmotm/master v4.18-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-even-lazier/20180621-045620
config: x86_64-rhel (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All error/warnings (new ones prefixed by >>):

   arch/x86/xen/mmu_pv.c: In function 'drop_mm_ref_this_cpu':
>> arch/x86/xen/mmu_pv.c:987:3: error: implicit declaration of function 'leave_mm' [-Werror=implicit-function-declaration]
      leave_mm(smp_processor_id());
      ^~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from arch/x86/include/asm/fixmap.h:19:0,
                    from arch/x86/include/asm/apic.h:10,
                    from arch/x86/include/asm/smp.h:13,
                    from arch/x86/include/asm/mmzone_64.h:11,
                    from arch/x86/include/asm/mmzone.h:5,
                    from include/linux/mmzone.h:911,
                    from include/linux/gfp.h:6,
                    from include/linux/umh.h:4,
                    from include/linux/kmod.h:22,
                    from include/linux/module.h:13,
                    from drivers/acpi/processor_idle.c:28:
   drivers/acpi/processor_idle.c: In function 'acpi_idle_enter_bm':
>> arch/x86/include/asm/acpi.h:164:28: error: implicit declaration of function 'leave_mm' [-Werror=implicit-function-declaration]
    #define acpi_unlazy_tlb(x) leave_mm(x)
                               ^
>> drivers/acpi/processor_idle.c:717:2: note: in expansion of macro 'acpi_unlazy_tlb'
     acpi_unlazy_tlb(smp_processor_id());
     ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   In file included from arch/x86/include/asm/fixmap.h:19:0,
                    from arch/x86/include/asm/apic.h:10,
                    from arch/x86/include/asm/smp.h:13,
                    from arch/x86/include/asm/mmzone_64.h:11,
                    from arch/x86/include/asm/mmzone.h:5,
                    from include/linux/mmzone.h:911,
                    from include/linux/gfp.h:6,
                    from include/linux/umh.h:4,
                    from include/linux/kmod.h:22,
                    from include/linux/module.h:13,
                    from drivers//acpi/processor_idle.c:28:
   drivers//acpi/processor_idle.c: In function 'acpi_idle_enter_bm':
>> arch/x86/include/asm/acpi.h:164:28: error: implicit declaration of function 'leave_mm' [-Werror=implicit-function-declaration]
    #define acpi_unlazy_tlb(x) leave_mm(x)
                               ^
   drivers//acpi/processor_idle.c:717:2: note: in expansion of macro 'acpi_unlazy_tlb'
     acpi_unlazy_tlb(smp_processor_id());
     ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/leave_mm +987 arch/x86/xen/mmu_pv.c

7e0563de Vitaly Kuznetsov 2017-04-04  981  
3d28ebce Andy Lutomirski  2017-05-28  982  static void drop_mm_ref_this_cpu(void *info)
7e0563de Vitaly Kuznetsov 2017-04-04  983  {
7e0563de Vitaly Kuznetsov 2017-04-04  984  	struct mm_struct *mm = info;
7e0563de Vitaly Kuznetsov 2017-04-04  985  
3d28ebce Andy Lutomirski  2017-05-28  986  	if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm)
7e0563de Vitaly Kuznetsov 2017-04-04 @987  		leave_mm(smp_processor_id());
7e0563de Vitaly Kuznetsov 2017-04-04  988  
3d28ebce Andy Lutomirski  2017-05-28  989  	/*
3d28ebce Andy Lutomirski  2017-05-28  990  	 * If this cpu still has a stale cr3 reference, then make sure
3d28ebce Andy Lutomirski  2017-05-28  991  	 * it has been flushed.
3d28ebce Andy Lutomirski  2017-05-28  992  	 */
7e0563de Vitaly Kuznetsov 2017-04-04  993  	if (this_cpu_read(xen_current_cr3) == __pa(mm->pgd))
3d28ebce Andy Lutomirski  2017-05-28  994  		xen_mc_flush();
7e0563de Vitaly Kuznetsov 2017-04-04  995  }
7e0563de Vitaly Kuznetsov 2017-04-04  996  

:::::: The code at line 987 was first introduced by commit
:::::: 7e0563dea9c4e62297846359186c597e7231912d x86/xen: split off mmu_pv.c

:::::: TO: Vitaly Kuznetsov <vkuznets@redhat.com>
:::::: CC: Juergen Gross <jgross@suse.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 40428 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-20 19:56 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
@ 2018-06-21  0:23   ` Rik van Riel
  2018-06-22 14:58   ` Andy Lutomirski
  1 sibling, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-21  0:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, mingo, tglx, dave.hansen, efault, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 544 bytes --]

On Wed, 2018-06-20 at 15:56 -0400, Rik van Riel wrote:
> 
> +void tlb_flush_remove_tables(struct mm_struct *mm)
> +{
> +	int cpu = get_cpu();
> +	/*
> +	 * XXX: this really only needs to be called for CPUs in lazy
> TLB mode.
> +	 */
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> +		smp_call_function_many(mm_cpumask(mm),
> tlb_flush_remove_tables_local, (void *)mm, 1);
> +}
>  

Jens helpfully pointed out I missed a put_cpu().

I have added that in my tree for a v2 of the series.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
  2018-06-20 21:32   ` kbuild test robot
@ 2018-06-21  0:24   ` kbuild test robot
  2018-06-22 15:10   ` Dave Hansen
  2 siblings, 0 replies; 47+ messages in thread
From: kbuild test robot @ 2018-06-21  0:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, 86, luto, mingo, tglx, dave.hansen,
	efault, songliubraving, kernel-team, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 3192 bytes --]

Hi Rik,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v4.17]
[also build test WARNING on next-20180620]
[cannot apply to tip/x86/core linus/master mmotm/master v4.18-rc1]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-even-lazier/20180621-045620
config: x86_64-allyesdebian (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/cpumask.h:12:0,
                    from arch/x86/include/asm/cpumask.h:5,
                    from arch/x86/include/asm/msr.h:11,
                    from arch/x86/include/asm/processor.h:21,
                    from arch/x86/include/asm/cpufeature.h:5,
                    from arch/x86/include/asm/thread_info.h:53,
                    from include/linux/thread_info.h:38,
                    from arch/x86/include/asm/preempt.h:7,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:6,
                    from include/linux/mm.h:10,
                    from arch/x86/platform/efi/efi_64.c:23:
   In function 'bitmap_zero.constprop',
       inlined from 'cpumask_clear.constprop' at include/linux/cpumask.h:378:2,
       inlined from 'efi_alloc_page_tables' at include/linux/mm_types.h:512:2:
>> include/linux/bitmap.h:208:3: warning: 'memset' writing 64 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
      memset(dst, 0, len);
      ^~~~~~~~~~~~~~~~~~~

vim +/memset +208 include/linux/bitmap.h

^1da177e Linus Torvalds   2005-04-16  198  
4b0bc0bc Rusty Russell    2008-12-30  199  #define small_const_nbits(nbits) \
4b0bc0bc Rusty Russell    2008-12-30  200  	(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
4b0bc0bc Rusty Russell    2008-12-30  201  
8b4daad5 Rasmus Villemoes 2015-02-12  202  static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
^1da177e Linus Torvalds   2005-04-16  203  {
4b0bc0bc Rusty Russell    2008-12-30  204  	if (small_const_nbits(nbits))
^1da177e Linus Torvalds   2005-04-16  205  		*dst = 0UL;
^1da177e Linus Torvalds   2005-04-16  206  	else {
8b4daad5 Rasmus Villemoes 2015-02-12  207  		unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
^1da177e Linus Torvalds   2005-04-16 @208  		memset(dst, 0, len);
^1da177e Linus Torvalds   2005-04-16  209  	}
^1da177e Linus Torvalds   2005-04-16  210  }
^1da177e Linus Torvalds   2005-04-16  211  

:::::: The code at line 208 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2

:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 39325 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-20 22:20   ` kbuild test robot
@ 2018-06-21  0:25     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-21  0:25 UTC (permalink / raw)
  To: kbuild test robot
  Cc: linux-kernel, x86, luto, mingo, tglx, dave.hansen, efault,
	songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On Thu, 2018-06-21 at 06:20 +0800, kbuild test robot wrote:

> All error/warnings (new ones prefixed by >>):
> 
>    arch/x86/xen/mmu_pv.c: In function 'drop_mm_ref_this_cpu':
> > > arch/x86/xen/mmu_pv.c:987:3: error: implicit declaration of
> > > function 'leave_mm' [-Werror=implicit-function-declaration]
> 
>       leave_mm(smp_processor_id());
>       ^~~~~~~~
>    cc1: some warnings being treated as errors

This was the result of a last minute cleanup,
erroneously thinking I had removed the last
user of leave_mm.  Of course there's Xen :)

Fixed in my tree for v2.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-20 21:32   ` kbuild test robot
@ 2018-06-21 20:18     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-21 20:18 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, luto, mingo, tglx, dave.hansen, efault,
	songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

On Thu, 2018-06-21 at 05:32 +0800, kbuild test robot wrote:
> Hi Rik,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on v4.17]
> [also build test ERROR on next-20180620]
> [cannot apply to tip/x86/core linus/master mmotm/master v4.18-rc1]
> [if your patch is applied to the wrong git tree, please drop us a
> note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/x86-tlb
> -mm-make-lazy-TLB-mode-even-lazier/20180621-045620
> config: x86_64-randconfig-x016-201824 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> Note: the linux-review/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-
> even-lazier/20180621-045620 HEAD
> 7f2e7d915758c367dd0515efc17af977592fa141 builds fine.
>       It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
> > > mm/init-mm.c:38:1: error: Only string constants are supported as
> > > initializers for randomized structures with flexible arrays
> 
>     };

Fixed in my tree for v2, by moving all of the randomizable
bits of mm_struct into an anonymous sub-structure, and
making sure the bitmap is always at the end.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-20 19:56 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
  2018-06-21  0:23   ` Rik van Riel
@ 2018-06-22 14:58   ` Andy Lutomirski
  2018-06-22 15:17     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 14:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, 86, Andrew Lutomirski, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> wrote:
>
> Andy discovered that speculative memory accesses while in lazy
> TLB mode can crash a system, when a CPU tries to dereference a
> speculative access using memory contents that used to be valid
> page table memory, but have since been reused for something else
> and point into la-la land.
>
> The latter problem can be prevented in two ways. The first is to
> always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
> the second one is to only send the TLB shootdown at page table
> freeing time.
>
> The second should result in fewer IPIs, since operationgs like
> mprotect and madvise are very common with some workloads, but
> do not involve page table freeing. Also, on munmap, batching
> of page table freeing covers much larger ranges of virtual
> memory than the batching of unmapped user pages.
>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/include/asm/tlbflush.h |  5 +++++
>  arch/x86/mm/tlb.c               | 24 ++++++++++++++++++++++++
>  include/asm-generic/tlb.h       | 10 ++++++++++
>  mm/memory.c                     | 22 ++++++++++++++--------
>  4 files changed, 53 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 6690cd3fc8b1..3aa3204b5dc0 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>         native_flush_tlb_others(mask, info)
>  #endif
>
> +extern void tlb_flush_remove_tables(struct mm_struct *mm);
> +extern void tlb_flush_remove_tables_local(void *arg);
> +
> +#define HAVE_TLB_FLUSH_REMOVE_TABLES
> +
>  #endif /* _ASM_X86_TLBFLUSH_H */
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e055d1a06699..61773b07ed54 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -646,6 +646,30 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>         put_cpu();
>  }
>
> +void tlb_flush_remove_tables_local(void *arg)
> +{
> +       struct mm_struct *mm = arg;
> +
> +       if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
> +                       this_cpu_read(cpu_tlbstate.is_lazy))
> +               /*
> +                * We're in lazy mode.  We need to at least flush our
> +                * paging-structure cache to avoid speculatively reading
> +                * garbage into our TLB.  Since switching to init_mm is barely
> +                * slower than a minimal flush, just switch to init_mm.
> +                */
> +               switch_mm_irqs_off(NULL, &init_mm, NULL);

Can you add braces?

> +}
> +
> +void tlb_flush_remove_tables(struct mm_struct *mm)
> +{
> +       int cpu = get_cpu();
> +       /*
> +        * XXX: this really only needs to be called for CPUs in lazy TLB mode.
> +        */
> +       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> +               smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);

I suspect that most if the gain will come from fixing this limitation :)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
@ 2018-06-22 15:04   ` Andy Lutomirski
  2018-06-22 15:15     ` Rik van Riel
  2018-06-22 17:05   ` Dave Hansen
  1 sibling, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 15:04 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, 86, Andrew Lutomirski, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> wrote:
>
> Lazy TLB mode can result in an idle CPU being woken up by a TLB flush,
> when all it really needs to do is reload %CR3 at the next context switch,
> assuming no page table pages got freed.
>
> This patch deals with that issue by introducing a third TLB state,
> TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context
> switch.
>
> Atomic compare and exchange is used to close races between the TLB
> shootdown code and the context switch code. Keying off just the
> tlb_gen is likely to not be enough, since that would not give
> lazy_clb_can_skip_flush() information on when it is facing a race
> and has to send the IPI to a CPU in the middle of a LAZY -> OK switch.
>
> Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are not
> removed from the mm_cpumask(mm), since that would prevent the TLB
> flush IPIs at page table free time from being sent to all the CPUs
> that need them.

Eek, this is so complicated.  In the 2016 version of the patches, you
needed all this.  But I rewrote the whole subsystem to make it easier
now :)  I think that you can get rid of all of this and instead just
revert the relevant parts of:

b956575bed91ecfb136a8300742ecbbf451471ab

All the bookkeeping is already in place -- no need for new state.


>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/include/asm/tlbflush.h |   5 ++
>  arch/x86/include/asm/uv/uv.h    |   6 +--
>  arch/x86/mm/tlb.c               | 108 ++++++++++++++++++++++++++++++++++++----
>  arch/x86/platform/uv/tlb_uv.c   |   2 +-
>  4 files changed, 107 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 88a4d6b87ff7..0a8bdd6084de 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -171,6 +171,7 @@ struct tlb_context {
>
>  #define TLBSTATE_OK    0
>  #define TLBSTATE_LAZY  1
> +#define TLBSTATE_FLUSH 2
>
>  struct tlb_state {
>         /*
> @@ -199,6 +200,10 @@ struct tlb_state {
>          *    We're heuristically guessing that the CR3 load we
>          *    skipped more than makes up for the overhead added by
>          *    lazy mode.
> +        *
> +        *  - Lazily using a real mm, which has seen a TLB invalidation on
> +        *    other CPUs. TLB needs to be flushed at context switch time,
> +        *    state == TLBSTATE_FLUSH.
>          */
>         int state;
>
> diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h
> index a80c0673798f..d801afb5fe90 100644
> --- a/arch/x86/include/asm/uv/uv.h
> +++ b/arch/x86/include/asm/uv/uv.h
> @@ -17,7 +17,7 @@ extern int is_uv_hubless(void);
>  extern void uv_cpu_init(void);
>  extern void uv_nmi_init(void);
>  extern void uv_system_init(void);
> -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
> +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
>                                                  const struct flush_tlb_info *info);
>
>  #else  /* X86_UV */
> @@ -27,8 +27,8 @@ static inline int is_uv_system(void)  { return 0; }
>  static inline int is_uv_hubless(void)  { return 0; }
>  static inline void uv_cpu_init(void)   { }
>  static inline void uv_system_init(void)        { }
> -static inline const struct cpumask *
> -uv_flush_tlb_others(const struct cpumask *cpumask,
> +static inline struct cpumask *
> +uv_flush_tlb_others(struct cpumask *cpumask,
>                     const struct flush_tlb_info *info)
>  { return cpumask; }
>
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e063e623e52c..4b27d8469848 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -7,6 +7,7 @@
>  #include <linux/export.h>
>  #include <linux/cpu.h>
>  #include <linux/debugfs.h>
> +#include <linux/gfp.h>
>
>  #include <asm/tlbflush.h>
>  #include <asm/mmu_context.h>
> @@ -227,8 +228,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                 __flush_tlb_all();
>         }
>  #endif
> -       this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> -
>         /*
>          * The membarrier system call requires a full memory barrier and
>          * core serialization before returning to user-space, after
> @@ -236,8 +235,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>          * memory barrier and core serializing instruction.
>          */
>         if (real_prev == next) {
> +               int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state);
> +               int oldstate = *tlbstate;
> +
>                 VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
> -                          next->context.ctx_id);
> +                               next->context.ctx_id);
>
>                 /*
>                  * We don't currently support having a real mm loaded without
> @@ -250,6 +252,26 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                                  !cpumask_test_cpu(cpu, mm_cpumask(next))))
>                         cpumask_set_cpu(cpu, mm_cpumask(next));
>
> +               /*
> +                * In lazy TLB mode. The atomic exchange makes sure there was
> +                * no TLB shootdown all the way to this context switch.
> +                */
> +               if (oldstate == TLBSTATE_LAZY)
> +                       oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK);
> +
> +               /*
> +                * There was a TLB invalidation while this CPU was idle.
> +                * The code in choose_new_asid will figure out what kind
> +                * of flush is needed.
> +                */
> +               if (oldstate == TLBSTATE_FLUSH)
> +                       goto reload_cr3;
> +
> +               /*
> +                * Nothing to do. Either we are switching between two
> +                * threads in the same process, or we were in TLBSTATE_LAZY
> +                * without any TLB invalidations happening.
> +                */
>                 return;
>         } else {
>                 u16 new_asid;
> @@ -294,6 +316,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>                  * Start remote flushes and then read tlb_gen.
>                  */
>                 cpumask_set_cpu(cpu, mm_cpumask(next));
> + reload_cr3:
> +               this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>                 next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>
>                 choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush);
> @@ -454,6 +478,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f,
>                  * paging-structure cache to avoid speculatively reading
>                  * garbage into our TLB.  Since switching to init_mm is barely
>                  * slower than a minimal flush, just switch to init_mm.
> +                *
> +                * This should be rare, with native_flush_tlb_others skipping
> +                * IPIs to lazy TLB mode CPUs.
>                  */
>                 switch_mm_irqs_off(NULL, &init_mm, NULL);
>                 return;
> @@ -557,9 +584,57 @@ static void flush_tlb_func_remote(void *info)
>         flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
>  }
>
> +/*
> + * Determine whether a CPU's TLB needs to be flushed now, or whether the
> + * flush can be delayed until the next context switch, by changing the
> + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH.
> + */
> +static bool lazy_tlb_can_skip_flush(int cpu)
> +{
> +       int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu);
> +       int old;
> +
> +       switch (*tlbstate) {
> +       case TLBSTATE_FLUSH:
> +               /* The TLB will be flushed on the next context switch. */
> +               return true;
> +       case TLBSTATE_LAZY:
> +               /*
> +                * The CPU is in TLBSTATE_LAZY, which could context switch back
> +                * to TLBSTATE_OK, re-using the old TLB state without a flush.
> +                * if that happened, send a TLB flush IPI.
> +                *
> +                * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will
> +                * be flushed at the next context switch. Skip the IPI.
> +                */
> +               old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH);
> +               return old != TLBSTATE_OK;
> +       case TLBSTATE_OK:
> +       default:
> +               /* A task on the CPU is actively using the mm. Flush the TLB. */
> +               return false;
> +       }
> +}
> +
>  void native_flush_tlb_others(const struct cpumask *cpumask,
>                              const struct flush_tlb_info *info)
>  {
> +       cpumask_t *mask = (struct cpumask *)cpumask;
> +       cpumask_var_t varmask;
> +       bool can_lazy_flush = false;
> +       unsigned int cpu;
> +
> +       /*
> +        * A temporary cpumask allows the kernel to skip sending IPIs
> +        * to CPUs in lazy TLB state, without also removing them from
> +        * mm_cpumask(mm).
> +        */
> +       if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) {
> +               cpumask_copy(varmask, cpumask);
> +               mask = varmask;
> +               can_lazy_flush = true;
> +       }
> +
>         count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
>         if (info->end == TLB_FLUSH_ALL)
>                 trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL);
> @@ -583,17 +658,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
>                  * that UV should be updated so that smp_call_function_many(),
>                  * etc, are optimal on UV.
>                  */
> -               unsigned int cpu;
> -
>                 cpu = smp_processor_id();
> -               cpumask = uv_flush_tlb_others(cpumask, info);
> -               if (cpumask)
> -                       smp_call_function_many(cpumask, flush_tlb_func_remote,
> +               mask = uv_flush_tlb_others(mask, info);
> +               if (mask)
> +                       smp_call_function_many(mask, flush_tlb_func_remote,
>                                                (void *)info, 1);
> -               return;
> +               goto out;
> +       }
> +
> +       /*
> +        * Instead of sending IPIs to CPUs in lazy TLB mode, move that
> +        * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed
> +        * at the next context switch, or at page table free time.
> +        */
> +       if (can_lazy_flush) {
> +               for_each_cpu(cpu, mask) {
> +                       if (lazy_tlb_can_skip_flush(cpu))
> +                               cpumask_clear_cpu(cpu, mask);
> +               }
>         }
> -       smp_call_function_many(cpumask, flush_tlb_func_remote,
> +
> +       smp_call_function_many(mask, flush_tlb_func_remote,
>                                (void *)info, 1);
> + out:
> +       free_cpumask_var(varmask);
>  }
>
>  /*
> diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c
> index b36caae0fb2f..42730acec8b7 100644
> --- a/arch/x86/platform/uv/tlb_uv.c
> +++ b/arch/x86/platform/uv/tlb_uv.c
> @@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp,
>   * Returns pointer to cpumask if some remote flushing remains to be
>   * done.  The returned pointer is valid till preemption is re-enabled.
>   */
> -const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask,
> +struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask,
>                                           const struct flush_tlb_info *info)
>  {
>         unsigned int cpu = smp_processor_id();
> --
> 2.14.4
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
  2018-06-20 21:32   ` kbuild test robot
  2018-06-21  0:24   ` kbuild test robot
@ 2018-06-22 15:10   ` Dave Hansen
  2018-06-22 17:45     ` Rik van Riel
  2 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2018-06-22 15:10 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: luto, mingo, tglx, efault, songliubraving, kernel-team

On 06/20/2018 12:56 PM, Rik van Riel wrote:
>  	/*
> -	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
> -	 * whole struct cpumask for the OFFSTACK case. We could change
> -	 * this to *only* allocate as much of it as required by the
> -	 * maximum number of CPU's we can ever have.  The cpumask_allocation
> -	 * is at the end of the structure, exactly for that reason.
> +	 * The mm_cpumask is located at the end of mm_struct, and is
> +	 * dynamically sized based on nr_cpu_ids.
>  	 */
> +	mm_size = sizeof(struct mm_struct) + cpumask_size();
> +
>  	mm_cachep = kmem_cache_create_usercopy("mm_struct",
> -			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
> +			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,

Could you add a bit to that comment, like "dynamically sized based on
nr_cpu_ids" ... which is sized based on the number of possible CPUs.

I found myself wondering how that interacts with hotplug.

t mm_struct, saved_auxv),
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index f94d5d15ebc0..20fe222fe4c0 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -15,6 +15,15 @@
>  #define INIT_MM_CONTEXT(name)
>  #endif
>  
> +/*
> + * For dynamically allocated mm_structs, there is a dynamically sized cpumask
> + * at the end of the structure, the size of which depends on nr_cpu_ids. That...

Similar nit.  Instead of calling out the variable alone, could we just
say what it means logically and then reference the variable?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-22 15:04   ` Andy Lutomirski
@ 2018-06-22 15:15     ` Rik van Riel
  2018-06-22 15:34       ` Andy Lutomirski
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 15:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Ingo Molnar, Thomas Gleixner, Dave Hansen,
	Mike Galbraith, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2107 bytes --]

On Fri, 2018-06-22 at 08:04 -0700, Andy Lutomirski wrote:
> On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > Lazy TLB mode can result in an idle CPU being woken up by a TLB
> > flush,
> > when all it really needs to do is reload %CR3 at the next context
> > switch,
> > assuming no page table pages got freed.
> > 
> > This patch deals with that issue by introducing a third TLB state,
> > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next
> > context
> > switch.
> > 
> > Atomic compare and exchange is used to close races between the TLB
> > shootdown code and the context switch code. Keying off just the
> > tlb_gen is likely to not be enough, since that would not give
> > lazy_clb_can_skip_flush() information on when it is facing a race
> > and has to send the IPI to a CPU in the middle of a LAZY -> OK
> > switch.
> > 
> > Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are
> > not
> > removed from the mm_cpumask(mm), since that would prevent the TLB
> > flush IPIs at page table free time from being sent to all the CPUs
> > that need them.
> 
> Eek, this is so complicated.  In the 2016 version of the patches, you
> needed all this.  But I rewrote the whole subsystem to make it easier
> now :)  I think that you can get rid of all of this and instead just
> revert the relevant parts of:
> 
> b956575bed91ecfb136a8300742ecbbf451471ab
> 
> All the bookkeeping is already in place -- no need for new state.

I looked at using your .tlb_gen stuff, but we need a
way to do that race free.  I suppose setting the
tlbstate to !lazy before checking .tlb_gen might do
the trick, if we get the ordering right at the tlb
invalidation site, too?

Something like this:

context switch                tlb invalidation

			advance mm->context.tlb_gen
			send IPI to cpus with !is_lazy tlb


tlbstate.is_lazy = FALSE
*need_flush = .tlb_gen < next_tlb_gen

Do you see any holes in that?

I will gladly simplify this code and get rid
of the atomic operations :)



-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time
  2018-06-22 14:58   ` Andy Lutomirski
@ 2018-06-22 15:17     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Ingo Molnar, Thomas Gleixner, Dave Hansen,
	Mike Galbraith, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On Fri, 2018-06-22 at 07:58 -0700, Andy Lutomirski wrote:
> On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > +++ b/arch/x86/mm/tlb.c
> > @@ -646,6 +646,30 @@ void flush_tlb_mm_range(struct mm_struct *mm,
> > unsigned long start,
> >         put_cpu();
> >  }
> > 
> > +void tlb_flush_remove_tables_local(void *arg)
> > +{
> > +       struct mm_struct *mm = arg;
> > +
> > +       if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm &&
> > +                       this_cpu_read(cpu_tlbstate.is_lazy))
> > +               /*
> > +                * We're in lazy mode.  We need to at least flush
> > our
> > +                * paging-structure cache to avoid speculatively
> > reading
> > +                * garbage into our TLB.  Since switching to
> > init_mm is barely
> > +                * slower than a minimal flush, just switch to
> > init_mm.
> > +                */
> > +               switch_mm_irqs_off(NULL, &init_mm, NULL);
> 
> Can you add braces?

Will do.

> > +}
> > +
> > +void tlb_flush_remove_tables(struct mm_struct *mm)
> > +{
> > +       int cpu = get_cpu();
> > +       /*
> > +        * XXX: this really only needs to be called for CPUs in
> > lazy TLB mode.
> > +        */
> > +       if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> > +               smp_call_function_many(mm_cpumask(mm),
> > tlb_flush_remove_tables_local, (void *)mm, 1);
> 
> I suspect that most if the gain will come from fixing this limitation
> :)

Patch 5 does that.

However, I suspect most of the gain comes from
not having mprotect and madvise send IPIs to
lazy TLB mode CPUs at all any more :)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-22 15:15     ` Rik van Riel
@ 2018-06-22 15:34       ` Andy Lutomirski
  0 siblings, 0 replies; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 15:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Lutomirski, LKML, X86 ML, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

On Fri, Jun 22, 2018 at 8:15 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-22 at 08:04 -0700, Andy Lutomirski wrote:
> > On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com>
> > wrote:
> > >
> > > Lazy TLB mode can result in an idle CPU being woken up by a TLB
> > > flush,
> > > when all it really needs to do is reload %CR3 at the next context
> > > switch,
> > > assuming no page table pages got freed.
> > >
> > > This patch deals with that issue by introducing a third TLB state,
> > > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next
> > > context
> > > switch.
> > >
> > > Atomic compare and exchange is used to close races between the TLB
> > > shootdown code and the context switch code. Keying off just the
> > > tlb_gen is likely to not be enough, since that would not give
> > > lazy_clb_can_skip_flush() information on when it is facing a race
> > > and has to send the IPI to a CPU in the middle of a LAZY -> OK
> > > switch.
> > >
> > > Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are
> > > not
> > > removed from the mm_cpumask(mm), since that would prevent the TLB
> > > flush IPIs at page table free time from being sent to all the CPUs
> > > that need them.
> >
> > Eek, this is so complicated.  In the 2016 version of the patches, you
> > needed all this.  But I rewrote the whole subsystem to make it easier
> > now :)  I think that you can get rid of all of this and instead just
> > revert the relevant parts of:
> >
> > b956575bed91ecfb136a8300742ecbbf451471ab
> >
> > All the bookkeeping is already in place -- no need for new state.
>
> I looked at using your .tlb_gen stuff, but we need a
> way to do that race free.  I suppose setting the
> tlbstate to !lazy before checking .tlb_gen might do
> the trick, if we get the ordering right at the tlb
> invalidation site, too?

Oh, right.

>
> Something like this:
>
> context switch                tlb invalidation
>
>                         advance mm->context.tlb_gen
>                         send IPI to cpus with !is_lazy tlb
>
>
> tlbstate.is_lazy = FALSE
> *need_flush = .tlb_gen < next_tlb_gen
>
> Do you see any holes in that?

Logically, is_lazy is (with your patches) just like mm_cpumask in
terms of ordering.  So I think your idea above is fine.  But I think
you need to make sure there's a full barrier between is_lazy = false
and reading .tlb_gen.

--Andy

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-20 19:56 ` [PATCH 7/7] x86,idle: do not leave mm in idle state Rik van Riel
  2018-06-20 22:20   ` kbuild test robot
@ 2018-06-22 15:36   ` Andy Lutomirski
  2018-06-22 15:53     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 15:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: LKML, 86, Andrew Lutomirski, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> wrote:
>
> Do not call leave_mm when going into a cstate. Now that mprotect and
> madvise no longer send IPIs for TLB shootdowns to idle CPUs, there is
> no real reason to disable lazy TLB mode in idle states.
>
> This seems to help performance on Broadwell systems. Haswell performance
> numbers are inconclusive.

I'm skeptical.  The code you're removing is more about power
consumption than about performance.  If a task migrates from one cpu
to another, runs for awhile, and exits, we don't want to IPI the old
CPU if the old CPU is deeply asleep.  The logic you're removing is a
bit awkwardly written, but the intent is to only do the leave_mm() in
deep sleep.  We should arguably change the condition to check the
expected sleep duration instead of the microarchitectural properties
of the target state, though.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-22 15:36   ` Andy Lutomirski
@ 2018-06-22 15:53     ` Rik van Riel
  2018-06-22 16:01       ` Andy Lutomirski
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 15:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Ingo Molnar, Thomas Gleixner, Dave Hansen,
	Mike Galbraith, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1242 bytes --]

On Fri, 2018-06-22 at 08:36 -0700, Andy Lutomirski wrote:
> On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com>
> wrote:
> > 
> > Do not call leave_mm when going into a cstate. Now that mprotect
> > and
> > madvise no longer send IPIs for TLB shootdowns to idle CPUs, there
> > is
> > no real reason to disable lazy TLB mode in idle states.
> > 
> > This seems to help performance on Broadwell systems. Haswell
> > performance
> > numbers are inconclusive.
> 
> I'm skeptical.  The code you're removing is more about power
> consumption than about performance.  If a task migrates from one cpu
> to another, runs for awhile, and exits, we don't want to IPI the old
> CPU if the old CPU is deeply asleep.  The logic you're removing is a
> bit awkwardly written, but the intent is to only do the leave_mm() in
> deep sleep.  We should arguably change the condition to check the
> expected sleep duration instead of the microarchitectural properties
> of the target state, though.

That is fair.

A quick hack could be to just remove the tlb flush
from C3 state (which gets called a lot), but keep
it for C6 state and deeper (which get called more
when a CPU is really idle).

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-22 15:53     ` Rik van Riel
@ 2018-06-22 16:01       ` Andy Lutomirski
  2018-06-22 20:18         ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 16:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Lutomirski, LKML, X86 ML, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

Hmm, fair enough.  I think a better heuristic would be if the
estimated idle duration is more than, say, 10ms.  I *think* the code
has been cleaned up enough that this is easy now.  (Using time instead
of C6 will make it a lot less dependent on which idle driver is in
use.)
On Fri, Jun 22, 2018 at 8:53 AM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-22 at 08:36 -0700, Andy Lutomirski wrote:
> > On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com>
> > wrote:
> > >
> > > Do not call leave_mm when going into a cstate. Now that mprotect
> > > and
> > > madvise no longer send IPIs for TLB shootdowns to idle CPUs, there
> > > is
> > > no real reason to disable lazy TLB mode in idle states.
> > >
> > > This seems to help performance on Broadwell systems. Haswell
> > > performance
> > > numbers are inconclusive.
> >
> > I'm skeptical.  The code you're removing is more about power
> > consumption than about performance.  If a task migrates from one cpu
> > to another, runs for awhile, and exits, we don't want to IPI the old
> > CPU if the old CPU is deeply asleep.  The logic you're removing is a
> > bit awkwardly written, but the intent is to only do the leave_mm() in
> > deep sleep.  We should arguably change the condition to check the
> > expected sleep duration instead of the microarchitectural properties
> > of the target state, though.
>
> That is fair.
>
> A quick hack could be to just remove the tlb flush
> from C3 state (which gets called a lot), but keep
> it for C6 state and deeper (which get called more
> when a CPU is really idle).
>
> --
> All Rights Reversed.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state
  2018-06-20 19:56 ` [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state Rik van Riel
@ 2018-06-22 17:01   ` Dave Hansen
  2018-06-22 17:08     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2018-06-22 17:01 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: 86, luto, mingo, tglx, efault, songliubraving, kernel-team

On 06/20/2018 12:56 PM, Rik van Riel wrote:
> +#define TLBSTATE_OK	0
> +#define TLBSTATE_LAZY	1

Could we spell out a bit more about what "OK" means in comments?  It
obviously means "not lazy", but anything else?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
  2018-06-22 15:04   ` Andy Lutomirski
@ 2018-06-22 17:05   ` Dave Hansen
  2018-06-22 17:16     ` Rik van Riel
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Hansen @ 2018-06-22 17:05 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: 86, luto, mingo, tglx, efault, songliubraving, kernel-team

On 06/20/2018 12:56 PM, Rik van Riel wrote:
> This patch deals with that issue by introducing a third TLB state,
> TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context
> switch.

With PCIDs, do we need to be a bit more explicit about what kind of %CR3
reload we have?  Because, with PCIDs, we do have non-TLB-flushing %CR3
writes.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state
  2018-06-22 17:01   ` Dave Hansen
@ 2018-06-22 17:08     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 17:08 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, luto, mingo, tglx, efault, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 635 bytes --]

On Fri, 2018-06-22 at 10:01 -0700, Dave Hansen wrote:
> On 06/20/2018 12:56 PM, Rik van Riel wrote:
> > +#define TLBSTATE_OK	0
> > +#define TLBSTATE_LAZY	1
> 
> Could we spell out a bit more about what "OK" means in comments?  It
> obviously means "not lazy", but anything else?

After discussion with Andy, I plan to simply drop
this patch, and use memory ordering between updating
tlbstate.is_lazy and *.tlb_gen to take care of getting
the TLB reloaded at context switch time if needed.

We should not need the extra TLBSTATE_FLUSH state,
so a boolean is fine and this patch is unnecessary.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
  2018-06-22 17:05   ` Dave Hansen
@ 2018-06-22 17:16     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 17:16 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: x86, luto, mingo, tglx, efault, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 943 bytes --]

On Fri, 2018-06-22 at 10:05 -0700, Dave Hansen wrote:
> On 06/20/2018 12:56 PM, Rik van Riel wrote:
> > This patch deals with that issue by introducing a third TLB state,
> > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next
> > context
> > switch.
> 
> With PCIDs, do we need to be a bit more explicit about what kind of
> %CR3
> reload we have?  Because, with PCIDs, we do have non-TLB-flushing
> %CR3
> writes.

Probably not in this patch series, because switch_mm_irqs_off
calls choose_new_asid, which figures all of this out for us
already.

Specifically, it has this bit of code to figure out whether
we need a flush:

                *need_flush =
(this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) <
                               next_tlb_gen);

As long as we get the memory ordering right, we should be
able to rely on that to switch back from lazy to non-lazy
TLB mode.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs
  2018-06-20 19:56 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
@ 2018-06-22 17:23   ` Dave Hansen
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Hansen @ 2018-06-22 17:23 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: 86, luto, mingo, tglx, efault, songliubraving, kernel-team

On 06/20/2018 12:56 PM, Rik van Riel wrote:
> CPUs in TLBSTATE_OK have either received TLB flush IPIs earlier on during
> the munmap (when the user memory was unmapped), or have context switched
> and reloaded during that stage of the munmap.
> 
> Page table free TLB flushes only need to be sent to CPUs in lazy TLB
> mode, which TLB contents might not yet be up to date yet.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Tested-by: Song Liu <songliubraving@fb.com>
> ---
>  arch/x86/mm/tlb.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 4b27d8469848..40b00055c883 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -752,11 +752,31 @@ void tlb_flush_remove_tables_local(void *arg)
>  void tlb_flush_remove_tables(struct mm_struct *mm)
>  {
>  	int cpu = get_cpu();
> +	cpumask_var_t varmask;
> +
> +	if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids)
> +		return;

Oh, man, we need a wrapper for that little nugget.  I was really
scratching my head about what this was doing until I saw the
cpumask_any_but() comment:

	 * Returns >= nr_cpu_ids if no cpus set.


> +	if (!zalloc_cpumask_var(&varmask, GFP_ATOMIC)) {
> +		/* Flush the TLB on all CPUs that have this mm loaded. */
> +		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
> +	}

Maybe:

	/*
	 * If the cpumask allocation fails, do a brute-force
	 * flush on all CPUs that have this mm loaded.
	 */

Also, don't you need a return inside the if(){} block here?

>  	/*
> -	 * XXX: this really only needs to be called for CPUs in lazy TLB mode.
> +	 * CPUs in TLBSTATE_OK either received a TLB flush IPI while the user
> +	 * pages in this address range were unmapped, or have context switched
> +	 * and reloaded %CR3 since then.
> +	 *
> +	 * Shootdown IPIs at page table freeing time only need to be sent to
> +	 * CPUs that may have out of date TLB contents.
>  	 */
> -	if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
> -		smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1);
> +	for_each_cpu(cpu, mm_cpumask(mm)) {
> +		if (per_cpu(cpu_tlbstate.state, cpu) != TLBSTATE_OK)
> +			cpumask_set_cpu(cpu, varmask);
> +	}
> +
> +	smp_call_function_many(varmask, tlb_flush_remove_tables_local, (void *)mm, 1);
> +	free_cpumask_var(varmask);
>  }

Would this look any better if you wrapped the mask manipulation in a:

void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
			       cpumask_var_t *varmask)

helper, including the explanation comments?

	cpumask_var_t lazy_cpus;
	
	... cpumask allocation/fallback here

	mm_fill_lazy_tlb_cpu_mask(mm, &lazy_cpus);
	smp_call_function_many(lazy_cpus, tlb_flush_remove_...


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-22 15:10   ` Dave Hansen
@ 2018-06-22 17:45     ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 17:45 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: luto, mingo, tglx, efault, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1139 bytes --]

On Fri, 2018-06-22 at 08:10 -0700, Dave Hansen wrote:
> On 06/20/2018 12:56 PM, Rik van Riel wrote:
> >  	/*
> > -	 * FIXME! The "sizeof(struct mm_struct)" currently
> > includes the
> > -	 * whole struct cpumask for the OFFSTACK case. We could
> > change
> > -	 * this to *only* allocate as much of it as required by
> > the
> > -	 * maximum number of CPU's we can ever have.  The
> > cpumask_allocation
> > -	 * is at the end of the structure, exactly for that
> > reason.
> > +	 * The mm_cpumask is located at the end of mm_struct, and
> > is
> > +	 * dynamically sized based on nr_cpu_ids.
> >  	 */
> > +	mm_size = sizeof(struct mm_struct) + cpumask_size();
> > +
> >  	mm_cachep = kmem_cache_create_usercopy("mm_struct",
> > -			sizeof(struct mm_struct),
> > ARCH_MIN_MMSTRUCT_ALIGN,
> > +			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
> 
> Could you add a bit to that comment, like "dynamically sized based on
> nr_cpu_ids" ... which is sized based on the number of possible CPUs.
> 
> I found myself wondering how that interacts with hotplug.

Improved in my tree for v2.  Thank you.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-22 16:01       ` Andy Lutomirski
@ 2018-06-22 20:18         ` Rik van Riel
  2018-06-22 22:05           ` Andy Lutomirski
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-22 20:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Ingo Molnar, Thomas Gleixner, Dave Hansen,
	Mike Galbraith, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 989 bytes --]

On Fri, 2018-06-22 at 09:01 -0700, Andy Lutomirski wrote:
> Hmm, fair enough.  I think a better heuristic would be if the
> estimated idle duration is more than, say, 10ms.  I *think* the code
> has been cleaned up enough that this is easy now.  (Using time
> instead
> of C6 will make it a lot less dependent on which idle driver is in
> use.)

This particular bit of code is only in intel_idle
though, and not every cpuidle governor estimates
an idle duration, nor does it get passed up the
stack (presumably because it not always exists).

I will just drop this patch for now, and see if
adding back in the patch that skips manipulation
of the mm_cpumask(&init_mm), since that might make
leave_mm() a little cheaper.

We would still have excess manipulation of the
bitmask when re-entering the task from what should
have been lazy TLB mode, but total cache line
contention would likely still be down from where
it is before that patch.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-22 20:18         ` Rik van Riel
@ 2018-06-22 22:05           ` Andy Lutomirski
  2018-06-23  0:55             ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Andy Lutomirski @ 2018-06-22 22:05 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Lutomirski, LKML, X86 ML, Ingo Molnar, Thomas Gleixner,
	Dave Hansen, Mike Galbraith, songliubraving, kernel-team

On Fri, Jun 22, 2018 at 1:19 PM Rik van Riel <riel@surriel.com> wrote:
>
> On Fri, 2018-06-22 at 09:01 -0700, Andy Lutomirski wrote:
> > Hmm, fair enough.  I think a better heuristic would be if the
> > estimated idle duration is more than, say, 10ms.  I *think* the code
> > has been cleaned up enough that this is easy now.  (Using time
> > instead
> > of C6 will make it a lot less dependent on which idle driver is in
> > use.)
>
> This particular bit of code is only in intel_idle
> though, and not every cpuidle governor estimates
> an idle duration, nor does it get passed up the
> stack (presumably because it not always exists).
>
> I will just drop this patch for now, and see if
> adding back in the patch that skips manipulation
> of the mm_cpumask(&init_mm), since that might make
> leave_mm() a little cheaper.
>
> We would still have excess manipulation of the
> bitmask when re-entering the task from what should
> have been lazy TLB mode, but total cache line
> contention would likely still be down from where
> it is before that patch.
>

Agreed.

I think the right solution if you want that last little bit of
performance is to get rid of the code in intel_idle and to add it in
the core idle code.  We have fancy scheduler code to estimate the idle
time, and we should use it here IMO.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 7/7] x86,idle: do not leave mm in idle state
  2018-06-22 22:05           ` Andy Lutomirski
@ 2018-06-23  0:55             ` Rik van Riel
  0 siblings, 0 replies; 47+ messages in thread
From: Rik van Riel @ 2018-06-23  0:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Ingo Molnar, Thomas Gleixner, Dave Hansen,
	Mike Galbraith, songliubraving, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1480 bytes --]

On Fri, 2018-06-22 at 15:05 -0700, Andy Lutomirski wrote:

> I think the right solution if you want that last little bit of
> performance is to get rid of the code in intel_idle and to add it in
> the core idle code.  We have fancy scheduler code to estimate the
> idle
> time, and we should use it here IMO.

Good point.

However, I suspect we have some lower hanging
larger fruit to tackle first.

Every time we go into lazy TLB mode, we take
a refcount on the mm. Every time we leave lazy
TLB mode, we drop the refcount.

Every time we switch from the idle task to a
kernel thread, the kernel thread takes a
refcount, and the idle task drops it. Every
time we switch back, we do the same dance in
reverse.

I am working on a patch to grab the refcount
once, and hang onto it while a particular mm
is that CPU's lazy_mm.

We can release it when we switch to a task
with a different mm.

The patches we have so far get rid of a lot of
the pounding on mm_cpumask(mm).

That patch should help us also get rid of tasks
pounding on mm->count.

After that, the idle state thing is probably
of pretty small impact, though I suspect it will
still be worth tackling :)


As an aside, isn't the fancy CPU power management
stuff in the scheduler cpufreq, not cpuidle? The
cpuidle stuff in kernel/sched/idle.c looks like it
will just call down into the menu governor (and
maybe the ladder governor on some systems??)

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-16 19:03 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-08-04 22:28   ` Guenter Roeck
  0 siblings, 0 replies; 47+ messages in thread
From: Guenter Roeck @ 2018-08-04 22:28 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, luto, efault, kernel-team, mingo, dave.hansen,
	Guan Xuetao

On Mon, Jul 16, 2018 at 03:03:31PM -0400, Rik van Riel wrote:
> The mm_struct always contains a cpumask bitmap, regardless of
> CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> simplify things, and simply have one bitmask at the end of the
> mm_struct for the mm_cpumask.
> 
> This does necessitate moving everything else in mm_struct into
> an anonymous sub-structure, which can be randomized when struct
> randomization is enabled.
> 
> The second step is to determine the correct size for the
> mm_struct slab object from the size of the mm_struct
> (excluding the cpu bitmap) and the size the cpumask.
> 
> For init_mm we can simply allocate the maximum size this
> kernel is compiled for, since we only have one init_mm
> in the system, anyway.
> 
> Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
> getting confused by the dynamically sized array.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>
> Tested-by: Song Liu <songliubraving@fb.com>

Hi,

this patch causes unicore32 build failures.

In file included from include/linux/mm.h:17,
                 from arch/unicore32/kernel/asm-offsets.c:17:
include/linux/mm_types.h:497: error:
	flexible array member in otherwise empty struct

Build reference: next-20180803
gcc version: unicore32-linux-gcc (UC4_1.0.5_20100917) 4.4.2

I understand this is an old compiler, but it is the only available
version as far as I know.

Guenter

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-16 19:03 [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-07-16 19:03 ` Rik van Riel
  2018-08-04 22:28   ` Guenter Roeck
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-16 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, efault, kernel-team, mingo, dave.hansen, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 drivers/firmware/efi/efi.c |   1 +
 include/linux/mm_types.h   | 241 +++++++++++++++++++++++----------------------
 kernel/fork.c              |  15 +--
 mm/init-mm.c               |  11 +++
 4 files changed, 145 insertions(+), 123 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 232f4915223b..7f0b19410a95 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
 	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
+	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 };
 
 static bool disable_runtime;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..efdc24dd9e97 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,183 @@ struct core_state {
 
 struct kioctx_table;
 struct mm_struct {
-	struct vm_area_struct *mmap;		/* list of VMAs */
-	struct rb_root mm_rb;
-	u32 vmacache_seqnum;                   /* per-thread vmacache */
+	struct {
+		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct rb_root mm_rb;
+		u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
+		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
 #endif
-	unsigned long mmap_base;		/* base of mmap area */
-	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
+		unsigned long mmap_base;	/* base of mmap area */
+		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-	/* Base adresses for compatible mmap() */
-	unsigned long mmap_compat_base;
-	unsigned long mmap_compat_legacy_base;
+		/* Base adresses for compatible mmap() */
+		unsigned long mmap_compat_base;
+		unsigned long mmap_compat_legacy_base;
 #endif
-	unsigned long task_size;		/* size of task vm space */
-	unsigned long highest_vm_end;		/* highest vma end address */
-	pgd_t * pgd;
-
-	/**
-	 * @mm_users: The number of users including userspace.
-	 *
-	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
-	 * to 0 (i.e. when the task exits and there are no other temporary
-	 * reference holders), we also release a reference on @mm_count
-	 * (which may then free the &struct mm_struct if @mm_count also
-	 * drops to 0).
-	 */
-	atomic_t mm_users;
-
-	/**
-	 * @mm_count: The number of references to &struct mm_struct
-	 * (@mm_users count as 1).
-	 *
-	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-	 * &struct mm_struct is freed.
-	 */
-	atomic_t mm_count;
+		unsigned long task_size;	/* size of task vm space */
+		unsigned long highest_vm_end;	/* highest vma end address */
+		pgd_t * pgd;
+
+		/**
+		 * @mm_users: The number of users including userspace.
+		 *
+		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+		 * drops to 0 (i.e. when the task exits and there are no other
+		 * temporary reference holders), we also release a reference on
+		 * @mm_count (which may then free the &struct mm_struct if
+		 * @mm_count also drops to 0).
+		 */
+		atomic_t mm_users;
+
+		/**
+		 * @mm_count: The number of references to &struct mm_struct
+		 * (@mm_users count as 1).
+		 *
+		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+		 * &struct mm_struct is freed.
+		 */
+		atomic_t mm_count;
 
 #ifdef CONFIG_MMU
-	atomic_long_t pgtables_bytes;		/* PTE page table pages */
+		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
-	int map_count;				/* number of VMAs */
+		int map_count;			/* number of VMAs */
 
-	spinlock_t page_table_lock;		/* Protects page tables and some counters */
-	struct rw_semaphore mmap_sem;
+		spinlock_t page_table_lock; /* Protects page tables and some
+					     * counters
+					     */
+		struct rw_semaphore mmap_sem;
 
-	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
-						 * together off init_mm.mmlist, and are protected
-						 * by mmlist_lock
-						 */
+		struct list_head mmlist; /* List of maybe swapped mm's.	These
+					  * are globally strung together off
+					  * init_mm.mmlist, and are protected
+					  * by mmlist_lock
+					  */
 
 
-	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
-	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+		unsigned long hiwater_rss; /* High-watermark of RSS usage */
+		unsigned long hiwater_vm;  /* High-water virtual memory usage */
 
-	unsigned long total_vm;		/* Total pages mapped */
-	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
-	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */
-	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
-	unsigned long stack_vm;		/* VM_STACK */
-	unsigned long def_flags;
+		unsigned long total_vm;	   /* Total pages mapped */
+		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
+		unsigned long pinned_vm;   /* Refcount permanently increased */
+		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+		unsigned long stack_vm;	   /* VM_STACK */
+		unsigned long def_flags;
 
-	spinlock_t arg_lock; /* protect the below fields */
-	unsigned long start_code, end_code, start_data, end_data;
-	unsigned long start_brk, brk, start_stack;
-	unsigned long arg_start, arg_end, env_start, env_end;
+		spinlock_t arg_lock; /* protect the below fields */
+		unsigned long start_code, end_code, start_data, end_data;
+		unsigned long start_brk, brk, start_stack;
+		unsigned long arg_start, arg_end, env_start, env_end;
 
-	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	/*
-	 * Special counters, in some configurations protected by the
-	 * page_table_lock, in other configurations by being atomic.
-	 */
-	struct mm_rss_stat rss_stat;
-
-	struct linux_binfmt *binfmt;
+		/*
+		 * Special counters, in some configurations protected by the
+		 * page_table_lock, in other configurations by being atomic.
+		 */
+		struct mm_rss_stat rss_stat;
 
-	cpumask_var_t cpu_vm_mask_var;
+		struct linux_binfmt *binfmt;
 
-	/* Architecture-specific MM context */
-	mm_context_t context;
+		/* Architecture-specific MM context */
+		mm_context_t context;
 
-	unsigned long flags; /* Must use atomic bitops to access the bits */
+		unsigned long flags; /* Must use atomic bitops to access */
 
-	struct core_state *core_state; /* coredumping support */
+		struct core_state *core_state; /* coredumping support */
 #ifdef CONFIG_MEMBARRIER
-	atomic_t membarrier_state;
+		atomic_t membarrier_state;
 #endif
 #ifdef CONFIG_AIO
-	spinlock_t			ioctx_lock;
-	struct kioctx_table __rcu	*ioctx_table;
+		spinlock_t			ioctx_lock;
+		struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+		/*
+		 * "owner" points to a task that is regarded as the canonical
+		 * user/owner of this mm. All of the following must be true in
+		 * order for it to be changed:
+		 *
+		 * current == mm->owner
+		 * current->mm != mm
+		 * new_owner->mm == mm
+		 * new_owner->alloc_lock is held
+		 */
+		struct task_struct __rcu *owner;
 #endif
-	struct user_namespace *user_ns;
+		struct user_namespace *user_ns;
 
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file __rcu *exe_file;
+		/* store ref to file /proc/<pid>/exe symlink points to */
+		struct file __rcu *exe_file;
 #ifdef CONFIG_MMU_NOTIFIER
-	struct mmu_notifier_mm *mmu_notifier_mm;
+		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
+		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * numa_next_scan is the next time that the PTEs will be marked
-	 * pte_numa. NUMA hinting faults will gather statistics and migrate
-	 * pages to new nodes if necessary.
-	 */
-	unsigned long numa_next_scan;
+		/*
+		 * numa_next_scan is the next time that the PTEs will be marked
+		 * pte_numa. NUMA hinting faults will gather statistics and
+		 * migrate pages to new nodes if necessary.
+		 */
+		unsigned long numa_next_scan;
 
-	/* Restart point for scanning and setting pte_numa */
-	unsigned long numa_scan_offset;
+		/* Restart point for scanning and setting pte_numa */
+		unsigned long numa_scan_offset;
 
-	/* numa_scan_seq prevents two threads setting pte_numa */
-	int numa_scan_seq;
+		/* numa_scan_seq prevents two threads setting pte_numa */
+		int numa_scan_seq;
 #endif
-	/*
-	 * An operation with batched TLB flushing is going on. Anything that
-	 * can move process memory needs to flush the TLB when moving a
-	 * PROT_NONE or PROT_NUMA mapped page.
-	 */
-	atomic_t tlb_flush_pending;
+		/*
+		 * An operation with batched TLB flushing is going on. Anything
+		 * that can move process memory needs to flush the TLB when
+		 * moving a PROT_NONE or PROT_NUMA mapped page.
+		 */
+		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/* See flush_tlb_batched_pending() */
-	bool tlb_flush_batched;
+		/* See flush_tlb_batched_pending() */
+		bool tlb_flush_batched;
 #endif
-	struct uprobes_state uprobes_state;
+		struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
-	atomic_long_t hugetlb_usage;
+		atomic_long_t hugetlb_usage;
 #endif
-	struct work_struct async_put_work;
+		struct work_struct async_put_work;
 
 #if IS_ENABLED(CONFIG_HMM)
-	/* HMM needs to track a few things per mm */
-	struct hmm *hmm;
+		/* HMM needs to track a few things per mm */
+		struct hmm *hmm;
 #endif
-} __randomize_layout;
+	} __randomize_layout;
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
+};
 
 extern struct mm_struct init_mm;
 
+/* Pointer magic because the dynamic array size confuses some compilers. */
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	unsigned long cpu_bitmap = (unsigned long)mm;
+
+	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	cpumask_clear((struct cpumask *)cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on the maximum CPU number this system
+	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-15 23:50     ` Rik van Riel
@ 2018-07-16  1:07       ` Ingo Molnar
  0 siblings, 0 replies; 47+ messages in thread
From: Ingo Molnar @ 2018-07-16  1:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx,
	songliubraving, hpa


* Rik van Riel <riel@surriel.com> wrote:

> On Mon, 2018-07-16 at 00:59 +0200, Ingo Molnar wrote:
> > * Rik van Riel <riel@surriel.com> wrote:
> > 
> > > The mm_struct always contains a cpumask bitmap, regardless of
> > > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> > > simplify things, and simply have one bitmask at the end of the
> > > mm_struct for the mm_cpumask.
> > > 
> > > This does necessitate moving everything else in mm_struct into
> > > an anonymous sub-structure, which can be randomized when struct
> > > randomization is enabled.
> > > 
> > > The second step is to determine the correct size for the
> > > mm_struct slab object from the size of the mm_struct
> > > (excluding the cpu bitmap) and the size the cpumask.
> > > 
> > > For init_mm we can simply allocate the maximum size this
> > > kernel is compiled for, since we only have one init_mm
> > > in the system, anyway.
> > > 
> > > Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
> > > getting confused by the dynamically sized array.
> > > 
> > > Signed-off-by: Rik van Riel <riel@surriel.com>
> > > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > 
> > Is this an Acked-by in disguise, or did this patch route via Mike?
> 
> Mike found an issue with the patch and sent a
> fix, so I added his S-o-b to this patch as
> well.

Makes sense - I'd suggest such a SoB chain:

  Signed-off-by: Rik van Riel <riel@surriel.com>
  [ Fixed crash. ]
  Signed-off-by: Mike Galbraith <efault@gmx.de>
  Signed-off-by: Rik van Riel <riel@surriel.com>

... it's a bit non-standard but we've used it in similar cases and it makes the 
routing and evolution of the patch pretty clear.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-15 22:59   ` Ingo Molnar
@ 2018-07-15 23:50     ` Rik van Riel
  2018-07-16  1:07       ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-15 23:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx,
	songliubraving, hpa

[-- Attachment #1: Type: text/plain, Size: 1290 bytes --]

On Mon, 2018-07-16 at 00:59 +0200, Ingo Molnar wrote:
> * Rik van Riel <riel@surriel.com> wrote:
> 
> > The mm_struct always contains a cpumask bitmap, regardless of
> > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> > simplify things, and simply have one bitmask at the end of the
> > mm_struct for the mm_cpumask.
> > 
> > This does necessitate moving everything else in mm_struct into
> > an anonymous sub-structure, which can be randomized when struct
> > randomization is enabled.
> > 
> > The second step is to determine the correct size for the
> > mm_struct slab object from the size of the mm_struct
> > (excluding the cpu bitmap) and the size the cpumask.
> > 
> > For init_mm we can simply allocate the maximum size this
> > kernel is compiled for, since we only have one init_mm
> > in the system, anyway.
> > 
> > Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
> > getting confused by the dynamically sized array.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> 
> Is this an Acked-by in disguise, or did this patch route via Mike?

Mike found an issue with the patch and sent a
fix, so I added his S-o-b to this patch as
well.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-07-15 22:59   ` Ingo Molnar
  2018-07-15 23:50     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Ingo Molnar @ 2018-07-15 22:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx,
	songliubraving, hpa


* Rik van Riel <riel@surriel.com> wrote:

> The mm_struct always contains a cpumask bitmap, regardless of
> CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> simplify things, and simply have one bitmask at the end of the
> mm_struct for the mm_cpumask.
> 
> This does necessitate moving everything else in mm_struct into
> an anonymous sub-structure, which can be randomized when struct
> randomization is enabled.
> 
> The second step is to determine the correct size for the
> mm_struct slab object from the size of the mm_struct
> (excluding the cpu bitmap) and the size the cpumask.
> 
> For init_mm we can simply allocate the maximum size this
> kernel is compiled for, since we only have one init_mm
> in the system, anyway.
> 
> Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
> getting confused by the dynamically sized array.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Mike Galbraith <efault@gmx.de>

Is this an Acked-by in disguise, or did this patch route via Mike?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-07-10 14:28 ` Rik van Riel
  2018-07-15 22:59   ` Ingo Molnar
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx,
	songliubraving, hpa, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Pointer magic by Mike Galbraith, to evade -Wstringop-overflow
getting confused by the dynamically sized array.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 drivers/firmware/efi/efi.c |   1 +
 include/linux/mm_types.h   | 241 +++++++++++++++++++++++----------------------
 kernel/fork.c              |  15 +--
 mm/init-mm.c               |  11 +++
 4 files changed, 145 insertions(+), 123 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 232f4915223b..7f0b19410a95 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
 	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
+	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 };
 
 static bool disable_runtime;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..efdc24dd9e97 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,183 @@ struct core_state {
 
 struct kioctx_table;
 struct mm_struct {
-	struct vm_area_struct *mmap;		/* list of VMAs */
-	struct rb_root mm_rb;
-	u32 vmacache_seqnum;                   /* per-thread vmacache */
+	struct {
+		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct rb_root mm_rb;
+		u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
+		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
 #endif
-	unsigned long mmap_base;		/* base of mmap area */
-	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
+		unsigned long mmap_base;	/* base of mmap area */
+		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-	/* Base adresses for compatible mmap() */
-	unsigned long mmap_compat_base;
-	unsigned long mmap_compat_legacy_base;
+		/* Base adresses for compatible mmap() */
+		unsigned long mmap_compat_base;
+		unsigned long mmap_compat_legacy_base;
 #endif
-	unsigned long task_size;		/* size of task vm space */
-	unsigned long highest_vm_end;		/* highest vma end address */
-	pgd_t * pgd;
-
-	/**
-	 * @mm_users: The number of users including userspace.
-	 *
-	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
-	 * to 0 (i.e. when the task exits and there are no other temporary
-	 * reference holders), we also release a reference on @mm_count
-	 * (which may then free the &struct mm_struct if @mm_count also
-	 * drops to 0).
-	 */
-	atomic_t mm_users;
-
-	/**
-	 * @mm_count: The number of references to &struct mm_struct
-	 * (@mm_users count as 1).
-	 *
-	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-	 * &struct mm_struct is freed.
-	 */
-	atomic_t mm_count;
+		unsigned long task_size;	/* size of task vm space */
+		unsigned long highest_vm_end;	/* highest vma end address */
+		pgd_t * pgd;
+
+		/**
+		 * @mm_users: The number of users including userspace.
+		 *
+		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+		 * drops to 0 (i.e. when the task exits and there are no other
+		 * temporary reference holders), we also release a reference on
+		 * @mm_count (which may then free the &struct mm_struct if
+		 * @mm_count also drops to 0).
+		 */
+		atomic_t mm_users;
+
+		/**
+		 * @mm_count: The number of references to &struct mm_struct
+		 * (@mm_users count as 1).
+		 *
+		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+		 * &struct mm_struct is freed.
+		 */
+		atomic_t mm_count;
 
 #ifdef CONFIG_MMU
-	atomic_long_t pgtables_bytes;		/* PTE page table pages */
+		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
-	int map_count;				/* number of VMAs */
+		int map_count;			/* number of VMAs */
 
-	spinlock_t page_table_lock;		/* Protects page tables and some counters */
-	struct rw_semaphore mmap_sem;
+		spinlock_t page_table_lock; /* Protects page tables and some
+					     * counters
+					     */
+		struct rw_semaphore mmap_sem;
 
-	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
-						 * together off init_mm.mmlist, and are protected
-						 * by mmlist_lock
-						 */
+		struct list_head mmlist; /* List of maybe swapped mm's.	These
+					  * are globally strung together off
+					  * init_mm.mmlist, and are protected
+					  * by mmlist_lock
+					  */
 
 
-	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
-	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+		unsigned long hiwater_rss; /* High-watermark of RSS usage */
+		unsigned long hiwater_vm;  /* High-water virtual memory usage */
 
-	unsigned long total_vm;		/* Total pages mapped */
-	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
-	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */
-	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
-	unsigned long stack_vm;		/* VM_STACK */
-	unsigned long def_flags;
+		unsigned long total_vm;	   /* Total pages mapped */
+		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
+		unsigned long pinned_vm;   /* Refcount permanently increased */
+		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+		unsigned long stack_vm;	   /* VM_STACK */
+		unsigned long def_flags;
 
-	spinlock_t arg_lock; /* protect the below fields */
-	unsigned long start_code, end_code, start_data, end_data;
-	unsigned long start_brk, brk, start_stack;
-	unsigned long arg_start, arg_end, env_start, env_end;
+		spinlock_t arg_lock; /* protect the below fields */
+		unsigned long start_code, end_code, start_data, end_data;
+		unsigned long start_brk, brk, start_stack;
+		unsigned long arg_start, arg_end, env_start, env_end;
 
-	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	/*
-	 * Special counters, in some configurations protected by the
-	 * page_table_lock, in other configurations by being atomic.
-	 */
-	struct mm_rss_stat rss_stat;
-
-	struct linux_binfmt *binfmt;
+		/*
+		 * Special counters, in some configurations protected by the
+		 * page_table_lock, in other configurations by being atomic.
+		 */
+		struct mm_rss_stat rss_stat;
 
-	cpumask_var_t cpu_vm_mask_var;
+		struct linux_binfmt *binfmt;
 
-	/* Architecture-specific MM context */
-	mm_context_t context;
+		/* Architecture-specific MM context */
+		mm_context_t context;
 
-	unsigned long flags; /* Must use atomic bitops to access the bits */
+		unsigned long flags; /* Must use atomic bitops to access */
 
-	struct core_state *core_state; /* coredumping support */
+		struct core_state *core_state; /* coredumping support */
 #ifdef CONFIG_MEMBARRIER
-	atomic_t membarrier_state;
+		atomic_t membarrier_state;
 #endif
 #ifdef CONFIG_AIO
-	spinlock_t			ioctx_lock;
-	struct kioctx_table __rcu	*ioctx_table;
+		spinlock_t			ioctx_lock;
+		struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+		/*
+		 * "owner" points to a task that is regarded as the canonical
+		 * user/owner of this mm. All of the following must be true in
+		 * order for it to be changed:
+		 *
+		 * current == mm->owner
+		 * current->mm != mm
+		 * new_owner->mm == mm
+		 * new_owner->alloc_lock is held
+		 */
+		struct task_struct __rcu *owner;
 #endif
-	struct user_namespace *user_ns;
+		struct user_namespace *user_ns;
 
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file __rcu *exe_file;
+		/* store ref to file /proc/<pid>/exe symlink points to */
+		struct file __rcu *exe_file;
 #ifdef CONFIG_MMU_NOTIFIER
-	struct mmu_notifier_mm *mmu_notifier_mm;
+		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
+		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * numa_next_scan is the next time that the PTEs will be marked
-	 * pte_numa. NUMA hinting faults will gather statistics and migrate
-	 * pages to new nodes if necessary.
-	 */
-	unsigned long numa_next_scan;
+		/*
+		 * numa_next_scan is the next time that the PTEs will be marked
+		 * pte_numa. NUMA hinting faults will gather statistics and
+		 * migrate pages to new nodes if necessary.
+		 */
+		unsigned long numa_next_scan;
 
-	/* Restart point for scanning and setting pte_numa */
-	unsigned long numa_scan_offset;
+		/* Restart point for scanning and setting pte_numa */
+		unsigned long numa_scan_offset;
 
-	/* numa_scan_seq prevents two threads setting pte_numa */
-	int numa_scan_seq;
+		/* numa_scan_seq prevents two threads setting pte_numa */
+		int numa_scan_seq;
 #endif
-	/*
-	 * An operation with batched TLB flushing is going on. Anything that
-	 * can move process memory needs to flush the TLB when moving a
-	 * PROT_NONE or PROT_NUMA mapped page.
-	 */
-	atomic_t tlb_flush_pending;
+		/*
+		 * An operation with batched TLB flushing is going on. Anything
+		 * that can move process memory needs to flush the TLB when
+		 * moving a PROT_NONE or PROT_NUMA mapped page.
+		 */
+		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/* See flush_tlb_batched_pending() */
-	bool tlb_flush_batched;
+		/* See flush_tlb_batched_pending() */
+		bool tlb_flush_batched;
 #endif
-	struct uprobes_state uprobes_state;
+		struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
-	atomic_long_t hugetlb_usage;
+		atomic_long_t hugetlb_usage;
 #endif
-	struct work_struct async_put_work;
+		struct work_struct async_put_work;
 
 #if IS_ENABLED(CONFIG_HMM)
-	/* HMM needs to track a few things per mm */
-	struct hmm *hmm;
+		/* HMM needs to track a few things per mm */
+		struct hmm *hmm;
 #endif
-} __randomize_layout;
+	} __randomize_layout;
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
+};
 
 extern struct mm_struct init_mm;
 
+/* Pointer magic because the dynamic array size confuses some compilers. */
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	unsigned long cpu_bitmap = (unsigned long)mm;
+
+	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	cpumask_clear((struct cpumask *)cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on the maximum CPU number this system
+	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-09 21:38         ` Rik van Riel
@ 2018-07-10  3:28           ` Mike Galbraith
  0 siblings, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2018-07-10  3:28 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

On Mon, 2018-07-09 at 17:38 -0400, Rik van Riel wrote:
> 
> I added your code, and Signed-off-By in patch
> 1 for version 5 of the series.

No objection, but no need (like taking credit for fixing a typo:).

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-08 14:13       ` Mike Galbraith
  2018-07-08 14:44         ` Mike Galbraith
@ 2018-07-09 21:38         ` Rik van Riel
  2018-07-10  3:28           ` Mike Galbraith
  1 sibling, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-09 21:38 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]

On Sun, 2018-07-08 at 16:13 +0200, Mike Galbraith wrote:
> On Sat, 2018-07-07 at 17:25 -0400, Rik van Riel wrote:
> > 
> > > ./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64
> > > bytes
> > > into a region of size 0 overflows the destination [-Wstringop-
> > > overflow=]
> > >    memset(dst, 0, len);
> > >    ^~~~~~~~~~~~~~~~~~~
> > 
> > I don't understand this one.
> > 
> > Inside init_mm we have this line:
> >         .cpu_bitmap     = { [BITS_TO_LONGS(NR_CPUS)] = 0},
> > 
> > which is the way the documentation suggests statically
> > allocated variable size arrays should be allocated 
> > and initialized.
> > 
> > How does that result in a memset of the same size,
> > on the same array, to throw an error like above?
> 
> Compiler knows that ->cpu_bitmap is 64 bits of storage, and with
> !CPUMASK_OFFSTACK, nr_cpumask_bits = NR_CPUS.  With NR_CPUS > 64,
> compiler gripes, with NR_CPUS <= 64 it's a happy camper.
> 
> > What am I doing wrong?
> 
> Below is what I did to get box to both STHU, and to boot with the
> openSUSE master branch config I sent.  Without the efi_mm hunk, boot
> hangs early with or without the other hunk.
> 
> I build and boot tested the openSUSE config, a NOPREEMPT+MAXSMP
> config,
> my local config w. NR_CPUS=8, and master-rt w. NR_CPUS=256, which is
> the only one that got any real exercise (building the others).
> 

Thank you for tracking that down.

I added your code, and Signed-off-By in patch
1 for version 5 of the series.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-08 14:13       ` Mike Galbraith
@ 2018-07-08 14:44         ` Mike Galbraith
  2018-07-09 21:38         ` Rik van Riel
  1 sibling, 0 replies; 47+ messages in thread
From: Mike Galbraith @ 2018-07-08 14:44 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

BTW, a second gripe ala the first, but wrt mm_init_cpumask(&efi_mm):

In function ‘bitmap_zero’,
    inlined from ‘cpumask_clear’ at ./include/linux/cpumask.h:378:2,
    inlined from ‘mm_init_cpumask’ at ./include/linux/mm_types.h:504:2,
    inlined from ‘efi_alloc_page_tables’ at arch/x86/platform/efi/efi_64.c:235:2:
./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
   memset(dst, 0, len);
   ^~~~~~~~~~~~~~~~~~~


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-07 21:25     ` Rik van Riel
@ 2018-07-08 14:13       ` Mike Galbraith
  2018-07-08 14:44         ` Mike Galbraith
  2018-07-09 21:38         ` Rik van Riel
  0 siblings, 2 replies; 47+ messages in thread
From: Mike Galbraith @ 2018-07-08 14:13 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

On Sat, 2018-07-07 at 17:25 -0400, Rik van Riel wrote:
> 
> > ./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes
> > into a region of size 0 overflows the destination [-Wstringop-
> > overflow=]
> >    memset(dst, 0, len);
> >    ^~~~~~~~~~~~~~~~~~~
> 
> I don't understand this one.
> 
> Inside init_mm we have this line:
>         .cpu_bitmap     = { [BITS_TO_LONGS(NR_CPUS)] = 0},
> 
> which is the way the documentation suggests statically
> allocated variable size arrays should be allocated 
> and initialized.
> 
> How does that result in a memset of the same size,
> on the same array, to throw an error like above?

Compiler knows that ->cpu_bitmap is 64 bits of storage, and with
!CPUMASK_OFFSTACK, nr_cpumask_bits = NR_CPUS.  With NR_CPUS > 64,
compiler gripes, with NR_CPUS <= 64 it's a happy camper.

> What am I doing wrong?

Below is what I did to get box to both STHU, and to boot with the
openSUSE master branch config I sent.  Without the efi_mm hunk, boot
hangs early with or without the other hunk.

I build and boot tested the openSUSE config, a NOPREEMPT+MAXSMP config,
my local config w. NR_CPUS=8, and master-rt w. NR_CPUS=256, which is
the only one that got any real exercise (building the others).

---
 drivers/firmware/efi/efi.c |    1 +
 include/linux/mm_types.h   |    5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
 	.mmap_sem		= __RWSEM_INITIALIZER(efi_mm.mmap_sem),
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
+	.cpu_bitmap		= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 };
 
 static bool disable_runtime;
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -501,7 +501,10 @@ extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-	cpumask_clear((struct cpumask *)&mm->cpu_bitmap);
+	unsigned long cpu_bitmap = (unsigned long)mm;
+
+	cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+	cpumask_clear((struct cpumask *)cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-07  8:23   ` Mike Galbraith
@ 2018-07-07 21:25     ` Rik van Riel
  2018-07-08 14:13       ` Mike Galbraith
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-07 21:25 UTC (permalink / raw)
  To: Mike Galbraith, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

[-- Attachment #1: Type: text/plain, Size: 2514 bytes --]

   I. On Sat, 2018-07-07 at 10:23 +0200, Mike Galbraith wrote:
> On Fri, 2018-07-06 at 17:56 -0400, Rik van Riel wrote:
> > The mm_struct always contains a cpumask bitmap, regardless of
> > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> > simplify things, and simply have one bitmask at the end of the
> > mm_struct for the mm_cpumask.
> 
> Otherwise virgin master.today grumbles.
> 
>   CC      kernel/bounds.s
>   UPD     include/generated/timeconst.h
>   UPD     include/generated/bounds.h
>   CC      arch/x86/kernel/asm-offsets.s
>   UPD     include/generated/asm-offsets.h
>   CALL    scripts/checksyscalls.sh
>   CHK     include/generated/compile.h
>   HOSTCC  usr/gen_init_cpio
>   UPD     include/generated/compile.h
>   CC      init/main.o
> In file included from ./include/linux/cpumask.h:12:0,
>                  from ./arch/x86/include/asm/cpumask.h:5,
>                  from ./arch/x86/include/asm/msr.h:11,
>                  from ./arch/x86/include/asm/processor.h:21,
>                  from ./arch/x86/include/asm/cpufeature.h:5,
>                  from ./arch/x86/include/asm/thread_info.h:53,
>                  from ./include/linux/thread_info.h:38,
>                  from ./arch/x86/include/asm/preempt.h:7,
>                  from ./include/linux/preempt.h:81,
>                  from ./include/linux/spinlock.h:51,
>                  from ./include/linux/seqlock.h:36,
>                  from ./include/linux/time.h:6,
>                  from ./include/linux/stat.h:19,
>                  from ./include/linux/module.h:10,
>                  from init/main.c:16:
> In function ‘bitmap_zero’,
>     inlined from ‘cpumask_clear’ at ./include/linux/cpumask.h:378:2,
>     inlined from ‘mm_init_cpumask’ at
> ./include/linux/mm_types.h:504:2,
>     inlined from ‘start_kernel’ at init/main.c:560:2:
> ./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes
> into a region of size 0 overflows the destination [-Wstringop-
> overflow=]
>    memset(dst, 0, len);
>    ^~~~~~~~~~~~~~~~~~~

I don't understand this one.

Inside init_mm we have this line:
        .cpu_bitmap     = { [BITS_TO_LONGS(NR_CPUS)] = 0},

which is the way the documentation suggests statically
allocated variable size arrays should be allocated 
and initialized.

How does that result in a memset of the same size,
on the same array, to throw an error like above?

What am I doing wrong?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-06 21:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-07-07  8:23   ` Mike Galbraith
  2018-07-07 21:25     ` Rik van Riel
  0 siblings, 1 reply; 47+ messages in thread
From: Mike Galbraith @ 2018-07-07  8:23 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, songliubraving, hpa

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

On Fri, 2018-07-06 at 17:56 -0400, Rik van Riel wrote:
> The mm_struct always contains a cpumask bitmap, regardless of
> CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> simplify things, and simply have one bitmask at the end of the
> mm_struct for the mm_cpumask.

Otherwise virgin master.today grumbles.

  CC      kernel/bounds.s
  UPD     include/generated/timeconst.h
  UPD     include/generated/bounds.h
  CC      arch/x86/kernel/asm-offsets.s
  UPD     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  HOSTCC  usr/gen_init_cpio
  UPD     include/generated/compile.h
  CC      init/main.o
In file included from ./include/linux/cpumask.h:12:0,
                 from ./arch/x86/include/asm/cpumask.h:5,
                 from ./arch/x86/include/asm/msr.h:11,
                 from ./arch/x86/include/asm/processor.h:21,
                 from ./arch/x86/include/asm/cpufeature.h:5,
                 from ./arch/x86/include/asm/thread_info.h:53,
                 from ./include/linux/thread_info.h:38,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:81,
                 from ./include/linux/spinlock.h:51,
                 from ./include/linux/seqlock.h:36,
                 from ./include/linux/time.h:6,
                 from ./include/linux/stat.h:19,
                 from ./include/linux/module.h:10,
                 from init/main.c:16:
In function ‘bitmap_zero’,
    inlined from ‘cpumask_clear’ at ./include/linux/cpumask.h:378:2,
    inlined from ‘mm_init_cpumask’ at ./include/linux/mm_types.h:504:2,
    inlined from ‘start_kernel’ at init/main.c:560:2:
./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
   memset(dst, 0, len);
   ^~~~~~~~~~~~~~~~~~~

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 50558 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-07-06 21:56 [PATCH v4 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-07-06 21:56 ` Rik van Riel
  2018-07-07  8:23   ` Mike Galbraith
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-07-06 21:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, hpa, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm_types.h | 237 ++++++++++++++++++++++++-----------------------
 kernel/fork.c            |  15 +--
 mm/init-mm.c             |  11 +++
 3 files changed, 140 insertions(+), 123 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..e06de7e492d0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,179 @@ struct core_state {
 
 struct kioctx_table;
 struct mm_struct {
-	struct vm_area_struct *mmap;		/* list of VMAs */
-	struct rb_root mm_rb;
-	u32 vmacache_seqnum;                   /* per-thread vmacache */
+	struct {
+		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct rb_root mm_rb;
+		u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
+		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
 #endif
-	unsigned long mmap_base;		/* base of mmap area */
-	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
+		unsigned long mmap_base;	/* base of mmap area */
+		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-	/* Base adresses for compatible mmap() */
-	unsigned long mmap_compat_base;
-	unsigned long mmap_compat_legacy_base;
+		/* Base adresses for compatible mmap() */
+		unsigned long mmap_compat_base;
+		unsigned long mmap_compat_legacy_base;
 #endif
-	unsigned long task_size;		/* size of task vm space */
-	unsigned long highest_vm_end;		/* highest vma end address */
-	pgd_t * pgd;
-
-	/**
-	 * @mm_users: The number of users including userspace.
-	 *
-	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
-	 * to 0 (i.e. when the task exits and there are no other temporary
-	 * reference holders), we also release a reference on @mm_count
-	 * (which may then free the &struct mm_struct if @mm_count also
-	 * drops to 0).
-	 */
-	atomic_t mm_users;
-
-	/**
-	 * @mm_count: The number of references to &struct mm_struct
-	 * (@mm_users count as 1).
-	 *
-	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-	 * &struct mm_struct is freed.
-	 */
-	atomic_t mm_count;
+		unsigned long task_size;	/* size of task vm space */
+		unsigned long highest_vm_end;	/* highest vma end address */
+		pgd_t * pgd;
+
+		/**
+		 * @mm_users: The number of users including userspace.
+		 *
+		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+		 * drops to 0 (i.e. when the task exits and there are no other
+		 * temporary reference holders), we also release a reference on
+		 * @mm_count (which may then free the &struct mm_struct if
+		 * @mm_count also drops to 0).
+		 */
+		atomic_t mm_users;
+
+		/**
+		 * @mm_count: The number of references to &struct mm_struct
+		 * (@mm_users count as 1).
+		 *
+		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+		 * &struct mm_struct is freed.
+		 */
+		atomic_t mm_count;
 
 #ifdef CONFIG_MMU
-	atomic_long_t pgtables_bytes;		/* PTE page table pages */
+		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
-	int map_count;				/* number of VMAs */
+		int map_count;			/* number of VMAs */
 
-	spinlock_t page_table_lock;		/* Protects page tables and some counters */
-	struct rw_semaphore mmap_sem;
+		spinlock_t page_table_lock; /* Protects page tables and some
+					     * counters
+					     */
+		struct rw_semaphore mmap_sem;
 
-	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
-						 * together off init_mm.mmlist, and are protected
-						 * by mmlist_lock
-						 */
+		struct list_head mmlist; /* List of maybe swapped mm's.	These
+					  * are globally strung together off
+					  * init_mm.mmlist, and are protected
+					  * by mmlist_lock
+					  */
 
 
-	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
-	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+		unsigned long hiwater_rss; /* High-watermark of RSS usage */
+		unsigned long hiwater_vm;  /* High-water virtual memory usage */
 
-	unsigned long total_vm;		/* Total pages mapped */
-	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
-	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */
-	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
-	unsigned long stack_vm;		/* VM_STACK */
-	unsigned long def_flags;
+		unsigned long total_vm;	   /* Total pages mapped */
+		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
+		unsigned long pinned_vm;   /* Refcount permanently increased */
+		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+		unsigned long stack_vm;	   /* VM_STACK */
+		unsigned long def_flags;
 
-	spinlock_t arg_lock; /* protect the below fields */
-	unsigned long start_code, end_code, start_data, end_data;
-	unsigned long start_brk, brk, start_stack;
-	unsigned long arg_start, arg_end, env_start, env_end;
+		spinlock_t arg_lock; /* protect the below fields */
+		unsigned long start_code, end_code, start_data, end_data;
+		unsigned long start_brk, brk, start_stack;
+		unsigned long arg_start, arg_end, env_start, env_end;
 
-	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	/*
-	 * Special counters, in some configurations protected by the
-	 * page_table_lock, in other configurations by being atomic.
-	 */
-	struct mm_rss_stat rss_stat;
-
-	struct linux_binfmt *binfmt;
+		/*
+		 * Special counters, in some configurations protected by the
+		 * page_table_lock, in other configurations by being atomic.
+		 */
+		struct mm_rss_stat rss_stat;
 
-	cpumask_var_t cpu_vm_mask_var;
+		struct linux_binfmt *binfmt;
 
-	/* Architecture-specific MM context */
-	mm_context_t context;
+		/* Architecture-specific MM context */
+		mm_context_t context;
 
-	unsigned long flags; /* Must use atomic bitops to access the bits */
+		unsigned long flags; /* Must use atomic bitops to access */
 
-	struct core_state *core_state; /* coredumping support */
+		struct core_state *core_state; /* coredumping support */
 #ifdef CONFIG_MEMBARRIER
-	atomic_t membarrier_state;
+		atomic_t membarrier_state;
 #endif
 #ifdef CONFIG_AIO
-	spinlock_t			ioctx_lock;
-	struct kioctx_table __rcu	*ioctx_table;
+		spinlock_t			ioctx_lock;
+		struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+		/*
+		 * "owner" points to a task that is regarded as the canonical
+		 * user/owner of this mm. All of the following must be true in
+		 * order for it to be changed:
+		 *
+		 * current == mm->owner
+		 * current->mm != mm
+		 * new_owner->mm == mm
+		 * new_owner->alloc_lock is held
+		 */
+		struct task_struct __rcu *owner;
 #endif
-	struct user_namespace *user_ns;
+		struct user_namespace *user_ns;
 
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file __rcu *exe_file;
+		/* store ref to file /proc/<pid>/exe symlink points to */
+		struct file __rcu *exe_file;
 #ifdef CONFIG_MMU_NOTIFIER
-	struct mmu_notifier_mm *mmu_notifier_mm;
+		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
+		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * numa_next_scan is the next time that the PTEs will be marked
-	 * pte_numa. NUMA hinting faults will gather statistics and migrate
-	 * pages to new nodes if necessary.
-	 */
-	unsigned long numa_next_scan;
+		/*
+		 * numa_next_scan is the next time that the PTEs will be marked
+		 * pte_numa. NUMA hinting faults will gather statistics and
+		 * migrate pages to new nodes if necessary.
+		 */
+		unsigned long numa_next_scan;
 
-	/* Restart point for scanning and setting pte_numa */
-	unsigned long numa_scan_offset;
+		/* Restart point for scanning and setting pte_numa */
+		unsigned long numa_scan_offset;
 
-	/* numa_scan_seq prevents two threads setting pte_numa */
-	int numa_scan_seq;
+		/* numa_scan_seq prevents two threads setting pte_numa */
+		int numa_scan_seq;
 #endif
-	/*
-	 * An operation with batched TLB flushing is going on. Anything that
-	 * can move process memory needs to flush the TLB when moving a
-	 * PROT_NONE or PROT_NUMA mapped page.
-	 */
-	atomic_t tlb_flush_pending;
+		/*
+		 * An operation with batched TLB flushing is going on. Anything
+		 * that can move process memory needs to flush the TLB when
+		 * moving a PROT_NONE or PROT_NUMA mapped page.
+		 */
+		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/* See flush_tlb_batched_pending() */
-	bool tlb_flush_batched;
+		/* See flush_tlb_batched_pending() */
+		bool tlb_flush_batched;
 #endif
-	struct uprobes_state uprobes_state;
+		struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
-	atomic_long_t hugetlb_usage;
+		atomic_long_t hugetlb_usage;
 #endif
-	struct work_struct async_put_work;
+		struct work_struct async_put_work;
 
 #if IS_ENABLED(CONFIG_HMM)
-	/* HMM needs to track a few things per mm */
-	struct hmm *hmm;
+		/* HMM needs to track a few things per mm */
+		struct hmm *hmm;
 #endif
-} __randomize_layout;
+	} __randomize_layout;
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
+};
 
 extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	cpumask_clear((struct cpumask *)&mm->cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on the maximum CPU number this system
+	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-29 14:29 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
@ 2018-06-30  4:30   ` kbuild test robot
  0 siblings, 0 replies; 47+ messages in thread
From: kbuild test robot @ 2018-06-30  4:30 UTC (permalink / raw)
  To: Rik van Riel
  Cc: kbuild-all, linux-kernel, x86, luto, dave.hansen, mingo,
	kernel-team, tglx, efault, songliubraving, hpa, Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 2977 bytes --]

Hi Rik,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18-rc2 next-20180629]
[cannot apply to tip/x86/core]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Rik-van-Riel/x86-tlb-mm-make-lazy-TLB-mode-even-lazier/20180629-232822
config: s390-debug_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=s390 

All errors (new ones prefixed by >>):

   In file included from arch/s390/include/asm/fpu/internal.h:12:0,
                    from arch/s390/include/asm/processor.h:47,
                    from arch/s390/include/asm/thread_info.h:25,
                    from include/linux/thread_info.h:38,
                    from arch/s390/include/asm/preempt.h:6,
                    from include/linux/preempt.h:81,
                    from include/linux/spinlock.h:51,
                    from include/linux/seqlock.h:36,
                    from include/linux/time.h:6,
                    from include/linux/stat.h:19,
                    from include/linux/module.h:10,
                    from init/main.c:16:
   In function 'memset',
       inlined from 'start_kernel' at include/linux/bitmap.h:208:3:
>> include/linux/string.h:327:3: error: call to '__write_overflow' declared with attribute error: detected write beyond size of object passed as 1st parameter
      __write_overflow();
      ^~~~~~~~~~~~~~~~~~

vim +/__write_overflow +327 include/linux/string.h

6974f0c4 Daniel Micay 2017-07-12  322  
6974f0c4 Daniel Micay 2017-07-12  323  __FORTIFY_INLINE void *memset(void *p, int c, __kernel_size_t size)
6974f0c4 Daniel Micay 2017-07-12  324  {
6974f0c4 Daniel Micay 2017-07-12  325  	size_t p_size = __builtin_object_size(p, 0);
6974f0c4 Daniel Micay 2017-07-12  326  	if (__builtin_constant_p(size) && p_size < size)
6974f0c4 Daniel Micay 2017-07-12 @327  		__write_overflow();
6974f0c4 Daniel Micay 2017-07-12  328  	if (p_size < size)
6974f0c4 Daniel Micay 2017-07-12  329  		fortify_panic(__func__);
6974f0c4 Daniel Micay 2017-07-12  330  	return __builtin_memset(p, c, size);
6974f0c4 Daniel Micay 2017-07-12  331  }
6974f0c4 Daniel Micay 2017-07-12  332  

:::::: The code at line 327 was first introduced by commit
:::::: 6974f0c4555e285ab217cee58b6e874f776ff409 include/linux/string.h: add the option of fortified string.h functions

:::::: TO: Daniel Micay <danielmicay@gmail.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 19056 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids
  2018-06-29 14:29 [PATCH v3 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
@ 2018-06-29 14:29 ` Rik van Riel
  2018-06-30  4:30   ` kbuild test robot
  0 siblings, 1 reply; 47+ messages in thread
From: Rik van Riel @ 2018-06-29 14:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault,
	songliubraving, hpa, Rik van Riel

The mm_struct always contains a cpumask bitmap, regardless of
CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
simplify things, and simply have one bitmask at the end of the
mm_struct for the mm_cpumask.

This does necessitate moving everything else in mm_struct into
an anonymous sub-structure, which can be randomized when struct
randomization is enabled.

The second step is to determine the correct size for the
mm_struct slab object from the size of the mm_struct
(excluding the cpu bitmap) and the size the cpumask.

For init_mm we can simply allocate the maximum size this
kernel is compiled for, since we only have one init_mm
in the system, anyway.

Signed-off-by: Rik van Riel <riel@surriel.com>
Tested-by: Song Liu <songliubraving@fb.com>
---
 include/linux/mm_types.h | 237 ++++++++++++++++++++++++-----------------------
 kernel/fork.c            |  15 +--
 mm/init-mm.c             |  11 +++
 3 files changed, 140 insertions(+), 123 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..e06de7e492d0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,176 +335,179 @@ struct core_state {
 
 struct kioctx_table;
 struct mm_struct {
-	struct vm_area_struct *mmap;		/* list of VMAs */
-	struct rb_root mm_rb;
-	u32 vmacache_seqnum;                   /* per-thread vmacache */
+	struct {
+		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct rb_root mm_rb;
+		u32 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
-	unsigned long (*get_unmapped_area) (struct file *filp,
+		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
 				unsigned long pgoff, unsigned long flags);
 #endif
-	unsigned long mmap_base;		/* base of mmap area */
-	unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
+		unsigned long mmap_base;	/* base of mmap area */
+		unsigned long mmap_legacy_base;	/* base of mmap area in bottom-up allocations */
 #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
-	/* Base adresses for compatible mmap() */
-	unsigned long mmap_compat_base;
-	unsigned long mmap_compat_legacy_base;
+		/* Base adresses for compatible mmap() */
+		unsigned long mmap_compat_base;
+		unsigned long mmap_compat_legacy_base;
 #endif
-	unsigned long task_size;		/* size of task vm space */
-	unsigned long highest_vm_end;		/* highest vma end address */
-	pgd_t * pgd;
-
-	/**
-	 * @mm_users: The number of users including userspace.
-	 *
-	 * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops
-	 * to 0 (i.e. when the task exits and there are no other temporary
-	 * reference holders), we also release a reference on @mm_count
-	 * (which may then free the &struct mm_struct if @mm_count also
-	 * drops to 0).
-	 */
-	atomic_t mm_users;
-
-	/**
-	 * @mm_count: The number of references to &struct mm_struct
-	 * (@mm_users count as 1).
-	 *
-	 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
-	 * &struct mm_struct is freed.
-	 */
-	atomic_t mm_count;
+		unsigned long task_size;	/* size of task vm space */
+		unsigned long highest_vm_end;	/* highest vma end address */
+		pgd_t * pgd;
+
+		/**
+		 * @mm_users: The number of users including userspace.
+		 *
+		 * Use mmget()/mmget_not_zero()/mmput() to modify. When this
+		 * drops to 0 (i.e. when the task exits and there are no other
+		 * temporary reference holders), we also release a reference on
+		 * @mm_count (which may then free the &struct mm_struct if
+		 * @mm_count also drops to 0).
+		 */
+		atomic_t mm_users;
+
+		/**
+		 * @mm_count: The number of references to &struct mm_struct
+		 * (@mm_users count as 1).
+		 *
+		 * Use mmgrab()/mmdrop() to modify. When this drops to 0, the
+		 * &struct mm_struct is freed.
+		 */
+		atomic_t mm_count;
 
 #ifdef CONFIG_MMU
-	atomic_long_t pgtables_bytes;		/* PTE page table pages */
+		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
-	int map_count;				/* number of VMAs */
+		int map_count;			/* number of VMAs */
 
-	spinlock_t page_table_lock;		/* Protects page tables and some counters */
-	struct rw_semaphore mmap_sem;
+		spinlock_t page_table_lock; /* Protects page tables and some
+					     * counters
+					     */
+		struct rw_semaphore mmap_sem;
 
-	struct list_head mmlist;		/* List of maybe swapped mm's.	These are globally strung
-						 * together off init_mm.mmlist, and are protected
-						 * by mmlist_lock
-						 */
+		struct list_head mmlist; /* List of maybe swapped mm's.	These
+					  * are globally strung together off
+					  * init_mm.mmlist, and are protected
+					  * by mmlist_lock
+					  */
 
 
-	unsigned long hiwater_rss;	/* High-watermark of RSS usage */
-	unsigned long hiwater_vm;	/* High-water virtual memory usage */
+		unsigned long hiwater_rss; /* High-watermark of RSS usage */
+		unsigned long hiwater_vm;  /* High-water virtual memory usage */
 
-	unsigned long total_vm;		/* Total pages mapped */
-	unsigned long locked_vm;	/* Pages that have PG_mlocked set */
-	unsigned long pinned_vm;	/* Refcount permanently increased */
-	unsigned long data_vm;		/* VM_WRITE & ~VM_SHARED & ~VM_STACK */
-	unsigned long exec_vm;		/* VM_EXEC & ~VM_WRITE & ~VM_STACK */
-	unsigned long stack_vm;		/* VM_STACK */
-	unsigned long def_flags;
+		unsigned long total_vm;	   /* Total pages mapped */
+		unsigned long locked_vm;   /* Pages that have PG_mlocked set */
+		unsigned long pinned_vm;   /* Refcount permanently increased */
+		unsigned long data_vm;	   /* VM_WRITE & ~VM_SHARED & ~VM_STACK */
+		unsigned long exec_vm;	   /* VM_EXEC & ~VM_WRITE & ~VM_STACK */
+		unsigned long stack_vm;	   /* VM_STACK */
+		unsigned long def_flags;
 
-	spinlock_t arg_lock; /* protect the below fields */
-	unsigned long start_code, end_code, start_data, end_data;
-	unsigned long start_brk, brk, start_stack;
-	unsigned long arg_start, arg_end, env_start, env_end;
+		spinlock_t arg_lock; /* protect the below fields */
+		unsigned long start_code, end_code, start_data, end_data;
+		unsigned long start_brk, brk, start_stack;
+		unsigned long arg_start, arg_end, env_start, env_end;
 
-	unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
+		unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
-	/*
-	 * Special counters, in some configurations protected by the
-	 * page_table_lock, in other configurations by being atomic.
-	 */
-	struct mm_rss_stat rss_stat;
-
-	struct linux_binfmt *binfmt;
+		/*
+		 * Special counters, in some configurations protected by the
+		 * page_table_lock, in other configurations by being atomic.
+		 */
+		struct mm_rss_stat rss_stat;
 
-	cpumask_var_t cpu_vm_mask_var;
+		struct linux_binfmt *binfmt;
 
-	/* Architecture-specific MM context */
-	mm_context_t context;
+		/* Architecture-specific MM context */
+		mm_context_t context;
 
-	unsigned long flags; /* Must use atomic bitops to access the bits */
+		unsigned long flags; /* Must use atomic bitops to access */
 
-	struct core_state *core_state; /* coredumping support */
+		struct core_state *core_state; /* coredumping support */
 #ifdef CONFIG_MEMBARRIER
-	atomic_t membarrier_state;
+		atomic_t membarrier_state;
 #endif
 #ifdef CONFIG_AIO
-	spinlock_t			ioctx_lock;
-	struct kioctx_table __rcu	*ioctx_table;
+		spinlock_t			ioctx_lock;
+		struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+		/*
+		 * "owner" points to a task that is regarded as the canonical
+		 * user/owner of this mm. All of the following must be true in
+		 * order for it to be changed:
+		 *
+		 * current == mm->owner
+		 * current->mm != mm
+		 * new_owner->mm == mm
+		 * new_owner->alloc_lock is held
+		 */
+		struct task_struct __rcu *owner;
 #endif
-	struct user_namespace *user_ns;
+		struct user_namespace *user_ns;
 
-	/* store ref to file /proc/<pid>/exe symlink points to */
-	struct file __rcu *exe_file;
+		/* store ref to file /proc/<pid>/exe symlink points to */
+		struct file __rcu *exe_file;
 #ifdef CONFIG_MMU_NOTIFIER
-	struct mmu_notifier_mm *mmu_notifier_mm;
+		struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
-	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
-#endif
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	struct cpumask cpumask_allocation;
+		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/*
-	 * numa_next_scan is the next time that the PTEs will be marked
-	 * pte_numa. NUMA hinting faults will gather statistics and migrate
-	 * pages to new nodes if necessary.
-	 */
-	unsigned long numa_next_scan;
+		/*
+		 * numa_next_scan is the next time that the PTEs will be marked
+		 * pte_numa. NUMA hinting faults will gather statistics and
+		 * migrate pages to new nodes if necessary.
+		 */
+		unsigned long numa_next_scan;
 
-	/* Restart point for scanning and setting pte_numa */
-	unsigned long numa_scan_offset;
+		/* Restart point for scanning and setting pte_numa */
+		unsigned long numa_scan_offset;
 
-	/* numa_scan_seq prevents two threads setting pte_numa */
-	int numa_scan_seq;
+		/* numa_scan_seq prevents two threads setting pte_numa */
+		int numa_scan_seq;
 #endif
-	/*
-	 * An operation with batched TLB flushing is going on. Anything that
-	 * can move process memory needs to flush the TLB when moving a
-	 * PROT_NONE or PROT_NUMA mapped page.
-	 */
-	atomic_t tlb_flush_pending;
+		/*
+		 * An operation with batched TLB flushing is going on. Anything
+		 * that can move process memory needs to flush the TLB when
+		 * moving a PROT_NONE or PROT_NUMA mapped page.
+		 */
+		atomic_t tlb_flush_pending;
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
-	/* See flush_tlb_batched_pending() */
-	bool tlb_flush_batched;
+		/* See flush_tlb_batched_pending() */
+		bool tlb_flush_batched;
 #endif
-	struct uprobes_state uprobes_state;
+		struct uprobes_state uprobes_state;
 #ifdef CONFIG_HUGETLB_PAGE
-	atomic_long_t hugetlb_usage;
+		atomic_long_t hugetlb_usage;
 #endif
-	struct work_struct async_put_work;
+		struct work_struct async_put_work;
 
 #if IS_ENABLED(CONFIG_HMM)
-	/* HMM needs to track a few things per mm */
-	struct hmm *hmm;
+		/* HMM needs to track a few things per mm */
+		struct hmm *hmm;
 #endif
-} __randomize_layout;
+	} __randomize_layout;
+
+	/*
+	 * The mm_cpumask needs to be at the end of mm_struct, because it
+	 * is dynamically sized based on nr_cpu_ids.
+	 */
+	unsigned long cpu_bitmap[];
+};
 
 extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-#ifdef CONFIG_CPUMASK_OFFSTACK
-	mm->cpu_vm_mask_var = &mm->cpumask_allocation;
-#endif
-	cpumask_clear(mm->cpu_vm_mask_var);
+	cpumask_clear((struct cpumask *)&mm->cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
 static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 {
-	return mm->cpu_vm_mask_var;
+	return (struct cpumask *)&mm->cpu_bitmap;
 }
 
 struct mmu_gather;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61b925c..5b64c1b8461e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data)
 
 void __init proc_caches_init(void)
 {
+	unsigned int mm_size;
+
 	sighand_cachep = kmem_cache_create("sighand_cache",
 			sizeof(struct sighand_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU|
@@ -2269,15 +2271,16 @@ void __init proc_caches_init(void)
 			sizeof(struct fs_struct), 0,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			NULL);
+
 	/*
-	 * FIXME! The "sizeof(struct mm_struct)" currently includes the
-	 * whole struct cpumask for the OFFSTACK case. We could change
-	 * this to *only* allocate as much of it as required by the
-	 * maximum number of CPU's we can ever have.  The cpumask_allocation
-	 * is at the end of the structure, exactly for that reason.
+	 * The mm_cpumask is located at the end of mm_struct, and is
+	 * dynamically sized based on the maximum CPU number this system
+	 * can have, taking hotplug into account (nr_cpu_ids).
 	 */
+	mm_size = sizeof(struct mm_struct) + cpumask_size();
+
 	mm_cachep = kmem_cache_create_usercopy("mm_struct",
-			sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
+			mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
 			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT,
 			offsetof(struct mm_struct, saved_auxv),
 			sizeof_field(struct mm_struct, saved_auxv),
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..a787a319211e 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -15,6 +15,16 @@
 #define INIT_MM_CONTEXT(name)
 #endif
 
+/*
+ * For dynamically allocated mm_structs, there is a dynamically sized cpumask
+ * at the end of the structure, the size of which depends on the maximum CPU
+ * number the system can see. That way we allocate only as much memory for
+ * mm_cpumask() as needed for the hundreds, or thousands of processes that
+ * a system typically runs.
+ *
+ * Since there is only one init_mm in the entire system, keep it simple
+ * and size this cpu_bitmask to NR_CPUS.
+ */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
 	.pgd		= swapper_pg_dir,
@@ -25,5 +35,6 @@ struct mm_struct init_mm = {
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
 	.mmlist		= LIST_HEAD_INIT(init_mm.mmlist),
 	.user_ns	= &init_user_ns,
+	.cpu_bitmap	= { [BITS_TO_LONGS(NR_CPUS)] = 0},
 	INIT_MM_CONTEXT(init_mm)
 };
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2018-08-04 22:31 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-06-20 19:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-06-20 21:32   ` kbuild test robot
2018-06-21 20:18     ` Rik van Riel
2018-06-21  0:24   ` kbuild test robot
2018-06-22 15:10   ` Dave Hansen
2018-06-22 17:45     ` Rik van Riel
2018-06-20 19:56 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel
2018-06-21  0:23   ` Rik van Riel
2018-06-22 14:58   ` Andy Lutomirski
2018-06-22 15:17     ` Rik van Riel
2018-06-20 19:56 ` [PATCH 3/7] x86,tlb: change tlbstate.is_lazy to tlbstate.state Rik van Riel
2018-06-22 17:01   ` Dave Hansen
2018-06-22 17:08     ` Rik van Riel
2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel
2018-06-22 15:04   ` Andy Lutomirski
2018-06-22 15:15     ` Rik van Riel
2018-06-22 15:34       ` Andy Lutomirski
2018-06-22 17:05   ` Dave Hansen
2018-06-22 17:16     ` Rik van Riel
2018-06-20 19:56 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel
2018-06-22 17:23   ` Dave Hansen
2018-06-20 19:56 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel
2018-06-20 19:56 ` [PATCH 7/7] x86,idle: do not leave mm in idle state Rik van Riel
2018-06-20 22:20   ` kbuild test robot
2018-06-21  0:25     ` Rik van Riel
2018-06-22 15:36   ` Andy Lutomirski
2018-06-22 15:53     ` Rik van Riel
2018-06-22 16:01       ` Andy Lutomirski
2018-06-22 20:18         ` Rik van Riel
2018-06-22 22:05           ` Andy Lutomirski
2018-06-23  0:55             ` Rik van Riel
2018-06-29 14:29 [PATCH v3 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-06-29 14:29 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-06-30  4:30   ` kbuild test robot
2018-07-06 21:56 [PATCH v4 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-07-06 21:56 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-07-07  8:23   ` Mike Galbraith
2018-07-07 21:25     ` Rik van Riel
2018-07-08 14:13       ` Mike Galbraith
2018-07-08 14:44         ` Mike Galbraith
2018-07-09 21:38         ` Rik van Riel
2018-07-10  3:28           ` Mike Galbraith
2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-07-15 22:59   ` Ingo Molnar
2018-07-15 23:50     ` Rik van Riel
2018-07-16  1:07       ` Ingo Molnar
2018-07-16 19:03 [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel
2018-07-16 19:03 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel
2018-08-04 22:28   ` Guenter Roeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.