* [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier @ 2018-07-10 14:28 Rik van Riel 2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel ` (6 more replies) 0 siblings, 7 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels, using 1.9% of a 48 CPU system during a netperf run. Digging into the profile, the atomic operations in cpumask_clear_cpu and cpumask_set_cpu are responsible for about half of that CPU use. However, the CPUs running netperf are simply switching back and forth between netperf and the idle task, which does not require any changes to the mm_cpumask if lazy TLB mode were used. Additionally, the init_mm cpumask ends up being the most heavily contended one in the system, for no reason at all. Making really lazy TLB mode work again on modern kernels, by sending a shootdown IPI only when page table pages are being unmapped, we get back some performance. v5 of the series fixes the preempt bug and string overflow compiler warnings pointed out by Mike Galbraith. On memcache workloads on 2 socket systems, this patch series seems to reduce total system CPU use by 1-2%. On Song's netbench tests, CPU use in the context switch time is about cut in half. These patches also provide a little memory savings by shrinking the size of mm_struct, especially on distro kernels compiled with a gigantically large NR_CPUS. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-15 22:59 ` Ingo Molnar 2018-07-10 14:28 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel ` (5 subsequent siblings) 6 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel The mm_struct always contains a cpumask bitmap, regardless of CONFIG_CPUMASK_OFFSTACK. That means the first step can be to simplify things, and simply have one bitmask at the end of the mm_struct for the mm_cpumask. This does necessitate moving everything else in mm_struct into an anonymous sub-structure, which can be randomized when struct randomization is enabled. The second step is to determine the correct size for the mm_struct slab object from the size of the mm_struct (excluding the cpu bitmap) and the size the cpumask. For init_mm we can simply allocate the maximum size this kernel is compiled for, since we only have one init_mm in the system, anyway. Pointer magic by Mike Galbraith, to evade -Wstringop-overflow getting confused by the dynamically sized array. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- drivers/firmware/efi/efi.c | 1 + include/linux/mm_types.h | 241 +++++++++++++++++++++++---------------------- kernel/fork.c | 15 +-- mm/init-mm.c | 11 +++ 4 files changed, 145 insertions(+), 123 deletions(-) diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c index 232f4915223b..7f0b19410a95 100644 --- a/drivers/firmware/efi/efi.c +++ b/drivers/firmware/efi/efi.c @@ -82,6 +82,7 @@ struct mm_struct efi_mm = { .mmap_sem = __RWSEM_INITIALIZER(efi_mm.mmap_sem), .page_table_lock = __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock), .mmlist = LIST_HEAD_INIT(efi_mm.mmlist), + .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0}, }; static bool disable_runtime; diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 99ce070e7dcb..efdc24dd9e97 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -335,176 +335,183 @@ struct core_state { struct kioctx_table; struct mm_struct { - struct vm_area_struct *mmap; /* list of VMAs */ - struct rb_root mm_rb; - u32 vmacache_seqnum; /* per-thread vmacache */ + struct { + struct vm_area_struct *mmap; /* list of VMAs */ + struct rb_root mm_rb; + u32 vmacache_seqnum; /* per-thread vmacache */ #ifdef CONFIG_MMU - unsigned long (*get_unmapped_area) (struct file *filp, + unsigned long (*get_unmapped_area) (struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags); #endif - unsigned long mmap_base; /* base of mmap area */ - unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ + unsigned long mmap_base; /* base of mmap area */ + unsigned long mmap_legacy_base; /* base of mmap area in bottom-up allocations */ #ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES - /* Base adresses for compatible mmap() */ - unsigned long mmap_compat_base; - unsigned long mmap_compat_legacy_base; + /* Base adresses for compatible mmap() */ + unsigned long mmap_compat_base; + unsigned long mmap_compat_legacy_base; #endif - unsigned long task_size; /* size of task vm space */ - unsigned long highest_vm_end; /* highest vma end address */ - pgd_t * pgd; - - /** - * @mm_users: The number of users including userspace. - * - * Use mmget()/mmget_not_zero()/mmput() to modify. When this drops - * to 0 (i.e. when the task exits and there are no other temporary - * reference holders), we also release a reference on @mm_count - * (which may then free the &struct mm_struct if @mm_count also - * drops to 0). - */ - atomic_t mm_users; - - /** - * @mm_count: The number of references to &struct mm_struct - * (@mm_users count as 1). - * - * Use mmgrab()/mmdrop() to modify. When this drops to 0, the - * &struct mm_struct is freed. - */ - atomic_t mm_count; + unsigned long task_size; /* size of task vm space */ + unsigned long highest_vm_end; /* highest vma end address */ + pgd_t * pgd; + + /** + * @mm_users: The number of users including userspace. + * + * Use mmget()/mmget_not_zero()/mmput() to modify. When this + * drops to 0 (i.e. when the task exits and there are no other + * temporary reference holders), we also release a reference on + * @mm_count (which may then free the &struct mm_struct if + * @mm_count also drops to 0). + */ + atomic_t mm_users; + + /** + * @mm_count: The number of references to &struct mm_struct + * (@mm_users count as 1). + * + * Use mmgrab()/mmdrop() to modify. When this drops to 0, the + * &struct mm_struct is freed. + */ + atomic_t mm_count; #ifdef CONFIG_MMU - atomic_long_t pgtables_bytes; /* PTE page table pages */ + atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif - int map_count; /* number of VMAs */ + int map_count; /* number of VMAs */ - spinlock_t page_table_lock; /* Protects page tables and some counters */ - struct rw_semaphore mmap_sem; + spinlock_t page_table_lock; /* Protects page tables and some + * counters + */ + struct rw_semaphore mmap_sem; - struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung - * together off init_mm.mmlist, and are protected - * by mmlist_lock - */ + struct list_head mmlist; /* List of maybe swapped mm's. These + * are globally strung together off + * init_mm.mmlist, and are protected + * by mmlist_lock + */ - unsigned long hiwater_rss; /* High-watermark of RSS usage */ - unsigned long hiwater_vm; /* High-water virtual memory usage */ + unsigned long hiwater_rss; /* High-watermark of RSS usage */ + unsigned long hiwater_vm; /* High-water virtual memory usage */ - unsigned long total_vm; /* Total pages mapped */ - unsigned long locked_vm; /* Pages that have PG_mlocked set */ - unsigned long pinned_vm; /* Refcount permanently increased */ - unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ - unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ - unsigned long stack_vm; /* VM_STACK */ - unsigned long def_flags; + unsigned long total_vm; /* Total pages mapped */ + unsigned long locked_vm; /* Pages that have PG_mlocked set */ + unsigned long pinned_vm; /* Refcount permanently increased */ + unsigned long data_vm; /* VM_WRITE & ~VM_SHARED & ~VM_STACK */ + unsigned long exec_vm; /* VM_EXEC & ~VM_WRITE & ~VM_STACK */ + unsigned long stack_vm; /* VM_STACK */ + unsigned long def_flags; - spinlock_t arg_lock; /* protect the below fields */ - unsigned long start_code, end_code, start_data, end_data; - unsigned long start_brk, brk, start_stack; - unsigned long arg_start, arg_end, env_start, env_end; + spinlock_t arg_lock; /* protect the below fields */ + unsigned long start_code, end_code, start_data, end_data; + unsigned long start_brk, brk, start_stack; + unsigned long arg_start, arg_end, env_start, env_end; - unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ + unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ - /* - * Special counters, in some configurations protected by the - * page_table_lock, in other configurations by being atomic. - */ - struct mm_rss_stat rss_stat; - - struct linux_binfmt *binfmt; + /* + * Special counters, in some configurations protected by the + * page_table_lock, in other configurations by being atomic. + */ + struct mm_rss_stat rss_stat; - cpumask_var_t cpu_vm_mask_var; + struct linux_binfmt *binfmt; - /* Architecture-specific MM context */ - mm_context_t context; + /* Architecture-specific MM context */ + mm_context_t context; - unsigned long flags; /* Must use atomic bitops to access the bits */ + unsigned long flags; /* Must use atomic bitops to access */ - struct core_state *core_state; /* coredumping support */ + struct core_state *core_state; /* coredumping support */ #ifdef CONFIG_MEMBARRIER - atomic_t membarrier_state; + atomic_t membarrier_state; #endif #ifdef CONFIG_AIO - spinlock_t ioctx_lock; - struct kioctx_table __rcu *ioctx_table; + spinlock_t ioctx_lock; + struct kioctx_table __rcu *ioctx_table; #endif #ifdef CONFIG_MEMCG - /* - * "owner" points to a task that is regarded as the canonical - * user/owner of this mm. All of the following must be true in - * order for it to be changed: - * - * current == mm->owner - * current->mm != mm - * new_owner->mm == mm - * new_owner->alloc_lock is held - */ - struct task_struct __rcu *owner; + /* + * "owner" points to a task that is regarded as the canonical + * user/owner of this mm. All of the following must be true in + * order for it to be changed: + * + * current == mm->owner + * current->mm != mm + * new_owner->mm == mm + * new_owner->alloc_lock is held + */ + struct task_struct __rcu *owner; #endif - struct user_namespace *user_ns; + struct user_namespace *user_ns; - /* store ref to file /proc/<pid>/exe symlink points to */ - struct file __rcu *exe_file; + /* store ref to file /proc/<pid>/exe symlink points to */ + struct file __rcu *exe_file; #ifdef CONFIG_MMU_NOTIFIER - struct mmu_notifier_mm *mmu_notifier_mm; + struct mmu_notifier_mm *mmu_notifier_mm; #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS - pgtable_t pmd_huge_pte; /* protected by page_table_lock */ -#endif -#ifdef CONFIG_CPUMASK_OFFSTACK - struct cpumask cpumask_allocation; + pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif #ifdef CONFIG_NUMA_BALANCING - /* - * numa_next_scan is the next time that the PTEs will be marked - * pte_numa. NUMA hinting faults will gather statistics and migrate - * pages to new nodes if necessary. - */ - unsigned long numa_next_scan; + /* + * numa_next_scan is the next time that the PTEs will be marked + * pte_numa. NUMA hinting faults will gather statistics and + * migrate pages to new nodes if necessary. + */ + unsigned long numa_next_scan; - /* Restart point for scanning and setting pte_numa */ - unsigned long numa_scan_offset; + /* Restart point for scanning and setting pte_numa */ + unsigned long numa_scan_offset; - /* numa_scan_seq prevents two threads setting pte_numa */ - int numa_scan_seq; + /* numa_scan_seq prevents two threads setting pte_numa */ + int numa_scan_seq; #endif - /* - * An operation with batched TLB flushing is going on. Anything that - * can move process memory needs to flush the TLB when moving a - * PROT_NONE or PROT_NUMA mapped page. - */ - atomic_t tlb_flush_pending; + /* + * An operation with batched TLB flushing is going on. Anything + * that can move process memory needs to flush the TLB when + * moving a PROT_NONE or PROT_NUMA mapped page. + */ + atomic_t tlb_flush_pending; #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH - /* See flush_tlb_batched_pending() */ - bool tlb_flush_batched; + /* See flush_tlb_batched_pending() */ + bool tlb_flush_batched; #endif - struct uprobes_state uprobes_state; + struct uprobes_state uprobes_state; #ifdef CONFIG_HUGETLB_PAGE - atomic_long_t hugetlb_usage; + atomic_long_t hugetlb_usage; #endif - struct work_struct async_put_work; + struct work_struct async_put_work; #if IS_ENABLED(CONFIG_HMM) - /* HMM needs to track a few things per mm */ - struct hmm *hmm; + /* HMM needs to track a few things per mm */ + struct hmm *hmm; #endif -} __randomize_layout; + } __randomize_layout; + + /* + * The mm_cpumask needs to be at the end of mm_struct, because it + * is dynamically sized based on nr_cpu_ids. + */ + unsigned long cpu_bitmap[]; +}; extern struct mm_struct init_mm; +/* Pointer magic because the dynamic array size confuses some compilers. */ static inline void mm_init_cpumask(struct mm_struct *mm) { -#ifdef CONFIG_CPUMASK_OFFSTACK - mm->cpu_vm_mask_var = &mm->cpumask_allocation; -#endif - cpumask_clear(mm->cpu_vm_mask_var); + unsigned long cpu_bitmap = (unsigned long)mm; + + cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap); + cpumask_clear((struct cpumask *)cpu_bitmap); } /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */ static inline cpumask_t *mm_cpumask(struct mm_struct *mm) { - return mm->cpu_vm_mask_var; + return (struct cpumask *)&mm->cpu_bitmap; } struct mmu_gather; diff --git a/kernel/fork.c b/kernel/fork.c index 9440d61b925c..5b64c1b8461e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2253,6 +2253,8 @@ static void sighand_ctor(void *data) void __init proc_caches_init(void) { + unsigned int mm_size; + sighand_cachep = kmem_cache_create("sighand_cache", sizeof(struct sighand_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_TYPESAFE_BY_RCU| @@ -2269,15 +2271,16 @@ void __init proc_caches_init(void) sizeof(struct fs_struct), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL); + /* - * FIXME! The "sizeof(struct mm_struct)" currently includes the - * whole struct cpumask for the OFFSTACK case. We could change - * this to *only* allocate as much of it as required by the - * maximum number of CPU's we can ever have. The cpumask_allocation - * is at the end of the structure, exactly for that reason. + * The mm_cpumask is located at the end of mm_struct, and is + * dynamically sized based on the maximum CPU number this system + * can have, taking hotplug into account (nr_cpu_ids). */ + mm_size = sizeof(struct mm_struct) + cpumask_size(); + mm_cachep = kmem_cache_create_usercopy("mm_struct", - sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN, + mm_size, ARCH_MIN_MMSTRUCT_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, offsetof(struct mm_struct, saved_auxv), sizeof_field(struct mm_struct, saved_auxv), diff --git a/mm/init-mm.c b/mm/init-mm.c index f0179c9c04c2..a787a319211e 100644 --- a/mm/init-mm.c +++ b/mm/init-mm.c @@ -15,6 +15,16 @@ #define INIT_MM_CONTEXT(name) #endif +/* + * For dynamically allocated mm_structs, there is a dynamically sized cpumask + * at the end of the structure, the size of which depends on the maximum CPU + * number the system can see. That way we allocate only as much memory for + * mm_cpumask() as needed for the hundreds, or thousands of processes that + * a system typically runs. + * + * Since there is only one init_mm in the entire system, keep it simple + * and size this cpu_bitmask to NR_CPUS. + */ struct mm_struct init_mm = { .mm_rb = RB_ROOT, .pgd = swapper_pg_dir, @@ -25,5 +35,6 @@ struct mm_struct init_mm = { .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), .mmlist = LIST_HEAD_INIT(init_mm.mmlist), .user_ns = &init_user_ns, + .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0}, INIT_MM_CONTEXT(init_mm) }; -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids 2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel @ 2018-07-15 22:59 ` Ingo Molnar 2018-07-15 23:50 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Ingo Molnar @ 2018-07-15 22:59 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa * Rik van Riel <riel@surriel.com> wrote: > The mm_struct always contains a cpumask bitmap, regardless of > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to > simplify things, and simply have one bitmask at the end of the > mm_struct for the mm_cpumask. > > This does necessitate moving everything else in mm_struct into > an anonymous sub-structure, which can be randomized when struct > randomization is enabled. > > The second step is to determine the correct size for the > mm_struct slab object from the size of the mm_struct > (excluding the cpu bitmap) and the size the cpumask. > > For init_mm we can simply allocate the maximum size this > kernel is compiled for, since we only have one init_mm > in the system, anyway. > > Pointer magic by Mike Galbraith, to evade -Wstringop-overflow > getting confused by the dynamically sized array. > > Signed-off-by: Rik van Riel <riel@surriel.com> > Signed-off-by: Mike Galbraith <efault@gmx.de> Is this an Acked-by in disguise, or did this patch route via Mike? Thanks, Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids 2018-07-15 22:59 ` Ingo Molnar @ 2018-07-15 23:50 ` Rik van Riel 2018-07-16 1:07 ` Ingo Molnar 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-15 23:50 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa [-- Attachment #1: Type: text/plain, Size: 1290 bytes --] On Mon, 2018-07-16 at 00:59 +0200, Ingo Molnar wrote: > * Rik van Riel <riel@surriel.com> wrote: > > > The mm_struct always contains a cpumask bitmap, regardless of > > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to > > simplify things, and simply have one bitmask at the end of the > > mm_struct for the mm_cpumask. > > > > This does necessitate moving everything else in mm_struct into > > an anonymous sub-structure, which can be randomized when struct > > randomization is enabled. > > > > The second step is to determine the correct size for the > > mm_struct slab object from the size of the mm_struct > > (excluding the cpu bitmap) and the size the cpumask. > > > > For init_mm we can simply allocate the maximum size this > > kernel is compiled for, since we only have one init_mm > > in the system, anyway. > > > > Pointer magic by Mike Galbraith, to evade -Wstringop-overflow > > getting confused by the dynamically sized array. > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > Signed-off-by: Mike Galbraith <efault@gmx.de> > > Is this an Acked-by in disguise, or did this patch route via Mike? Mike found an issue with the patch and sent a fix, so I added his S-o-b to this patch as well. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids 2018-07-15 23:50 ` Rik van Riel @ 2018-07-16 1:07 ` Ingo Molnar 0 siblings, 0 replies; 46+ messages in thread From: Ingo Molnar @ 2018-07-16 1:07 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa * Rik van Riel <riel@surriel.com> wrote: > On Mon, 2018-07-16 at 00:59 +0200, Ingo Molnar wrote: > > * Rik van Riel <riel@surriel.com> wrote: > > > > > The mm_struct always contains a cpumask bitmap, regardless of > > > CONFIG_CPUMASK_OFFSTACK. That means the first step can be to > > > simplify things, and simply have one bitmask at the end of the > > > mm_struct for the mm_cpumask. > > > > > > This does necessitate moving everything else in mm_struct into > > > an anonymous sub-structure, which can be randomized when struct > > > randomization is enabled. > > > > > > The second step is to determine the correct size for the > > > mm_struct slab object from the size of the mm_struct > > > (excluding the cpu bitmap) and the size the cpumask. > > > > > > For init_mm we can simply allocate the maximum size this > > > kernel is compiled for, since we only have one init_mm > > > in the system, anyway. > > > > > > Pointer magic by Mike Galbraith, to evade -Wstringop-overflow > > > getting confused by the dynamically sized array. > > > > > > Signed-off-by: Rik van Riel <riel@surriel.com> > > > Signed-off-by: Mike Galbraith <efault@gmx.de> > > > > Is this an Acked-by in disguise, or did this patch route via Mike? > > Mike found an issue with the patch and sent a > fix, so I added his S-o-b to this patch as > well. Makes sense - I'd suggest such a SoB chain: Signed-off-by: Rik van Riel <riel@surriel.com> [ Fixed crash. ] Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Rik van Riel <riel@surriel.com> ... it's a bit non-standard but we've used it in similar cases and it makes the routing and evolution of the patch pretty clear. Thanks, Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-10 14:28 ` [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off Rik van Riel ` (4 subsequent siblings) 6 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel Andy discovered that speculative memory accesses while in lazy TLB mode can crash a system, when a CPU tries to dereference a speculative access using memory contents that used to be valid page table memory, but have since been reused for something else and point into la-la land. The latter problem can be prevented in two ways. The first is to always send a TLB shootdown IPI to CPUs in lazy TLB mode, while the second one is to only send the TLB shootdown at page table freeing time. The second should result in fewer IPIs, since operationgs like mprotect and madvise are very common with some workloads, but do not involve page table freeing. Also, on munmap, batching of page table freeing covers much larger ranges of virtual memory than the batching of unmapped user pages. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/include/asm/tlbflush.h | 5 +++++ arch/x86/mm/tlb.c | 27 +++++++++++++++++++++++++++ include/asm-generic/tlb.h | 10 ++++++++++ mm/memory.c | 22 ++++++++++++++-------- 4 files changed, 56 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6690cd3fc8b1..3aa3204b5dc0 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -554,4 +554,9 @@ extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); native_flush_tlb_others(mask, info) #endif +extern void tlb_flush_remove_tables(struct mm_struct *mm); +extern void tlb_flush_remove_tables_local(void *arg); + +#define HAVE_TLB_FLUSH_REMOVE_TABLES + #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 6eb1f34c3c85..9a893673c56b 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -646,6 +646,33 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, put_cpu(); } +void tlb_flush_remove_tables_local(void *arg) +{ + struct mm_struct *mm = arg; + + if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm && + this_cpu_read(cpu_tlbstate.is_lazy)) { + /* + * We're in lazy mode. We need to at least flush our + * paging-structure cache to avoid speculatively reading + * garbage into our TLB. Since switching to init_mm is barely + * slower than a minimal flush, just switch to init_mm. + */ + switch_mm_irqs_off(NULL, &init_mm, NULL); + } +} + +void tlb_flush_remove_tables(struct mm_struct *mm) +{ + int cpu = get_cpu(); + /* + * XXX: this really only needs to be called for CPUs in lazy TLB mode. + */ + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) + smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1); + + put_cpu(); +} static void do_flush_tlb_all(void *info) { diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index faddde44de8c..f0b462ddfa15 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -295,4 +295,14 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define tlb_migrate_finish(mm) do {} while (0) +/* + * Used to flush the TLB when page tables are removed, when lazy + * TLB mode may cause a CPU to retain intermediate translations + * pointing to about-to-be-freed page table memory. + */ +#ifndef HAVE_TLB_FLUSH_REMOVE_TABLES +#define tlb_flush_remove_tables(mm) do {} while (0) +#define tlb_flush_remove_tables_local(mm) do {} while (0) +#endif + #endif /* _ASM_GENERIC__TLB_H */ diff --git a/mm/memory.c b/mm/memory.c index 9d472e00fc2d..b4117272dc7f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -326,16 +326,20 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ #ifdef CONFIG_HAVE_RCU_TABLE_FREE -/* - * See the comment near struct mmu_table_batch. - */ - static void tlb_remove_table_smp_sync(void *arg) { - /* Simply deliver the interrupt */ + struct mm_struct __maybe_unused *mm = arg; + /* + * On most architectures this does nothing. Simply delivering the + * interrupt is enough to prevent races with software page table + * walking like that done in get_user_pages_fast. + * + * See the comment near struct mmu_table_batch. + */ + tlb_flush_remove_tables_local(mm); } -static void tlb_remove_table_one(void *table) +static void tlb_remove_table_one(void *table, struct mmu_gather *tlb) { /* * This isn't an RCU grace period and hence the page-tables cannot be @@ -344,7 +348,7 @@ static void tlb_remove_table_one(void *table) * It is however sufficient for software page-table walkers that rely on * IRQ disabling. See the comment near struct mmu_table_batch. */ - smp_call_function(tlb_remove_table_smp_sync, NULL, 1); + smp_call_function(tlb_remove_table_smp_sync, tlb->mm, 1); __tlb_remove_table(table); } @@ -365,6 +369,8 @@ void tlb_table_flush(struct mmu_gather *tlb) { struct mmu_table_batch **batch = &tlb->batch; + tlb_flush_remove_tables(tlb->mm); + if (*batch) { call_rcu_sched(&(*batch)->rcu, tlb_remove_table_rcu); *batch = NULL; @@ -387,7 +393,7 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) if (*batch == NULL) { *batch = (struct mmu_table_batch *)__get_free_page(GFP_NOWAIT | __GFP_NOWARN); if (*batch == NULL) { - tlb_remove_table_one(table); + tlb_remove_table_one(table, tlb); return; } (*batch)->nr = 0; -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel 2018-07-10 14:28 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-10 14:28 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel ` (3 subsequent siblings) 6 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel Move some code that will be needed for the lazy -> !lazy state transition when a lazy TLB CPU has gotten out of date. No functional changes, since the if (real_prev == next) branch always returns. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Suggested-by: Andy Lutomirski <luto@kernel.org> --- arch/x86/mm/tlb.c | 60 +++++++++++++++++++++++++++---------------------------- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9a893673c56b..4b73fe835c95 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -187,6 +187,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; + bool need_flush; + u16 new_asid; /* * NB: The scheduler will call us with prev == next when switching @@ -252,8 +254,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, return; } else { - u16 new_asid; - bool need_flush; u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id); /* @@ -297,41 +297,41 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); + } - if (need_flush) { - this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); - this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); - load_new_mm_cr3(next->pgd, new_asid, true); - - /* - * NB: This gets called via leave_mm() in the idle path - * where RCU functions differently. Tracing normally - * uses RCU, so we need to use the _rcuidle variant. - * - * (There is no good reason for this. The idle code should - * be rearranged to call this before rcu_idle_enter().) - */ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); - } else { - /* The new ASID is already up to date. */ - load_new_mm_cr3(next->pgd, new_asid, false); - - /* See above wrt _rcuidle. */ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0); - } + if (need_flush) { + this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id); + this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen); + load_new_mm_cr3(next->pgd, new_asid, true); /* - * Record last user mm's context id, so we can avoid - * flushing branch buffer with IBPB if we switch back - * to the same user. + * NB: This gets called via leave_mm() in the idle path + * where RCU functions differently. Tracing normally + * uses RCU, so we need to use the _rcuidle variant. + * + * (There is no good reason for this. The idle code should + * be rearranged to call this before rcu_idle_enter().) */ - if (next != &init_mm) - this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id); + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); + } else { + /* The new ASID is already up to date. */ + load_new_mm_cr3(next->pgd, new_asid, false); - this_cpu_write(cpu_tlbstate.loaded_mm, next); - this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); + /* See above wrt _rcuidle. */ + trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0); } + /* + * Record last user mm's context id, so we can avoid + * flushing branch buffer with IBPB if we switch back + * to the same user. + */ + if (next != &init_mm) + this_cpu_write(cpu_tlbstate.last_ctx_id, next->context.ctx_id); + + this_cpu_write(cpu_tlbstate.loaded_mm, next); + this_cpu_write(cpu_tlbstate.loaded_mm_asid, new_asid); + load_mm_cr4(next); switch_ldt(real_prev, next); } -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel ` (2 preceding siblings ...) 2018-07-10 14:28 ` [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-10 14:28 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel ` (2 subsequent siblings) 6 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4b73fe835c95..26542cc17043 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/cpu.h> #include <linux/debugfs.h> +#include <linux/gfp.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, next->context.ctx_id); /* - * We don't currently support having a real mm loaded without - * our cpu set in mm_cpumask(). We have all the bookkeeping - * in place to figure out whether we would need to flush - * if our cpu were cleared in mm_cpumask(), but we don't - * currently use it. + * Even in lazy TLB mode, the CPU should stay set in the + * mm_cpumask. The TLB shootdown code can figure out from + * from cpu_tlbstate.is_lazy whether or not to send an IPI. */ if (WARN_ON_ONCE(real_prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); - return; + /* + * If the CPU is not in lazy TLB mode, we are just switching + * from one thread in a process to another thread in the same + * process. No TLB flush required. + */ + if (!was_lazy) + return; + + /* + * Read the tlb_gen to check whether a flush is needed. + * If the TLB is up to date, just use it. + * The barrier synchronizes with the tlb_gen increment in + * the TLB shootdown code. + */ + smp_mb(); + next_tlb_gen = atomic64_read(&next->context.tlb_gen); + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == + next_tlb_gen) + return; + + /* + * TLB contents went out of date while we were in lazy + * mode. Fall through to the TLB switching code below. + */ + new_asid = prev_asid; + need_flush = true; } else { u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id); @@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * paging-structure cache to avoid speculatively reading * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. + * + * This should be rare, with native_flush_tlb_others skipping + * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); return; @@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + cpumask_var_t lazymask; + unsigned int cpu; + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ - unsigned int cpu; - cpu = smp_processor_id(); cpumask = uv_flush_tlb_others(cpumask, info); if (cpumask) @@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask, (void *)info, 1); return; } - smp_call_function_many(cpumask, flush_tlb_func_remote, + + /* + * A temporary cpumask is used in order to skip sending IPIs + * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm). + * If the allocation fails, simply IPI every CPU in mm_cpumask. + */ + if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) { + smp_call_function_many(cpumask, flush_tlb_func_remote, + (void *)info, 1); + return; + } + + cpumask_copy(lazymask, cpumask); + + for_each_cpu(cpu, lazymask) { + if (per_cpu(cpu_tlbstate.is_lazy, cpu)) + cpumask_clear_cpu(cpu, lazymask); + } + + smp_call_function_many(lazymask, flush_tlb_func_remote, (void *)info, 1); + + free_cpumask_var(lazymask); } /* -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel ` (3 preceding siblings ...) 2018-07-10 14:28 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-10 14:28 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel 2018-07-10 14:28 ` [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm Rik van Riel 6 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel CPUs in !is_lazy have either received TLB flush IPIs earlier on during the munmap (when the user memory was unmapped), or have context switched and reloaded during that stage of the munmap. Page table free TLB flushes only need to be sent to CPUs in lazy TLB mode, which TLB contents might not yet be up to date yet. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/tlb.c | 43 +++++++++++++++++++++++++++++++++++++++---- 1 file changed, 39 insertions(+), 4 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 26542cc17043..e4156e37aa71 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -712,15 +712,50 @@ void tlb_flush_remove_tables_local(void *arg) } } +static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm, + struct cpumask *lazy_cpus) +{ + int cpu; + + for_each_cpu(cpu, mm_cpumask(mm)) { + if (!per_cpu(cpu_tlbstate.is_lazy, cpu)) + cpumask_set_cpu(cpu, lazy_cpus); + } +} + void tlb_flush_remove_tables(struct mm_struct *mm) { int cpu = get_cpu(); + cpumask_var_t lazy_cpus; + + if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids) { + put_cpu(); + return; + } + + if (!zalloc_cpumask_var(&lazy_cpus, GFP_ATOMIC)) { + /* + * If the cpumask allocation fails, do a brute force flush + * on all the CPUs that have this mm loaded. + */ + smp_call_function_many(mm_cpumask(mm), + tlb_flush_remove_tables_local, (void *)mm, 1); + put_cpu(); + return; + } + /* - * XXX: this really only needs to be called for CPUs in lazy TLB mode. + * CPUs with !is_lazy either received a TLB flush IPI while the user + * pages in this address range were unmapped, or have context switched + * and reloaded %CR3 since then. + * + * Shootdown IPIs at page table freeing time only need to be sent to + * CPUs that may have out of date TLB contents. */ - if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) - smp_call_function_many(mm_cpumask(mm), tlb_flush_remove_tables_local, (void *)mm, 1); - + mm_fill_lazy_tlb_cpu_mask(mm, lazy_cpus); + smp_call_function_many(lazy_cpus, + tlb_flush_remove_tables_local, (void *)mm, 1); + free_cpumask_var(lazy_cpus); put_cpu(); } -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 6/7] x86,mm: always use lazy TLB mode 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel ` (4 preceding siblings ...) 2018-07-10 14:28 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-10 14:28 ` [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm Rik van Riel 6 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except at page table freeing time, and idle CPUs will no longer get shootdown IPIs for things like mprotect and madvise, we can always use lazy TLB mode. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/include/asm/tlbflush.h | 16 ---------------- arch/x86/mm/tlb.c | 15 +-------------- 2 files changed, 1 insertion(+), 30 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 3aa3204b5dc0..511bf5fae8b8 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid) #define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr) #endif -static inline bool tlb_defer_switch_to_init_mm(void) -{ - /* - * If we have PCID, then switching to init_mm is reasonably - * fast. If we don't have PCID, then switching to init_mm is - * quite slow, so we try to defer it in the hopes that we can - * avoid it entirely. The latter approach runs the risk of - * receiving otherwise unnecessary IPIs. - * - * This choice is just a heuristic. The tlb code can handle this - * function returning true or false regardless of whether we have - * PCID. - */ - return !static_cpu_has(X86_FEATURE_PCID); -} - struct tlb_context { u64 ctx_id; u64 tlb_gen; diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index e4156e37aa71..493559cae2d5 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -379,20 +379,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm) return; - if (tlb_defer_switch_to_init_mm()) { - /* - * There's a significant optimization that may be possible - * here. We have accurate enough TLB flush tracking that we - * don't need to maintain coherence of TLB per se when we're - * lazy. We do, however, need to maintain coherence of - * paging-structure caches. We could, in principle, leave our - * old mm loaded and only switch to init_mm when - * tlb_remove_page() happens. - */ - this_cpu_write(cpu_tlbstate.is_lazy, true); - } else { - switch_mm(NULL, &init_mm, NULL); - } + this_cpu_write(cpu_tlbstate.is_lazy, true); } /* -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel ` (5 preceding siblings ...) 2018-07-10 14:28 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel @ 2018-07-10 14:28 ` Rik van Riel 2018-07-15 23:04 ` Ingo Molnar 6 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-10 14:28 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, efault, tglx, songliubraving, hpa, Rik van Riel Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels,using 1.8% of a 48 CPU system during a netperf to localhost run. Digging into the profile, we noticed that cpumask_clear_cpu and cpumask_set_cpu together take about half of the CPU time taken by switch_mm_irqs_off. However, the CPUs running netperf end up switching back and forth between netperf and the idle task, which does not require changes to the mm_cpumask. Furthermore, the init_mm cpumask ends up being the most heavily contended one in the system. Simply skipping changes to mm_cpumask(&init_mm) reduces overhead. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Reported-and-tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/tlb.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 493559cae2d5..ac86c5010472 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -310,15 +310,22 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, sync_current_stack_to_mm(next); } - /* Stop remote flushes for the previous mm */ - VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, mm_cpumask(real_prev)) && - real_prev != &init_mm); - cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); + /* + * Stop remote flushes for the previous mm. + * Skip the idle task; we never send init_mm TLB flushing IPIs, + * but the bitmap manipulation can cause cache line contention. + */ + if (real_prev != &init_mm) { + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, + mm_cpumask(real_prev))); + cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); + } /* * Start remote flushes and then read tlb_gen. */ - cpumask_set_cpu(cpu, mm_cpumask(next)); + if (next != &init_mm) + cpumask_set_cpu(cpu, mm_cpumask(next)); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm 2018-07-10 14:28 ` [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm Rik van Riel @ 2018-07-15 23:04 ` Ingo Molnar 2018-07-15 23:49 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Ingo Molnar @ 2018-07-15 23:04 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa * Rik van Riel <riel@surriel.com> wrote: > + /* > + * Stop remote flushes for the previous mm. > + * Skip the idle task; we never send init_mm TLB flushing IPIs, > + * but the bitmap manipulation can cause cache line contention. > + */ > + if (real_prev != &init_mm) { > + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, > + mm_cpumask(real_prev))); > + cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); BTW., could this optimization be (safely) extended to all (or most) !task->mm kernel threads? In particular softirq and threaded irq handlers could benefit greatly I suspect in certain networking intense workloads that happen to active them. Thanks, Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm 2018-07-15 23:04 ` Ingo Molnar @ 2018-07-15 23:49 ` Rik van Riel 2018-07-16 1:04 ` Ingo Molnar 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-15 23:49 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa [-- Attachment #1: Type: text/plain, Size: 910 bytes --] On Mon, 2018-07-16 at 01:04 +0200, Ingo Molnar wrote: > * Rik van Riel <riel@surriel.com> wrote: > > > + /* > > + * Stop remote flushes for the previous mm. > > + * Skip the idle task; we never send init_mm TLB > > flushing IPIs, > > + * but the bitmap manipulation can cause cache > > line contention. > > + */ > > + if (real_prev != &init_mm) { > > + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, > > + mm_cpumask(real_pr > > ev))); > > + cpumask_clear_cpu(cpu, > > mm_cpumask(real_prev)); > > BTW., could this optimization be (safely) extended to all (or most) > !task->mm > kernel threads? > > In particular softirq and threaded irq handlers could benefit greatly > I suspect in > certain networking intense workloads that happen to active them. Yes, it could. Are there kernel threads that use something other than init_mm today? -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm 2018-07-15 23:49 ` Rik van Riel @ 2018-07-16 1:04 ` Ingo Molnar 2018-07-16 18:37 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Ingo Molnar @ 2018-07-16 1:04 UTC (permalink / raw) To: Rik van Riel Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa * Rik van Riel <riel@surriel.com> wrote: > On Mon, 2018-07-16 at 01:04 +0200, Ingo Molnar wrote: > > * Rik van Riel <riel@surriel.com> wrote: > > > > > + /* > > > + * Stop remote flushes for the previous mm. > > > + * Skip the idle task; we never send init_mm TLB > > > flushing IPIs, > > > + * but the bitmap manipulation can cause cache > > > line contention. > > > + */ > > > + if (real_prev != &init_mm) { > > > + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, > > > + mm_cpumask(real_pr > > > ev))); > > > + cpumask_clear_cpu(cpu, > > > mm_cpumask(real_prev)); > > > > BTW., could this optimization be (safely) extended to all (or most) > > !task->mm > > kernel threads? > > > > In particular softirq and threaded irq handlers could benefit greatly > > I suspect in > > certain networking intense workloads that happen to active them. > > Yes, it could. > > Are there kernel threads that use something other than > init_mm today? Yeah, I think that's the typical case - so at minimum the comment should be fixed: > > > + * Skip the idle task; we never send init_mm TLB flushing IPIs, and it should say 'skip kernel threads', right? Thanks, Ingo ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm 2018-07-16 1:04 ` Ingo Molnar @ 2018-07-16 18:37 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-16 18:37 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, x86, luto, dave.hansen, kernel-team, efault, tglx, songliubraving, hpa [-- Attachment #1: Type: text/plain, Size: 1572 bytes --] On Mon, 2018-07-16 at 03:04 +0200, Ingo Molnar wrote: > * Rik van Riel <riel@surriel.com> wrote: > > > On Mon, 2018-07-16 at 01:04 +0200, Ingo Molnar wrote: > > > * Rik van Riel <riel@surriel.com> wrote: > > > > > > > + /* > > > > + * Stop remote flushes for the previous mm. > > > > + * Skip the idle task; we never send init_mm > > > > TLB > > > > flushing IPIs, > > > > + * but the bitmap manipulation can cause cache > > > > line contention. > > > > + */ > > > > + if (real_prev != &init_mm) { > > > > + VM_WARN_ON_ONCE(!cpumask_test_cpu(cpu, > > > > + mm_cpumask(rea > > > > l_pr > > > > ev))); > > > > + cpumask_clear_cpu(cpu, > > > > mm_cpumask(real_prev)); > > > > > > BTW., could this optimization be (safely) extended to all (or > > > most) > > > !task->mm > > > kernel threads? > > > > > > In particular softirq and threaded irq handlers could benefit > > > greatly > > > I suspect in > > > certain networking intense workloads that happen to active them. > > > > Yes, it could. > > > > Are there kernel threads that use something other than > > init_mm today? > > Yeah, I think that's the typical case - so at minimum the comment > should be fixed: > > > > > + * Skip the idle task; we never send init_mm TLB > flushing IPIs, > > and it should say 'skip kernel threads', right? I will send a v6 that improves this comment, and has the S-o-b thing you suggested for patch 1/7. I think that addresses all the comments people had on this patch series. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier @ 2018-07-16 19:03 Rik van Riel 2018-07-16 19:03 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-16 19:03 UTC (permalink / raw) To: linux-kernel; +Cc: x86, luto, efault, kernel-team, mingo, dave.hansen Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels, using 1.9% of a 48 CPU system during a netperf run. Digging into the profile, the atomic operations in cpumask_clear_cpu and cpumask_set_cpu are responsible for about half of that CPU use. However, the CPUs running netperf are simply switching back and forth between netperf and the idle task, which does not require any changes to the mm_cpumask if lazy TLB mode were used. Additionally, the init_mm cpumask ends up being the most heavily contended one in the system, for no reason at all. Making really lazy TLB mode work again on modern kernels, by sending a shootdown IPI only when page table pages are being unmapped, we get back some performance. v6 addresses the comment and Signed-off-by issues pointed out by Ingo. On memcache workloads on 2 socket systems, this patch series seems to reduce total system CPU use by 1-2%. On Song's netbench tests, CPU use in the context switch time is about cut in half. These patches also provide a little memory savings by shrinking the size of mm_struct, especially on distro kernels compiled with a gigantically large NR_CPUS. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-16 19:03 [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel @ 2018-07-16 19:03 ` Rik van Riel 2018-07-17 20:04 ` Andy Lutomirski 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-16 19:03 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, efault, kernel-team, mingo, dave.hansen, Rik van Riel Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4b73fe835c95..26542cc17043 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/cpu.h> #include <linux/debugfs.h> +#include <linux/gfp.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, next->context.ctx_id); /* - * We don't currently support having a real mm loaded without - * our cpu set in mm_cpumask(). We have all the bookkeeping - * in place to figure out whether we would need to flush - * if our cpu were cleared in mm_cpumask(), but we don't - * currently use it. + * Even in lazy TLB mode, the CPU should stay set in the + * mm_cpumask. The TLB shootdown code can figure out from + * from cpu_tlbstate.is_lazy whether or not to send an IPI. */ if (WARN_ON_ONCE(real_prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); - return; + /* + * If the CPU is not in lazy TLB mode, we are just switching + * from one thread in a process to another thread in the same + * process. No TLB flush required. + */ + if (!was_lazy) + return; + + /* + * Read the tlb_gen to check whether a flush is needed. + * If the TLB is up to date, just use it. + * The barrier synchronizes with the tlb_gen increment in + * the TLB shootdown code. + */ + smp_mb(); + next_tlb_gen = atomic64_read(&next->context.tlb_gen); + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == + next_tlb_gen) + return; + + /* + * TLB contents went out of date while we were in lazy + * mode. Fall through to the TLB switching code below. + */ + new_asid = prev_asid; + need_flush = true; } else { u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id); @@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * paging-structure cache to avoid speculatively reading * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. + * + * This should be rare, with native_flush_tlb_others skipping + * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); return; @@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + cpumask_var_t lazymask; + unsigned int cpu; + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ - unsigned int cpu; - cpu = smp_processor_id(); cpumask = uv_flush_tlb_others(cpumask, info); if (cpumask) @@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask, (void *)info, 1); return; } - smp_call_function_many(cpumask, flush_tlb_func_remote, + + /* + * A temporary cpumask is used in order to skip sending IPIs + * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm). + * If the allocation fails, simply IPI every CPU in mm_cpumask. + */ + if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) { + smp_call_function_many(cpumask, flush_tlb_func_remote, + (void *)info, 1); + return; + } + + cpumask_copy(lazymask, cpumask); + + for_each_cpu(cpu, lazymask) { + if (per_cpu(cpu_tlbstate.is_lazy, cpu)) + cpumask_clear_cpu(cpu, lazymask); + } + + smp_call_function_many(lazymask, flush_tlb_func_remote, (void *)info, 1); + + free_cpumask_var(lazymask); } /* -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-16 19:03 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel @ 2018-07-17 20:04 ` Andy Lutomirski [not found] ` <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com> 2018-07-18 20:58 ` Rik van Riel 0 siblings, 2 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-17 20:04 UTC (permalink / raw) To: Rik van Riel Cc: LKML, X86 ML, Andrew Lutomirski, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Mon, Jul 16, 2018 at 12:03 PM, Rik van Riel <riel@surriel.com> wrote: > Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, > when all it really needs to do is reload %CR3 at the next context switch, > assuming no page table pages got freed. > > Memory ordering is used to prevent race conditions between switch_mm_irqs_off, > which checks whether .tlb_gen changed, and the TLB invalidation code, which > increments .tlb_gen whenever page table entries get invalidated. > > The atomic increment in inc_mm_tlb_gen is its own barrier; the context > switch code adds an explicit barrier between reading tlbstate.is_lazy and > next->context.tlb_gen. > > Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set > are not removed from the mm_cpumask(mm), since that would prevent the TLB > flush IPIs at page table free time from being sent to all the CPUs > that need them. > > This patch reduces total CPU use in the system by about 1-2% for a > memcache workload on two socket systems, and by about 1% for a heavily > multi-process netperf between two systems. > I'm not 100% certain I'm replying to the right email, and I haven't gotten the tip-bot notification at all, but: I think you've introduced a minor-ish performance regression due to changing the old (admittedly terribly documented) control flow a bit. Before, if real_prev == next, we would skip: load_mm_cr4(next); switch_ldt(real_prev, next); Now we don't any more. I think you should reinstate that optimization. It's probably as simple as wrapping them in an if (real_priv != next) with a comment like /* Remote changes that would require a cr4 or ldt reload will unconditionally send an IPI even to lazy CPUs. So, if we aren't changing our mm, we don't need to refresh cr4 or the ldt */ Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to fix that part or should I? --Andy ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com>]
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier [not found] ` <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com> @ 2018-07-17 21:29 ` Andy Lutomirski 2018-07-17 22:05 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Andy Lutomirski @ 2018-07-17 21:29 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <riel@surriel.com> wrote: > Can I skip both the cr4 and let switches when the TLB contents > are no longer valid and got reloaded? > > If the TLB contents are still valid, either because we never went > into lazy TLB mode, or because no invalidates happened while > we were lazy, we immediately return. > > The cr4 and ldt reloads only happen if the TLB was invalidated > while we were in lazy TLB mode. Yes, since the only events that would change the LDT or the required CR4 value will unconditionally broadcast to every CPU in mm_cpumask regardless of whether they're lazy. The interesting case is that you go lazy, you miss an invalidation IPI because you were lazy, then you go unlazy, notice the tlb_gen change, and flush. If this happens, you know that you only missed a page table update and not an LDT update or a CR4 update, because the latter would have sent the IPI even though you were lazy. So you should skip the CR4 and LDT updates. I suppose a different approach would be to fix the issue below and to try to track when the LDT actually needs reloading. But that latter part seems a bit complicated for minimal gain. (Do you believe me? If not, please argue back!) >> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to >> fix that part or should I? > > I would be happy to send in a patch for this, and one for > the above optimization you pointed out. > Yes please! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-17 21:29 ` Andy Lutomirski @ 2018-07-17 22:05 ` Rik van Riel 2018-07-17 22:27 ` Andy Lutomirski 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-17 22:05 UTC (permalink / raw) To: Andy Lutomirski Cc: LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen > On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <luto@kernel.org> wrote: > > On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <riel@surriel.com> wrote: >> Can I skip both the cr4 and let switches when the TLB contents >> are no longer valid and got reloaded? >> >> If the TLB contents are still valid, either because we never went >> into lazy TLB mode, or because no invalidates happened while >> we were lazy, we immediately return. >> >> The cr4 and ldt reloads only happen if the TLB was invalidated >> while we were in lazy TLB mode. > > Yes, since the only events that would change the LDT or the required > CR4 value will unconditionally broadcast to every CPU in mm_cpumask > regardless of whether they're lazy. The interesting case is that you > go lazy, you miss an invalidation IPI because you were lazy, then you > go unlazy, notice the tlb_gen change, and flush. If this happens, you > know that you only missed a page table update and not an LDT update or > a CR4 update, because the latter would have sent the IPI even though > you were lazy. So you should skip the CR4 and LDT updates. > > I suppose a different approach would be to fix the issue below and to > try to track when the LDT actually needs reloading. But that latter > part seems a bit complicated for minimal gain. > > (Do you believe me? If not, please argue back!) > I believe you :) >>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to >>> fix that part or should I? >> >> I would be happy to send in a patch for this, and one for >> the above optimization you pointed out. >> > > Yes please! > There is a third optimization left to do. Currently every time we switch into lazy tlb mode, we take a refcount on the mm, even when switching from one kernel thread to another, or when repeatedly switching between the same mm and kernel threads. We could keep that refcount (on a per cpu basis) from the time we first switch to that mm in lazy tlb mode, to when we switch the CPU to a different mm. That would allow us to not bounce the cache line with the mm_struct reference count on every lazy TLB context switch. Does that seem like a reasonable optimization? Am I overlooking anything? I'll try to get all three optimizations working, and will run them through some testing here before posting upstream. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-17 22:05 ` Rik van Riel @ 2018-07-17 22:27 ` Andy Lutomirski 0 siblings, 0 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-17 22:27 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen > On Jul 17, 2018, at 12:05 PM, Rik van Riel <riel@surriel.com> wrote: > > > >> On Jul 17, 2018, at 5:29 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> On Tue, Jul 17, 2018 at 1:16 PM, Rik van Riel <riel@surriel.com> wrote: >>> Can I skip both the cr4 and let switches when the TLB contents >>> are no longer valid and got reloaded? >>> >>> If the TLB contents are still valid, either because we never went >>> into lazy TLB mode, or because no invalidates happened while >>> we were lazy, we immediately return. >>> >>> The cr4 and ldt reloads only happen if the TLB was invalidated >>> while we were in lazy TLB mode. >> >> Yes, since the only events that would change the LDT or the required >> CR4 value will unconditionally broadcast to every CPU in mm_cpumask >> regardless of whether they're lazy. The interesting case is that you >> go lazy, you miss an invalidation IPI because you were lazy, then you >> go unlazy, notice the tlb_gen change, and flush. If this happens, you >> know that you only missed a page table update and not an LDT update or >> a CR4 update, because the latter would have sent the IPI even though >> you were lazy. So you should skip the CR4 and LDT updates. >> >> I suppose a different approach would be to fix the issue below and to >> try to track when the LDT actually needs reloading. But that latter >> part seems a bit complicated for minimal gain. >> >> (Do you believe me? If not, please argue back!) >> > I believe you :) > >>>> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to >>>> fix that part or should I? >>> >>> I would be happy to send in a patch for this, and one for >>> the above optimization you pointed out. >>> >> >> Yes please! >> > There is a third optimization left to do. Currently every time > we switch into lazy tlb mode, we take a refcount on the mm, > even when switching from one kernel thread to another, or > when repeatedly switching between the same mm and kernel > threads. > > We could keep that refcount (on a per cpu basis) from the time > we first switch to that mm in lazy tlb mode, to when we switch > the CPU to a different mm. > > That would allow us to not bounce the cache line with the > mm_struct reference count on every lazy TLB context switch. > > Does that seem like a reasonable optimization? Are you referring to the core sched code that deals with mm_count and active_mm? If so, last time I looked at it, I convinced myself that it was totally useless, at least on x86. I think the my reasoning was that, when mm_users went to zero, we already waited for RCU before tearing down page tables. Things may have changed, but I strongly suspect that it should be possibly for at least x86 to opt out of mm_count and maybe even active_mm entirely. If nothing else, you’re shooting the mm out of CR3 on all CPUs whenever the pagetables get freed, and more or less the same logic should be sufficient so that, whenever mm_users hits zero, we can synchronously or via RCU callback kill the mm entirely. Want to take a look at that? > > Am I overlooking anything? > > I'll try to get all three optimizations working, and will run them > through some testing here before posting upstream. > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-17 20:04 ` Andy Lutomirski [not found] ` <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com> @ 2018-07-18 20:58 ` Rik van Riel 2018-07-18 23:13 ` Andy Lutomirski 1 sibling, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-18 20:58 UTC (permalink / raw) To: Andy Lutomirski Cc: LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen > On Jul 17, 2018, at 4:04 PM, Andy Lutomirski <luto@kernel.org> wrote: > > > I think you've introduced a minor-ish performance regression due to > changing the old (admittedly terribly documented) control flow a bit. > Before, if real_prev == next, we would skip: > > load_mm_cr4(next); > switch_ldt(real_prev, next); > > Now we don't any more. I think you should reinstate that > optimization. It's probably as simple as wrapping them in an if > (real_priv != next) with a comment like /* Remote changes that would > require a cr4 or ldt reload will unconditionally send an IPI even to > lazy CPUs. So, if we aren't changing our mm, we don't need to refresh > cr4 or the ldt */ Looks like switch_ldt already skips reloading the LDT when prev equals next, or when they simply have the same LDT values: if (unlikely((unsigned long)prev->context.ldt | (unsigned long)next->context.ldt)) load_mm_ldt(next); It appears that the cr4 bits have a similar optimization: static inline void cr4_set_bits(unsigned long mask) { unsigned long cr4, flags; local_irq_save(flags); cr4 = this_cpu_read(cpu_tlbstate.cr4); if ((cr4 | mask) != cr4) __cr4_set(cr4 | mask); local_irq_restore(flags); } > > Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to > fix that part or should I? > Looks like there might not be anything to do here, after all. On to the lazy TLB mm_struct refcounting stuff :) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-18 20:58 ` Rik van Riel @ 2018-07-18 23:13 ` Andy Lutomirski [not found] ` <B976CC13-D014-433A-83DE-F8DF9AB4F421@surriel.com> 0 siblings, 1 reply; 46+ messages in thread From: Andy Lutomirski @ 2018-07-18 23:13 UTC (permalink / raw) To: Rik van Riel Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen > On Jul 18, 2018, at 10:58 AM, Rik van Riel <riel@surriel.com> wrote: > > > >> On Jul 17, 2018, at 4:04 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> >> I think you've introduced a minor-ish performance regression due to >> changing the old (admittedly terribly documented) control flow a bit. >> Before, if real_prev == next, we would skip: >> >> load_mm_cr4(next); >> switch_ldt(real_prev, next); >> >> Now we don't any more. I think you should reinstate that >> optimization. It's probably as simple as wrapping them in an if >> (real_priv != next) with a comment like /* Remote changes that would >> require a cr4 or ldt reload will unconditionally send an IPI even to >> lazy CPUs. So, if we aren't changing our mm, we don't need to refresh >> cr4 or the ldt */ > > Looks like switch_ldt already skips reloading the LDT when prev equals > next, or when they simply have the same LDT values: > > if (unlikely((unsigned long)prev->context.ldt | > (unsigned long)next->context.ldt)) > load_mm_ldt(next); > Read that again? It will reload if there’s an LDT, even if it’s the same one. > It appears that the cr4 bits have a similar optimization: > > static inline void cr4_set_bits(unsigned long mask) > { > unsigned long cr4, flags; > > local_irq_save(flags); > cr4 = this_cpu_read(cpu_tlbstate.cr4); > if ((cr4 | mask) != cr4) > __cr4_set(cr4 | mask); > local_irq_restore(flags); > } >> >> Hmm. load_mm_cr4() should bypass itself when mm == &init_mm. Want to >> fix that part or should I? >> > Looks like there might not be anything to do here, after all. But if init_mm and the thread that just went idle have different selected cr4 values, we’ll still write it. With your lazy TLB work, it’s less of a big deal, but still. I’m happy to fix this myself, though. > > On to the lazy TLB mm_struct refcounting stuff :) > Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about mm_count. My suggestion is to get rid of mm_count instead of trying to optimize it. ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <B976CC13-D014-433A-83DE-F8DF9AB4F421@surriel.com>]
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier [not found] ` <B976CC13-D014-433A-83DE-F8DF9AB4F421@surriel.com> @ 2018-07-19 16:45 ` Andy Lutomirski 2018-07-19 17:04 ` Andy Lutomirski ` (3 more replies) 0 siblings, 4 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-19 16:45 UTC (permalink / raw) To: Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen [I added PeterZ and Vitaly -- can you see any way in which this would break something obscure? I don't.] On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote: > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals > next? Yes, AFAICS. > > On to the lazy TLB mm_struct refcounting stuff :) > >> >> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about >> mm_count. My suggestion is to get rid of mm_count instead of trying to >> optimize it. > > > Do you have any suggestions on how? :) > > The TLB shootdown sent at __exit_mm time does not get rid of the > kernelthread->active_mm > pointer pointing at the mm that is exiting. > Ah, but that's conceptually very easy to fix. Add a #define like ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that #define is set. After some grepping, there are very few users. The only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that are involved in the rather complicated dance of refcounting active_mm. If that field goes away, it doesn't need to be refcounted. Instead, I think the refcounting can get replaced with something like: /* * Release any arch-internal references to mm. Only called when mm_users is zero * and all tasks using mm have either been switch_mm()'d away or have had * enter_lazy_tlb() called. */ extern void arch_shoot_down_dead_mm(struct mm_struct *mm); which the kernel calls in __mmput() after tearing down all the page tables. The body can be something like: if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { /* send an IPI. Maybe just call tlb_flush_remove_tables() */ } (You'll also have to fix up the highly questionable users in arch/x86/platform/efi/efi_64.c, but that's easy.) Does all that make sense? Basically, as I understand it, the expensive atomic ops you're seeing are all pointless because they're enabling an optimization that hasn't actually worked for a long time, if ever. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 16:45 ` Andy Lutomirski @ 2018-07-19 17:04 ` Andy Lutomirski [not found] ` <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com> ` (2 subsequent siblings) 3 siblings, 0 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-19 17:04 UTC (permalink / raw) To: Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote: > [I added PeterZ and Vitaly -- can you see any way in which this would > break something obscure? I don't.] > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote: >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals >> next? > > Yes, AFAICS. > >> >> On to the lazy TLB mm_struct refcounting stuff :) >> >>> >>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about >>> mm_count. My suggestion is to get rid of mm_count instead of trying to >>> optimize it. >> >> >> Do you have any suggestions on how? :) >> >> The TLB shootdown sent at __exit_mm time does not get rid of the >> kernelthread->active_mm >> pointer pointing at the mm that is exiting. >> > > Ah, but that's conceptually very easy to fix. Add a #define like > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > #define is set. After some grepping, there are very few users. The > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > are involved in the rather complicated dance of refcounting active_mm. > If that field goes away, it doesn't need to be refcounted. Instead, I > think the refcounting can get replaced with something like: > > /* > * Release any arch-internal references to mm. Only called when > mm_users is zero > * and all tasks using mm have either been switch_mm()'d away or have had > * enter_lazy_tlb() called. > */ > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > which the kernel calls in __mmput() after tearing down all the page > tables. The body can be something like: > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > } > > (You'll also have to fix up the highly questionable users in > arch/x86/platform/efi/efi_64.c, but that's easy.) > > Does all that make sense? Basically, as I understand it, the > expensive atomic ops you're seeing are all pointless because they're > enabling an optimization that hasn't actually worked for a long time, > if ever. Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from arch_exit_mmap(), I think. It's a heavier weight version of more or less the same thing that arch_shoot_down_dead_mm() would be, except that it happens before exit_mmap(). But maybe Xen actually has the right idea. In other words, rather doing the big pagetable free in exit_mmap() while there may still be other CPUs pointing at the page tables, the other order might make more sense. So maybe, if ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible for getting rid of all secret arch references to the mm. Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. I added some more arch maintainers. The idea here is that, on x86 at least, task->active_mm and all its refcounting is pure overhead. When a process exits, __mmput() gets called, but the core kernel has a longstanding "optimization" in which other tasks (kernel threads and idle tasks) may have ->active_mm pointing at this mm. This is nasty, complicated, and hurts performance on large systems, since it requires extra atomic operations whenever a CPU switches between real users threads and idle/kernel threads. It's also almost completely worthless on x86 at least, since __mmput() frees pagetables, and that operation *already* forces a remote TLB flush, so we might as well zap all the active_mm references at the same time. But arm64 has real HW remote flushes. Does arm64 actually benefit from the active_mm optimization? What happens on arm64 when a process exits? How about s390? I suspect that x390 has rather larger systems than arm64, where the cost of the reference counting can be much higher. (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that interact with your previous patch set?) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier @ 2018-07-19 17:04 ` Andy Lutomirski 0 siblings, 0 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-19 17:04 UTC (permalink / raw) To: Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrot= e: > [I added PeterZ and Vitaly -- can you see any way in which this would > break something obscure? I don't.] > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote: >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals >> next? > > Yes, AFAICS. > >> >> On to the lazy TLB mm_struct refcounting stuff :) >> >>> >>> Which refcount? mm_users shouldn=E2=80=99t be hot, so I assume you=E2= =80=99re talking about >>> mm_count. My suggestion is to get rid of mm_count instead of trying to >>> optimize it. >> >> >> Do you have any suggestions on how? :) >> >> The TLB shootdown sent at __exit_mm time does not get rid of the >> kernelthread->active_mm >> pointer pointing at the mm that is exiting. >> > > Ah, but that's conceptually very easy to fix. Add a #define like > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > #define is set. After some grepping, there are very few users. The > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > are involved in the rather complicated dance of refcounting active_mm. > If that field goes away, it doesn't need to be refcounted. Instead, I > think the refcounting can get replaced with something like: > > /* > * Release any arch-internal references to mm. Only called when > mm_users is zero > * and all tasks using mm have either been switch_mm()'d away or have had > * enter_lazy_tlb() called. > */ > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > which the kernel calls in __mmput() after tearing down all the page > tables. The body can be something like: > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > } > > (You'll also have to fix up the highly questionable users in > arch/x86/platform/efi/efi_64.c, but that's easy.) > > Does all that make sense? Basically, as I understand it, the > expensive atomic ops you're seeing are all pointless because they're > enabling an optimization that hasn't actually worked for a long time, > if ever. Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from arch_exit_mmap(), I think. It's a heavier weight version of more or less the same thing that arch_shoot_down_dead_mm() would be, except that it happens before exit_mmap(). But maybe Xen actually has the right idea. In other words, rather doing the big pagetable free in exit_mmap() while there may still be other CPUs pointing at the page tables, the other order might make more sense. So maybe, if ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible for getting rid of all secret arch references to the mm. Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. I added some more arch maintainers. The idea here is that, on x86 at least, task->active_mm and all its refcounting is pure overhead. When a process exits, __mmput() gets called, but the core kernel has a longstanding "optimization" in which other tasks (kernel threads and idle tasks) may have ->active_mm pointing at this mm. This is nasty, complicated, and hurts performance on large systems, since it requires extra atomic operations whenever a CPU switches between real users threads and idle/kernel threads. It's also almost completely worthless on x86 at least, since __mmput() frees pagetables, and that operation *already* forces a remote TLB flush, so we might as well zap all the active_mm references at the same time. But arm64 has real HW remote flushes. Does arm64 actually benefit from the active_mm optimization? What happens on arm64 when a process exits? How about s390? I suspect that x390 has rather larger systems than arm64, where the cost of the reference counting can be much higher. (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that interact with your previous patch set?) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 17:04 ` Andy Lutomirski (?) @ 2018-07-20 4:57 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 46+ messages in thread From: Benjamin Herrenschmidt @ 2018-07-20 4:57 UTC (permalink / raw) To: Andy Lutomirski, Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, linuxppc-dev Cc: LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen, Nick Piggin, Aneesh Kumar K.V On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote: > On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > [I added PeterZ and Vitaly -- can you see any way in which this would > > break something obscure? I don't.] Added Nick and Aneesh. We do have HW remote flushes on powerpc. > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote: > > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals > > > next? > > > > Yes, AFAICS. > > > > > > > > On to the lazy TLB mm_struct refcounting stuff :) > > > > > > > > > > > Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about > > > > mm_count. My suggestion is to get rid of mm_count instead of trying to > > > > optimize it. > > > > > > > > > Do you have any suggestions on how? :) > > > > > > The TLB shootdown sent at __exit_mm time does not get rid of the > > > kernelthread->active_mm > > > pointer pointing at the mm that is exiting. > > > > > > > Ah, but that's conceptually very easy to fix. Add a #define like > > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > > #define is set. After some grepping, there are very few users. The > > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > > are involved in the rather complicated dance of refcounting active_mm. > > If that field goes away, it doesn't need to be refcounted. Instead, I > > think the refcounting can get replaced with something like: > > > > /* > > * Release any arch-internal references to mm. Only called when > > mm_users is zero > > * and all tasks using mm have either been switch_mm()'d away or have had > > * enter_lazy_tlb() called. > > */ > > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > > > which the kernel calls in __mmput() after tearing down all the page > > tables. The body can be something like: > > > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > > } > > > > (You'll also have to fix up the highly questionable users in > > arch/x86/platform/efi/efi_64.c, but that's easy.) > > > > Does all that make sense? Basically, as I understand it, the > > expensive atomic ops you're seeing are all pointless because they're > > enabling an optimization that hasn't actually worked for a long time, > > if ever. > > Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from > arch_exit_mmap(), I think. It's a heavier weight version of more or > less the same thing that arch_shoot_down_dead_mm() would be, except > that it happens before exit_mmap(). But maybe Xen actually has the > right idea. In other words, rather doing the big pagetable free in > exit_mmap() while there may still be other CPUs pointing at the page > tables, the other order might make more sense. So maybe, if > ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible > for getting rid of all secret arch references to the mm. > > Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. > > I added some more arch maintainers. The idea here is that, on x86 at > least, task->active_mm and all its refcounting is pure overhead. When > a process exits, __mmput() gets called, but the core kernel has a > longstanding "optimization" in which other tasks (kernel threads and > idle tasks) may have ->active_mm pointing at this mm. This is nasty, > complicated, and hurts performance on large systems, since it requires > extra atomic operations whenever a CPU switches between real users > threads and idle/kernel threads. > > It's also almost completely worthless on x86 at least, since __mmput() > frees pagetables, and that operation *already* forces a remote TLB > flush, so we might as well zap all the active_mm references at the > same time. > > But arm64 has real HW remote flushes. Does arm64 actually benefit > from the active_mm optimization? What happens on arm64 when a process > exits? How about s390? I suspect that x390 has rather larger systems > than arm64, where the cost of the reference counting can be much > higher. > > (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that > interact with your previous patch set?) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 17:04 ` Andy Lutomirski (?) (?) @ 2018-07-20 8:30 ` Peter Zijlstra 2018-07-23 12:26 ` Rik van Riel -1 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2018-07-20 8:30 UTC (permalink / raw) To: Andy Lutomirski Cc: Rik van Riel, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote: > I added some more arch maintainers. The idea here is that, on x86 at > least, task->active_mm and all its refcounting is pure overhead. When > a process exits, __mmput() gets called, but the core kernel has a > longstanding "optimization" in which other tasks (kernel threads and > idle tasks) may have ->active_mm pointing at this mm. This is nasty, > complicated, and hurts performance on large systems, since it requires > extra atomic operations whenever a CPU switches between real users > threads and idle/kernel threads. > > It's also almost completely worthless on x86 at least, since __mmput() > frees pagetables, and that operation *already* forces a remote TLB > flush, so we might as well zap all the active_mm references at the > same time. So I disagree that active_mm is complicated (the code is less than ideal but that is actually fixable). And aside from the process exit case, it does avoid CR3 writes when switching between user and kernel threads (which can be far more often than exit if you have longer running tasks). Now agreed, recent x86 work has made that less important. And I of course also agree that not doing those refcount atomics is better. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-20 8:30 ` Peter Zijlstra @ 2018-07-23 12:26 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-23 12:26 UTC (permalink / raw) To: Peter Zijlstra, Andy Lutomirski Cc: Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen [-- Attachment #1: Type: text/plain, Size: 1749 bytes --] On Fri, 2018-07-20 at 10:30 +0200, Peter Zijlstra wrote: > On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote: > > I added some more arch maintainers. The idea here is that, on x86 > > at > > least, task->active_mm and all its refcounting is pure > > overhead. When > > a process exits, __mmput() gets called, but the core kernel has a > > longstanding "optimization" in which other tasks (kernel threads > > and > > idle tasks) may have ->active_mm pointing at this mm. This is > > nasty, > > complicated, and hurts performance on large systems, since it > > requires > > extra atomic operations whenever a CPU switches between real users > > threads and idle/kernel threads. > > > > It's also almost completely worthless on x86 at least, since > > __mmput() > > frees pagetables, and that operation *already* forces a remote TLB > > flush, so we might as well zap all the active_mm references at the > > same time. > > So I disagree that active_mm is complicated (the code is less than > ideal > but that is actually fixable). And aside from the process exit case, > it > does avoid CR3 writes when switching between user and kernel threads > (which can be far more often than exit if you have longer running > tasks). > > Now agreed, recent x86 work has made that less important. > > And I of course also agree that not doing those refcount atomics is > better. It might be cleaner to keep the ->active_mm pointer in place for now (at least in the first patch), even on architectures where we end up dropping the refcounting. That way the code is more similar everywhere, and we just get rid of the expensive instructions. Let me try coding this up... -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier @ 2018-07-23 12:26 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-23 12:26 UTC (permalink / raw) To: Peter Zijlstra, Andy Lutomirski Cc: Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen [-- Attachment #1: Type: text/plain, Size: 1749 bytes --] On Fri, 2018-07-20 at 10:30 +0200, Peter Zijlstra wrote: > On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote: > > I added some more arch maintainers. The idea here is that, on x86 > > at > > least, task->active_mm and all its refcounting is pure > > overhead. When > > a process exits, __mmput() gets called, but the core kernel has a > > longstanding "optimization" in which other tasks (kernel threads > > and > > idle tasks) may have ->active_mm pointing at this mm. This is > > nasty, > > complicated, and hurts performance on large systems, since it > > requires > > extra atomic operations whenever a CPU switches between real users > > threads and idle/kernel threads. > > > > It's also almost completely worthless on x86 at least, since > > __mmput() > > frees pagetables, and that operation *already* forces a remote TLB > > flush, so we might as well zap all the active_mm references at the > > same time. > > So I disagree that active_mm is complicated (the code is less than > ideal > but that is actually fixable). And aside from the process exit case, > it > does avoid CR3 writes when switching between user and kernel threads > (which can be far more often than exit if you have longer running > tasks). > > Now agreed, recent x86 work has made that less important. > > And I of course also agree that not doing those refcount atomics is > better. It might be cleaner to keep the ->active_mm pointer in place for now (at least in the first patch), even on architectures where we end up dropping the refcounting. That way the code is more similar everywhere, and we just get rid of the expensive instructions. Let me try coding this up... -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 17:04 ` Andy Lutomirski ` (2 preceding siblings ...) (?) @ 2018-07-24 16:33 ` Will Deacon -1 siblings, 0 replies; 46+ messages in thread From: Will Deacon @ 2018-07-24 16:33 UTC (permalink / raw) To: Andy Lutomirski Cc: Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov, Juergen Gross, Boris Ostrovsky, linux-arch, Catalin Marinas, linux-s390, Benjamin Herrenschmidt, linuxppc-dev, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen Hi Andy, Sorry, I missed the arm64 question at the end of this... On Thu, Jul 19, 2018 at 10:04:09AM -0700, Andy Lutomirski wrote: > On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote: > > [I added PeterZ and Vitaly -- can you see any way in which this would > > break something obscure? I don't.] > > > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote: > >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals > >> next? > > > > Yes, AFAICS. > > > >> > >> On to the lazy TLB mm_struct refcounting stuff :) > >> > >>> > >>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about > >>> mm_count. My suggestion is to get rid of mm_count instead of trying to > >>> optimize it. > >> > >> > >> Do you have any suggestions on how? :) > >> > >> The TLB shootdown sent at __exit_mm time does not get rid of the > >> kernelthread->active_mm > >> pointer pointing at the mm that is exiting. > >> > > > > Ah, but that's conceptually very easy to fix. Add a #define like > > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > > #define is set. After some grepping, there are very few users. The > > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > > are involved in the rather complicated dance of refcounting active_mm. > > If that field goes away, it doesn't need to be refcounted. Instead, I > > think the refcounting can get replaced with something like: > > > > /* > > * Release any arch-internal references to mm. Only called when > > mm_users is zero > > * and all tasks using mm have either been switch_mm()'d away or have had > > * enter_lazy_tlb() called. > > */ > > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > > > which the kernel calls in __mmput() after tearing down all the page > > tables. The body can be something like: > > > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > > } > > > > (You'll also have to fix up the highly questionable users in > > arch/x86/platform/efi/efi_64.c, but that's easy.) > > > > Does all that make sense? Basically, as I understand it, the > > expensive atomic ops you're seeing are all pointless because they're > > enabling an optimization that hasn't actually worked for a long time, > > if ever. > > Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from > arch_exit_mmap(), I think. It's a heavier weight version of more or > less the same thing that arch_shoot_down_dead_mm() would be, except > that it happens before exit_mmap(). But maybe Xen actually has the > right idea. In other words, rather doing the big pagetable free in > exit_mmap() while there may still be other CPUs pointing at the page > tables, the other order might make more sense. So maybe, if > ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible > for getting rid of all secret arch references to the mm. > > Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. > > I added some more arch maintainers. The idea here is that, on x86 at > least, task->active_mm and all its refcounting is pure overhead. When > a process exits, __mmput() gets called, but the core kernel has a > longstanding "optimization" in which other tasks (kernel threads and > idle tasks) may have ->active_mm pointing at this mm. This is nasty, > complicated, and hurts performance on large systems, since it requires > extra atomic operations whenever a CPU switches between real users > threads and idle/kernel threads. > > It's also almost completely worthless on x86 at least, since __mmput() > frees pagetables, and that operation *already* forces a remote TLB > flush, so we might as well zap all the active_mm references at the > same time. > > But arm64 has real HW remote flushes. Does arm64 actually benefit > from the active_mm optimization? What happens on arm64 when a process > exits? How about s390? I suspect that x390 has rather larger systems > than arm64, where the cost of the reference counting can be much > higher. IIRC, the TLB invalidation on task exit has the fullmm field set in the mmu_gather structure, so we don't actually do any TLB invalidation at all. Instead, we just don't re-allocate the ASID and invalidate the whole TLB when we run out of ASIDs (they're 16-bit on most Armv8 CPUs). Does that answer your question? Will ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com>]
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier [not found] ` <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com> @ 2018-07-19 17:18 ` Andy Lutomirski 0 siblings, 0 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-07-19 17:18 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, Vitaly Kuznetsov, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Thu, Jul 19, 2018 at 10:15 AM, Rik van Riel <riel@surriel.com> wrote: > > > Given that CPUs in lazy TLB mode stay part of the mm_cpumask, > that WARN_ON seems misplaced. You are right though, that the > mm_cpumask alone should provide enough information for us to > avoid a need for both tsk->active_mm and the refcounting. > If you do this extra shootdown after freeing pagetables, it would be odd if mm_cpumask() wasn't empty. But you're right, the warn is probably silly. And if you move it into arch_exit_mmap(), the warn is definitely wrong. > > Does all that make sense? Basically, as I understand it, the > expensive atomic ops you're seeing are all pointless because they're > enabling an optimization that hasn't actually worked for a long time, > if ever. > > > Our benchmark results suggest that lazy TLB mode works, and makes > a measurable performance difference. Getting rid of the atomic ops > should make it a little better, though :) > I'm not saying lazy mode is useless. I'm saying that active_mm isn't needed for x86's lazy mode :) ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 16:45 ` Andy Lutomirski 2018-07-19 17:04 ` Andy Lutomirski [not found] ` <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com> @ 2018-07-20 8:02 ` Vitaly Kuznetsov 2018-07-20 9:49 ` Peter Zijlstra 2018-07-20 9:32 ` Peter Zijlstra 3 siblings, 1 reply; 46+ messages in thread From: Vitaly Kuznetsov @ 2018-07-20 8:02 UTC (permalink / raw) To: Andy Lutomirski Cc: Rik van Riel, Peter Zijlstra, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen Andy Lutomirski <luto@amacapital.net> writes: > [I added PeterZ and Vitaly -- can you see any way in which this would > break something obscure? I don't.] Thanks for CCing me, I don't see how this can break things either. At first glance, however, I'm afraid we can add performance penalty to virtualized guests which don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading CR3 without a need as we don't look at lazy mode in PV tlb flush functions. We can either check to switch_mm_irqs_off() that native_flush_tlb_others() is in use or teach PV tlb flush functions to look at lazy mode too. (Sorry if I'm missing something important here or if this was already discussed. I just became aware of this work) [...] -- Vitaly ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-20 8:02 ` Vitaly Kuznetsov @ 2018-07-20 9:49 ` Peter Zijlstra 2018-07-20 10:18 ` Vitaly Kuznetsov 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2018-07-20 9:49 UTC (permalink / raw) To: Vitaly Kuznetsov Cc: Andy Lutomirski, Rik van Riel, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Fri, Jul 20, 2018 at 10:02:10AM +0200, Vitaly Kuznetsov wrote: > Andy Lutomirski <luto@amacapital.net> writes: > > > [I added PeterZ and Vitaly -- can you see any way in which this would > > break something obscure? I don't.] > > Thanks for CCing me, > > I don't see how this can break things either. At first glance, however, > I'm afraid we can add performance penalty to virtualized guests which > don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading > CR3 without a need as we don't look at lazy mode in PV tlb flush > functions. > > We can either check to switch_mm_irqs_off() that > native_flush_tlb_others() is in use or teach PV tlb flush functions to > look at lazy mode too. As Rik noted elsewhere in the thread, kvm_flush_tlb_others() ends up calling native_tlb_flush_others() for all running vcpu threads. The Hyper-V thing is magical, we can't really do anything about it there. Let them worry about it. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-20 9:49 ` Peter Zijlstra @ 2018-07-20 10:18 ` Vitaly Kuznetsov 0 siblings, 0 replies; 46+ messages in thread From: Vitaly Kuznetsov @ 2018-07-20 10:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Andy Lutomirski, Rik van Riel, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen Peter Zijlstra <peterz@infradead.org> writes: > On Fri, Jul 20, 2018 at 10:02:10AM +0200, Vitaly Kuznetsov wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >> > [I added PeterZ and Vitaly -- can you see any way in which this would >> > break something obscure? I don't.] >> >> Thanks for CCing me, >> >> I don't see how this can break things either. At first glance, however, >> I'm afraid we can add performance penalty to virtualized guests which >> don't use native_flush_tlb_others() (Hyper-V, KVM): we will be reloading >> CR3 without a need as we don't look at lazy mode in PV tlb flush >> functions. >> >> We can either check to switch_mm_irqs_off() that >> native_flush_tlb_others() is in use or teach PV tlb flush functions to >> look at lazy mode too. > > As Rik noted elsewhere in the thread, kvm_flush_tlb_others() ends up > calling native_tlb_flush_others() for all running vcpu threads. Ah, right! > > The Hyper-V thing is magical, we can't really do anything about it > there. Let them worry about it. Well, we kinda know how this magic works: we just ask the hypervisor to flush TLB for us (if the particular vCPU is running) :-) Anyway, nothing stops us from duplicating the logic regarding lazy mode from native_flush_tlb_others() to hyperv_flush_tlb_others(): if TLB state is lazy omit TLB flush. -- Vitaly ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-19 16:45 ` Andy Lutomirski ` (2 preceding siblings ...) 2018-07-20 8:02 ` Vitaly Kuznetsov @ 2018-07-20 9:32 ` Peter Zijlstra 2018-07-20 11:04 ` Peter Zijlstra 3 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2018-07-20 9:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Rik van Riel, Vitaly Kuznetsov, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Thu, Jul 19, 2018 at 09:45:47AM -0700, Andy Lutomirski wrote: > After some grepping, there are very few users. The > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > are involved in the rather complicated dance of refcounting active_mm. Something like so should work I suppose. It keeps the field, but ensure that for all tasks tsk->active_mm == tsk->mm. The implication is that switch_mm(.prev == NULL) should work (it does for x86, we don't care about @prev). diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d84eafcf5e29..7bfd850c6bf7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2780,12 +2780,8 @@ static __always_inline struct rq * context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next, struct rq_flags *rf) { - struct mm_struct *mm, *oldmm; - prepare_task_switch(rq, prev, next); - mm = next->mm; - oldmm = prev->active_mm; /* * For paravirt, this is coupled with an exit in switch_to to * combine the page table reload and the switch backend into @@ -2800,16 +2796,32 @@ context_switch(struct rq *rq, struct task_struct *prev, * membarrier after storing to rq->curr, before returning to * user-space. */ - if (!mm) { - next->active_mm = oldmm; - mmgrab(oldmm); - enter_lazy_tlb(oldmm, next); - } else - switch_mm_irqs_off(oldmm, mm, next); - - if (!prev->mm) { - prev->active_mm = NULL; - rq->prev_mm = oldmm; + + /* + * kernel -> kernel lazy + transfer active + * user -> kernel lazy + mmgrab() active + * + * kernel -> user switch + mmdrop() active + * user -> user switch + */ + if (!next->mm) { // to kernel + enter_lazy_tlb(prev->active_mm, next); + +#ifdef ARCH_NO_ACTIVE_MM + next->active_mm = prev->active_mm; + if (prev->mm) // from user + mmgrab(prev->active_mm); +#endif + } else { // to user + switch_mm_irqs_off(prev->active_mm, next->mm, next); + +#ifdef ARCH_NO_ACTIVE_MM + if (!prev->mm) { // from kernel + /* will mmdrop() in finish_task_switch(). */ + rq->prev_mm = prev->active_mm; + prev->active_mm = NULL; + } +#endif } rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); diff --git a/mm/mmu_context.c b/mm/mmu_context.c index 3e612ae748e9..770bb615d115 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -12,6 +12,10 @@ #include <asm/mmu_context.h> +#ifndef finish_arch_post_lock_switch +# define finish_arch_post_lock_switch() do { } while (0) +#endif + /* * use_mm * Makes the calling kernel thread take on the specified @@ -33,12 +37,14 @@ void use_mm(struct mm_struct *mm) tsk->mm = mm; switch_mm(active_mm, mm, tsk); task_unlock(tsk); -#ifdef finish_arch_post_lock_switch finish_arch_post_lock_switch(); -#endif +#ifdef ARCH_NO_ACTIVE_MM + WARN_ON_ONCE(active_mm != NULL); +#else if (active_mm != mm) mmdrop(active_mm); +#endif } EXPORT_SYMBOL_GPL(use_mm); @@ -57,8 +63,22 @@ void unuse_mm(struct mm_struct *mm) task_lock(tsk); sync_mm_rss(mm); tsk->mm = NULL; + + WARN_ON_ONCE(tsk->active_mm != mm); + +#ifdef ARCH_NO_ACTIVE_MM + switch_mm(tsk->active_mm, &init_mm, tsk); + mmdrop(tsk->active_mm); + tsk->active_mm = NULL; +#else /* active_mm is still 'mm' */ enter_lazy_tlb(mm, tsk); +#endif + task_unlock(tsk); + +#ifdef ARCH_NO_ACTIVE_MM + finish_arch_post_lock_switch(); +#endif } EXPORT_SYMBOL_GPL(unuse_mm); ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-20 9:32 ` Peter Zijlstra @ 2018-07-20 11:04 ` Peter Zijlstra 0 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2018-07-20 11:04 UTC (permalink / raw) To: Andy Lutomirski Cc: Rik van Riel, Vitaly Kuznetsov, Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar, Dave Hansen On Fri, Jul 20, 2018 at 11:32:39AM +0200, Peter Zijlstra wrote: > + if (!next->mm) { // to kernel > + enter_lazy_tlb(prev->active_mm, next); > + > +#ifdef ARCH_NO_ACTIVE_MM > + next->active_mm = prev->active_mm; > + if (prev->mm) // from user > + mmgrab(prev->active_mm); else prev->active_mm = NULL; > +#endif > + } else { // to user > + switch_mm_irqs_off(prev->active_mm, next->mm, next); > + > +#ifdef ARCH_NO_ACTIVE_MM > + if (!prev->mm) { // from kernel > + /* will mmdrop() in finish_task_switch(). */ > + rq->prev_mm = prev->active_mm; > + prev->active_mm = NULL; > + } > +#endif > } ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH v4 0/7] x86,tlb,mm: make lazy TLB mode even lazier @ 2018-07-06 21:56 Rik van Riel 2018-07-06 21:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-07-06 21:56 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault, songliubraving, hpa Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels, using 1.9% of a 48 CPU system during a netperf run. Digging into the profile, the atomic operations in cpumask_clear_cpu and cpumask_set_cpu are responsible for about half of that CPU use. However, the CPUs running netperf are simply switching back and forth between netperf and the idle task, which does not require any changes to the mm_cpumask if lazy TLB mode were used. Additionally, the init_mm cpumask ends up being the most heavily contended one in the system, for no reason at all. Making really lazy TLB mode work again on modern kernels, by sending a shootdown IPI only when page table pages are being unmapped, we get back some performance. v4 of the series has a few minor cleanups; no functional changes since v3 On memcache workloads on 2 socket systems, this patch series seems to reduce total system CPU use by 1-2%. On Song's netbench tests, CPU use in the context switch time is about cut in half. These patches also provide a little memory savings by shrinking the size of mm_struct, especially on distro kernels compiled with a gigantically large NR_CPUS. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-07-06 21:56 [PATCH v4 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel @ 2018-07-06 21:56 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-07-06 21:56 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault, songliubraving, hpa, Rik van Riel Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/tlb.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 59 insertions(+), 9 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4b73fe835c95..26542cc17043 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/cpu.h> #include <linux/debugfs.h> +#include <linux/gfp.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, next->context.ctx_id); /* - * We don't currently support having a real mm loaded without - * our cpu set in mm_cpumask(). We have all the bookkeeping - * in place to figure out whether we would need to flush - * if our cpu were cleared in mm_cpumask(), but we don't - * currently use it. + * Even in lazy TLB mode, the CPU should stay set in the + * mm_cpumask. The TLB shootdown code can figure out from + * from cpu_tlbstate.is_lazy whether or not to send an IPI. */ if (WARN_ON_ONCE(real_prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); - return; + /* + * If the CPU is not in lazy TLB mode, we are just switching + * from one thread in a process to another thread in the same + * process. No TLB flush required. + */ + if (!was_lazy) + return; + + /* + * Read the tlb_gen to check whether a flush is needed. + * If the TLB is up to date, just use it. + * The barrier synchronizes with the tlb_gen increment in + * the TLB shootdown code. + */ + smp_mb(); + next_tlb_gen = atomic64_read(&next->context.tlb_gen); + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == + next_tlb_gen) + return; + + /* + * TLB contents went out of date while we were in lazy + * mode. Fall through to the TLB switching code below. + */ + new_asid = prev_asid; + need_flush = true; } else { u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id); @@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * paging-structure cache to avoid speculatively reading * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. + * + * This should be rare, with native_flush_tlb_others skipping + * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); return; @@ -560,6 +588,9 @@ static void flush_tlb_func_remote(void *info) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + cpumask_var_t lazymask; + unsigned int cpu; + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -583,8 +614,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ - unsigned int cpu; - cpu = smp_processor_id(); cpumask = uv_flush_tlb_others(cpumask, info); if (cpumask) @@ -592,8 +621,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask, (void *)info, 1); return; } - smp_call_function_many(cpumask, flush_tlb_func_remote, + + /* + * A temporary cpumask is used in order to skip sending IPIs + * to CPUs in lazy TLB state, while keeping them in mm_cpumask(mm). + * If the allocation fails, simply IPI every CPU in mm_cpumask. + */ + if (!alloc_cpumask_var(&lazymask, GFP_ATOMIC)) { + smp_call_function_many(cpumask, flush_tlb_func_remote, + (void *)info, 1); + return; + } + + cpumask_copy(lazymask, cpumask); + + for_each_cpu(cpu, lazymask) { + if (per_cpu(cpu_tlbstate.is_lazy, cpu)) + cpumask_clear_cpu(cpu, lazymask); + } + + smp_call_function_many(lazymask, flush_tlb_func_remote, (void *)info, 1); + + free_cpumask_var(lazymask); } /* -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* [PATCH v3 0/7] x86,tlb,mm: make lazy TLB mode even lazier @ 2018-06-29 14:29 Rik van Riel 2018-06-29 14:29 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-06-29 14:29 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault, songliubraving, hpa Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels, using 1.9% of a 48 CPU system during a netperf run. Digging into the profile, the atomic operations in cpumask_clear_cpu and cpumask_set_cpu are responsible for about half of that CPU use. However, the CPUs running netperf are simply switching back and forth between netperf and the idle task, which does not require any changes to the mm_cpumask if lazy TLB mode were used. Additionally, the init_mm cpumask ends up being the most heavily contended one in the system, for no reason at all. Making really lazy TLB mode work again on modern kernels, by sending a shootdown IPI only when page table pages are being unmapped, we get back some performance. v3 of the series addresses Andy Lutomirski's latest comments, and seems to work well. On memcache workloads on 2 socket systems, this patch series seems to reduce total system CPU use by 1-2%. On Song's netbench tests, CPU use in the context switch time is about cut in half. These patches also provide a little memory savings by shrinking the size of mm_struct, especially on distro kernels compiled with a gigantically large NR_CPUS. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-29 14:29 [PATCH v3 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel @ 2018-06-29 14:29 ` Rik van Riel 2018-06-29 17:05 ` Dave Hansen 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-06-29 14:29 UTC (permalink / raw) To: linux-kernel Cc: x86, luto, dave.hansen, mingo, kernel-team, tglx, efault, songliubraving, hpa, Rik van Riel Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. Memory ordering is used to prevent race conditions between switch_mm_irqs_off, which checks whether .tlb_gen changed, and the TLB invalidation code, which increments .tlb_gen whenever page table entries get invalidated. The atomic increment in inc_mm_tlb_gen is its own barrier; the context switch code adds an explicit barrier between reading tlbstate.is_lazy and next->context.tlb_gen. Unlike the 2016 version of this patch, CPUs with cpu_tlbstate.is_lazy set are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. This patch reduces total CPU use in the system by about 1-2% for a memcache workload on two socket systems, and by about 1% for a heavily multi-process netperf between two systems. Signed-off-by: Rik van Riel <riel@surriel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/include/asm/uv/uv.h | 6 ++-- arch/x86/mm/tlb.c | 82 ++++++++++++++++++++++++++++++++++++------- arch/x86/platform/uv/tlb_uv.c | 2 +- 3 files changed, 73 insertions(+), 17 deletions(-) diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h index a80c0673798f..d801afb5fe90 100644 --- a/arch/x86/include/asm/uv/uv.h +++ b/arch/x86/include/asm/uv/uv.h @@ -17,7 +17,7 @@ extern int is_uv_hubless(void); extern void uv_cpu_init(void); extern void uv_nmi_init(void); extern void uv_system_init(void); -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info); #else /* X86_UV */ @@ -27,8 +27,8 @@ static inline int is_uv_system(void) { return 0; } static inline int is_uv_hubless(void) { return 0; } static inline void uv_cpu_init(void) { } static inline void uv_system_init(void) { } -static inline const struct cpumask * -uv_flush_tlb_others(const struct cpumask *cpumask, +static inline struct cpumask * +uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info) { return cpumask; } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4b73fe835c95..e59214ec52b1 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/cpu.h> #include <linux/debugfs.h> +#include <linux/gfp.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -185,6 +186,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -242,17 +244,40 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, next->context.ctx_id); /* - * We don't currently support having a real mm loaded without - * our cpu set in mm_cpumask(). We have all the bookkeeping - * in place to figure out whether we would need to flush - * if our cpu were cleared in mm_cpumask(), but we don't - * currently use it. + * Even in lazy TLB mode, the CPU should stay set in the + * mm_cpumask. The TLB shootdown code can figure out from + * from cpu_tlbstate.is_lazy whether or not to send an IPI. */ if (WARN_ON_ONCE(real_prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); - return; + /* + * If the CPU is not in lazy TLB mode, we are just switching + * from one thread in a process to another thread in the same + * process. No TLB flush required. + */ + if (!was_lazy) + return; + + /* + * Read the tlb_gen to check whether a flush is needed. + * If the TLB is up to date, just use it. + * The barrier synchronizes with the tlb_gen increment in + * the TLB shootdown code. + */ + smp_mb(); + next_tlb_gen = atomic64_read(&next->context.tlb_gen); + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == + next_tlb_gen) + return; + + /* + * TLB contents went out of date while we were in lazy + * mode. Fall through to the TLB switching code below. + */ + new_asid = prev_asid; + need_flush = true; } else { u64 last_ctx_id = this_cpu_read(cpu_tlbstate.last_ctx_id); @@ -454,6 +479,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * paging-structure cache to avoid speculatively reading * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. + * + * This should be rare, with native_flush_tlb_others skipping + * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); return; @@ -560,6 +588,22 @@ static void flush_tlb_func_remote(void *info) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + cpumask_t *mask = (struct cpumask *)cpumask; + cpumask_var_t varmask; + bool can_lazy_flush = false; + unsigned int cpu; + + /* + * A temporary cpumask allows the kernel to skip sending IPIs + * to CPUs in lazy TLB state, without also removing them from + * mm_cpumask(mm). + */ + if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) { + cpumask_copy(varmask, cpumask); + mask = varmask; + can_lazy_flush = true; + } + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -583,17 +627,29 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ - unsigned int cpu; - cpu = smp_processor_id(); - cpumask = uv_flush_tlb_others(cpumask, info); - if (cpumask) - smp_call_function_many(cpumask, flush_tlb_func_remote, + mask = uv_flush_tlb_others(mask, info); + if (mask) + smp_call_function_many(mask, flush_tlb_func_remote, (void *)info, 1); - return; + goto out; } - smp_call_function_many(cpumask, flush_tlb_func_remote, + + /* + * There is no need to send IPIs to CPUs in lazy TLB mode. They + * can simply flush the TLB at the next context switch. + */ + if (can_lazy_flush) { + for_each_cpu(cpu, mask) { + if (per_cpu(cpu_tlbstate.is_lazy, cpu)) + cpumask_clear_cpu(cpu, mask); + } + } + + smp_call_function_many(mask, flush_tlb_func_remote, (void *)info, 1); + out: + free_cpumask_var(varmask); } /* diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c index ca446da48fd2..84a4c6679da6 100644 --- a/arch/x86/platform/uv/tlb_uv.c +++ b/arch/x86/platform/uv/tlb_uv.c @@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp, * Returns pointer to cpumask if some remote flushing remains to be * done. The returned pointer is valid till preemption is re-enabled. */ -const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, +struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info) { unsigned int cpu = smp_processor_id(); -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-29 14:29 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel @ 2018-06-29 17:05 ` Dave Hansen 2018-06-29 17:29 ` Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Dave Hansen @ 2018-06-29 17:05 UTC (permalink / raw) To: Rik van Riel, linux-kernel Cc: x86, luto, mingo, kernel-team, tglx, efault, songliubraving, hpa On 06/29/2018 07:29 AM, Rik van Riel wrote: > + /* > + * If the CPU is not in lazy TLB mode, we are just switching > + * from one thread in a process to another thread in the same > + * process. No TLB flush required. > + */ > + if (!was_lazy) > + return; > + > + /* > + * Read the tlb_gen to check whether a flush is needed. > + * If the TLB is up to date, just use it. > + * The barrier synchronizes with the tlb_gen increment in > + * the TLB shootdown code. > + */ > + smp_mb(); > + next_tlb_gen = atomic64_read(&next->context.tlb_gen); > + if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > + next_tlb_gen) > + return; Nit: it might be nice to have this hunk in a helper like cpu_tlbstate_needs_flush(next) to keep this nice and readable. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-29 17:05 ` Dave Hansen @ 2018-06-29 17:29 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-06-29 17:29 UTC (permalink / raw) To: Dave Hansen, linux-kernel Cc: x86, luto, mingo, kernel-team, tglx, efault, songliubraving, hpa [-- Attachment #1: Type: text/plain, Size: 1245 bytes --] On Fri, 2018-06-29 at 10:05 -0700, Dave Hansen wrote: > On 06/29/2018 07:29 AM, Rik van Riel wrote: > > + /* > > + * If the CPU is not in lazy TLB mode, we are just > > switching > > + * from one thread in a process to another thread > > in the same > > + * process. No TLB flush required. > > + */ > > + if (!was_lazy) > > + return; > > + > > + /* > > + * Read the tlb_gen to check whether a flush is > > needed. > > + * If the TLB is up to date, just use it. > > + * The barrier synchronizes with the tlb_gen > > increment in > > + * the TLB shootdown code. > > + */ > > + smp_mb(); > > + next_tlb_gen = atomic64_read(&next- > > >context.tlb_gen); > > + if > > (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == > > + next_tlb_gen) > > + return; > > Nit: it might be nice to have this hunk in a helper like > cpu_tlbstate_needs_flush(next) to keep this nice and readable. That helper might not end up as pretty as you would like, because when the TLB is not up to date, we use the value of next_tlb_gen in the "if (need_flush)" branch below. I would be happy to whip up a "tlb_up_to_date(mm, &next_tlb_gen)" helper as patch 8/7, though :) -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier @ 2018-06-20 19:56 Rik van Riel 2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw) To: linux-kernel Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving, kernel-team Song noticed switch_mm_irqs_off taking a lot of CPU time in recent kernels, using 1.9% of a 48 CPU system during a netperf run. Digging into the profile, the atomic operations in cpumask_clear_cpu and cpumask_set_cpu are responsible for about half of that CPU use. However, the CPUs running netperf are simply switching back and forth between netperf and the idle task, which does not require any changes to the mm_cpumask if lazy TLB mode were used. Additionally, the init_mm cpumask ends up being the most heavily contended one in the system, for no reason at all. Making really lazy TLB mode work again on modern kernels, by sending a shootdown IPI only when page table pages are being unmapped, we get back some performance. Using a memcache style workload on Broadwell systems, these patches result in about a 0.5% reduction of CPU use on the system. Numbers on Haswell are inconclusive so far. Song's netperf performance results: w/o patchset: Throughput: 1.74716e+06 perf profile: + 0.95% swapper [kernel.vmlinux] [k] switch_mm_irqs_off + 0.82% netserver [kernel.vmlinux] [k] switch_mm_irqs_off w/ patchset: Throughput: 1.76911e+06 perf profile: + 0.81% swapper [kernel.vmlinux] [k] switch_mm_irqs_off With these patches, netserver no longer calls switch_mm_irqs_off, and the CPU use of enter_lazy_tlb was below the 0.05% threshold of statistics gathered by Song's scripts. I am still working on a patch to also get rid of the continuous pounding on mm->count during lazy TLB entry and exit, when the same mm_struct is being used all the time. I do not have that working yet. Until then, these patches provide a nice performance boost, as well as a little memory savings by shrinking the size of mm_struct. ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel @ 2018-06-20 19:56 ` Rik van Riel 2018-06-22 15:04 ` Andy Lutomirski 2018-06-22 17:05 ` Dave Hansen 0 siblings, 2 replies; 46+ messages in thread From: Rik van Riel @ 2018-06-20 19:56 UTC (permalink / raw) To: linux-kernel Cc: 86, luto, mingo, tglx, dave.hansen, efault, songliubraving, kernel-team, Rik van Riel Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, when all it really needs to do is reload %CR3 at the next context switch, assuming no page table pages got freed. This patch deals with that issue by introducing a third TLB state, TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context switch. Atomic compare and exchange is used to close races between the TLB shootdown code and the context switch code. Keying off just the tlb_gen is likely to not be enough, since that would not give lazy_clb_can_skip_flush() information on when it is facing a race and has to send the IPI to a CPU in the middle of a LAZY -> OK switch. Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are not removed from the mm_cpumask(mm), since that would prevent the TLB flush IPIs at page table free time from being sent to all the CPUs that need them. Signed-off-by: Rik van Riel <riel@surriel.com> Tested-by: Song Liu <songliubraving@fb.com> --- arch/x86/include/asm/tlbflush.h | 5 ++ arch/x86/include/asm/uv/uv.h | 6 +-- arch/x86/mm/tlb.c | 108 ++++++++++++++++++++++++++++++++++++---- arch/x86/platform/uv/tlb_uv.c | 2 +- 4 files changed, 107 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 88a4d6b87ff7..0a8bdd6084de 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -171,6 +171,7 @@ struct tlb_context { #define TLBSTATE_OK 0 #define TLBSTATE_LAZY 1 +#define TLBSTATE_FLUSH 2 struct tlb_state { /* @@ -199,6 +200,10 @@ struct tlb_state { * We're heuristically guessing that the CR3 load we * skipped more than makes up for the overhead added by * lazy mode. + * + * - Lazily using a real mm, which has seen a TLB invalidation on + * other CPUs. TLB needs to be flushed at context switch time, + * state == TLBSTATE_FLUSH. */ int state; diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h index a80c0673798f..d801afb5fe90 100644 --- a/arch/x86/include/asm/uv/uv.h +++ b/arch/x86/include/asm/uv/uv.h @@ -17,7 +17,7 @@ extern int is_uv_hubless(void); extern void uv_cpu_init(void); extern void uv_nmi_init(void); extern void uv_system_init(void); -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info); #else /* X86_UV */ @@ -27,8 +27,8 @@ static inline int is_uv_system(void) { return 0; } static inline int is_uv_hubless(void) { return 0; } static inline void uv_cpu_init(void) { } static inline void uv_system_init(void) { } -static inline const struct cpumask * -uv_flush_tlb_others(const struct cpumask *cpumask, +static inline struct cpumask * +uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info) { return cpumask; } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index e063e623e52c..4b27d8469848 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -7,6 +7,7 @@ #include <linux/export.h> #include <linux/cpu.h> #include <linux/debugfs.h> +#include <linux/gfp.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -227,8 +228,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); - /* * The membarrier system call requires a full memory barrier and * core serialization before returning to user-space, after @@ -236,8 +235,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * memory barrier and core serializing instruction. */ if (real_prev == next) { + int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state); + int oldstate = *tlbstate; + VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != - next->context.ctx_id); + next->context.ctx_id); /* * We don't currently support having a real mm loaded without @@ -250,6 +252,26 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next)); + /* + * In lazy TLB mode. The atomic exchange makes sure there was + * no TLB shootdown all the way to this context switch. + */ + if (oldstate == TLBSTATE_LAZY) + oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK); + + /* + * There was a TLB invalidation while this CPU was idle. + * The code in choose_new_asid will figure out what kind + * of flush is needed. + */ + if (oldstate == TLBSTATE_FLUSH) + goto reload_cr3; + + /* + * Nothing to do. Either we are switching between two + * threads in the same process, or we were in TLBSTATE_LAZY + * without any TLB invalidations happening. + */ return; } else { u16 new_asid; @@ -294,6 +316,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * Start remote flushes and then read tlb_gen. */ cpumask_set_cpu(cpu, mm_cpumask(next)); + reload_cr3: + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); next_tlb_gen = atomic64_read(&next->context.tlb_gen); choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); @@ -454,6 +478,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * paging-structure cache to avoid speculatively reading * garbage into our TLB. Since switching to init_mm is barely * slower than a minimal flush, just switch to init_mm. + * + * This should be rare, with native_flush_tlb_others skipping + * IPIs to lazy TLB mode CPUs. */ switch_mm_irqs_off(NULL, &init_mm, NULL); return; @@ -557,9 +584,57 @@ static void flush_tlb_func_remote(void *info) flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); } +/* + * Determine whether a CPU's TLB needs to be flushed now, or whether the + * flush can be delayed until the next context switch, by changing the + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH. + */ +static bool lazy_tlb_can_skip_flush(int cpu) +{ + int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu); + int old; + + switch (*tlbstate) { + case TLBSTATE_FLUSH: + /* The TLB will be flushed on the next context switch. */ + return true; + case TLBSTATE_LAZY: + /* + * The CPU is in TLBSTATE_LAZY, which could context switch back + * to TLBSTATE_OK, re-using the old TLB state without a flush. + * if that happened, send a TLB flush IPI. + * + * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will + * be flushed at the next context switch. Skip the IPI. + */ + old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH); + return old != TLBSTATE_OK; + case TLBSTATE_OK: + default: + /* A task on the CPU is actively using the mm. Flush the TLB. */ + return false; + } +} + void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { + cpumask_t *mask = (struct cpumask *)cpumask; + cpumask_var_t varmask; + bool can_lazy_flush = false; + unsigned int cpu; + + /* + * A temporary cpumask allows the kernel to skip sending IPIs + * to CPUs in lazy TLB state, without also removing them from + * mm_cpumask(mm). + */ + if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) { + cpumask_copy(varmask, cpumask); + mask = varmask; + can_lazy_flush = true; + } + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); if (info->end == TLB_FLUSH_ALL) trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); @@ -583,17 +658,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * that UV should be updated so that smp_call_function_many(), * etc, are optimal on UV. */ - unsigned int cpu; - cpu = smp_processor_id(); - cpumask = uv_flush_tlb_others(cpumask, info); - if (cpumask) - smp_call_function_many(cpumask, flush_tlb_func_remote, + mask = uv_flush_tlb_others(mask, info); + if (mask) + smp_call_function_many(mask, flush_tlb_func_remote, (void *)info, 1); - return; + goto out; + } + + /* + * Instead of sending IPIs to CPUs in lazy TLB mode, move that + * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed + * at the next context switch, or at page table free time. + */ + if (can_lazy_flush) { + for_each_cpu(cpu, mask) { + if (lazy_tlb_can_skip_flush(cpu)) + cpumask_clear_cpu(cpu, mask); + } } - smp_call_function_many(cpumask, flush_tlb_func_remote, + + smp_call_function_many(mask, flush_tlb_func_remote, (void *)info, 1); + out: + free_cpumask_var(varmask); } /* diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c index b36caae0fb2f..42730acec8b7 100644 --- a/arch/x86/platform/uv/tlb_uv.c +++ b/arch/x86/platform/uv/tlb_uv.c @@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp, * Returns pointer to cpumask if some remote flushing remains to be * done. The returned pointer is valid till preemption is re-enabled. */ -const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, +struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, const struct flush_tlb_info *info) { unsigned int cpu = smp_processor_id(); -- 2.14.4 ^ permalink raw reply related [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel @ 2018-06-22 15:04 ` Andy Lutomirski 2018-06-22 15:15 ` Rik van Riel 2018-06-22 17:05 ` Dave Hansen 1 sibling, 1 reply; 46+ messages in thread From: Andy Lutomirski @ 2018-06-22 15:04 UTC (permalink / raw) To: Rik van Riel Cc: LKML, 86, Andrew Lutomirski, Ingo Molnar, Thomas Gleixner, Dave Hansen, Mike Galbraith, songliubraving, kernel-team On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> wrote: > > Lazy TLB mode can result in an idle CPU being woken up by a TLB flush, > when all it really needs to do is reload %CR3 at the next context switch, > assuming no page table pages got freed. > > This patch deals with that issue by introducing a third TLB state, > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context > switch. > > Atomic compare and exchange is used to close races between the TLB > shootdown code and the context switch code. Keying off just the > tlb_gen is likely to not be enough, since that would not give > lazy_clb_can_skip_flush() information on when it is facing a race > and has to send the IPI to a CPU in the middle of a LAZY -> OK switch. > > Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are not > removed from the mm_cpumask(mm), since that would prevent the TLB > flush IPIs at page table free time from being sent to all the CPUs > that need them. Eek, this is so complicated. In the 2016 version of the patches, you needed all this. But I rewrote the whole subsystem to make it easier now :) I think that you can get rid of all of this and instead just revert the relevant parts of: b956575bed91ecfb136a8300742ecbbf451471ab All the bookkeeping is already in place -- no need for new state. > > Signed-off-by: Rik van Riel <riel@surriel.com> > Tested-by: Song Liu <songliubraving@fb.com> > --- > arch/x86/include/asm/tlbflush.h | 5 ++ > arch/x86/include/asm/uv/uv.h | 6 +-- > arch/x86/mm/tlb.c | 108 ++++++++++++++++++++++++++++++++++++---- > arch/x86/platform/uv/tlb_uv.c | 2 +- > 4 files changed, 107 insertions(+), 14 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 88a4d6b87ff7..0a8bdd6084de 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -171,6 +171,7 @@ struct tlb_context { > > #define TLBSTATE_OK 0 > #define TLBSTATE_LAZY 1 > +#define TLBSTATE_FLUSH 2 > > struct tlb_state { > /* > @@ -199,6 +200,10 @@ struct tlb_state { > * We're heuristically guessing that the CR3 load we > * skipped more than makes up for the overhead added by > * lazy mode. > + * > + * - Lazily using a real mm, which has seen a TLB invalidation on > + * other CPUs. TLB needs to be flushed at context switch time, > + * state == TLBSTATE_FLUSH. > */ > int state; > > diff --git a/arch/x86/include/asm/uv/uv.h b/arch/x86/include/asm/uv/uv.h > index a80c0673798f..d801afb5fe90 100644 > --- a/arch/x86/include/asm/uv/uv.h > +++ b/arch/x86/include/asm/uv/uv.h > @@ -17,7 +17,7 @@ extern int is_uv_hubless(void); > extern void uv_cpu_init(void); > extern void uv_nmi_init(void); > extern void uv_system_init(void); > -extern const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, > +extern struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, > const struct flush_tlb_info *info); > > #else /* X86_UV */ > @@ -27,8 +27,8 @@ static inline int is_uv_system(void) { return 0; } > static inline int is_uv_hubless(void) { return 0; } > static inline void uv_cpu_init(void) { } > static inline void uv_system_init(void) { } > -static inline const struct cpumask * > -uv_flush_tlb_others(const struct cpumask *cpumask, > +static inline struct cpumask * > +uv_flush_tlb_others(struct cpumask *cpumask, > const struct flush_tlb_info *info) > { return cpumask; } > > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index e063e623e52c..4b27d8469848 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -7,6 +7,7 @@ > #include <linux/export.h> > #include <linux/cpu.h> > #include <linux/debugfs.h> > +#include <linux/gfp.h> > > #include <asm/tlbflush.h> > #include <asm/mmu_context.h> > @@ -227,8 +228,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > __flush_tlb_all(); > } > #endif > - this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > - > /* > * The membarrier system call requires a full memory barrier and > * core serialization before returning to user-space, after > @@ -236,8 +235,11 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > * memory barrier and core serializing instruction. > */ > if (real_prev == next) { > + int *tlbstate = this_cpu_ptr(&cpu_tlbstate.state); > + int oldstate = *tlbstate; > + > VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != > - next->context.ctx_id); > + next->context.ctx_id); > > /* > * We don't currently support having a real mm loaded without > @@ -250,6 +252,26 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > !cpumask_test_cpu(cpu, mm_cpumask(next)))) > cpumask_set_cpu(cpu, mm_cpumask(next)); > > + /* > + * In lazy TLB mode. The atomic exchange makes sure there was > + * no TLB shootdown all the way to this context switch. > + */ > + if (oldstate == TLBSTATE_LAZY) > + oldstate = cmpxchg(tlbstate, oldstate, TLBSTATE_OK); > + > + /* > + * There was a TLB invalidation while this CPU was idle. > + * The code in choose_new_asid will figure out what kind > + * of flush is needed. > + */ > + if (oldstate == TLBSTATE_FLUSH) > + goto reload_cr3; > + > + /* > + * Nothing to do. Either we are switching between two > + * threads in the same process, or we were in TLBSTATE_LAZY > + * without any TLB invalidations happening. > + */ > return; > } else { > u16 new_asid; > @@ -294,6 +316,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, > * Start remote flushes and then read tlb_gen. > */ > cpumask_set_cpu(cpu, mm_cpumask(next)); > + reload_cr3: > + this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK); > next_tlb_gen = atomic64_read(&next->context.tlb_gen); > > choose_new_asid(next, next_tlb_gen, &new_asid, &need_flush); > @@ -454,6 +478,9 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, > * paging-structure cache to avoid speculatively reading > * garbage into our TLB. Since switching to init_mm is barely > * slower than a minimal flush, just switch to init_mm. > + * > + * This should be rare, with native_flush_tlb_others skipping > + * IPIs to lazy TLB mode CPUs. > */ > switch_mm_irqs_off(NULL, &init_mm, NULL); > return; > @@ -557,9 +584,57 @@ static void flush_tlb_func_remote(void *info) > flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); > } > > +/* > + * Determine whether a CPU's TLB needs to be flushed now, or whether the > + * flush can be delayed until the next context switch, by changing the > + * tlbstate from TLBSTATE_LAZY to TLBSTATE_FLUSH. > + */ > +static bool lazy_tlb_can_skip_flush(int cpu) > +{ > + int *tlbstate = &per_cpu(cpu_tlbstate.state, cpu); > + int old; > + > + switch (*tlbstate) { > + case TLBSTATE_FLUSH: > + /* The TLB will be flushed on the next context switch. */ > + return true; > + case TLBSTATE_LAZY: > + /* > + * The CPU is in TLBSTATE_LAZY, which could context switch back > + * to TLBSTATE_OK, re-using the old TLB state without a flush. > + * if that happened, send a TLB flush IPI. > + * > + * Otherwise, the state is now TLBSTATE_FLUSH, and TLB will > + * be flushed at the next context switch. Skip the IPI. > + */ > + old = cmpxchg(tlbstate, TLBSTATE_LAZY, TLBSTATE_FLUSH); > + return old != TLBSTATE_OK; > + case TLBSTATE_OK: > + default: > + /* A task on the CPU is actively using the mm. Flush the TLB. */ > + return false; > + } > +} > + > void native_flush_tlb_others(const struct cpumask *cpumask, > const struct flush_tlb_info *info) > { > + cpumask_t *mask = (struct cpumask *)cpumask; > + cpumask_var_t varmask; > + bool can_lazy_flush = false; > + unsigned int cpu; > + > + /* > + * A temporary cpumask allows the kernel to skip sending IPIs > + * to CPUs in lazy TLB state, without also removing them from > + * mm_cpumask(mm). > + */ > + if (alloc_cpumask_var(&varmask, GFP_ATOMIC)) { > + cpumask_copy(varmask, cpumask); > + mask = varmask; > + can_lazy_flush = true; > + } > + > count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); > if (info->end == TLB_FLUSH_ALL) > trace_tlb_flush(TLB_REMOTE_SEND_IPI, TLB_FLUSH_ALL); > @@ -583,17 +658,30 @@ void native_flush_tlb_others(const struct cpumask *cpumask, > * that UV should be updated so that smp_call_function_many(), > * etc, are optimal on UV. > */ > - unsigned int cpu; > - > cpu = smp_processor_id(); > - cpumask = uv_flush_tlb_others(cpumask, info); > - if (cpumask) > - smp_call_function_many(cpumask, flush_tlb_func_remote, > + mask = uv_flush_tlb_others(mask, info); > + if (mask) > + smp_call_function_many(mask, flush_tlb_func_remote, > (void *)info, 1); > - return; > + goto out; > + } > + > + /* > + * Instead of sending IPIs to CPUs in lazy TLB mode, move that > + * CPU's TLB state to TLBSTATE_FLUSH, causing the TLB to be flushed > + * at the next context switch, or at page table free time. > + */ > + if (can_lazy_flush) { > + for_each_cpu(cpu, mask) { > + if (lazy_tlb_can_skip_flush(cpu)) > + cpumask_clear_cpu(cpu, mask); > + } > } > - smp_call_function_many(cpumask, flush_tlb_func_remote, > + > + smp_call_function_many(mask, flush_tlb_func_remote, > (void *)info, 1); > + out: > + free_cpumask_var(varmask); > } > > /* > diff --git a/arch/x86/platform/uv/tlb_uv.c b/arch/x86/platform/uv/tlb_uv.c > index b36caae0fb2f..42730acec8b7 100644 > --- a/arch/x86/platform/uv/tlb_uv.c > +++ b/arch/x86/platform/uv/tlb_uv.c > @@ -1102,7 +1102,7 @@ static int set_distrib_bits(struct cpumask *flush_mask, struct bau_control *bcp, > * Returns pointer to cpumask if some remote flushing remains to be > * done. The returned pointer is valid till preemption is re-enabled. > */ > -const struct cpumask *uv_flush_tlb_others(const struct cpumask *cpumask, > +struct cpumask *uv_flush_tlb_others(struct cpumask *cpumask, > const struct flush_tlb_info *info) > { > unsigned int cpu = smp_processor_id(); > -- > 2.14.4 > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-22 15:04 ` Andy Lutomirski @ 2018-06-22 15:15 ` Rik van Riel 2018-06-22 15:34 ` Andy Lutomirski 0 siblings, 1 reply; 46+ messages in thread From: Rik van Riel @ 2018-06-22 15:15 UTC (permalink / raw) To: Andy Lutomirski Cc: LKML, x86, Ingo Molnar, Thomas Gleixner, Dave Hansen, Mike Galbraith, songliubraving, kernel-team [-- Attachment #1: Type: text/plain, Size: 2107 bytes --] On Fri, 2018-06-22 at 08:04 -0700, Andy Lutomirski wrote: > On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> > wrote: > > > > Lazy TLB mode can result in an idle CPU being woken up by a TLB > > flush, > > when all it really needs to do is reload %CR3 at the next context > > switch, > > assuming no page table pages got freed. > > > > This patch deals with that issue by introducing a third TLB state, > > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next > > context > > switch. > > > > Atomic compare and exchange is used to close races between the TLB > > shootdown code and the context switch code. Keying off just the > > tlb_gen is likely to not be enough, since that would not give > > lazy_clb_can_skip_flush() information on when it is facing a race > > and has to send the IPI to a CPU in the middle of a LAZY -> OK > > switch. > > > > Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are > > not > > removed from the mm_cpumask(mm), since that would prevent the TLB > > flush IPIs at page table free time from being sent to all the CPUs > > that need them. > > Eek, this is so complicated. In the 2016 version of the patches, you > needed all this. But I rewrote the whole subsystem to make it easier > now :) I think that you can get rid of all of this and instead just > revert the relevant parts of: > > b956575bed91ecfb136a8300742ecbbf451471ab > > All the bookkeeping is already in place -- no need for new state. I looked at using your .tlb_gen stuff, but we need a way to do that race free. I suppose setting the tlbstate to !lazy before checking .tlb_gen might do the trick, if we get the ordering right at the tlb invalidation site, too? Something like this: context switch tlb invalidation advance mm->context.tlb_gen send IPI to cpus with !is_lazy tlb tlbstate.is_lazy = FALSE *need_flush = .tlb_gen < next_tlb_gen Do you see any holes in that? I will gladly simplify this code and get rid of the atomic operations :) -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-22 15:15 ` Rik van Riel @ 2018-06-22 15:34 ` Andy Lutomirski 0 siblings, 0 replies; 46+ messages in thread From: Andy Lutomirski @ 2018-06-22 15:34 UTC (permalink / raw) To: Rik van Riel Cc: Andrew Lutomirski, LKML, X86 ML, Ingo Molnar, Thomas Gleixner, Dave Hansen, Mike Galbraith, songliubraving, kernel-team On Fri, Jun 22, 2018 at 8:15 AM Rik van Riel <riel@surriel.com> wrote: > > On Fri, 2018-06-22 at 08:04 -0700, Andy Lutomirski wrote: > > On Wed, Jun 20, 2018 at 12:57 PM Rik van Riel <riel@surriel.com> > > wrote: > > > > > > Lazy TLB mode can result in an idle CPU being woken up by a TLB > > > flush, > > > when all it really needs to do is reload %CR3 at the next context > > > switch, > > > assuming no page table pages got freed. > > > > > > This patch deals with that issue by introducing a third TLB state, > > > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next > > > context > > > switch. > > > > > > Atomic compare and exchange is used to close races between the TLB > > > shootdown code and the context switch code. Keying off just the > > > tlb_gen is likely to not be enough, since that would not give > > > lazy_clb_can_skip_flush() information on when it is facing a race > > > and has to send the IPI to a CPU in the middle of a LAZY -> OK > > > switch. > > > > > > Unlike the 2016 version of this patch, CPUs in TLBSTATE_LAZY are > > > not > > > removed from the mm_cpumask(mm), since that would prevent the TLB > > > flush IPIs at page table free time from being sent to all the CPUs > > > that need them. > > > > Eek, this is so complicated. In the 2016 version of the patches, you > > needed all this. But I rewrote the whole subsystem to make it easier > > now :) I think that you can get rid of all of this and instead just > > revert the relevant parts of: > > > > b956575bed91ecfb136a8300742ecbbf451471ab > > > > All the bookkeeping is already in place -- no need for new state. > > I looked at using your .tlb_gen stuff, but we need a > way to do that race free. I suppose setting the > tlbstate to !lazy before checking .tlb_gen might do > the trick, if we get the ordering right at the tlb > invalidation site, too? Oh, right. > > Something like this: > > context switch tlb invalidation > > advance mm->context.tlb_gen > send IPI to cpus with !is_lazy tlb > > > tlbstate.is_lazy = FALSE > *need_flush = .tlb_gen < next_tlb_gen > > Do you see any holes in that? Logically, is_lazy is (with your patches) just like mm_cpumask in terms of ordering. So I think your idea above is fine. But I think you need to make sure there's a full barrier between is_lazy = false and reading .tlb_gen. --Andy ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-06-22 15:04 ` Andy Lutomirski @ 2018-06-22 17:05 ` Dave Hansen 2018-06-22 17:16 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Dave Hansen @ 2018-06-22 17:05 UTC (permalink / raw) To: Rik van Riel, linux-kernel Cc: 86, luto, mingo, tglx, efault, songliubraving, kernel-team On 06/20/2018 12:56 PM, Rik van Riel wrote: > This patch deals with that issue by introducing a third TLB state, > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next context > switch. With PCIDs, do we need to be a bit more explicit about what kind of %CR3 reload we have? Because, with PCIDs, we do have non-TLB-flushing %CR3 writes. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier 2018-06-22 17:05 ` Dave Hansen @ 2018-06-22 17:16 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2018-06-22 17:16 UTC (permalink / raw) To: Dave Hansen, linux-kernel Cc: x86, luto, mingo, tglx, efault, songliubraving, kernel-team [-- Attachment #1: Type: text/plain, Size: 943 bytes --] On Fri, 2018-06-22 at 10:05 -0700, Dave Hansen wrote: > On 06/20/2018 12:56 PM, Rik van Riel wrote: > > This patch deals with that issue by introducing a third TLB state, > > TLBSTATE_FLUSH, which causes %CR3 to be reloaded at the next > > context > > switch. > > With PCIDs, do we need to be a bit more explicit about what kind of > %CR3 > reload we have? Because, with PCIDs, we do have non-TLB-flushing > %CR3 > writes. Probably not in this patch series, because switch_mm_irqs_off calls choose_new_asid, which figures all of this out for us already. Specifically, it has this bit of code to figure out whether we need a flush: *need_flush = (this_cpu_read(cpu_tlbstate.ctxs[asid].tlb_gen) < next_tlb_gen); As long as we get the memory ordering right, we should be able to rely on that to switch back from lazy to non-lazy TLB mode. -- All Rights Reversed. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2018-07-24 16:33 UTC | newest] Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-10 14:28 [PATCH v5 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-07-10 14:28 ` [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids Rik van Riel 2018-07-15 22:59 ` Ingo Molnar 2018-07-15 23:50 ` Rik van Riel 2018-07-16 1:07 ` Ingo Molnar 2018-07-10 14:28 ` [PATCH 2/7] x86,tlb: leave lazy TLB mode at page table free time Rik van Riel 2018-07-10 14:28 ` [PATCH 3/7] x86,mm: restructure switch_mm_irqs_off Rik van Riel 2018-07-10 14:28 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-07-10 14:28 ` [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs Rik van Riel 2018-07-10 14:28 ` [PATCH 6/7] x86,mm: always use lazy TLB mode Rik van Riel 2018-07-10 14:28 ` [PATCH 7/7] x86,switch_mm: skip atomic operations for init_mm Rik van Riel 2018-07-15 23:04 ` Ingo Molnar 2018-07-15 23:49 ` Rik van Riel 2018-07-16 1:04 ` Ingo Molnar 2018-07-16 18:37 ` Rik van Riel -- strict thread matches above, loose matches on Subject: below -- 2018-07-16 19:03 [PATCH v6 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-07-16 19:03 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-07-17 20:04 ` Andy Lutomirski [not found] ` <FF977B78-140F-4787-AA57-0EA934017D85@surriel.com> 2018-07-17 21:29 ` Andy Lutomirski 2018-07-17 22:05 ` Rik van Riel 2018-07-17 22:27 ` Andy Lutomirski 2018-07-18 20:58 ` Rik van Riel 2018-07-18 23:13 ` Andy Lutomirski [not found] ` <B976CC13-D014-433A-83DE-F8DF9AB4F421@surriel.com> 2018-07-19 16:45 ` Andy Lutomirski 2018-07-19 17:04 ` Andy Lutomirski 2018-07-19 17:04 ` Andy Lutomirski 2018-07-20 4:57 ` Benjamin Herrenschmidt 2018-07-20 8:30 ` Peter Zijlstra 2018-07-23 12:26 ` Rik van Riel 2018-07-23 12:26 ` Rik van Riel 2018-07-24 16:33 ` Will Deacon [not found] ` <CF849A07-B7CE-4DE9-8246-53AC5A53A705@surriel.com> 2018-07-19 17:18 ` Andy Lutomirski 2018-07-20 8:02 ` Vitaly Kuznetsov 2018-07-20 9:49 ` Peter Zijlstra 2018-07-20 10:18 ` Vitaly Kuznetsov 2018-07-20 9:32 ` Peter Zijlstra 2018-07-20 11:04 ` Peter Zijlstra 2018-07-06 21:56 [PATCH v4 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-07-06 21:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-06-29 14:29 [PATCH v3 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-06-29 14:29 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-06-29 17:05 ` Dave Hansen 2018-06-29 17:29 ` Rik van Riel 2018-06-20 19:56 [PATCH 0/7] x86,tlb,mm: make lazy TLB mode even lazier Rik van Riel 2018-06-20 19:56 ` [PATCH 4/7] x86,tlb: make lazy TLB mode lazier Rik van Riel 2018-06-22 15:04 ` Andy Lutomirski 2018-06-22 15:15 ` Rik van Riel 2018-06-22 15:34 ` Andy Lutomirski 2018-06-22 17:05 ` Dave Hansen 2018-06-22 17:16 ` Rik van Riel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.