* [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, @ 2020-07-10 1:56 Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions Nicholas Piggin ` (6 more replies) 0 siblings, 7 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard This blew up a bit bigger than I thought, so I'd like to get some comments as to whether people agree with the direction it's going. The patches aren't cleanly split out by arch, but as it is now it's probably easier to get a quick overview of the changes at a glance anyway. So there's a few different things here. 1. Clean up and use asm-generic for no-op mmu context functions (so not just for nommu architectures). This should be functionally a no-op for everybody. This allows exit_lazy_tlb to easily be added. 2. Add exit_lazy_tlb and use it for x86, so this is x86 and membarrier specific changes. I _may_ have spotted a small membarrier / core sync bug here when adding exit_lazy_tlb. 3. Tidy up lazy tlb a little bit, have its own refcount function and allow it to be selected out. We can audit the nommu archs and deselect it for those. 4. Add a non-refcounting lazy mmu mode, to help scalability when the same mm is used for a lot of lazy mmu switching. Comments, questions on anything would be much appreciated. Thanks, Nick Nicholas Piggin (7): asm-generic: add generic MMU versions of mmu context functions arch: use asm-generic mmu context for no-op implementations mm: introduce exit_lazy_tlb x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode lazy tlb: introduce lazy mm refcount helper functions lazy tlb: allow lazy tlb mm switching to be configurable lazy tlb: shoot lazies, a non-refcounting lazy tlb option .../membarrier-sync-core/arch-support.txt | 6 +- arch/Kconfig | 23 +++++ arch/alpha/include/asm/mmu_context.h | 12 +-- arch/arc/include/asm/mmu_context.h | 16 ++-- arch/arm/include/asm/mmu_context.h | 26 +----- arch/arm64/include/asm/mmu_context.h | 7 +- arch/csky/include/asm/mmu_context.h | 8 +- arch/hexagon/include/asm/mmu_context.h | 33 ++------ arch/ia64/include/asm/mmu_context.h | 17 +--- arch/m68k/include/asm/mmu_context.h | 47 ++--------- arch/microblaze/include/asm/mmu_context.h | 2 +- arch/microblaze/include/asm/mmu_context_mm.h | 8 +- arch/microblaze/include/asm/processor.h | 3 - arch/mips/include/asm/mmu_context.h | 11 +-- arch/nds32/include/asm/mmu_context.h | 10 +-- arch/nios2/include/asm/mmu_context.h | 21 +---- arch/nios2/mm/mmu_context.c | 1 + arch/openrisc/include/asm/mmu_context.h | 8 +- arch/openrisc/mm/tlb.c | 2 + arch/parisc/include/asm/mmu_context.h | 12 +-- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/mmu_context.h | 22 ++--- arch/powerpc/kernel/smp.c | 2 +- arch/powerpc/mm/book3s64/radix_tlb.c | 4 +- arch/riscv/include/asm/mmu_context.h | 22 +---- arch/s390/include/asm/mmu_context.h | 9 +- arch/sh/include/asm/mmu_context.h | 7 +- arch/sh/include/asm/mmu_context_32.h | 9 -- arch/sparc/include/asm/mmu_context_32.h | 10 +-- arch/sparc/include/asm/mmu_context_64.h | 10 +-- arch/um/include/asm/mmu_context.h | 12 ++- arch/unicore32/include/asm/mmu_context.h | 24 +----- arch/x86/include/asm/mmu_context.h | 41 +++++++++ arch/x86/include/asm/sync_core.h | 28 ------- arch/xtensa/include/asm/mmu_context.h | 11 +-- arch/xtensa/include/asm/nommu_context.h | 26 +----- fs/exec.c | 5 +- include/asm-generic/mmu_context.h | 77 +++++++++++++---- include/asm-generic/nommu_context.h | 19 +++++ include/linux/sched/mm.h | 35 ++++---- include/linux/sync_core.h | 21 ----- kernel/cpu.c | 6 +- kernel/exit.c | 2 +- kernel/fork.c | 39 +++++++++ kernel/kthread.c | 12 ++- kernel/sched/core.c | 84 ++++++++++++------- kernel/sched/sched.h | 4 +- 47 files changed, 388 insertions(+), 427 deletions(-) delete mode 100644 arch/x86/include/asm/sync_core.h create mode 100644 include/asm-generic/nommu_context.h delete mode 100644 include/linux/sync_core.h -- 2.23.0 ^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 2/7] arch: use asm-generic mmu context for no-op implementations Nicholas Piggin ` (5 subsequent siblings) 6 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard, Remis Lima Baima Many of these are no-ops on many architectures, so extend mmu_context.h to cover MMU and NOMMU, and split the NOMMU bits out to nommu_context.h Cc: Arnd Bergmann <arnd@arndb.de> Cc: Remis Lima Baima <remis.developer@googlemail.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/microblaze/include/asm/mmu_context.h | 2 +- arch/sh/include/asm/mmu_context.h | 2 +- include/asm-generic/mmu_context.h | 57 +++++++++++++++++------ include/asm-generic/nommu_context.h | 19 ++++++++ 4 files changed, 64 insertions(+), 16 deletions(-) create mode 100644 include/asm-generic/nommu_context.h diff --git a/arch/microblaze/include/asm/mmu_context.h b/arch/microblaze/include/asm/mmu_context.h index f74f9da07fdc..34004efb3def 100644 --- a/arch/microblaze/include/asm/mmu_context.h +++ b/arch/microblaze/include/asm/mmu_context.h @@ -2,5 +2,5 @@ #ifdef CONFIG_MMU # include <asm/mmu_context_mm.h> #else -# include <asm-generic/mmu_context.h> +# include <asm-generic/nommu_context.h> #endif diff --git a/arch/sh/include/asm/mmu_context.h b/arch/sh/include/asm/mmu_context.h index 48e67d544d53..9470d17c71c2 100644 --- a/arch/sh/include/asm/mmu_context.h +++ b/arch/sh/include/asm/mmu_context.h @@ -134,7 +134,7 @@ static inline void switch_mm(struct mm_struct *prev, #define set_TTB(pgd) do { } while (0) #define get_TTB() (0) -#include <asm-generic/mmu_context.h> +#include <asm-generic/nommu_context.h> #endif /* CONFIG_MMU */ diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h index 6be9106fb6fb..86cea80a50df 100644 --- a/include/asm-generic/mmu_context.h +++ b/include/asm-generic/mmu_context.h @@ -3,44 +3,73 @@ #define __ASM_GENERIC_MMU_CONTEXT_H /* - * Generic hooks for NOMMU architectures, which do not need to do - * anything special here. + * Generic hooks to implement no-op functionality. */ -#include <asm-generic/mm_hooks.h> - struct task_struct; struct mm_struct; +/* + * enter_lazy_tlb - Called when "tsk" is about to enter lazy TLB mode. + * + * @mm: the currently active mm context which is becoming lazy + * @tsk: task which is entering lazy tlb + * + * tsk->mm will be NULL + */ +#ifndef enter_lazy_tlb static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { } +#endif +/** + * init_new_context - Initialize context of a new mm_struct. + * @tsk: task struct for the mm + * @mm: the new mm struct + */ +#ifndef init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { return 0; } +#endif +/** + * destroy_context - Undo init_new_context when the mm is going away + * @mm: old mm struct + */ +#ifndef destroy_context static inline void destroy_context(struct mm_struct *mm) { } +#endif -static inline void deactivate_mm(struct task_struct *task, - struct mm_struct *mm) -{ -} - -static inline void switch_mm(struct mm_struct *prev, - struct mm_struct *next, - struct task_struct *tsk) +/** + * activate_mm - called after exec switches the current task to a new mm, to switch to it + * @prev_mm: previous mm of this task + * @next_mm: new mm + */ +#ifndef activate_mm +static inline void activate_mm(struct mm_struct *prev_mm, + struct mm_struct *next_mm) { + switch_mm(prev_mm, next_mm, current); } +#endif -static inline void activate_mm(struct mm_struct *prev_mm, - struct mm_struct *next_mm) +/** + * dectivate_mm - called when an mm is released after exit or exec switches away from it + * @tsk: the task + * @mm: the old mm + */ +#ifndef deactivate_mm +static inline void deactivate_mm(struct task_struct *tsk, + struct mm_struct *mm) { } +#endif #endif /* __ASM_GENERIC_MMU_CONTEXT_H */ diff --git a/include/asm-generic/nommu_context.h b/include/asm-generic/nommu_context.h new file mode 100644 index 000000000000..72b8d8b1d81e --- /dev/null +++ b/include/asm-generic/nommu_context.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_NOMMU_H +#define __ASM_GENERIC_NOMMU_H + +/* + * Generic hooks for NOMMU architectures, which do not need to do + * anything special here. + */ + +#include <asm-generic/mm_hooks.h> +#include <asm-generic/mmu_context.h> + +static inline void switch_mm(struct mm_struct *prev, + struct mm_struct *next, + struct task_struct *tsk) +{ +} + +#endif /* __ASM_GENERIC_NOMMU_H */ -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH 2/7] arch: use asm-generic mmu context for no-op implementations 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 3/7] mm: introduce exit_lazy_tlb Nicholas Piggin ` (4 subsequent siblings) 6 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard This patch bunches all architectures together. If the general idea is accepted I will split them individually. Some architectures can go further e.g., with consolidating switch_mm and activate_mm but I only did the more obvious ones. --- arch/alpha/include/asm/mmu_context.h | 12 ++--- arch/arc/include/asm/mmu_context.h | 16 +++---- arch/arm/include/asm/mmu_context.h | 26 ++--------- arch/arm64/include/asm/mmu_context.h | 7 ++- arch/csky/include/asm/mmu_context.h | 8 ++-- arch/hexagon/include/asm/mmu_context.h | 33 +++----------- arch/ia64/include/asm/mmu_context.h | 17 ++----- arch/m68k/include/asm/mmu_context.h | 47 ++++---------------- arch/microblaze/include/asm/mmu_context_mm.h | 8 ++-- arch/microblaze/include/asm/processor.h | 3 -- arch/mips/include/asm/mmu_context.h | 11 ++--- arch/nds32/include/asm/mmu_context.h | 10 +---- arch/nios2/include/asm/mmu_context.h | 21 ++------- arch/nios2/mm/mmu_context.c | 1 + arch/openrisc/include/asm/mmu_context.h | 8 ++-- arch/openrisc/mm/tlb.c | 2 + arch/parisc/include/asm/mmu_context.h | 12 ++--- arch/powerpc/include/asm/mmu_context.h | 22 +++------ arch/riscv/include/asm/mmu_context.h | 22 +-------- arch/s390/include/asm/mmu_context.h | 9 ++-- arch/sh/include/asm/mmu_context.h | 5 +-- arch/sh/include/asm/mmu_context_32.h | 9 ---- arch/sparc/include/asm/mmu_context_32.h | 10 ++--- arch/sparc/include/asm/mmu_context_64.h | 10 ++--- arch/um/include/asm/mmu_context.h | 12 +++-- arch/unicore32/include/asm/mmu_context.h | 24 ++-------- arch/x86/include/asm/mmu_context.h | 6 +++ arch/xtensa/include/asm/mmu_context.h | 11 ++--- arch/xtensa/include/asm/nommu_context.h | 26 +---------- 29 files changed, 106 insertions(+), 302 deletions(-) diff --git a/arch/alpha/include/asm/mmu_context.h b/arch/alpha/include/asm/mmu_context.h index 6d7d9bc1b4b8..4eea7c616992 100644 --- a/arch/alpha/include/asm/mmu_context.h +++ b/arch/alpha/include/asm/mmu_context.h @@ -214,8 +214,6 @@ ev4_activate_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) tbiap(); } -#define deactivate_mm(tsk,mm) do { } while (0) - #ifdef CONFIG_ALPHA_GENERIC # define switch_mm(a,b,c) alpha_mv.mv_switch_mm((a),(b),(c)) # define activate_mm(x,y) alpha_mv.mv_activate_mm((x),(y)) @@ -229,6 +227,7 @@ ev4_activate_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) # endif #endif +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -242,12 +241,7 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm) return 0; } -extern inline void -destroy_context(struct mm_struct *mm) -{ - /* Nothing to do. */ -} - +#define enter_lazy_tlb enter_lazy_tlb static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { @@ -255,6 +249,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) = ((unsigned long)mm->pgd - IDENT_ADDR) >> PAGE_SHIFT; } +#include <asm-generic/mmu_context.h> + #ifdef __MMU_EXTERN_INLINE #undef __EXTERN_INLINE #undef __MMU_EXTERN_INLINE diff --git a/arch/arc/include/asm/mmu_context.h b/arch/arc/include/asm/mmu_context.h index 3a5e6a5b9ed6..586d31902a99 100644 --- a/arch/arc/include/asm/mmu_context.h +++ b/arch/arc/include/asm/mmu_context.h @@ -102,6 +102,7 @@ static inline void get_new_mmu_context(struct mm_struct *mm) * Initialize the context related info for a new mm_struct * instance. */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -113,6 +114,7 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm) return 0; } +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { unsigned long flags; @@ -153,13 +155,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, } /* - * Called at the time of execve() to get a new ASID - * Note the subtlety here: get_new_mmu_context() behaves differently here - * vs. in switch_mm(). Here it always returns a new ASID, because mm has - * an unallocated "initial" value, while in latter, it moves to a new ASID, - * only if it was unallocated + * activate_mm defaults to switch_mm and is called at the time of execve() to + * get a new ASID Note the subtlety here: get_new_mmu_context() behaves + * differently here vs. in switch_mm(). Here it always returns a new ASID, + * because mm has an unallocated "initial" value, while in latter, it moves to + * a new ASID, only if it was unallocated */ -#define activate_mm(prev, next) switch_mm(prev, next, NULL) /* it seemed that deactivate_mm( ) is a reasonable place to do book-keeping * for retiring-mm. However destroy_context( ) still needs to do that because @@ -168,8 +169,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * there is a good chance that task gets sched-out/in, making it's ASID valid * again (this teased me for a whole day). */ -#define deactivate_mm(tsk, mm) do { } while (0) -#define enter_lazy_tlb(mm, tsk) +#include <asm-generic/mmu_context.h> #endif /* __ASM_ARC_MMU_CONTEXT_H */ diff --git a/arch/arm/include/asm/mmu_context.h b/arch/arm/include/asm/mmu_context.h index f99ed524fe41..84e58956fcab 100644 --- a/arch/arm/include/asm/mmu_context.h +++ b/arch/arm/include/asm/mmu_context.h @@ -26,6 +26,8 @@ void __check_vmalloc_seq(struct mm_struct *mm); #ifdef CONFIG_CPU_HAS_ASID void check_and_switch_context(struct mm_struct *mm, struct task_struct *tsk); + +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -92,32 +94,10 @@ static inline void finish_arch_post_lock_switch(void) #endif /* CONFIG_MMU */ -static inline int -init_new_context(struct task_struct *tsk, struct mm_struct *mm) -{ - return 0; -} - - #endif /* CONFIG_CPU_HAS_ASID */ -#define destroy_context(mm) do { } while(0) #define activate_mm(prev,next) switch_mm(prev, next, NULL) -/* - * This is called when "tsk" is about to enter lazy TLB mode. - * - * mm: describes the currently active mm context - * tsk: task which is entering lazy tlb - * cpu: cpu number which is entering lazy tlb - * - * tsk->mm will be NULL - */ -static inline void -enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* * This is the actual mm switch as far as the scheduler * is concerned. No registers are touched. We avoid @@ -149,6 +129,6 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, #endif } -#define deactivate_mm(tsk,mm) do { } while (0) +#include <asm-generic/mmu_context.h> #endif diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h index b0bd9b55594c..0f5e351f586a 100644 --- a/arch/arm64/include/asm/mmu_context.h +++ b/arch/arm64/include/asm/mmu_context.h @@ -174,7 +174,6 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp) * Setting a reserved TTBR0 or EPD0 would work, but it all gets ugly when you * take CPU migration into account. */ -#define destroy_context(mm) do { } while(0) void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); #define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.id, 0); 0; }) @@ -202,6 +201,7 @@ static inline void update_saved_ttbr0(struct task_struct *tsk, } #endif +#define enter_lazy_tlb enter_lazy_tlb static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { @@ -244,12 +244,11 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, update_saved_ttbr0(tsk, next); } -#define deactivate_mm(tsk,mm) do { } while (0) -#define activate_mm(prev,next) switch_mm(prev, next, current) - void verify_cpu_asid_bits(void); void post_ttbr_update_workaround(void); +#include <asm-generic/mmu_context.h> + #endif /* !__ASSEMBLY__ */ #endif /* !__ASM_MMU_CONTEXT_H */ diff --git a/arch/csky/include/asm/mmu_context.h b/arch/csky/include/asm/mmu_context.h index abdf1f1cb6ec..b227d29393a8 100644 --- a/arch/csky/include/asm/mmu_context.h +++ b/arch/csky/include/asm/mmu_context.h @@ -24,11 +24,6 @@ #define cpu_asid(mm) (atomic64_read(&mm->context.asid) & ASID_MASK) #define init_new_context(tsk,mm) ({ atomic64_set(&(mm)->context.asid, 0); 0; }) -#define activate_mm(prev,next) switch_mm(prev, next, current) - -#define destroy_context(mm) do {} while (0) -#define enter_lazy_tlb(mm, tsk) do {} while (0) -#define deactivate_mm(tsk, mm) do {} while (0) void check_and_switch_context(struct mm_struct *mm, unsigned int cpu); @@ -46,4 +41,7 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, flush_icache_deferred(next); } + +#include <asm-generic/mmu_context.h> + #endif /* __ASM_CSKY_MMU_CONTEXT_H */ diff --git a/arch/hexagon/include/asm/mmu_context.h b/arch/hexagon/include/asm/mmu_context.h index cdc4adc0300a..81947764c47d 100644 --- a/arch/hexagon/include/asm/mmu_context.h +++ b/arch/hexagon/include/asm/mmu_context.h @@ -15,39 +15,13 @@ #include <asm/pgalloc.h> #include <asm/mem-layout.h> -static inline void destroy_context(struct mm_struct *mm) -{ -} - /* * VM port hides all TLB management, so "lazy TLB" isn't very * meaningful. Even for ports to architectures with visble TLBs, * this is almost invariably a null function. + * + * mm->context is set up by pgd_alloc, so no init_new_context required. */ -static inline void enter_lazy_tlb(struct mm_struct *mm, - struct task_struct *tsk) -{ -} - -/* - * Architecture-specific actions, if any, for memory map deactivation. - */ -static inline void deactivate_mm(struct task_struct *tsk, - struct mm_struct *mm) -{ -} - -/** - * init_new_context - initialize context related info for new mm_struct instance - * @tsk: pointer to a task struct - * @mm: pointer to a new mm struct - */ -static inline int init_new_context(struct task_struct *tsk, - struct mm_struct *mm) -{ - /* mm->context is set up by pgd_alloc */ - return 0; -} /* * Switch active mm context @@ -74,6 +48,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, /* * Activate new memory map for task */ +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) { unsigned long flags; @@ -86,4 +61,6 @@ static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) /* Generic hooks for arch_dup_mmap and arch_exit_mmap */ #include <asm-generic/mm_hooks.h> +#include <asm-generic/mmu_context.h> + #endif diff --git a/arch/ia64/include/asm/mmu_context.h b/arch/ia64/include/asm/mmu_context.h index 2da0e2eb036b..87a0d5bc11ef 100644 --- a/arch/ia64/include/asm/mmu_context.h +++ b/arch/ia64/include/asm/mmu_context.h @@ -49,11 +49,6 @@ DECLARE_PER_CPU(u8, ia64_need_tlb_flush); extern void mmu_context_init (void); extern void wrap_mmu_context (struct mm_struct *mm); -static inline void -enter_lazy_tlb (struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* * When the context counter wraps around all TLBs need to be flushed because * an old context number might have been reused. This is signalled by the @@ -116,6 +111,7 @@ get_mmu_context (struct mm_struct *mm) * Initialize context number to some sane value. MM is guaranteed to be a * brand-new address-space, so no TLB flushing is needed, ever. */ +#define init_new_context init_new_context static inline int init_new_context (struct task_struct *p, struct mm_struct *mm) { @@ -123,12 +119,6 @@ init_new_context (struct task_struct *p, struct mm_struct *mm) return 0; } -static inline void -destroy_context (struct mm_struct *mm) -{ - /* Nothing to do. */ -} - static inline void reload_context (nv_mm_context_t context) { @@ -178,11 +168,10 @@ activate_context (struct mm_struct *mm) } while (unlikely(context != mm->context)); } -#define deactivate_mm(tsk,mm) do { } while (0) - /* * Switch from address space PREV to address space NEXT. */ +#define activate_mm activate_mm static inline void activate_mm (struct mm_struct *prev, struct mm_struct *next) { @@ -196,5 +185,7 @@ activate_mm (struct mm_struct *prev, struct mm_struct *next) #define switch_mm(prev_mm,next_mm,next_task) activate_mm(prev_mm, next_mm) +#include <asm-generic/mmu_context.h> + # endif /* ! __ASSEMBLY__ */ #endif /* _ASM_IA64_MMU_CONTEXT_H */ diff --git a/arch/m68k/include/asm/mmu_context.h b/arch/m68k/include/asm/mmu_context.h index cac9f289d1f6..56ae27322178 100644 --- a/arch/m68k/include/asm/mmu_context.h +++ b/arch/m68k/include/asm/mmu_context.h @@ -5,10 +5,6 @@ #include <asm-generic/mm_hooks.h> #include <linux/mm_types.h> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - #ifdef CONFIG_MMU #if defined(CONFIG_COLDFIRE) @@ -58,6 +54,7 @@ static inline void get_mmu_context(struct mm_struct *mm) /* * We're finished using the context for an address space. */ +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { if (mm->context != NO_CONTEXT) { @@ -79,19 +76,6 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, set_context(tsk->mm->context, next->pgd); } -/* - * After we have set current->mm to a new value, this activates - * the context for the new mm so we see the new mappings. - */ -static inline void activate_mm(struct mm_struct *active_mm, - struct mm_struct *mm) -{ - get_mmu_context(mm); - set_context(mm->context, mm->pgd); -} - -#define deactivate_mm(tsk, mm) do { } while (0) - #define prepare_arch_switch(next) load_ksp_mmu(next) static inline void load_ksp_mmu(struct task_struct *task) @@ -176,6 +160,7 @@ extern unsigned long get_free_context(struct mm_struct *mm); extern void clear_context(unsigned long context); /* set the context for a new task to unmapped */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -210,8 +195,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, activate_context(tsk->mm); } -#define deactivate_mm(tsk, mm) do { } while (0) - +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) { @@ -224,6 +208,7 @@ static inline void activate_mm(struct mm_struct *prev_mm, #include <asm/page.h> #include <asm/pgalloc.h> +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -231,8 +216,6 @@ static inline int init_new_context(struct task_struct *tsk, return 0; } -#define destroy_context(mm) do { } while(0) - static inline void switch_mm_0230(struct mm_struct *mm) { unsigned long crp[2] = { @@ -300,8 +283,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, str } } -#define deactivate_mm(tsk,mm) do { } while (0) - +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) { @@ -315,24 +297,11 @@ static inline void activate_mm(struct mm_struct *prev_mm, #endif -#else /* !CONFIG_MMU */ +#include <asm-generic/mmu_context.h> -static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) -{ - return 0; -} - - -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) -{ -} - -#define destroy_context(mm) do { } while (0) -#define deactivate_mm(tsk,mm) do { } while (0) +#else /* !CONFIG_MMU */ -static inline void activate_mm(struct mm_struct *prev_mm, struct mm_struct *next_mm) -{ -} +#include <asm-generic/nommu_context.h> #endif /* CONFIG_MMU */ #endif /* __M68K_MMU_CONTEXT_H */ diff --git a/arch/microblaze/include/asm/mmu_context_mm.h b/arch/microblaze/include/asm/mmu_context_mm.h index a1c7dd48454c..c2c77f708455 100644 --- a/arch/microblaze/include/asm/mmu_context_mm.h +++ b/arch/microblaze/include/asm/mmu_context_mm.h @@ -33,10 +33,6 @@ to represent all kernel pages as shared among all contexts. */ -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - # define NO_CONTEXT 256 # define LAST_CONTEXT 255 # define FIRST_CONTEXT 1 @@ -105,6 +101,7 @@ static inline void get_mmu_context(struct mm_struct *mm) /* * We're finished using the context for an address space. */ +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { if (mm->context != NO_CONTEXT) { @@ -126,6 +123,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * After we have set current->mm to a new value, this activates * the context for the new mm so we see the new mappings. */ +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *active_mm, struct mm_struct *mm) { @@ -136,5 +134,7 @@ static inline void activate_mm(struct mm_struct *active_mm, extern void mmu_context_init(void); +#include <asm-generic/mmu_context.h> + # endif /* __KERNEL__ */ #endif /* _ASM_MICROBLAZE_MMU_CONTEXT_H */ diff --git a/arch/microblaze/include/asm/processor.h b/arch/microblaze/include/asm/processor.h index 1ff5a82b76b6..616211871a6e 100644 --- a/arch/microblaze/include/asm/processor.h +++ b/arch/microblaze/include/asm/processor.h @@ -122,9 +122,6 @@ unsigned long get_wchan(struct task_struct *p); # define KSTK_EIP(task) (task_pc(task)) # define KSTK_ESP(task) (task_sp(task)) -/* FIXME */ -# define deactivate_mm(tsk, mm) do { } while (0) - # define STACK_TOP TASK_SIZE # define STACK_TOP_MAX STACK_TOP diff --git a/arch/mips/include/asm/mmu_context.h b/arch/mips/include/asm/mmu_context.h index cddead91acd4..ed9f2d748f63 100644 --- a/arch/mips/include/asm/mmu_context.h +++ b/arch/mips/include/asm/mmu_context.h @@ -124,10 +124,6 @@ static inline void set_cpu_context(unsigned int cpu, #define cpu_asid(cpu, mm) \ (cpu_context((cpu), (mm)) & cpu_asid_mask(&cpu_data[cpu])) -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - extern void get_new_mmu_context(struct mm_struct *mm); extern void check_mmu_context(struct mm_struct *mm); extern void check_switch_mmu_context(struct mm_struct *mm); @@ -136,6 +132,7 @@ extern void check_switch_mmu_context(struct mm_struct *mm); * Initialize the context related info for a new mm_struct * instance. */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -180,14 +177,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, * Destroy context related info for an mm_struct that is about * to be put to rest. */ +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { dsemul_mm_cleanup(mm); } -#define activate_mm(prev, next) switch_mm(prev, next, current) -#define deactivate_mm(tsk, mm) do { } while (0) - static inline void drop_mmu_context(struct mm_struct *mm) { @@ -237,4 +232,6 @@ drop_mmu_context(struct mm_struct *mm) local_irq_restore(flags); } +#include <asm-generic/mmu_context.h> + #endif /* _ASM_MMU_CONTEXT_H */ diff --git a/arch/nds32/include/asm/mmu_context.h b/arch/nds32/include/asm/mmu_context.h index b8fd3d189fdc..c651bc8cacdc 100644 --- a/arch/nds32/include/asm/mmu_context.h +++ b/arch/nds32/include/asm/mmu_context.h @@ -9,6 +9,7 @@ #include <asm/proc-fns.h> #include <asm-generic/mm_hooks.h> +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -16,8 +17,6 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm) return 0; } -#define destroy_context(mm) do { } while(0) - #define CID_BITS 9 extern spinlock_t cid_lock; extern unsigned int cpu_last_cid; @@ -47,10 +46,6 @@ static inline void check_context(struct mm_struct *mm) __new_context(mm); } -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { @@ -62,7 +57,6 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, } } -#define deactivate_mm(tsk,mm) do { } while (0) -#define activate_mm(prev,next) switch_mm(prev, next, NULL) +#include <asm-generic/mmu_context.h> #endif diff --git a/arch/nios2/include/asm/mmu_context.h b/arch/nios2/include/asm/mmu_context.h index 78ab3dacf579..4f99ed09b5a7 100644 --- a/arch/nios2/include/asm/mmu_context.h +++ b/arch/nios2/include/asm/mmu_context.h @@ -26,16 +26,13 @@ extern unsigned long get_pid_from_context(mm_context_t *ctx); */ extern pgd_t *pgd_current; -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* * Initialize the context related info for a new mm_struct instance. * * Set all new contexts to 0, that way the generation will never match * the currently running generation when this context is switched in. */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -43,26 +40,16 @@ static inline int init_new_context(struct task_struct *tsk, return 0; } -/* - * Destroy context related info for an mm_struct that is about - * to be put to rest. - */ -static inline void destroy_context(struct mm_struct *mm) -{ -} - void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk); -static inline void deactivate_mm(struct task_struct *tsk, - struct mm_struct *mm) -{ -} - /* * After we have set current->mm to a new value, this activates * the context for the new mm so we see the new mappings. */ +#define activate_mm activate_mm void activate_mm(struct mm_struct *prev, struct mm_struct *next); +#include <asm-generic/mmu_context.h> + #endif /* _ASM_NIOS2_MMU_CONTEXT_H */ diff --git a/arch/nios2/mm/mmu_context.c b/arch/nios2/mm/mmu_context.c index 45d6b9c58d67..d77aa542deb2 100644 --- a/arch/nios2/mm/mmu_context.c +++ b/arch/nios2/mm/mmu_context.c @@ -103,6 +103,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * After we have set current->mm to a new value, this activates * the context for the new mm so we see the new mappings. */ +#define activate_mm activate_mm void activate_mm(struct mm_struct *prev, struct mm_struct *next) { next->context = get_new_context(); diff --git a/arch/openrisc/include/asm/mmu_context.h b/arch/openrisc/include/asm/mmu_context.h index ced577542e29..a6702384c77d 100644 --- a/arch/openrisc/include/asm/mmu_context.h +++ b/arch/openrisc/include/asm/mmu_context.h @@ -17,13 +17,13 @@ #include <asm-generic/mm_hooks.h> +#define init_new_context init_new_context extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); +#define destroy_context destroy_context extern void destroy_context(struct mm_struct *mm); extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk); -#define deactivate_mm(tsk, mm) do { } while (0) - #define activate_mm(prev, next) switch_mm((prev), (next), NULL) /* current active pgd - this is similar to other processors pgd @@ -32,8 +32,6 @@ extern void switch_mm(struct mm_struct *prev, struct mm_struct *next, extern volatile pgd_t *current_pgd[]; /* defined in arch/openrisc/mm/fault.c */ -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} +#include <asm-generic/mmu_context.h> #endif diff --git a/arch/openrisc/mm/tlb.c b/arch/openrisc/mm/tlb.c index 4b680aed8f5f..821aab4cf3be 100644 --- a/arch/openrisc/mm/tlb.c +++ b/arch/openrisc/mm/tlb.c @@ -159,6 +159,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next, * instance. */ +#define init_new_context init_new_context int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { mm->context = NO_CONTEXT; @@ -170,6 +171,7 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) * drops it. */ +#define destroy_context destroy_context void destroy_context(struct mm_struct *mm) { flush_tlb_mm(mm); diff --git a/arch/parisc/include/asm/mmu_context.h b/arch/parisc/include/asm/mmu_context.h index 07b89c74abeb..71f8a3679b83 100644 --- a/arch/parisc/include/asm/mmu_context.h +++ b/arch/parisc/include/asm/mmu_context.h @@ -8,16 +8,13 @@ #include <asm/pgalloc.h> #include <asm-generic/mm_hooks.h> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* on PA-RISC, we actually have enough contexts to justify an allocator * for them. prumpf */ extern unsigned long alloc_sid(void); extern void free_sid(unsigned long); +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -27,6 +24,7 @@ init_new_context(struct task_struct *tsk, struct mm_struct *mm) return 0; } +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { @@ -72,8 +70,7 @@ static inline void switch_mm(struct mm_struct *prev, } #define switch_mm_irqs_off switch_mm_irqs_off -#define deactivate_mm(tsk,mm) do { } while (0) - +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) { /* @@ -91,4 +88,7 @@ static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) switch_mm(prev,next,current); } + +#include <asm-generic/mmu_context.h> + #endif diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 1a474f6b1992..242bd987247b 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -14,7 +14,9 @@ /* * Most if the context management is out of line */ +#define init_new_context init_new_context extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); +#define destroy_context destroy_context extern void destroy_context(struct mm_struct *mm); #ifdef CONFIG_SPAPR_TCE_IOMMU struct mm_iommu_table_group_mem_t; @@ -237,27 +239,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, } #define switch_mm_irqs_off switch_mm_irqs_off - -#define deactivate_mm(tsk,mm) do { } while (0) - -/* - * After we have set current->mm to a new value, this activates - * the context for the new mm so we see the new mappings. - */ -static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) -{ - switch_mm(prev, next, current); -} - -/* We don't currently use enter_lazy_tlb() for anything */ +#ifdef CONFIG_PPC_BOOK3E_64 +#define enter_lazy_tlb enter_lazy_tlb static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { /* 64-bit Book3E keeps track of current PGD in the PACA */ -#ifdef CONFIG_PPC_BOOK3E_64 get_paca()->pgd = NULL; -#endif } +#endif extern void arch_exit_mmap(struct mm_struct *mm); @@ -300,5 +290,7 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm, return 0; } +#include <asm-generic/mmu_context.h> + #endif /* __KERNEL__ */ #endif /* __ASM_POWERPC_MMU_CONTEXT_H */ diff --git a/arch/riscv/include/asm/mmu_context.h b/arch/riscv/include/asm/mmu_context.h index 67c463812e2d..250defa06f3a 100644 --- a/arch/riscv/include/asm/mmu_context.h +++ b/arch/riscv/include/asm/mmu_context.h @@ -13,34 +13,16 @@ #include <linux/mm.h> #include <linux/sched.h> -static inline void enter_lazy_tlb(struct mm_struct *mm, - struct task_struct *task) -{ -} - -/* Initialize context-related info for a new mm_struct */ -static inline int init_new_context(struct task_struct *task, - struct mm_struct *mm) -{ - return 0; -} - -static inline void destroy_context(struct mm_struct *mm) -{ -} - void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *task); +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) { switch_mm(prev, next, NULL); } -static inline void deactivate_mm(struct task_struct *task, - struct mm_struct *mm) -{ -} +#include <asm-generic/mmu_context.h> #endif /* _ASM_RISCV_MMU_CONTEXT_H */ diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h index c9f3d8a52756..66f9cf0a07e3 100644 --- a/arch/s390/include/asm/mmu_context.h +++ b/arch/s390/include/asm/mmu_context.h @@ -15,6 +15,7 @@ #include <asm/ctl_reg.h> #include <asm-generic/mm_hooks.h> +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -69,8 +70,6 @@ static inline int init_new_context(struct task_struct *tsk, return 0; } -#define destroy_context(mm) do { } while (0) - static inline void set_user_asce(struct mm_struct *mm) { S390_lowcore.user_asce = mm->context.asce; @@ -125,9 +124,7 @@ static inline void finish_arch_post_lock_switch(void) set_fs(current->thread.mm_segment); } -#define enter_lazy_tlb(mm,tsk) do { } while (0) -#define deactivate_mm(tsk,mm) do { } while (0) - +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) { @@ -136,4 +133,6 @@ static inline void activate_mm(struct mm_struct *prev, set_user_asce(next); } +#include <asm-generic/mmu_context.h> + #endif /* __S390_MMU_CONTEXT_H */ diff --git a/arch/sh/include/asm/mmu_context.h b/arch/sh/include/asm/mmu_context.h index 9470d17c71c2..ce40147d4a7d 100644 --- a/arch/sh/include/asm/mmu_context.h +++ b/arch/sh/include/asm/mmu_context.h @@ -85,6 +85,7 @@ static inline void get_mmu_context(struct mm_struct *mm, unsigned int cpu) * Initialize the context related info for a new mm_struct * instance. */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -121,9 +122,7 @@ static inline void switch_mm(struct mm_struct *prev, activate_context(next, cpu); } -#define activate_mm(prev, next) switch_mm((prev),(next),NULL) -#define deactivate_mm(tsk,mm) do { } while (0) -#define enter_lazy_tlb(mm,tsk) do { } while (0) +#include <asm-generic/mmu_context.h> #else diff --git a/arch/sh/include/asm/mmu_context_32.h b/arch/sh/include/asm/mmu_context_32.h index 71bf12ef1f65..bc5034fa6249 100644 --- a/arch/sh/include/asm/mmu_context_32.h +++ b/arch/sh/include/asm/mmu_context_32.h @@ -2,15 +2,6 @@ #ifndef __ASM_SH_MMU_CONTEXT_32_H #define __ASM_SH_MMU_CONTEXT_32_H -/* - * Destroy context related info for an mm_struct that is about - * to be put to rest. - */ -static inline void destroy_context(struct mm_struct *mm) -{ - /* Do nothing */ -} - #ifdef CONFIG_CPU_HAS_PTEAEX static inline void set_asid(unsigned long asid) { diff --git a/arch/sparc/include/asm/mmu_context_32.h b/arch/sparc/include/asm/mmu_context_32.h index 7ddcb8badf70..509043f81560 100644 --- a/arch/sparc/include/asm/mmu_context_32.h +++ b/arch/sparc/include/asm/mmu_context_32.h @@ -6,13 +6,10 @@ #include <asm-generic/mm_hooks.h> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* Initialize a new mmu context. This is invoked when a new * address space instance (unique or shared) is instantiated. */ +#define init_new_context init_new_context int init_new_context(struct task_struct *tsk, struct mm_struct *mm); /* Destroy a dead context. This occurs when mmput drops the @@ -20,17 +17,18 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm); * all the page tables have been flushed. Our job is to destroy * any remaining processor-specific state. */ +#define destroy_context destroy_context void destroy_context(struct mm_struct *mm); /* Switch the current MM context. */ void switch_mm(struct mm_struct *old_mm, struct mm_struct *mm, struct task_struct *tsk); -#define deactivate_mm(tsk,mm) do { } while (0) - /* Activate a new MM instance for the current task. */ #define activate_mm(active_mm, mm) switch_mm((active_mm), (mm), NULL) +#include <asm-generic/mmu_context.h> + #endif /* !(__ASSEMBLY__) */ #endif /* !(__SPARC_MMU_CONTEXT_H) */ diff --git a/arch/sparc/include/asm/mmu_context_64.h b/arch/sparc/include/asm/mmu_context_64.h index 312fcee8df2b..7a8380c63aab 100644 --- a/arch/sparc/include/asm/mmu_context_64.h +++ b/arch/sparc/include/asm/mmu_context_64.h @@ -16,17 +16,16 @@ #include <asm-generic/mm_hooks.h> #include <asm/percpu.h> -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - extern spinlock_t ctx_alloc_lock; extern unsigned long tlb_context_cache; extern unsigned long mmu_context_bmap[]; DECLARE_PER_CPU(struct mm_struct *, per_cpu_secondary_mm); void get_new_mmu_context(struct mm_struct *mm); + +#define init_new_context init_new_context int init_new_context(struct task_struct *tsk, struct mm_struct *mm); +#define destroy_context destroy_context void destroy_context(struct mm_struct *mm); void __tsb_context_switch(unsigned long pgd_pa, @@ -136,7 +135,6 @@ static inline void switch_mm(struct mm_struct *old_mm, struct mm_struct *mm, str spin_unlock_irqrestore(&mm->context.lock, flags); } -#define deactivate_mm(tsk,mm) do { } while (0) #define activate_mm(active_mm, mm) switch_mm(active_mm, mm, NULL) #define __HAVE_ARCH_START_CONTEXT_SWITCH @@ -187,6 +185,8 @@ static inline void finish_arch_post_lock_switch(void) } } +#include <asm-generic/mmu_context.h> + #endif /* !(__ASSEMBLY__) */ #endif /* !(__SPARC64_MMU_CONTEXT_H) */ diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h index 17ddd4edf875..f8a100770691 100644 --- a/arch/um/include/asm/mmu_context.h +++ b/arch/um/include/asm/mmu_context.h @@ -37,10 +37,9 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, * end asm-generic/mm_hooks.h functions */ -#define deactivate_mm(tsk,mm) do { } while (0) - extern void force_flush_all(void); +#define activate_mm activate_mm static inline void activate_mm(struct mm_struct *old, struct mm_struct *new) { /* @@ -66,13 +65,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, } } -static inline void enter_lazy_tlb(struct mm_struct *mm, - struct task_struct *tsk) -{ -} - +#define init_new_context init_new_context extern int init_new_context(struct task_struct *task, struct mm_struct *mm); +#define destroy_context destroy_context extern void destroy_context(struct mm_struct *mm); +#include <asm-generic/mmu_context.h> + #endif diff --git a/arch/unicore32/include/asm/mmu_context.h b/arch/unicore32/include/asm/mmu_context.h index 388c0c811c68..e1751cb5439c 100644 --- a/arch/unicore32/include/asm/mmu_context.h +++ b/arch/unicore32/include/asm/mmu_context.h @@ -18,24 +18,6 @@ #include <asm/cacheflush.h> #include <asm/cpu-single.h> -#define init_new_context(tsk, mm) 0 - -#define destroy_context(mm) do { } while (0) - -/* - * This is called when "tsk" is about to enter lazy TLB mode. - * - * mm: describes the currently active mm context - * tsk: task which is entering lazy tlb - * cpu: cpu number which is entering lazy tlb - * - * tsk->mm will be NULL - */ -static inline void -enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - /* * This is the actual mm switch as far as the scheduler * is concerned. No registers are touched. We avoid @@ -52,9 +34,6 @@ switch_mm(struct mm_struct *prev, struct mm_struct *next, cpu_switch_mm(next->pgd, next); } -#define deactivate_mm(tsk, mm) do { } while (0) -#define activate_mm(prev, next) switch_mm(prev, next, NULL) - /* * We are inserting a "fake" vma for the user-accessible vector page so * gdb and friends can get to it through ptrace and /proc/<pid>/mem. @@ -95,4 +74,7 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, /* by default, allow everything */ return true; } + +#include <asm-generic/mmu_context.h> + #endif diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 47562147e70b..255750548433 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -92,12 +92,14 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) } #endif +#define enter_lazy_tlb enter_lazy_tlb extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); /* * Init a new mm. Used on mm copies, like at fork() * and on mm's that are brand-new, like at execve(). */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -117,6 +119,8 @@ static inline int init_new_context(struct task_struct *tsk, init_new_context_ldt(mm); return 0; } + +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { destroy_context_ldt(mm); @@ -215,4 +219,6 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma, unsigned long __get_current_cr3_fast(void); +#include <asm-generic/mmu_context.h> + #endif /* _ASM_X86_MMU_CONTEXT_H */ diff --git a/arch/xtensa/include/asm/mmu_context.h b/arch/xtensa/include/asm/mmu_context.h index 74923ef3b228..e337ba9686e9 100644 --- a/arch/xtensa/include/asm/mmu_context.h +++ b/arch/xtensa/include/asm/mmu_context.h @@ -111,6 +111,7 @@ static inline void activate_context(struct mm_struct *mm, unsigned int cpu) * to -1 says the process has never run on any core. */ +#define init_new_context init_new_context static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { @@ -136,24 +137,18 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, activate_context(next, cpu); } -#define activate_mm(prev, next) switch_mm((prev), (next), NULL) -#define deactivate_mm(tsk, mm) do { } while (0) - /* * Destroy context related info for an mm_struct that is about * to be put to rest. */ +#define destroy_context destroy_context static inline void destroy_context(struct mm_struct *mm) { invalidate_page_directory(); } -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ - /* Nothing to do. */ - -} +#include <asm-generic/mmu_context.h> #endif /* CONFIG_MMU */ #endif /* _XTENSA_MMU_CONTEXT_H */ diff --git a/arch/xtensa/include/asm/nommu_context.h b/arch/xtensa/include/asm/nommu_context.h index 37251b2ef871..7c9d1918dc41 100644 --- a/arch/xtensa/include/asm/nommu_context.h +++ b/arch/xtensa/include/asm/nommu_context.h @@ -7,28 +7,4 @@ static inline void init_kio(void) { } -static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) -{ -} - -static inline int init_new_context(struct task_struct *tsk,struct mm_struct *mm) -{ - return 0; -} - -static inline void destroy_context(struct mm_struct *mm) -{ -} - -static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next) -{ -} - -static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, - struct task_struct *tsk) -{ -} - -static inline void deactivate_mm(struct task_struct *tsk, struct mm_struct *mm) -{ -} +#include <asm-generic/nommu_context.h> -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH 3/7] mm: introduce exit_lazy_tlb 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 2/7] arch: use asm-generic mmu context for no-op implementations Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin ` (3 subsequent siblings) 6 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- fs/exec.c | 5 +++-- include/asm-generic/mmu_context.h | 20 ++++++++++++++++++++ kernel/kthread.c | 1 + kernel/sched/core.c | 2 ++ 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index e6e8a9a70327..e2ab71e88293 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1117,9 +1117,10 @@ static int exec_mmap(struct mm_struct *mm) setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm); mm_update_next_owner(old_mm); mmput(old_mm); - return 0; + } else { + exit_lazy_tlb(active_mm, tsk); + mmdrop(active_mm); } - mmdrop(active_mm); return 0; } diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h index 86cea80a50df..3fc4c3879b79 100644 --- a/include/asm-generic/mmu_context.h +++ b/include/asm-generic/mmu_context.h @@ -24,6 +24,26 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, } #endif +/* + * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm. + * + * mm: the lazy mm context that was switched away from + * tsk: the task that was switched to non-lazy mm + * + * tsk->mm will not be NULL. + * + * Note this is not symmetrical to enter_lazy_tlb, this is not + * called when tasks switch into the lazy mm, it's called after the + * lazy mm becomes non-lazy (either switched to a different mm or the + * owner of the mm returns). + */ +#ifndef exit_lazy_tlb +static inline void exit_lazy_tlb(struct mm_struct *mm, + struct task_struct *tsk) +{ +} +#endif + /** * init_new_context - Initialize context of a new mm_struct. * @tsk: task struct for the mm diff --git a/kernel/kthread.c b/kernel/kthread.c index 132f84a5fde3..e813d92f2eab 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1253,6 +1253,7 @@ void kthread_use_mm(struct mm_struct *mm) if (active_mm != mm) mmdrop(active_mm); + exit_lazy_tlb(active_mm, tsk); to_kthread(tsk)->oldfs = get_fs(); set_fs(USER_DS); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca5db40392d4..debc917bc69b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3439,6 +3439,8 @@ context_switch(struct rq *rq, struct task_struct *prev, switch_mm_irqs_off(prev->active_mm, next->mm, next); if (!prev->mm) { // from kernel + exit_lazy_tlb(prev->active_mm, next); + /* will mmdrop() in finish_task_switch(). */ rq->prev_mm = prev->active_mm; prev->active_mm = NULL; -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin ` (2 preceding siblings ...) 2020-07-10 1:56 ` [RFC PATCH 3/7] mm: introduce exit_lazy_tlb Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 9:42 ` Peter Zijlstra ` (2 more replies) 2020-07-10 1:56 ` [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin ` (2 subsequent siblings) 6 siblings, 3 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard And get rid of the generic sync_core_before_usermode facility. This helper is the wrong way around I think. The idea that membarrier state requires a core sync before returning to user is the easy one that does not need hiding behind membarrier calls. The gap in core synchronization due to x86's sysret/sysexit and lazy tlb mode, is the tricky detail that is better put in x86 lazy tlb code. Consider if an arch did not synchronize core in switch_mm either, then membarrier_mm_sync_core_before_usermode would be in the wrong place but arch specific mmu context functions would still be the right place. There is also a exit_lazy_tlb case that is not covered by this call, which could be a bugs (kthread use mm the membarrier process's mm then context switch back to the process without switching mm or lazy mm switch). This makes lazy tlb code a bit more modular. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- .../membarrier-sync-core/arch-support.txt | 6 +++- arch/x86/include/asm/mmu_context.h | 35 +++++++++++++++++++ arch/x86/include/asm/sync_core.h | 28 --------------- include/linux/sched/mm.h | 14 -------- include/linux/sync_core.h | 21 ----------- kernel/cpu.c | 4 ++- kernel/kthread.c | 2 +- kernel/sched/core.c | 16 ++++----- 8 files changed, 51 insertions(+), 75 deletions(-) delete mode 100644 arch/x86/include/asm/sync_core.h delete mode 100644 include/linux/sync_core.h diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 52ad74a25f54..bd43fb1f5986 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -5,6 +5,10 @@ # # Architecture requirements # +# If your architecture returns to user-space through non-core-serializing +# instructions, you need to ensure these are done in switch_mm and exit_lazy_tlb +# (if lazy tlb switching is implemented). +# # * arm/arm64/powerpc # # Rely on implicit context synchronization as a result of exception return @@ -24,7 +28,7 @@ # instead on write_cr3() performed by switch_mm() to provide core serialization # after changing the current mm, and deal with the special case of kthread -> # uthread (temporarily keeping current mm into active_mm) by issuing a -# sync_core_before_usermode() in that specific case. +# serializing instruction in exit_lazy_mm() in that specific case. # ----------------------- | arch |status| diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 255750548433..5263863a9be8 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -6,6 +6,7 @@ #include <linux/atomic.h> #include <linux/mm_types.h> #include <linux/pkeys.h> +#include <linux/sched/mm.h> #include <trace/events/tlb.h> @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) #define enter_lazy_tlb enter_lazy_tlb extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); +#ifdef CONFIG_MEMBARRIER +/* + * Ensure that a core serializing instruction is issued before returning + * to user-mode, if a SYNC_CORE was requested. x86 implements return to + * user-space through sysexit, sysrel, and sysretq, which are not core + * serializing. + * + * See the membarrier comment in finish_task_switch as to why this is done + * in exit_lazy_tlb. + */ +#define exit_lazy_tlb exit_lazy_tlb +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) +{ + /* Switching mm is serializing with write_cr3 */ + if (tsk->mm != mm) + return; + + if (likely(!(atomic_read(&mm->membarrier_state) & + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) + return; + + /* With PTI, we unconditionally serialize before running user code. */ + if (static_cpu_has(X86_FEATURE_PTI)) + return; + /* + * Return from interrupt and NMI is done through iret, which is core + * serializing. + */ + if (in_irq() || in_nmi()) + return; + sync_core(); +} +#endif + /* * Init a new mm. Used on mm copies, like at fork() * and on mm's that are brand-new, like at execve(). diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h deleted file mode 100644 index c67caafd3381..000000000000 --- a/arch/x86/include/asm/sync_core.h +++ /dev/null @@ -1,28 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ASM_X86_SYNC_CORE_H -#define _ASM_X86_SYNC_CORE_H - -#include <linux/preempt.h> -#include <asm/processor.h> -#include <asm/cpufeature.h> - -/* - * Ensure that a core serializing instruction is issued before returning - * to user-mode. x86 implements return to user-space through sysexit, - * sysrel, and sysretq, which are not core serializing. - */ -static inline void sync_core_before_usermode(void) -{ - /* With PTI, we unconditionally serialize before running user code. */ - if (static_cpu_has(X86_FEATURE_PTI)) - return; - /* - * Return from interrupt and NMI is done through iret, which is core - * serializing. - */ - if (in_irq() || in_nmi()) - return; - sync_core(); -} - -#endif /* _ASM_X86_SYNC_CORE_H */ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 480a4d1b7dd8..9b026264b445 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -7,7 +7,6 @@ #include <linux/sched.h> #include <linux/mm_types.h> #include <linux/gfp.h> -#include <linux/sync_core.h> /* * Routines for handling mm_structs @@ -364,16 +363,6 @@ enum { #include <asm/membarrier.h> #endif -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ - if (current->mm != mm) - return; - if (likely(!(atomic_read(&mm->membarrier_state) & - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) - return; - sync_core_before_usermode(); -} - extern void membarrier_exec_mmap(struct mm_struct *mm); #else @@ -387,9 +376,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev, static inline void membarrier_exec_mmap(struct mm_struct *mm) { } -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ -} #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h deleted file mode 100644 index 013da4b8b327..000000000000 --- a/include/linux/sync_core.h +++ /dev/null @@ -1,21 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_SYNC_CORE_H -#define _LINUX_SYNC_CORE_H - -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE -#include <asm/sync_core.h> -#else -/* - * This is a dummy sync_core_before_usermode() implementation that can be used - * on all architectures which return to user-space through core serializing - * instructions. - * If your architecture returns to user-space through non-core-serializing - * instructions, you need to write your own functions. - */ -static inline void sync_core_before_usermode(void) -{ -} -#endif - -#endif /* _LINUX_SYNC_CORE_H */ - diff --git a/kernel/cpu.c b/kernel/cpu.c index 6ff2578ecf17..134688d79589 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu) /* * idle_task_exit() will have switched to &init_mm, now - * clean up any remaining active_mm state. + * clean up any remaining active_mm state. exit_lazy_tlb + * is not done, if an arch did any accounting in these + * functions it would have to be added. */ if (mm != &init_mm) idle->active_mm = &init_mm; diff --git a/kernel/kthread.c b/kernel/kthread.c index e813d92f2eab..6f93c649aa97 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1251,9 +1251,9 @@ void kthread_use_mm(struct mm_struct *mm) finish_arch_post_lock_switch(); #endif + exit_lazy_tlb(active_mm, tsk); if (active_mm != mm) mmdrop(active_mm); - exit_lazy_tlb(active_mm, tsk); to_kthread(tsk)->oldfs = get_fs(); set_fs(USER_DS); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index debc917bc69b..31e22c79826c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3294,22 +3294,19 @@ static struct rq *finish_task_switch(struct task_struct *prev) kcov_finish_switch(current); fire_sched_in_preempt_notifiers(current); + /* * When switching through a kernel thread, the loop in * membarrier_{private,global}_expedited() may have observed that * kernel thread and not issued an IPI. It is therefore possible to * schedule between user->kernel->user threads without passing though - * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, so provide them here: - * - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly - * provided by mmdrop(), - * - a sync_core for SYNC_CORE. + * switch_mm(). Membarrier requires a full barrier after storing to + * rq->curr, before returning to userspace, for + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). */ - if (mm) { - membarrier_mm_sync_core_before_usermode(mm); + if (mm) mmdrop(mm); - } + if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); @@ -6292,6 +6289,7 @@ void idle_task_exit(void) BUG_ON(current != this_rq()->idle); if (mm != &init_mm) { + /* enter_lazy_tlb is not done because we're about to go down */ switch_mm(mm, &init_mm, current); finish_arch_post_lock_switch(); } -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin @ 2020-07-10 9:42 ` Peter Zijlstra 2020-07-10 14:02 ` Mathieu Desnoyers 2020-07-10 17:04 ` Andy Lutomirski 2 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2020-07-10 9:42 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, x86, Mathieu Desnoyers, Arnd Bergmann, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard On Fri, Jul 10, 2020 at 11:56:43AM +1000, Nicholas Piggin wrote: > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. Hurmph, I know I've been staring at this at some point. I think I meant to have a TIF to force the IRET path in the case of MEMBAR_SYNC_CORE. But I was discouraged by amluto. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin 2020-07-10 9:42 ` Peter Zijlstra @ 2020-07-10 14:02 ` Mathieu Desnoyers 2020-07-10 17:04 ` Andy Lutomirski 2 siblings, 0 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-10 14:02 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, x86, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard ----- On Jul 9, 2020, at 9:56 PM, Nicholas Piggin npiggin@gmail.com wrote: > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. I agree that moving this logic to exit_lazy_tlb is much more modular and cleaner. Thanks, Mathieu > > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > .../membarrier-sync-core/arch-support.txt | 6 +++- > arch/x86/include/asm/mmu_context.h | 35 +++++++++++++++++++ > arch/x86/include/asm/sync_core.h | 28 --------------- > include/linux/sched/mm.h | 14 -------- > include/linux/sync_core.h | 21 ----------- > kernel/cpu.c | 4 ++- > kernel/kthread.c | 2 +- > kernel/sched/core.c | 16 ++++----- > 8 files changed, 51 insertions(+), 75 deletions(-) > delete mode 100644 arch/x86/include/asm/sync_core.h > delete mode 100644 include/linux/sync_core.h > > diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt > b/Documentation/features/sched/membarrier-sync-core/arch-support.txt > index 52ad74a25f54..bd43fb1f5986 100644 > --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt > +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt > @@ -5,6 +5,10 @@ > # > # Architecture requirements > # > +# If your architecture returns to user-space through non-core-serializing > +# instructions, you need to ensure these are done in switch_mm and > exit_lazy_tlb > +# (if lazy tlb switching is implemented). > +# > # * arm/arm64/powerpc > # > # Rely on implicit context synchronization as a result of exception return > @@ -24,7 +28,7 @@ > # instead on write_cr3() performed by switch_mm() to provide core serialization > # after changing the current mm, and deal with the special case of kthread -> > # uthread (temporarily keeping current mm into active_mm) by issuing a > -# sync_core_before_usermode() in that specific case. > +# serializing instruction in exit_lazy_mm() in that specific case. > # > ----------------------- > | arch |status| > diff --git a/arch/x86/include/asm/mmu_context.h > b/arch/x86/include/asm/mmu_context.h > index 255750548433..5263863a9be8 100644 > --- a/arch/x86/include/asm/mmu_context.h > +++ b/arch/x86/include/asm/mmu_context.h > @@ -6,6 +6,7 @@ > #include <linux/atomic.h> > #include <linux/mm_types.h> > #include <linux/pkeys.h> > +#include <linux/sched/mm.h> > > #include <trace/events/tlb.h> > > @@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct > mm_struct *next) > #define enter_lazy_tlb enter_lazy_tlb > extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk); > > +#ifdef CONFIG_MEMBARRIER > +/* > + * Ensure that a core serializing instruction is issued before returning > + * to user-mode, if a SYNC_CORE was requested. x86 implements return to > + * user-space through sysexit, sysrel, and sysretq, which are not core > + * serializing. > + * > + * See the membarrier comment in finish_task_switch as to why this is done > + * in exit_lazy_tlb. > + */ > +#define exit_lazy_tlb exit_lazy_tlb > +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > +{ > + /* Switching mm is serializing with write_cr3 */ > + if (tsk->mm != mm) > + return; > + > + if (likely(!(atomic_read(&mm->membarrier_state) & > + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > + return; > + > + /* With PTI, we unconditionally serialize before running user code. */ > + if (static_cpu_has(X86_FEATURE_PTI)) > + return; > + /* > + * Return from interrupt and NMI is done through iret, which is core > + * serializing. > + */ > + if (in_irq() || in_nmi()) > + return; > + sync_core(); > +} > +#endif > + > /* > * Init a new mm. Used on mm copies, like at fork() > * and on mm's that are brand-new, like at execve(). > diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h > deleted file mode 100644 > index c67caafd3381..000000000000 > --- a/arch/x86/include/asm/sync_core.h > +++ /dev/null > @@ -1,28 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 */ > -#ifndef _ASM_X86_SYNC_CORE_H > -#define _ASM_X86_SYNC_CORE_H > - > -#include <linux/preempt.h> > -#include <asm/processor.h> > -#include <asm/cpufeature.h> > - > -/* > - * Ensure that a core serializing instruction is issued before returning > - * to user-mode. x86 implements return to user-space through sysexit, > - * sysrel, and sysretq, which are not core serializing. > - */ > -static inline void sync_core_before_usermode(void) > -{ > - /* With PTI, we unconditionally serialize before running user code. */ > - if (static_cpu_has(X86_FEATURE_PTI)) > - return; > - /* > - * Return from interrupt and NMI is done through iret, which is core > - * serializing. > - */ > - if (in_irq() || in_nmi()) > - return; > - sync_core(); > -} > - > -#endif /* _ASM_X86_SYNC_CORE_H */ > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h > index 480a4d1b7dd8..9b026264b445 100644 > --- a/include/linux/sched/mm.h > +++ b/include/linux/sched/mm.h > @@ -7,7 +7,6 @@ > #include <linux/sched.h> > #include <linux/mm_types.h> > #include <linux/gfp.h> > -#include <linux/sync_core.h> > > /* > * Routines for handling mm_structs > @@ -364,16 +363,6 @@ enum { > #include <asm/membarrier.h> > #endif > > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > - if (current->mm != mm) > - return; > - if (likely(!(atomic_read(&mm->membarrier_state) & > - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) > - return; > - sync_core_before_usermode(); > -} > - > extern void membarrier_exec_mmap(struct mm_struct *mm); > > #else > @@ -387,9 +376,6 @@ static inline void membarrier_arch_switch_mm(struct > mm_struct *prev, > static inline void membarrier_exec_mmap(struct mm_struct *mm) > { > } > -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct > *mm) > -{ > -} > #endif > > #endif /* _LINUX_SCHED_MM_H */ > diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h > deleted file mode 100644 > index 013da4b8b327..000000000000 > --- a/include/linux/sync_core.h > +++ /dev/null > @@ -1,21 +0,0 @@ > -/* SPDX-License-Identifier: GPL-2.0 */ > -#ifndef _LINUX_SYNC_CORE_H > -#define _LINUX_SYNC_CORE_H > - > -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE > -#include <asm/sync_core.h> > -#else > -/* > - * This is a dummy sync_core_before_usermode() implementation that can be used > - * on all architectures which return to user-space through core serializing > - * instructions. > - * If your architecture returns to user-space through non-core-serializing > - * instructions, you need to write your own functions. > - */ > -static inline void sync_core_before_usermode(void) > -{ > -} > -#endif > - > -#endif /* _LINUX_SYNC_CORE_H */ > - > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 6ff2578ecf17..134688d79589 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -572,7 +572,9 @@ static int finish_cpu(unsigned int cpu) > > /* > * idle_task_exit() will have switched to &init_mm, now > - * clean up any remaining active_mm state. > + * clean up any remaining active_mm state. exit_lazy_tlb > + * is not done, if an arch did any accounting in these > + * functions it would have to be added. > */ > if (mm != &init_mm) > idle->active_mm = &init_mm; > diff --git a/kernel/kthread.c b/kernel/kthread.c > index e813d92f2eab..6f93c649aa97 100644 > --- a/kernel/kthread.c > +++ b/kernel/kthread.c > @@ -1251,9 +1251,9 @@ void kthread_use_mm(struct mm_struct *mm) > finish_arch_post_lock_switch(); > #endif > > + exit_lazy_tlb(active_mm, tsk); > if (active_mm != mm) > mmdrop(active_mm); > - exit_lazy_tlb(active_mm, tsk); > > to_kthread(tsk)->oldfs = get_fs(); > set_fs(USER_DS); > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index debc917bc69b..31e22c79826c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3294,22 +3294,19 @@ static struct rq *finish_task_switch(struct task_struct > *prev) > kcov_finish_switch(current); > > fire_sched_in_preempt_notifiers(current); > + > /* > * When switching through a kernel thread, the loop in > * membarrier_{private,global}_expedited() may have observed that > * kernel thread and not issued an IPI. It is therefore possible to > * schedule between user->kernel->user threads without passing though > - * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > - * > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * switch_mm(). Membarrier requires a full barrier after storing to > + * rq->curr, before returning to userspace, for > + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). > */ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > @@ -6292,6 +6289,7 @@ void idle_task_exit(void) > BUG_ON(current != this_rq()->idle); > > if (mm != &init_mm) { > + /* enter_lazy_tlb is not done because we're about to go down */ > switch_mm(mm, &init_mm, current); > finish_arch_post_lock_switch(); > } > -- > 2.23.0 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin 2020-07-10 9:42 ` Peter Zijlstra 2020-07-10 14:02 ` Mathieu Desnoyers @ 2020-07-10 17:04 ` Andy Lutomirski 2020-07-13 4:45 ` Nicholas Piggin 2 siblings, 1 reply; 59+ messages in thread From: Andy Lutomirski @ 2020-07-10 17:04 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, X86 ML, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, LKML, linuxppc-dev, Linux-MM, Anton Blanchard On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > And get rid of the generic sync_core_before_usermode facility. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. > > Consider if an arch did not synchronize core in switch_mm either, then > membarrier_mm_sync_core_before_usermode would be in the wrong place > but arch specific mmu context functions would still be the right place. > There is also a exit_lazy_tlb case that is not covered by this call, which > could be a bugs (kthread use mm the membarrier process's mm then context > switch back to the process without switching mm or lazy mm switch). > > This makes lazy tlb code a bit more modular. The mm-switching and TLB-management has often had the regrettable property that it gets wired up in a way that seems to work at the time but doesn't have clear semantics, and I'm a bit concerned that this patch is in that category. If I'm understanding right, you're trying to enforce the property that exiting lazy TLB mode will promise to sync the core eventually. But this has all kinds of odd properties: - Why is exit_lazy_tlb() getting called at all in the relevant cases? When is it permissible to call it? I look at your new code and see: > +/* > + * Ensure that a core serializing instruction is issued before returning > + * to user-mode, if a SYNC_CORE was requested. x86 implements return to > + * user-space through sysexit, sysrel, and sysretq, which are not core > + * serializing. > + * > + * See the membarrier comment in finish_task_switch as to why this is done > + * in exit_lazy_tlb. > + */ > +#define exit_lazy_tlb exit_lazy_tlb > +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) > +{ > + /* Switching mm is serializing with write_cr3 */ > + if (tsk->mm != mm) > + return; And my brain says WTF? Surely you meant something like if (WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened? how do we try to recover well enough to get a crashed logged at least? */ } So this needs actual documentation, preferably in comments near the function, of what the preconditions are and what this mm parameter is. Once that's done, then we could consider whether it's appropriate to have this function promise to sync the core under some conditions. - This whole structure seems to rely on the idea that switching mm syncs something. I periodically ask chip vendor for non-serializing mm switches. Specifically, in my dream world, we have totally separate user and kernel page tables. Changing out the user tables doesn't serialize or even create a fence. Instead it creates the minimum required pipeline hazard such that user memory access and switches to usermode will make sure they access memory through the correct page tables. I haven't convinced a chip vendor yet, but there are quite a few hundreds of cycles to be saved here. With this in mind, I see the fencing aspects of the TLB handling code as somewhat of an accident. I'm fine with documenting them and using them to optimize other paths, but I think it should be explicit. For example: /* Also does a full barrier? (Or a sync_core()-style barrier.) However, if you rely on this, you must document it in a comment where you call this function. *? void switch_mm_irqs_off() { } This is kind of like how we strongly encourage anyone using smp_?mb() to document what they are fencing against. Also, as it stands, I can easily see in_irq() ceasing to promise to serialize. There are older kernels for which it does not promise to serialize. And I have plans to make it stop serializing in the nearish future. --Andy ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-10 17:04 ` Andy Lutomirski @ 2020-07-13 4:45 ` Nicholas Piggin 2020-07-13 13:47 ` Nicholas Piggin 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-13 4:45 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: > On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> And get rid of the generic sync_core_before_usermode facility. >> >> This helper is the wrong way around I think. The idea that membarrier >> state requires a core sync before returning to user is the easy one >> that does not need hiding behind membarrier calls. The gap in core >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the >> tricky detail that is better put in x86 lazy tlb code. >> >> Consider if an arch did not synchronize core in switch_mm either, then >> membarrier_mm_sync_core_before_usermode would be in the wrong place >> but arch specific mmu context functions would still be the right place. >> There is also a exit_lazy_tlb case that is not covered by this call, which >> could be a bugs (kthread use mm the membarrier process's mm then context >> switch back to the process without switching mm or lazy mm switch). >> >> This makes lazy tlb code a bit more modular. > > The mm-switching and TLB-management has often had the regrettable > property that it gets wired up in a way that seems to work at the time > but doesn't have clear semantics, and I'm a bit concerned that this > patch is in that category. It's much more explicit in the core code about where hooks are called after this patch. And then the x86 membarrier implementation details are contained to the x86 code where they belong, and we don't have the previous hook with unclear semantics missing from core code. > If I'm understanding right, you're trying > to enforce the property that exiting lazy TLB mode will promise to > sync the core eventually. But this has all kinds of odd properties: > > - Why is exit_lazy_tlb() getting called at all in the relevant cases? It's a property of how MEMBARRIER_SYNC_CORE is implemented by arch/x86, see the membarrier comment in finish_task_switch (for analogous reason). > When is it permissible to call it? Comment for the asm-generic code says it's to be called when the lazy active mm becomes non-lazy. > I look at your new code and see: > >> +/* >> + * Ensure that a core serializing instruction is issued before returning >> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to >> + * user-space through sysexit, sysrel, and sysretq, which are not core >> + * serializing. >> + * >> + * See the membarrier comment in finish_task_switch as to why this is done >> + * in exit_lazy_tlb. >> + */ >> +#define exit_lazy_tlb exit_lazy_tlb >> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) >> +{ >> + /* Switching mm is serializing with write_cr3 */ >> + if (tsk->mm != mm) >> + return; > > And my brain says WTF? Surely you meant something like if > (WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened? how do > we try to recover well enough to get a crashed logged at least? */ } No, the active mm can be unlazied by switching to a different mm. > So this needs actual documentation, preferably in comments near the > function, of what the preconditions are and what this mm parameter is. > Once that's done, then we could consider whether it's appropriate to > have this function promise to sync the core under some conditions. It's documented in generic code. I prefer not to duplicate comments too much but I can add a "refer to asm-generic version for usage" or something if you'd like? > - This whole structure seems to rely on the idea that switching mm > syncs something. Which whole structure? The x86 implementation of sync core explicitly does rely on that, yes. But I've pulled that out of core code with this patch. > I periodically ask chip vendor for non-serializing > mm switches. Specifically, in my dream world, we have totally > separate user and kernel page tables. Changing out the user tables > doesn't serialize or even create a fence. Instead it creates the > minimum required pipeline hazard such that user memory access and > switches to usermode will make sure they access memory through the > correct page tables. I haven't convinced a chip vendor yet, but there > are quite a few hundreds of cycles to be saved here. The fundmaental difficulty is that the kernel can still access user mappings any time after the switch. We can probably handwave ways around it by serializing lazily when encountering the next user access and hoping that most of your mm switches result in a kernel exit that serializes or some other unavoidable serialize so you can avoid the mm switch one. In practice it sounds like a lot of trouble. But anyway the sync core could presumably be adjusted or reasoned to still be correct, depending on how it works. > With this in > mind, I see the fencing aspects of the TLB handling code as somewhat > of an accident. I'm fine with documenting them and using them to > optimize other paths, but I think it should be explicit. For example: > > /* Also does a full barrier? (Or a sync_core()-style barrier.) > However, if you rely on this, you must document it in a comment where > you call this function. *? > void switch_mm_irqs_off() > { > } > > This is kind of like how we strongly encourage anyone using smp_?mb() > to document what they are fencing against. Hmm. I don't think anything outside core scheduler/arch code is allowed to assume that, because they don't really know if schedule() will cause a switch. Hopefully nobody does, I would agree it shouldn't be encouraged. It is pretty fundamental to how we do task CPU migration so I don't see it ever going away. A push model where the source CPU has to release tasks that it last ran before they can be run elsewhere is unworkable. (Or maybe it's not, but no getting around that would require careful audits of said low level code). > Also, as it stands, I can easily see in_irq() ceasing to promise to > serialize. There are older kernels for which it does not promise to > serialize. And I have plans to make it stop serializing in the > nearish future. You mean x86's return from interrupt? Sounds fun... you'll konw where to update the membarrier sync code, at least :) Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-13 4:45 ` Nicholas Piggin @ 2020-07-13 13:47 ` Nicholas Piggin 2020-07-13 14:13 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-13 13:47 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: > Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >> Also, as it stands, I can easily see in_irq() ceasing to promise to >> serialize. There are older kernels for which it does not promise to >> serialize. And I have plans to make it stop serializing in the >> nearish future. > > You mean x86's return from interrupt? Sounds fun... you'll konw where to > update the membarrier sync code, at least :) Oh, I should actually say Mathieu recently clarified a return from interrupt doesn't fundamentally need to serialize in order to support membarrier sync core. https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html So you may not need to do anything more if you relaxed it. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-13 13:47 ` Nicholas Piggin @ 2020-07-13 14:13 ` Mathieu Desnoyers 2020-07-13 15:48 ` Andy Lutomirski 2020-07-16 4:15 ` Nicholas Piggin 0 siblings, 2 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-13 14:13 UTC (permalink / raw) To: Nicholas Piggin Cc: Andy Lutomirski, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Peter Zijlstra, x86 ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>> serialize. There are older kernels for which it does not promise to >>> serialize. And I have plans to make it stop serializing in the >>> nearish future. >> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to >> update the membarrier sync code, at least :) > > Oh, I should actually say Mathieu recently clarified a return from > interrupt doesn't fundamentally need to serialize in order to support > membarrier sync core. Clarification to your statement: Return from interrupt to kernel code does not need to be context serializing as long as kernel serializes before returning to user-space. However, return from interrupt to user-space needs to be context serializing. Thanks, Mathieu > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html > > So you may not need to do anything more if you relaxed it. > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-13 14:13 ` Mathieu Desnoyers @ 2020-07-13 15:48 ` Andy Lutomirski 2020-07-13 16:37 ` Nicholas Piggin 2020-07-16 4:15 ` Nicholas Piggin 1 sibling, 1 reply; 59+ messages in thread From: Andy Lutomirski @ 2020-07-13 15:48 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, Andy Lutomirski, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Peter Zijlstra, x86 On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > > > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: > >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: > >>> Also, as it stands, I can easily see in_irq() ceasing to promise to > >>> serialize. There are older kernels for which it does not promise to > >>> serialize. And I have plans to make it stop serializing in the > >>> nearish future. > >> > >> You mean x86's return from interrupt? Sounds fun... you'll konw where to > >> update the membarrier sync code, at least :) > > > > Oh, I should actually say Mathieu recently clarified a return from > > interrupt doesn't fundamentally need to serialize in order to support > > membarrier sync core. > > Clarification to your statement: > > Return from interrupt to kernel code does not need to be context serializing > as long as kernel serializes before returning to user-space. > > However, return from interrupt to user-space needs to be context serializing. > Indeed, and I figured this out on the first read through because I'm quite familiar with the x86 entry code. But Nick somehow missed this, and Nick is the one who wrote the patch. Nick, I think this helps prove my point. The code you're submitting may well be correct, but it's unmaintainable. At the very least, this needs a comment explaining, from the perspective of x86, *exactly* what exit_lazy_tlb() is promising, why it's promising it, how it achieves that promise, and what code cares about it. Or we could do something with TIF flags and make this all less magical, although that will probably end up very slightly slower. --Andy ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-13 15:48 ` Andy Lutomirski @ 2020-07-13 16:37 ` Nicholas Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-13 16:37 UTC (permalink / raw) To: Andy Lutomirski, Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Peter Zijlstra, x86 Excerpts from Andy Lutomirski's message of July 14, 2020 1:48 am: > On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >> > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >> >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >> >>> Also, as it stands, I can easily see in_irq() ceasing to promise to >> >>> serialize. There are older kernels for which it does not promise to >> >>> serialize. And I have plans to make it stop serializing in the >> >>> nearish future. >> >> >> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to >> >> update the membarrier sync code, at least :) >> > >> > Oh, I should actually say Mathieu recently clarified a return from >> > interrupt doesn't fundamentally need to serialize in order to support >> > membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. >> > > Indeed, and I figured this out on the first read through because I'm > quite familiar with the x86 entry code. But Nick somehow missed this, > and Nick is the one who wrote the patch. > > Nick, I think this helps prove my point. The code you're submitting > may well be correct, but it's unmaintainable. It's not. The patch I wrote for x86 is a no-op, it just moves existing x86 hook and code that's already there to a different name. Actually it's not quite a no-op, it't changes it to use hooks that are actually called in the right places. Because previously it was unmaintainable from point of view of generic mm -- it was not clear at all that the old one should have been called in other places where the mm goes non-lazy. Now with the exit_lazy_tlb hook, it can quite easily be spotted where it is missing. And x86 keeps their membarrier code in x86, and uses nice well defined lazy tlb mm hooks. > At the very least, this > needs a comment explaining, from the perspective of x86, *exactly* > what exit_lazy_tlb() is promising, why it's promising it, how it > achieves that promise, and what code cares about it. Or we could do > something with TIF flags and make this all less magical, although that > will probably end up very slightly slower. It's all documented there in existing comments plus the asm-generic exit_lazy_tlb specification added AFAIKS. Is the membarrier comment in finish_task_switch plus these ones not enough? Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-13 14:13 ` Mathieu Desnoyers 2020-07-13 15:48 ` Andy Lutomirski @ 2020-07-16 4:15 ` Nicholas Piggin 2020-07-16 4:42 ` Nicholas Piggin 2020-07-16 5:18 ` Andy Lutomirski 1 sibling, 2 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 4:15 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: > ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>> serialize. There are older kernels for which it does not promise to >>>> serialize. And I have plans to make it stop serializing in the >>>> nearish future. >>> >>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>> update the membarrier sync code, at least :) >> >> Oh, I should actually say Mathieu recently clarified a return from >> interrupt doesn't fundamentally need to serialize in order to support >> membarrier sync core. > > Clarification to your statement: > > Return from interrupt to kernel code does not need to be context serializing > as long as kernel serializes before returning to user-space. > > However, return from interrupt to user-space needs to be context serializing. Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb in the right places. A kernel thread does a use_mm, then it blocks and the user process with the same mm runs on that CPU, and then it calls into the kernel, blocks, the kernel thread runs again, another CPU issues a membarrier which does not IPI this one because it's running a kthread, and then the kthread switches back to the user process (still without having unused the mm), and then the user process returns from syscall without having done a core synchronising instruction. The cause of the problem is you want to avoid IPI'ing kthreads. Why? I'm guessing it really only matters as an optimisation in case of idle threads. Idle thread is easy (well, easier) because it won't use_mm, so you could check for rq->curr == rq->idle in your loop (in a suitable sched accessor function). But... I'm not really liking this subtlety in the scheduler for all this (the scheduler still needs the barriers when switching out of idle). Can it be improved somehow? Let me forget x86 core sync problem for now (that _may_ be a bit harder), and step back and look at what we're doing. The memory barrier case would actually suffer from the same problem as core sync, because in the same situation it has no implicit mmdrop in the scheduler switch code either. So what are we doing with membarrier? We want any activity caused by the set of CPUs/threads specified that can be observed by this thread before calling membarrier is appropriately fenced from activity that can be observed to happen after the call returns. CPU0 CPU1 1. user stuff a. membarrier() 2. enter kernel b. read rq->curr 3. rq->curr switched to kthread c. is kthread, skip IPI 4. switch_to kthread d. return to user 5. rq->curr switched to user thread 6. switch_to user thread 7. exit kernel 8. more user stuff As far as I can see, the problem is CPU1 might reorder step 5 and step 8, so you have mmdrop of lazy mm be a mb after step 6. But why? The membarrier call only cares that there is a full barrier between 1 and 8, right? Which it will get from the previous context switch to the kthread. I must say the memory barrier comments in membarrier could be improved a bit (unless I'm missing where the main comment is). It's fine to know what barriers pair with one another, but we need to know which exact memory accesses it is ordering /* * Matches memory barriers around rq->curr modification in * scheduler. */ Sure, but it doesn't say what else is being ordered. I think it's just the user memory accesses, but would be nice to make that a bit more explicit. If we had such comments then we might know this case is safe. I think the funny powerpc barrier is a similar case of this. If we ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that CPU has or will have issued a memory barrier between running user code. So AFAIKS all this membarrier stuff in kernel/sched/core.c could just go away. Except x86 because thread switch doesn't imply core sync, so CPU1 between 1 and 8 may never issue a core sync instruction the same way a context switch must be a full mb. Before getting to x86 -- Am I right, or way off track here? Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 4:15 ` Nicholas Piggin @ 2020-07-16 4:42 ` Nicholas Piggin 2020-07-16 15:46 ` Mathieu Desnoyers 2020-07-16 5:18 ` Andy Lutomirski 1 sibling, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 4:42 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 Excerpts from Nicholas Piggin's message of July 16, 2020 2:15 pm: > Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>> serialize. There are older kernels for which it does not promise to >>>>> serialize. And I have plans to make it stop serializing in the >>>>> nearish future. >>>> >>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>> update the membarrier sync code, at least :) >>> >>> Oh, I should actually say Mathieu recently clarified a return from >>> interrupt doesn't fundamentally need to serialize in order to support >>> membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. > > Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb > in the right places. > > A kernel thread does a use_mm, then it blocks and the user process with > the same mm runs on that CPU, and then it calls into the kernel, blocks, > the kernel thread runs again, another CPU issues a membarrier which does > not IPI this one because it's running a kthread, and then the kthread > switches back to the user process (still without having unused the mm), > and then the user process returns from syscall without having done a > core synchronising instruction. > > The cause of the problem is you want to avoid IPI'ing kthreads. Why? > I'm guessing it really only matters as an optimisation in case of idle > threads. Idle thread is easy (well, easier) because it won't use_mm, so > you could check for rq->curr == rq->idle in your loop (in a suitable > sched accessor function). > > But... I'm not really liking this subtlety in the scheduler for all this > (the scheduler still needs the barriers when switching out of idle). > > Can it be improved somehow? Let me forget x86 core sync problem for now > (that _may_ be a bit harder), and step back and look at what we're doing. > The memory barrier case would actually suffer from the same problem as > core sync, because in the same situation it has no implicit mmdrop in > the scheduler switch code either. > > So what are we doing with membarrier? We want any activity caused by the > set of CPUs/threads specified that can be observed by this thread before > calling membarrier is appropriately fenced from activity that can be > observed to happen after the call returns. > > CPU0 CPU1 > 1. user stuff > a. membarrier() 2. enter kernel > b. read rq->curr 3. rq->curr switched to kthread > c. is kthread, skip IPI 4. switch_to kthread > d. return to user 5. rq->curr switched to user thread > 6. switch_to user thread > 7. exit kernel > 8. more user stuff > > As far as I can see, the problem is CPU1 might reorder step 5 and step > 8, so you have mmdrop of lazy mm be a mb after step 6. > > But why? The membarrier call only cares that there is a full barrier > between 1 and 8, right? Which it will get from the previous context > switch to the kthread. I should be more complete here, especially since I was complaining about unclear barrier comment :) CPU0 CPU1 a. user stuff 1. user stuff b. membarrier() 2. enter kernel c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule d. read rq->curr 4. rq->curr switched to kthread e. is kthread, skip IPI 5. switch_to kthread f. return to user 6. rq->curr switched to user thread g. user stuff 7. switch_to user thread 8. exit kernel 9. more user stuff What you're really ordering is a, g vs 1, 9 right? In other words, 9 must see a if it sees g, g must see 1 if it saw 9, etc. Userspace does not care where the barriers are exactly or what kernel memory accesses might be being ordered by them, so long as there is a mb somewhere between a and g, and 1 and 9. Right? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 4:42 ` Nicholas Piggin @ 2020-07-16 15:46 ` Mathieu Desnoyers 2020-07-16 16:03 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-16 15:46 UTC (permalink / raw) To: Nicholas Piggin Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > I should be more complete here, especially since I was complaining > about unclear barrier comment :) > > > CPU0 CPU1 > a. user stuff 1. user stuff > b. membarrier() 2. enter kernel > c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > d. read rq->curr 4. rq->curr switched to kthread > e. is kthread, skip IPI 5. switch_to kthread > f. return to user 6. rq->curr switched to user thread > g. user stuff 7. switch_to user thread > 8. exit kernel > 9. more user stuff > > What you're really ordering is a, g vs 1, 9 right? > > In other words, 9 must see a if it sees g, g must see 1 if it saw 9, > etc. > > Userspace does not care where the barriers are exactly or what kernel > memory accesses might be being ordered by them, so long as there is a > mb somewhere between a and g, and 1 and 9. Right? This is correct. Note that the accesses to user-space memory can be done either by user-space code or kernel code, it doesn't matter. However, in order to be considered as happening before/after either membarrier or the matching compiler barrier, kernel code needs to have causality relationship with user-space execution, e.g. user-space does a system call, or returns from a system call. In the case of io_uring, submitting a request or returning from waiting on request completion appear to provide this causality relationship. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 15:46 ` Mathieu Desnoyers @ 2020-07-16 16:03 ` Mathieu Desnoyers 2020-07-16 18:58 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-16 16:03 UTC (permalink / raw) To: Nicholas Piggin, paulmck Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> I should be more complete here, especially since I was complaining >> about unclear barrier comment :) >> >> >> CPU0 CPU1 >> a. user stuff 1. user stuff >> b. membarrier() 2. enter kernel >> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> d. read rq->curr 4. rq->curr switched to kthread >> e. is kthread, skip IPI 5. switch_to kthread >> f. return to user 6. rq->curr switched to user thread >> g. user stuff 7. switch_to user thread >> 8. exit kernel >> 9. more user stuff >> >> What you're really ordering is a, g vs 1, 9 right? >> >> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> etc. >> >> Userspace does not care where the barriers are exactly or what kernel >> memory accesses might be being ordered by them, so long as there is a >> mb somewhere between a and g, and 1 and 9. Right? > > This is correct. Actually, sorry, the above is not quite right. It's been a while since I looked into the details of membarrier. The smp_mb() at the beginning of membarrier() needs to be paired with a smp_mb() _after_ rq->curr is switched back to the user thread, so the memory barrier is between store to rq->curr and following user-space accesses. The smp_mb() at the end of membarrier() needs to be paired with the smp_mb__after_spinlock() at the beginning of schedule, which is between accesses to userspace memory and switching rq->curr to kthread. As to *why* this ordering is needed, I'd have to dig through additional scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thanks, Mathieu > Note that the accesses to user-space memory can be > done either by user-space code or kernel code, it doesn't matter. > However, in order to be considered as happening before/after > either membarrier or the matching compiler barrier, kernel code > needs to have causality relationship with user-space execution, > e.g. user-space does a system call, or returns from a system call. > > In the case of io_uring, submitting a request or returning from waiting > on request completion appear to provide this causality relationship. > > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 16:03 ` Mathieu Desnoyers @ 2020-07-16 18:58 ` Mathieu Desnoyers 2020-07-16 21:24 ` Alan Stern 2020-07-17 0:00 ` Nicholas Piggin 0 siblings, 2 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-16 18:58 UTC (permalink / raw) To: Nicholas Piggin, paulmck, Alan Stern Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > mathieu.desnoyers@efficios.com wrote: > >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >>> I should be more complete here, especially since I was complaining >>> about unclear barrier comment :) >>> >>> >>> CPU0 CPU1 >>> a. user stuff 1. user stuff >>> b. membarrier() 2. enter kernel >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >>> d. read rq->curr 4. rq->curr switched to kthread >>> e. is kthread, skip IPI 5. switch_to kthread >>> f. return to user 6. rq->curr switched to user thread >>> g. user stuff 7. switch_to user thread >>> 8. exit kernel >>> 9. more user stuff >>> >>> What you're really ordering is a, g vs 1, 9 right? >>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >>> etc. >>> >>> Userspace does not care where the barriers are exactly or what kernel >>> memory accesses might be being ordered by them, so long as there is a >>> mb somewhere between a and g, and 1 and 9. Right? >> >> This is correct. > > Actually, sorry, the above is not quite right. It's been a while > since I looked into the details of membarrier. > > The smp_mb() at the beginning of membarrier() needs to be paired with a > smp_mb() _after_ rq->curr is switched back to the user thread, so the > memory barrier is between store to rq->curr and following user-space > accesses. > > The smp_mb() at the end of membarrier() needs to be paired with the > smp_mb__after_spinlock() at the beginning of schedule, which is > between accesses to userspace memory and switching rq->curr to kthread. > > As to *why* this ordering is needed, I'd have to dig through additional > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thinking further about this, I'm beginning to consider that maybe we have been overly cautious by requiring memory barriers before and after store to rq->curr. If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) while running the membarrier system call, it necessarily means that CPU1 had to issue smp_mb__after_spinlock when entering the scheduler, between any user-space loads/stores and update of rq->curr. Requiring a memory barrier between update of rq->curr (back to current process's thread) and following user-space memory accesses does not seem to guarantee anything more than what the initial barrier at the beginning of __schedule already provides, because the guarantees are only about accesses to user-space memory. Therefore, with the memory barrier at the beginning of __schedule, just observing that CPU1's rq->curr differs from current should guarantee that a memory barrier was issued between any sequentially consistent instructions belonging to the current process on CPU1. Or am I missing/misremembering an important point here ? Thanks, Mathieu > > Thanks, > > Mathieu > > >> Note that the accesses to user-space memory can be >> done either by user-space code or kernel code, it doesn't matter. >> However, in order to be considered as happening before/after >> either membarrier or the matching compiler barrier, kernel code >> needs to have causality relationship with user-space execution, >> e.g. user-space does a system call, or returns from a system call. >> >> In the case of io_uring, submitting a request or returning from waiting >> on request completion appear to provide this causality relationship. >> >> Thanks, >> >> Mathieu >> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> http://www.efficios.com > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 18:58 ` Mathieu Desnoyers @ 2020-07-16 21:24 ` Alan Stern 2020-07-17 13:39 ` Mathieu Desnoyers 2020-07-17 0:00 ` Nicholas Piggin 1 sibling, 1 reply; 59+ messages in thread From: Alan Stern @ 2020-07-16 21:24 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: > ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > > > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > > mathieu.desnoyers@efficios.com wrote: > > > >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > >>> I should be more complete here, especially since I was complaining > >>> about unclear barrier comment :) > >>> > >>> > >>> CPU0 CPU1 > >>> a. user stuff 1. user stuff > >>> b. membarrier() 2. enter kernel > >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > >>> d. read rq->curr 4. rq->curr switched to kthread > >>> e. is kthread, skip IPI 5. switch_to kthread > >>> f. return to user 6. rq->curr switched to user thread > >>> g. user stuff 7. switch_to user thread > >>> 8. exit kernel > >>> 9. more user stuff > >>> > >>> What you're really ordering is a, g vs 1, 9 right? > >>> > >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, > >>> etc. > >>> > >>> Userspace does not care where the barriers are exactly or what kernel > >>> memory accesses might be being ordered by them, so long as there is a > >>> mb somewhere between a and g, and 1 and 9. Right? > >> > >> This is correct. > > > > Actually, sorry, the above is not quite right. It's been a while > > since I looked into the details of membarrier. > > > > The smp_mb() at the beginning of membarrier() needs to be paired with a > > smp_mb() _after_ rq->curr is switched back to the user thread, so the > > memory barrier is between store to rq->curr and following user-space > > accesses. > > > > The smp_mb() at the end of membarrier() needs to be paired with the > > smp_mb__after_spinlock() at the beginning of schedule, which is > > between accesses to userspace memory and switching rq->curr to kthread. > > > > As to *why* this ordering is needed, I'd have to dig through additional > > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? > > Thinking further about this, I'm beginning to consider that maybe we have been > overly cautious by requiring memory barriers before and after store to rq->curr. > > If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) > while running the membarrier system call, it necessarily means that CPU1 had > to issue smp_mb__after_spinlock when entering the scheduler, between any user-space > loads/stores and update of rq->curr. > > Requiring a memory barrier between update of rq->curr (back to current process's > thread) and following user-space memory accesses does not seem to guarantee > anything more than what the initial barrier at the beginning of __schedule already > provides, because the guarantees are only about accesses to user-space memory. > > Therefore, with the memory barrier at the beginning of __schedule, just observing that > CPU1's rq->curr differs from current should guarantee that a memory barrier was issued > between any sequentially consistent instructions belonging to the current process on > CPU1. > > Or am I missing/misremembering an important point here ? Is it correct to say that the switch_to operations in 5 and 7 include memory barriers? If they do, then skipping the IPI should be okay. The reason is as follows: The guarantee you need to enforce is that anything written by CPU0 before the membarrier() will be visible to CPU1 after it returns to user mode. Let's say that a writes to X and 9 reads from X. Then we have an instance of the Store Buffer pattern: CPU0 CPU1 a. Write X 6. Write rq->curr for user thread c. smp_mb() 7. switch_to memory barrier d. Read rq->curr 9. Read X In this pattern, the memory barriers make it impossible for both reads to miss their corresponding writes. Since d does fail to read 6 (it sees the earlier value stored by 4), 9 must read a. The other guarantee you need is that g on CPU0 will observe anything written by CPU1 in 1. This is easier to see, using the fact that 3 is a memory barrier and d reads from 4. Alan Stern ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 21:24 ` Alan Stern @ 2020-07-17 13:39 ` Mathieu Desnoyers 2020-07-17 14:51 ` Alan Stern 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-17 13:39 UTC (permalink / raw) To: Alan Stern Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> mathieu.desnoyers@efficios.com wrote: >> >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> > mathieu.desnoyers@efficios.com wrote: >> > >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> I should be more complete here, especially since I was complaining >> >>> about unclear barrier comment :) >> >>> >> >>> >> >>> CPU0 CPU1 >> >>> a. user stuff 1. user stuff >> >>> b. membarrier() 2. enter kernel >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >>> f. return to user 6. rq->curr switched to user thread >> >>> g. user stuff 7. switch_to user thread >> >>> 8. exit kernel >> >>> 9. more user stuff >> >>> >> >>> What you're really ordering is a, g vs 1, 9 right? >> >>> >> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> >>> etc. >> >>> >> >>> Userspace does not care where the barriers are exactly or what kernel >> >>> memory accesses might be being ordered by them, so long as there is a >> >>> mb somewhere between a and g, and 1 and 9. Right? >> >> >> >> This is correct. >> > >> > Actually, sorry, the above is not quite right. It's been a while >> > since I looked into the details of membarrier. >> > >> > The smp_mb() at the beginning of membarrier() needs to be paired with a >> > smp_mb() _after_ rq->curr is switched back to the user thread, so the >> > memory barrier is between store to rq->curr and following user-space >> > accesses. >> > >> > The smp_mb() at the end of membarrier() needs to be paired with the >> > smp_mb__after_spinlock() at the beginning of schedule, which is >> > between accesses to userspace memory and switching rq->curr to kthread. >> > >> > As to *why* this ordering is needed, I'd have to dig through additional >> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? >> >> Thinking further about this, I'm beginning to consider that maybe we have been >> overly cautious by requiring memory barriers before and after store to rq->curr. >> >> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process >> (current) >> while running the membarrier system call, it necessarily means that CPU1 had >> to issue smp_mb__after_spinlock when entering the scheduler, between any >> user-space >> loads/stores and update of rq->curr. >> >> Requiring a memory barrier between update of rq->curr (back to current process's >> thread) and following user-space memory accesses does not seem to guarantee >> anything more than what the initial barrier at the beginning of __schedule >> already >> provides, because the guarantees are only about accesses to user-space memory. >> >> Therefore, with the memory barrier at the beginning of __schedule, just >> observing that >> CPU1's rq->curr differs from current should guarantee that a memory barrier was >> issued >> between any sequentially consistent instructions belonging to the current >> process on >> CPU1. >> >> Or am I missing/misremembering an important point here ? > > Is it correct to say that the switch_to operations in 5 and 7 include > memory barriers? If they do, then skipping the IPI should be okay. > > The reason is as follows: The guarantee you need to enforce is that > anything written by CPU0 before the membarrier() will be visible to CPU1 > after it returns to user mode. Let's say that a writes to X and 9 > reads from X. > > Then we have an instance of the Store Buffer pattern: > > CPU0 CPU1 > a. Write X 6. Write rq->curr for user thread > c. smp_mb() 7. switch_to memory barrier > d. Read rq->curr 9. Read X > > In this pattern, the memory barriers make it impossible for both reads > to miss their corresponding writes. Since d does fail to read 6 (it > sees the earlier value stored by 4), 9 must read a. > > The other guarantee you need is that g on CPU0 will observe anything > written by CPU1 in 1. This is easier to see, using the fact that 3 is a > memory barrier and d reads from 4. Right, and Nick's reply involving pairs of loads/stores on each side clarifies the situation even further. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 13:39 ` Mathieu Desnoyers @ 2020-07-17 14:51 ` Alan Stern 2020-07-17 15:39 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Alan Stern @ 2020-07-17 14:51 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote: > ----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: > > > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: > >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers > >> mathieu.desnoyers@efficios.com wrote: > >> > >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > >> > mathieu.desnoyers@efficios.com wrote: > >> > > >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> >>> I should be more complete here, especially since I was complaining > >> >>> about unclear barrier comment :) > >> >>> > >> >>> > >> >>> CPU0 CPU1 > >> >>> a. user stuff 1. user stuff > >> >>> b. membarrier() 2. enter kernel > >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > >> >>> d. read rq->curr 4. rq->curr switched to kthread > >> >>> e. is kthread, skip IPI 5. switch_to kthread > >> >>> f. return to user 6. rq->curr switched to user thread > >> >>> g. user stuff 7. switch_to user thread > >> >>> 8. exit kernel > >> >>> 9. more user stuff ... > >> Requiring a memory barrier between update of rq->curr (back to current process's > >> thread) and following user-space memory accesses does not seem to guarantee > >> anything more than what the initial barrier at the beginning of __schedule > >> already > >> provides, because the guarantees are only about accesses to user-space memory. ... > > Is it correct to say that the switch_to operations in 5 and 7 include > > memory barriers? If they do, then skipping the IPI should be okay. > > > > The reason is as follows: The guarantee you need to enforce is that > > anything written by CPU0 before the membarrier() will be visible to CPU1 > > after it returns to user mode. Let's say that a writes to X and 9 > > reads from X. > > > > Then we have an instance of the Store Buffer pattern: > > > > CPU0 CPU1 > > a. Write X 6. Write rq->curr for user thread > > c. smp_mb() 7. switch_to memory barrier > > d. Read rq->curr 9. Read X > > > > In this pattern, the memory barriers make it impossible for both reads > > to miss their corresponding writes. Since d does fail to read 6 (it > > sees the earlier value stored by 4), 9 must read a. > > > > The other guarantee you need is that g on CPU0 will observe anything > > written by CPU1 in 1. This is easier to see, using the fact that 3 is a > > memory barrier and d reads from 4. > > Right, and Nick's reply involving pairs of loads/stores on each side > clarifies the situation even further. The key part of my reply was the question: "Is it correct to say that the switch_to operations in 5 and 7 include memory barriers?" From the text quoted above and from Nick's reply, it seems clear that they do not. I agree with Nick: A memory barrier is needed somewhere between the assignment at 6 and the return to user mode at 8. Otherwise you end up with the Store Buffer pattern having a memory barrier on only one side, and it is well known that this arrangement does not guarantee any ordering. One thing I don't understand about all this: Any context switch has to include a memory barrier somewhere, but both you and Nick seem to be saying that steps 6 and 7 don't include (or don't need) any memory barriers. What am I missing? Alan Stern ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 14:51 ` Alan Stern @ 2020-07-17 15:39 ` Mathieu Desnoyers 2020-07-17 16:11 ` Alan Stern 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-17 15:39 UTC (permalink / raw) To: Alan Stern Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 17, 2020, at 10:51 AM, Alan Stern stern@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 16, 2020, at 5:24 PM, Alan Stern stern@rowland.harvard.edu wrote: >> >> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> >> mathieu.desnoyers@efficios.com wrote: >> >> >> >> > ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> >> > mathieu.desnoyers@efficios.com wrote: >> >> > >> >> >> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >> >>> I should be more complete here, especially since I was complaining >> >> >>> about unclear barrier comment :) >> >> >>> >> >> >>> >> >> >>> CPU0 CPU1 >> >> >>> a. user stuff 1. user stuff >> >> >>> b. membarrier() 2. enter kernel >> >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >> >>> f. return to user 6. rq->curr switched to user thread >> >> >>> g. user stuff 7. switch_to user thread >> >> >>> 8. exit kernel >> >> >>> 9. more user stuff > > ... > >> >> Requiring a memory barrier between update of rq->curr (back to current process's >> >> thread) and following user-space memory accesses does not seem to guarantee >> >> anything more than what the initial barrier at the beginning of __schedule >> >> already >> >> provides, because the guarantees are only about accesses to user-space memory. > > ... > >> > Is it correct to say that the switch_to operations in 5 and 7 include >> > memory barriers? If they do, then skipping the IPI should be okay. >> > >> > The reason is as follows: The guarantee you need to enforce is that >> > anything written by CPU0 before the membarrier() will be visible to CPU1 >> > after it returns to user mode. Let's say that a writes to X and 9 >> > reads from X. >> > >> > Then we have an instance of the Store Buffer pattern: >> > >> > CPU0 CPU1 >> > a. Write X 6. Write rq->curr for user thread >> > c. smp_mb() 7. switch_to memory barrier >> > d. Read rq->curr 9. Read X >> > >> > In this pattern, the memory barriers make it impossible for both reads >> > to miss their corresponding writes. Since d does fail to read 6 (it >> > sees the earlier value stored by 4), 9 must read a. >> > >> > The other guarantee you need is that g on CPU0 will observe anything >> > written by CPU1 in 1. This is easier to see, using the fact that 3 is a >> > memory barrier and d reads from 4. >> >> Right, and Nick's reply involving pairs of loads/stores on each side >> clarifies the situation even further. > > The key part of my reply was the question: "Is it correct to say that > the switch_to operations in 5 and 7 include memory barriers?" From the > text quoted above and from Nick's reply, it seems clear that they do > not. I remember that switch_mm implies it, but not switch_to. The scenario that triggered this discussion is when the scheduler does a lazy tlb entry/exit, which is basically switch from a user task to a kernel thread without changing the mm, and usually switching back afterwards. This optimization means the rq->curr mm temporarily differs, which prevent IPIs from being sent by membarrier, but without involving a switch_mm. This requires explicit memory barriers either on entry/exit of lazy tlb mode, or explicit barriers in the scheduler for those special-cases. > I agree with Nick: A memory barrier is needed somewhere between the > assignment at 6 and the return to user mode at 8. Otherwise you end up > with the Store Buffer pattern having a memory barrier on only one side, > and it is well known that this arrangement does not guarantee any > ordering. Yes, I see this now. I'm still trying to wrap my head around why the memory barrier at the end of membarrier() needs to be paired with a scheduler barrier though. > One thing I don't understand about all this: Any context switch has to > include a memory barrier somewhere, but both you and Nick seem to be > saying that steps 6 and 7 don't include (or don't need) any memory > barriers. What am I missing? All context switch have the smp_mb__before_spinlock at the beginning of __schedule(), which I suspect is what you refer to. However this barrier is before the store to rq->curr, not after. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 15:39 ` Mathieu Desnoyers @ 2020-07-17 16:11 ` Alan Stern 2020-07-17 16:22 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Alan Stern @ 2020-07-17 16:11 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 > > I agree with Nick: A memory barrier is needed somewhere between the > > assignment at 6 and the return to user mode at 8. Otherwise you end up > > with the Store Buffer pattern having a memory barrier on only one side, > > and it is well known that this arrangement does not guarantee any > > ordering. > > Yes, I see this now. I'm still trying to wrap my head around why the memory > barrier at the end of membarrier() needs to be paired with a scheduler > barrier though. The memory barrier at the end of membarrier() on CPU0 is necessary in order to enforce the guarantee that any writes occurring on CPU1 before the membarrier() is executed will be visible to any code executing on CPU0 after the membarrier(). Ignoring the kthread issue, we can have: CPU0 CPU1 x = 1 barrier() y = 1 r2 = y membarrier(): a: smp_mb() b: send IPI IPI-induced mb c: smp_mb() r1 = x The writes to x and y are unordered by the hardware, so it's possible to have r2 = 1 even though the write to x doesn't execute until b. If the memory barrier at c is omitted then "r1 = x" can be reordered before b (although not before a), so we get r1 = 0. This violates the guarantee that membarrier() is supposed to provide. The timing of the memory barrier at c has to ensure that it executes after the IPI-induced memory barrier on CPU1. If it happened before then we could still end up with r1 = 0. That's why the pairing matters. I hope this helps your head get properly wrapped. :-) Alan Stern ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 16:11 ` Alan Stern @ 2020-07-17 16:22 ` Mathieu Desnoyers 2020-07-17 17:44 ` Alan Stern 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-17 16:22 UTC (permalink / raw) To: Alan Stern Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: >> > I agree with Nick: A memory barrier is needed somewhere between the >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> > with the Store Buffer pattern having a memory barrier on only one side, >> > and it is well known that this arrangement does not guarantee any >> > ordering. >> >> Yes, I see this now. I'm still trying to wrap my head around why the memory >> barrier at the end of membarrier() needs to be paired with a scheduler >> barrier though. > > The memory barrier at the end of membarrier() on CPU0 is necessary in > order to enforce the guarantee that any writes occurring on CPU1 before > the membarrier() is executed will be visible to any code executing on > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: > > CPU0 CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > b: send IPI IPI-induced mb > c: smp_mb() > r1 = x > > The writes to x and y are unordered by the hardware, so it's possible to > have r2 = 1 even though the write to x doesn't execute until b. If the > memory barrier at c is omitted then "r1 = x" can be reordered before b > (although not before a), so we get r1 = 0. This violates the guarantee > that membarrier() is supposed to provide. > > The timing of the memory barrier at c has to ensure that it executes > after the IPI-induced memory barrier on CPU1. If it happened before > then we could still end up with r1 = 0. That's why the pairing matters. > > I hope this helps your head get properly wrapped. :-) It does help a bit! ;-) This explains this part of the comment near the smp_mb at the end of membarrier: * Memory barrier on the caller thread _after_ we finished * waiting for the last IPI. [...] However, it does not explain why it needs to be paired with a barrier in the scheduler, clearly for the case where the IPI is skipped. I wonder whether this part of the comment is factually correct: * [...] Matches memory barriers around rq->curr modification in scheduler. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 16:22 ` Mathieu Desnoyers @ 2020-07-17 17:44 ` Alan Stern 2020-07-17 17:52 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Alan Stern @ 2020-07-17 17:44 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote: > ----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: > > >> > I agree with Nick: A memory barrier is needed somewhere between the > >> > assignment at 6 and the return to user mode at 8. Otherwise you end up > >> > with the Store Buffer pattern having a memory barrier on only one side, > >> > and it is well known that this arrangement does not guarantee any > >> > ordering. > >> > >> Yes, I see this now. I'm still trying to wrap my head around why the memory > >> barrier at the end of membarrier() needs to be paired with a scheduler > >> barrier though. > > > > The memory barrier at the end of membarrier() on CPU0 is necessary in > > order to enforce the guarantee that any writes occurring on CPU1 before > > the membarrier() is executed will be visible to any code executing on > > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: > > > > CPU0 CPU1 > > x = 1 > > barrier() > > y = 1 > > r2 = y > > membarrier(): > > a: smp_mb() > > b: send IPI IPI-induced mb > > c: smp_mb() > > r1 = x > > > > The writes to x and y are unordered by the hardware, so it's possible to > > have r2 = 1 even though the write to x doesn't execute until b. If the > > memory barrier at c is omitted then "r1 = x" can be reordered before b > > (although not before a), so we get r1 = 0. This violates the guarantee > > that membarrier() is supposed to provide. > > > > The timing of the memory barrier at c has to ensure that it executes > > after the IPI-induced memory barrier on CPU1. If it happened before > > then we could still end up with r1 = 0. That's why the pairing matters. > > > > I hope this helps your head get properly wrapped. :-) > > It does help a bit! ;-) > > This explains this part of the comment near the smp_mb at the end of membarrier: > > * Memory barrier on the caller thread _after_ we finished > * waiting for the last IPI. [...] > > However, it does not explain why it needs to be paired with a barrier in the > scheduler, clearly for the case where the IPI is skipped. I wonder whether this part > of the comment is factually correct: > > * [...] Matches memory barriers around rq->curr modification in scheduler. The reasoning is pretty much the same as above: CPU0 CPU1 x = 1 barrier() y = 1 r2 = y membarrier(): a: smp_mb() switch to kthread (includes mb) b: read rq->curr == kthread switch to user (includes mb) c: smp_mb() r1 = x Once again, it is possible that x = 1 doesn't become visible to CPU0 until shortly before b. But if c is omitted then "r1 = x" can be reordered before b (to any time after a), so we can have r1 = 0. Here the timing requirement is that c executes after the first memory barrier on CPU1 -- which is one of the ones around the rq->curr modification. (In fact, in this scenario CPU1's switch back to the user process is irrelevant.) Alan Stern ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 17:44 ` Alan Stern @ 2020-07-17 17:52 ` Mathieu Desnoyers 0 siblings, 0 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-17 17:52 UTC (permalink / raw) To: Alan Stern Cc: Nicholas Piggin, paulmck, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 17, 2020, at 1:44 PM, Alan Stern stern@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 17, 2020, at 12:11 PM, Alan Stern stern@rowland.harvard.edu wrote: >> >> >> > I agree with Nick: A memory barrier is needed somewhere between the >> >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> >> > with the Store Buffer pattern having a memory barrier on only one side, >> >> > and it is well known that this arrangement does not guarantee any >> >> > ordering. >> >> >> >> Yes, I see this now. I'm still trying to wrap my head around why the memory >> >> barrier at the end of membarrier() needs to be paired with a scheduler >> >> barrier though. >> > >> > The memory barrier at the end of membarrier() on CPU0 is necessary in >> > order to enforce the guarantee that any writes occurring on CPU1 before >> > the membarrier() is executed will be visible to any code executing on >> > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: >> > >> > CPU0 CPU1 >> > x = 1 >> > barrier() >> > y = 1 >> > r2 = y >> > membarrier(): >> > a: smp_mb() >> > b: send IPI IPI-induced mb >> > c: smp_mb() >> > r1 = x >> > >> > The writes to x and y are unordered by the hardware, so it's possible to >> > have r2 = 1 even though the write to x doesn't execute until b. If the >> > memory barrier at c is omitted then "r1 = x" can be reordered before b >> > (although not before a), so we get r1 = 0. This violates the guarantee >> > that membarrier() is supposed to provide. >> > >> > The timing of the memory barrier at c has to ensure that it executes >> > after the IPI-induced memory barrier on CPU1. If it happened before >> > then we could still end up with r1 = 0. That's why the pairing matters. >> > >> > I hope this helps your head get properly wrapped. :-) >> >> It does help a bit! ;-) >> >> This explains this part of the comment near the smp_mb at the end of membarrier: >> >> * Memory barrier on the caller thread _after_ we finished >> * waiting for the last IPI. [...] >> >> However, it does not explain why it needs to be paired with a barrier in the >> scheduler, clearly for the case where the IPI is skipped. I wonder whether this >> part >> of the comment is factually correct: >> >> * [...] Matches memory barriers around rq->curr modification in scheduler. > > The reasoning is pretty much the same as above: > > CPU0 CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > switch to kthread (includes mb) > b: read rq->curr == kthread > switch to user (includes mb) > c: smp_mb() > r1 = x > > Once again, it is possible that x = 1 doesn't become visible to CPU0 > until shortly before b. But if c is omitted then "r1 = x" can be > reordered before b (to any time after a), so we can have r1 = 0. > > Here the timing requirement is that c executes after the first memory > barrier on CPU1 -- which is one of the ones around the rq->curr > modification. (In fact, in this scenario CPU1's switch back to the user > process is irrelevant.) That indeed covers the last scenario I was wondering about. Thanks Alan! Mathieu > > Alan Stern -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 18:58 ` Mathieu Desnoyers 2020-07-16 21:24 ` Alan Stern @ 2020-07-17 0:00 ` Nicholas Piggin 1 sibling, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-17 0:00 UTC (permalink / raw) To: Mathieu Desnoyers, paulmck, Alan Stern Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 Excerpts from Mathieu Desnoyers's message of July 17, 2020 4:58 am: > ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote: > >> ----- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> mathieu.desnoyers@efficios.com wrote: >> >>> ----- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npiggin@gmail.com wrote: >>>> I should be more complete here, especially since I was complaining >>>> about unclear barrier comment :) >>>> >>>> >>>> CPU0 CPU1 >>>> a. user stuff 1. user stuff >>>> b. membarrier() 2. enter kernel >>>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >>>> d. read rq->curr 4. rq->curr switched to kthread >>>> e. is kthread, skip IPI 5. switch_to kthread >>>> f. return to user 6. rq->curr switched to user thread >>>> g. user stuff 7. switch_to user thread >>>> 8. exit kernel >>>> 9. more user stuff >>>> >>>> What you're really ordering is a, g vs 1, 9 right? >>>> >>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >>>> etc. >>>> >>>> Userspace does not care where the barriers are exactly or what kernel >>>> memory accesses might be being ordered by them, so long as there is a >>>> mb somewhere between a and g, and 1 and 9. Right? >>> >>> This is correct. >> >> Actually, sorry, the above is not quite right. It's been a while >> since I looked into the details of membarrier. >> >> The smp_mb() at the beginning of membarrier() needs to be paired with a >> smp_mb() _after_ rq->curr is switched back to the user thread, so the >> memory barrier is between store to rq->curr and following user-space >> accesses. >> >> The smp_mb() at the end of membarrier() needs to be paired with the >> smp_mb__after_spinlock() at the beginning of schedule, which is >> between accesses to userspace memory and switching rq->curr to kthread. >> >> As to *why* this ordering is needed, I'd have to dig through additional >> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? > > Thinking further about this, I'm beginning to consider that maybe we have been > overly cautious by requiring memory barriers before and after store to rq->curr. > > If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) > while running the membarrier system call, it necessarily means that CPU1 had > to issue smp_mb__after_spinlock when entering the scheduler, between any user-space > loads/stores and update of rq->curr. > > Requiring a memory barrier between update of rq->curr (back to current process's > thread) and following user-space memory accesses does not seem to guarantee > anything more than what the initial barrier at the beginning of __schedule already > provides, because the guarantees are only about accesses to user-space memory. > > Therefore, with the memory barrier at the beginning of __schedule, just observing that > CPU1's rq->curr differs from current should guarantee that a memory barrier was issued > between any sequentially consistent instructions belonging to the current process on > CPU1. > > Or am I missing/misremembering an important point here ? I might have mislead you. CPU0 CPU1 r1=y x=1 membarrier() y=1 r2=x membarrier provides if r1==1 then r2==1 (right?) CPU0 r1=y membarrier() smp_mb(); t = cpu_rq(1)->curr; if (t->mm == mm) IPI(CPU1); smp_mb() r2=x vs CPU1 ... __schedule() smp_mb__after_spinlock() rq->curr = kthread ... __schedule() smp_mb__after_spinlock() rq->curr = user thread exit kernel x=1 y=1 Now these last 3 stores are not ordered, so CPU0 might see y==1 but rq->curr == kthread, right? Then it will skip the IPI and stores to x and y will not be ordered. So we do need a mb after rq->curr store when mm is switching. I believe for the global membarrier PF_KTHREAD optimisation, we also need a barrier when switching from a kernel thread to user, for the same reason. So I think I was wrong to say the barrier is not necessary. I haven't quite worked out why two mb()s are required in membarrier(), but at least that's less of a performance concern. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 4:15 ` Nicholas Piggin 2020-07-16 4:42 ` Nicholas Piggin @ 2020-07-16 5:18 ` Andy Lutomirski 2020-07-16 6:06 ` Nicholas Piggin 2020-07-16 8:50 ` Peter Zijlstra 1 sibling, 2 replies; 59+ messages in thread From: Andy Lutomirski @ 2020-07-16 5:18 UTC (permalink / raw) To: Nicholas Piggin Cc: Mathieu Desnoyers, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Peter Zijlstra, x86 > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >> >>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>> serialize. There are older kernels for which it does not promise to >>>>> serialize. And I have plans to make it stop serializing in the >>>>> nearish future. >>>> >>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>> update the membarrier sync code, at least :) >>> >>> Oh, I should actually say Mathieu recently clarified a return from >>> interrupt doesn't fundamentally need to serialize in order to support >>> membarrier sync core. >> >> Clarification to your statement: >> >> Return from interrupt to kernel code does not need to be context serializing >> as long as kernel serializes before returning to user-space. >> >> However, return from interrupt to user-space needs to be context serializing. > > Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb > in the right places. > > A kernel thread does a use_mm, then it blocks and the user process with > the same mm runs on that CPU, and then it calls into the kernel, blocks, > the kernel thread runs again, another CPU issues a membarrier which does > not IPI this one because it's running a kthread, and then the kthread > switches back to the user process (still without having unused the mm), > and then the user process returns from syscall without having done a > core synchronising instruction. > > The cause of the problem is you want to avoid IPI'ing kthreads. Why? > I'm guessing it really only matters as an optimisation in case of idle > threads. Idle thread is easy (well, easier) because it won't use_mm, so > you could check for rq->curr == rq->idle in your loop (in a suitable > sched accessor function). > > But... I'm not really liking this subtlety in the scheduler for all this > (the scheduler still needs the barriers when switching out of idle). > > Can it be improved somehow? Let me forget x86 core sync problem for now > (that _may_ be a bit harder), and step back and look at what we're doing. > The memory barrier case would actually suffer from the same problem as > core sync, because in the same situation it has no implicit mmdrop in > the scheduler switch code either. > > So what are we doing with membarrier? We want any activity caused by the > set of CPUs/threads specified that can be observed by this thread before > calling membarrier is appropriately fenced from activity that can be > observed to happen after the call returns. > > CPU0 CPU1 > 1. user stuff > a. membarrier() 2. enter kernel > b. read rq->curr 3. rq->curr switched to kthread > c. is kthread, skip IPI 4. switch_to kthread > d. return to user 5. rq->curr switched to user thread > 6. switch_to user thread > 7. exit kernel > 8. more user stuff > > As far as I can see, the problem is CPU1 might reorder step 5 and step > 8, so you have mmdrop of lazy mm be a mb after step 6. > > But why? The membarrier call only cares that there is a full barrier > between 1 and 8, right? Which it will get from the previous context > switch to the kthread. > > I must say the memory barrier comments in membarrier could be improved > a bit (unless I'm missing where the main comment is). It's fine to know > what barriers pair with one another, but we need to know which exact > memory accesses it is ordering > > /* > * Matches memory barriers around rq->curr modification in > * scheduler. > */ > > Sure, but it doesn't say what else is being ordered. I think it's just > the user memory accesses, but would be nice to make that a bit more > explicit. If we had such comments then we might know this case is safe. > > I think the funny powerpc barrier is a similar case of this. If we > ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that > CPU has or will have issued a memory barrier between running user > code. > > So AFAIKS all this membarrier stuff in kernel/sched/core.c could > just go away. Except x86 because thread switch doesn't imply core > sync, so CPU1 between 1 and 8 may never issue a core sync instruction > the same way a context switch must be a full mb. > > Before getting to x86 -- Am I right, or way off track here? I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()? But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: int a[2]; Thread A: a[0] = 1; a[1] = 2: Thread B: write(fd, a, sizeof(a)); Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone. —Andy > > Thanks, > Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 5:18 ` Andy Lutomirski @ 2020-07-16 6:06 ` Nicholas Piggin 2020-07-16 8:50 ` Peter Zijlstra 1 sibling, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 6:06 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, x86 Excerpts from Andy Lutomirski's message of July 16, 2020 3:18 pm: > > >> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am: >>> ----- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npiggin@gmail.com wrote: >>> >>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >>>>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>>>>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>>>>> serialize. There are older kernels for which it does not promise to >>>>>> serialize. And I have plans to make it stop serializing in the >>>>>> nearish future. >>>>> >>>>> You mean x86's return from interrupt? Sounds fun... you'll konw where to >>>>> update the membarrier sync code, at least :) >>>> >>>> Oh, I should actually say Mathieu recently clarified a return from >>>> interrupt doesn't fundamentally need to serialize in order to support >>>> membarrier sync core. >>> >>> Clarification to your statement: >>> >>> Return from interrupt to kernel code does not need to be context serializing >>> as long as kernel serializes before returning to user-space. >>> >>> However, return from interrupt to user-space needs to be context serializing. >> >> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb >> in the right places. >> >> A kernel thread does a use_mm, then it blocks and the user process with >> the same mm runs on that CPU, and then it calls into the kernel, blocks, >> the kernel thread runs again, another CPU issues a membarrier which does >> not IPI this one because it's running a kthread, and then the kthread >> switches back to the user process (still without having unused the mm), >> and then the user process returns from syscall without having done a >> core synchronising instruction. >> >> The cause of the problem is you want to avoid IPI'ing kthreads. Why? >> I'm guessing it really only matters as an optimisation in case of idle >> threads. Idle thread is easy (well, easier) because it won't use_mm, so >> you could check for rq->curr == rq->idle in your loop (in a suitable >> sched accessor function). >> >> But... I'm not really liking this subtlety in the scheduler for all this >> (the scheduler still needs the barriers when switching out of idle). >> >> Can it be improved somehow? Let me forget x86 core sync problem for now >> (that _may_ be a bit harder), and step back and look at what we're doing. >> The memory barrier case would actually suffer from the same problem as >> core sync, because in the same situation it has no implicit mmdrop in >> the scheduler switch code either. >> >> So what are we doing with membarrier? We want any activity caused by the >> set of CPUs/threads specified that can be observed by this thread before >> calling membarrier is appropriately fenced from activity that can be >> observed to happen after the call returns. >> >> CPU0 CPU1 >> 1. user stuff >> a. membarrier() 2. enter kernel >> b. read rq->curr 3. rq->curr switched to kthread >> c. is kthread, skip IPI 4. switch_to kthread >> d. return to user 5. rq->curr switched to user thread >> 6. switch_to user thread >> 7. exit kernel >> 8. more user stuff >> >> As far as I can see, the problem is CPU1 might reorder step 5 and step >> 8, so you have mmdrop of lazy mm be a mb after step 6. >> >> But why? The membarrier call only cares that there is a full barrier >> between 1 and 8, right? Which it will get from the previous context >> switch to the kthread. >> >> I must say the memory barrier comments in membarrier could be improved >> a bit (unless I'm missing where the main comment is). It's fine to know >> what barriers pair with one another, but we need to know which exact >> memory accesses it is ordering >> >> /* >> * Matches memory barriers around rq->curr modification in >> * scheduler. >> */ >> >> Sure, but it doesn't say what else is being ordered. I think it's just >> the user memory accesses, but would be nice to make that a bit more >> explicit. If we had such comments then we might know this case is safe. >> >> I think the funny powerpc barrier is a similar case of this. If we >> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that >> CPU has or will have issued a memory barrier between running user >> code. >> >> So AFAIKS all this membarrier stuff in kernel/sched/core.c could >> just go away. Except x86 because thread switch doesn't imply core >> sync, so CPU1 between 1 and 8 may never issue a core sync instruction >> the same way a context switch must be a full mb. >> >> Before getting to x86 -- Am I right, or way off track here? > > I find it hard to believe that this is x86 only. Why would thread switch imply core sync on any architecture? Is x86 unique in having a stupid expensive core sync that is heavier than smp_mb()? It's not the thread switch but the return from kernel to user -- at least of architectures that implement membarrier SYNC_CORE, x86 can do that without serializing. The thread switch is muddying the waters a bit, it's not the actual thread switch we care about, that just happens to be used as a point where we try to catch the membarrier IPIs that were skipped due to the PF_KTHREAD optimisation. I think that doing said check in the lazy tlb exit code is both unnecessary for the memory ordering and insufficient for pipeline serialization. > But I’m wondering if all this deferred sync stuff is wrong. In the brave new world of io_uring and such, perhaps kernel access matter too. Heck, even: > > int a[2]; > > Thread A: > a[0] = 1; > a[1] = 2: > > Thread B: > > write(fd, a, sizeof(a)); > > Doesn’t do what thread A is expecting. Admittedly this particular example is nonsense, but maybe there are sensible cases that matter to someone. I think kernel accesses probably do matter (or at least they should by principle of least surprise). And so I was doubly misleading by labeling it as "user stuff". I should have distinguished between previous user or kernel accesses, as opposed to the kernel accesses specifically for the implementation of the membarrier call. So I think the membarrier code gets *that* part right (modulo what we have seen already) if the kernel access is being done from process context. But yes if the access is coming from io_uring that has done kthread_use_mm or some other random code running in a kernel thread working on get_user_pages memory or any similar shared vm_insert_pfn memory, then it goes completely to hell. So good catch, PF_KTHREAD check is problematic there even if no actual users exist today. rq->curr == rq->idle test might be better, but can we have interrupts writing completions into user memory? For performance I would hope so, so that makes even that test problematic. Maybe membarrier should close that gap entirely, and work around performance issue by adding _USER_ONLY flags which explicitly only order user mode accesess vs other user accesses. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 5:18 ` Andy Lutomirski 2020-07-16 6:06 ` Nicholas Piggin @ 2020-07-16 8:50 ` Peter Zijlstra 2020-07-16 10:03 ` Nicholas Piggin 1 sibling, 1 reply; 59+ messages in thread From: Peter Zijlstra @ 2020-07-16 8:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Nicholas Piggin, Mathieu Desnoyers, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, x86 On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: > > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > CPU0 CPU1 > > 1. user stuff > > a. membarrier() 2. enter kernel > > b. read rq->curr 3. rq->curr switched to kthread > > c. is kthread, skip IPI 4. switch_to kthread > > d. return to user 5. rq->curr switched to user thread > > 6. switch_to user thread > > 7. exit kernel > > 8. more user stuff > I find it hard to believe that this is x86 only. Why would thread > switch imply core sync on any architecture? Is x86 unique in having a > stupid expensive core sync that is heavier than smp_mb()? smp_mb() is nowhere near the most expensive barrier we have in Linux, mb() might qualify, since that has some completion requirements since it needs to serialize against external actors. On x86_64 things are rather murky, we have: LOCK prefix -- which implies smp_mb() before and after RmW LFENCE -- which used to be rmb like, until Spectre, and now it is ISYNC like. Since ISYNC ensures an empty pipeline, it also implies all loads are retired (and therefore complete) it implies rmb. MFENCE -- which is a memop completion barrier like, it makes sure all previously issued memops are complete. if you read that carefully, you'll note you'll have to use LFENCE + MFENCE to order against non-memops instructions. But none of them imply dumping the instruction decoder caches, that only happens on core serializing instructions like CR3 writes, IRET, CPUID and a few others, I think we recently got a SERIALIZE instruction to add to this list. On ARM64 there's something a whole different set of barriers, and again smp_mb() isn't nowhere near the top of the list. They have roughly 3 classes: ISB -- instruction sync barrier DMB(x) -- memory ordering in domain x DSB(x) -- memory completion in domain x And they have at least 3 domains (IIRC), system, outer, inner. The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to have a SYNC, but since PowerPC is rare for only having one rediculously heavy serializing instruction, we got to re-use the smp_mb() early in __schedule() instead, but ARM64 can't do that. So rather than say that x86 is special here, I'd say that PowerPC is special here. > But I’m wondering if all this deferred sync stuff is wrong. In the > brave new world of io_uring and such, perhaps kernel access matter > too. Heck, even: IIRC the membarrier SYNC_CORE use-case is about user-space self-modifying code. Userspace re-uses a text address and needs to SYNC_CORE before it can be sure the old text is forgotten. Nothing the kernel does matters there. I suppose the manpage could be more clear there. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 8:50 ` Peter Zijlstra @ 2020-07-16 10:03 ` Nicholas Piggin 2020-07-16 11:00 ` peterz 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 10:03 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, x86 Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> > CPU0 CPU1 >> > 1. user stuff >> > a. membarrier() 2. enter kernel >> > b. read rq->curr 3. rq->curr switched to kthread >> > c. is kthread, skip IPI 4. switch_to kthread >> > d. return to user 5. rq->curr switched to user thread >> > 6. switch_to user thread >> > 7. exit kernel >> > 8. more user stuff > >> I find it hard to believe that this is x86 only. Why would thread >> switch imply core sync on any architecture? Is x86 unique in having a >> stupid expensive core sync that is heavier than smp_mb()? > > smp_mb() is nowhere near the most expensive barrier we have in Linux, > mb() might qualify, since that has some completion requirements since it > needs to serialize against external actors. > > On x86_64 things are rather murky, we have: > > LOCK prefix -- which implies smp_mb() before and after RmW > LFENCE -- which used to be rmb like, until Spectre, and now it > is ISYNC like. Since ISYNC ensures an empty pipeline, > it also implies all loads are retired (and therefore > complete) it implies rmb. > MFENCE -- which is a memop completion barrier like, it makes > sure all previously issued memops are complete. > > if you read that carefully, you'll note you'll have to use LFENCE + > MFENCE to order against non-memops instructions. > > But none of them imply dumping the instruction decoder caches, that only > happens on core serializing instructions like CR3 writes, IRET, CPUID > and a few others, I think we recently got a SERIALIZE instruction to add > to this list. > > > On ARM64 there's something a whole different set of barriers, and again > smp_mb() isn't nowhere near the top of the list. They have roughly 3 > classes: > > ISB -- instruction sync barrier > DMB(x) -- memory ordering in domain x > DSB(x) -- memory completion in domain x > > And they have at least 3 domains (IIRC), system, outer, inner. > > The ARM64 __switch_to() includes a dsb(sy), just like PowerPC used to > have a SYNC, but since PowerPC is rare for only having one rediculously > heavy serializing instruction, we got to re-use the smp_mb() early in > __schedule() instead, but ARM64 can't do that. > > > So rather than say that x86 is special here, I'd say that PowerPC is > special here. PowerPC is "special", I'll agree with you there :) It does have a SYNC (HWSYNC) instruction that is mb(). It does not serialize the core. ISYNC is a nop. ICBI ; ISYNC does serialize the core. Difference between them is probably much the same as difference between MFENCE and CPUID on x86 CPUs. Serializing the core is almost always pretty expensive. HWSYNC/MFENCE can be expensive if you have a lot of or difficult (not exclusive in cache) outstanding with critical reads after the barrier, but it can also be somewhat cheap if there are few writes, and executed past, it only needs to hold up subsequent reads. That said... implementation details. powerpc CPUs have traditionally had fairly costly HWSYNC. >> But I’m wondering if all this deferred sync stuff is wrong. In the >> brave new world of io_uring and such, perhaps kernel access matter >> too. Heck, even: > > IIRC the membarrier SYNC_CORE use-case is about user-space > self-modifying code. > > Userspace re-uses a text address and needs to SYNC_CORE before it can be > sure the old text is forgotten. Nothing the kernel does matters there. > > I suppose the manpage could be more clear there. True, but memory ordering of kernel stores from kernel threads for regular mem barrier is the concern here. Does io_uring update completion queue from kernel thread or interrupt, for example? If it does, then membarrier will not order such stores with user memory accesses. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 10:03 ` Nicholas Piggin @ 2020-07-16 11:00 ` peterz 2020-07-16 15:34 ` Mathieu Desnoyers 2020-07-16 23:26 ` Nicholas Piggin 0 siblings, 2 replies; 59+ messages in thread From: peterz @ 2020-07-16 11:00 UTC (permalink / raw) To: Nicholas Piggin Cc: Andy Lutomirski, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, x86 On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: > Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: > > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: > >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> But I’m wondering if all this deferred sync stuff is wrong. In the > >> brave new world of io_uring and such, perhaps kernel access matter > >> too. Heck, even: > > > > IIRC the membarrier SYNC_CORE use-case is about user-space > > self-modifying code. > > > > Userspace re-uses a text address and needs to SYNC_CORE before it can be > > sure the old text is forgotten. Nothing the kernel does matters there. > > > > I suppose the manpage could be more clear there. > > True, but memory ordering of kernel stores from kernel threads for > regular mem barrier is the concern here. > > Does io_uring update completion queue from kernel thread or interrupt, > for example? If it does, then membarrier will not order such stores > with user memory accesses. So we're talking about regular membarrier() then? Not the SYNC_CORE variant per-se. Even there, I'll argue we don't care, but perhaps Mathieu has a different opinion. All we care about is that all other threads (or CPUs for GLOBAL) observe an smp_mb() before it returns. Any serialization against whatever those other threads/CPUs are running at the instant of the syscall is external to the syscall, we make no gauarantees about that. That is, we can fundamentally not say what another CPU is executing concurrently. Nor should we want to. So if you feel that your membarrier() ought to serialize against remote execution, you need to arrange a quiecent state on the remote side yourself. Now, normally membarrier() is used to implement userspace RCU like things, and there all that matters is that the remote CPUs observe the beginngin of the new grace-period, ie counter flip, and we observe their read-side critical sections, or smething like that, it's been a while since I looked at all that. It's always been the case that concurrent syscalls could change user memory, io_uring doesn't change that, it just makes it even less well defined when that would happen. If you want to serialize against that, you need to arrange that externally. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 11:00 ` peterz @ 2020-07-16 15:34 ` Mathieu Desnoyers 2020-07-16 23:26 ` Nicholas Piggin 1 sibling, 0 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-16 15:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Andy Lutomirski, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, x86 ----- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra peterz@infradead.org wrote: > On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: >> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> >> But I’m wondering if all this deferred sync stuff is wrong. In the >> >> brave new world of io_uring and such, perhaps kernel access matter >> >> too. Heck, even: >> > >> > IIRC the membarrier SYNC_CORE use-case is about user-space >> > self-modifying code. >> > >> > Userspace re-uses a text address and needs to SYNC_CORE before it can be >> > sure the old text is forgotten. Nothing the kernel does matters there. >> > >> > I suppose the manpage could be more clear there. >> >> True, but memory ordering of kernel stores from kernel threads for >> regular mem barrier is the concern here. >> >> Does io_uring update completion queue from kernel thread or interrupt, >> for example? If it does, then membarrier will not order such stores >> with user memory accesses. > > So we're talking about regular membarrier() then? Not the SYNC_CORE > variant per-se. > > Even there, I'll argue we don't care, but perhaps Mathieu has a > different opinion. I agree with Peter that we don't care about accesses to user-space memory performed concurrently with membarrier. What we'd care about in terms of accesses to user-space memory from the kernel is something that would be clearly ordered as happening before or after the membarrier call, for instance a read(2) followed by membarrier(2) after the read returns, or a read(2) issued after return from membarrier(2). The other scenario we'd care about is with the compiler barrier paired with membarrier: e.g. read(2) returns, compiler barrier, followed by a store. Or load, compiler barrier, followed by write(2). All those scenarios imply before/after ordering wrt either membarrier or the compiler barrier. I notice that io_uring has a "completion" queue. Let's try to come up with realistic usage scenarios. So the dependency chain would be provided by e.g.: * Infrequent read / Frequent write, communicating read completion through variable X wait for io_uring read request completion -> membarrier -> store X=1 with matching load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit io_uring write request or this other scenario: * Frequent read / Infrequent write, communicating read completion through variable X load from X (waiting for X==1) -> membarrier -> submit io_uring write request with matching wait for io_uring read request completion -> asm volatile (::: "memory") -> store X=1 Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 11:00 ` peterz 2020-07-16 15:34 ` Mathieu Desnoyers @ 2020-07-16 23:26 ` Nicholas Piggin 2020-07-17 13:42 ` Mathieu Desnoyers 1 sibling, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 23:26 UTC (permalink / raw) To: peterz Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Mathieu Desnoyers, x86 Excerpts from peterz@infradead.org's message of July 16, 2020 9:00 pm: > On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: >> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > >> >> But I’m wondering if all this deferred sync stuff is wrong. In the >> >> brave new world of io_uring and such, perhaps kernel access matter >> >> too. Heck, even: >> > >> > IIRC the membarrier SYNC_CORE use-case is about user-space >> > self-modifying code. >> > >> > Userspace re-uses a text address and needs to SYNC_CORE before it can be >> > sure the old text is forgotten. Nothing the kernel does matters there. >> > >> > I suppose the manpage could be more clear there. >> >> True, but memory ordering of kernel stores from kernel threads for >> regular mem barrier is the concern here. >> >> Does io_uring update completion queue from kernel thread or interrupt, >> for example? If it does, then membarrier will not order such stores >> with user memory accesses. > > So we're talking about regular membarrier() then? Not the SYNC_CORE > variant per-se. Well, both but Andy in this case was wondering about kernel writes vs user. > > Even there, I'll argue we don't care, but perhaps Mathieu has a > different opinion. All we care about is that all other threads (or CPUs > for GLOBAL) observe an smp_mb() before it returns. > > Any serialization against whatever those other threads/CPUs are running > at the instant of the syscall is external to the syscall, we make no > gauarantees about that. That is, we can fundamentally not say what > another CPU is executing concurrently. Nor should we want to. > > So if you feel that your membarrier() ought to serialize against remote > execution, you need to arrange a quiecent state on the remote side > yourself. > > Now, normally membarrier() is used to implement userspace RCU like > things, and there all that matters is that the remote CPUs observe the > beginngin of the new grace-period, ie counter flip, and we observe their > read-side critical sections, or smething like that, it's been a while > since I looked at all that. > > It's always been the case that concurrent syscalls could change user > memory, io_uring doesn't change that, it just makes it even less well > defined when that would happen. If you want to serialize against that, > you need to arrange that externally. membarrier does replace barrier instructions on remote CPUs, which do order accesses performed by the kernel on the user address space. So membarrier should too I guess. Normal process context accesses like read(2) will do so because they don't get filtered out from IPIs, but kernel threads using the mm may not. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-16 23:26 ` Nicholas Piggin @ 2020-07-17 13:42 ` Mathieu Desnoyers 2020-07-20 3:03 ` Nicholas Piggin 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-17 13:42 UTC (permalink / raw) To: Nicholas Piggin Cc: Peter Zijlstra, Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, x86 ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: [...] > > membarrier does replace barrier instructions on remote CPUs, which do > order accesses performed by the kernel on the user address space. So > membarrier should too I guess. > > Normal process context accesses like read(2) will do so because they > don't get filtered out from IPIs, but kernel threads using the mm may > not. But it should not be an issue, because membarrier's ordering is only with respect to submit and completion of io_uring requests, which are performed through system calls from the context of user-space threads, which are called from the right mm. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-17 13:42 ` Mathieu Desnoyers @ 2020-07-20 3:03 ` Nicholas Piggin 2020-07-20 16:46 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-20 3:03 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, x86, Jens Axboe Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: > ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: > [...] >> >> membarrier does replace barrier instructions on remote CPUs, which do >> order accesses performed by the kernel on the user address space. So >> membarrier should too I guess. >> >> Normal process context accesses like read(2) will do so because they >> don't get filtered out from IPIs, but kernel threads using the mm may >> not. > > But it should not be an issue, because membarrier's ordering is only with respect > to submit and completion of io_uring requests, which are performed through > system calls from the context of user-space threads, which are called from the > right mm. Is that true? Can io completions be written into an address space via a kernel thread? I don't know the io_uring code well but it looks like that's asynchonously using the user mm context. How about other memory accesses via kthread_use_mm? Presumably there is still ordering requirement there for membarrier, so I really think it's a fragile interface with no real way for the user to know how kernel threads may use its mm for any particular reason, so membarrier should synchronize all possible kernel users as well. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-20 3:03 ` Nicholas Piggin @ 2020-07-20 16:46 ` Mathieu Desnoyers 2020-07-21 10:04 ` Nicholas Piggin 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-20 16:46 UTC (permalink / raw) To: Nicholas Piggin Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, x86, Jens Axboe ----- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: >> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: >> [...] >>> >>> membarrier does replace barrier instructions on remote CPUs, which do >>> order accesses performed by the kernel on the user address space. So >>> membarrier should too I guess. >>> >>> Normal process context accesses like read(2) will do so because they >>> don't get filtered out from IPIs, but kernel threads using the mm may >>> not. >> >> But it should not be an issue, because membarrier's ordering is only with >> respect >> to submit and completion of io_uring requests, which are performed through >> system calls from the context of user-space threads, which are called from the >> right mm. > > Is that true? Can io completions be written into an address space via a > kernel thread? I don't know the io_uring code well but it looks like > that's asynchonously using the user mm context. Indeed, the io completion appears to be signaled asynchronously between kernel and user-space. Therefore, both kernel and userspace code need to have proper memory barriers in place to signal completion, otherwise user-space could read garbage after it notices completion of a read. I did not review the entire io_uring implementation, but the publish side for completion appears to be: static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings; /* order cqe stores with ring update */ smp_store_release(&rings->cq.tail, ctx->cached_cq_tail); if (wq_has_sleeper(&ctx->cq_wait)) { wake_up_interruptible(&ctx->cq_wait); kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); } } The store-release on tail should be paired with a load_acquire on the reader-side (it's called "read_barrier()" in the code): tools/io_uring/queue.c: static int __io_uring_get_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, int wait) { struct io_uring_cq *cq = &ring->cq; const unsigned mask = *cq->kring_mask; unsigned head; int ret; *cqe_ptr = NULL; head = *cq->khead; do { /* * It's necessary to use a read_barrier() before reading * the CQ tail, since the kernel updates it locklessly. The * kernel has the matching store barrier for the update. The * kernel also ensures that previous stores to CQEs are ordered * with the tail update. */ read_barrier(); if (head != *cq->ktail) { *cqe_ptr = &cq->cqes[head & mask]; break; } if (!wait) break; ret = io_uring_enter(ring->ring_fd, 0, 1, IORING_ENTER_GETEVENTS, NULL); if (ret < 0) return -errno; } while (1); return 0; } So as far as membarrier memory ordering dependencies are concerned, it relies on the store-release/load-acquire dependency chain in the completion queue to order against anything that was done prior to the completed requests. What is in-flight while the requests are being serviced provides no memory ordering guarantee whatsoever. > How about other memory accesses via kthread_use_mm? Presumably there is > still ordering requirement there for membarrier, Please provide an example case with memory accesses via kthread_use_mm where ordering matters to support your concern. > so I really think > it's a fragile interface with no real way for the user to know how > kernel threads may use its mm for any particular reason, so membarrier > should synchronize all possible kernel users as well. I strongly doubt so, but perhaps something should be clarified in the documentation if you have that feeling. Thanks, Mathieu > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-20 16:46 ` Mathieu Desnoyers @ 2020-07-21 10:04 ` Nicholas Piggin 2020-07-21 13:11 ` Mathieu Desnoyers 2020-07-21 15:06 ` peterz 0 siblings, 2 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-21 10:04 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, x86 Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: > ----- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: >>> ----- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npiggin@gmail.com wrote: >>> [...] >>>> >>>> membarrier does replace barrier instructions on remote CPUs, which do >>>> order accesses performed by the kernel on the user address space. So >>>> membarrier should too I guess. >>>> >>>> Normal process context accesses like read(2) will do so because they >>>> don't get filtered out from IPIs, but kernel threads using the mm may >>>> not. >>> >>> But it should not be an issue, because membarrier's ordering is only with >>> respect >>> to submit and completion of io_uring requests, which are performed through >>> system calls from the context of user-space threads, which are called from the >>> right mm. >> >> Is that true? Can io completions be written into an address space via a >> kernel thread? I don't know the io_uring code well but it looks like >> that's asynchonously using the user mm context. > > Indeed, the io completion appears to be signaled asynchronously between kernel > and user-space. Yep, many other places do similar with use_mm. [snip] > So as far as membarrier memory ordering dependencies are concerned, it relies > on the store-release/load-acquire dependency chain in the completion queue to > order against anything that was done prior to the completed requests. > > What is in-flight while the requests are being serviced provides no memory > ordering guarantee whatsoever. Yeah you're probably right in this case I think. Quite likely most kernel tasks that asynchronously write to user memory would at least have some kind of producer-consumer barriers. But is that restriction of all async modifications documented and enforced anywhere? >> How about other memory accesses via kthread_use_mm? Presumably there is >> still ordering requirement there for membarrier, > > Please provide an example case with memory accesses via kthread_use_mm where > ordering matters to support your concern. I think the concern Andy raised with io_uring was less a specific problem he saw and more a general concern that we have these memory accesses which are not synchronized with membarrier. >> so I really think >> it's a fragile interface with no real way for the user to know how >> kernel threads may use its mm for any particular reason, so membarrier >> should synchronize all possible kernel users as well. > > I strongly doubt so, but perhaps something should be clarified in the documentation > if you have that feeling. I'd rather go the other way and say if you have reasoning or numbers for why PF_KTHREAD is an important optimisation above rq->curr == rq->idle then we could think about keeping this subtlety with appropriate documentation added, otherwise we can just kill it and remove all doubt. That being said, the x86 sync core gap that I imagined could be fixed by changing to rq->curr == rq->idle test does not actually exist because the global membarrier does not have a sync core option. So fixing the exit_lazy_tlb points that this series does *should* fix that. So PF_KTHREAD may be less problematic than I thought from implementation point of view, only semantics. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 10:04 ` Nicholas Piggin @ 2020-07-21 13:11 ` Mathieu Desnoyers 2020-07-21 14:30 ` Nicholas Piggin 2020-07-21 15:06 ` peterz 1 sibling, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-21 13:11 UTC (permalink / raw) To: Nicholas Piggin Cc: Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, x86 ----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: [...] > > Yeah you're probably right in this case I think. Quite likely most kernel > tasks that asynchronously write to user memory would at least have some > kind of producer-consumer barriers. > > But is that restriction of all async modifications documented and enforced > anywhere? > >>> How about other memory accesses via kthread_use_mm? Presumably there is >>> still ordering requirement there for membarrier, >> >> Please provide an example case with memory accesses via kthread_use_mm where >> ordering matters to support your concern. > > I think the concern Andy raised with io_uring was less a specific > problem he saw and more a general concern that we have these memory > accesses which are not synchronized with membarrier. > >>> so I really think >>> it's a fragile interface with no real way for the user to know how >>> kernel threads may use its mm for any particular reason, so membarrier >>> should synchronize all possible kernel users as well. >> >> I strongly doubt so, but perhaps something should be clarified in the >> documentation >> if you have that feeling. > > I'd rather go the other way and say if you have reasoning or numbers for > why PF_KTHREAD is an important optimisation above rq->curr == rq->idle > then we could think about keeping this subtlety with appropriate > documentation added, otherwise we can just kill it and remove all doubt. > > That being said, the x86 sync core gap that I imagined could be fixed > by changing to rq->curr == rq->idle test does not actually exist because > the global membarrier does not have a sync core option. So fixing the > exit_lazy_tlb points that this series does *should* fix that. So > PF_KTHREAD may be less problematic than I thought from implementation > point of view, only semantics. Today, the membarrier global expedited command explicitly skips kernel threads, but it happens that membarrier private expedited considers those with the same mm as target for the IPI. So we already implement a semantic which differs between private and global expedited membarriers. This can be explained in part by the fact that kthread_use_mm was introduced after 4.16, where the most recent membarrier commands where introduced. It seems that the effect on membarrier was not considered when kthread_use_mm was introduced. Looking at membarrier(2) documentation, it states that IPIs are only sent to threads belonging to the same process as the calling thread. If my understanding of the notion of process is correct, this should rule out sending the IPI to kernel threads, given they are not "part" of the same process, only borrowing the mm. But I agree that the distinction is moot, and should be clarified. Without a clear use-case to justify adding a constraint on membarrier, I am tempted to simply clarify documentation of current membarrier commands, stating clearly that they are not guaranteed to affect kernel threads. Then, if we have a compelling use-case to implement a different behavior which covers kthreads, this could be added consistently across membarrier commands with a flag (or by adding new commands). Does this approach make sense ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 13:11 ` Mathieu Desnoyers @ 2020-07-21 14:30 ` Nicholas Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-21 14:30 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, Peter Zijlstra, x86 Excerpts from Mathieu Desnoyers's message of July 21, 2020 11:11 pm: > ----- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npiggin@gmail.com wrote: > >> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: > [...] >> >> Yeah you're probably right in this case I think. Quite likely most kernel >> tasks that asynchronously write to user memory would at least have some >> kind of producer-consumer barriers. >> >> But is that restriction of all async modifications documented and enforced >> anywhere? >> >>>> How about other memory accesses via kthread_use_mm? Presumably there is >>>> still ordering requirement there for membarrier, >>> >>> Please provide an example case with memory accesses via kthread_use_mm where >>> ordering matters to support your concern. >> >> I think the concern Andy raised with io_uring was less a specific >> problem he saw and more a general concern that we have these memory >> accesses which are not synchronized with membarrier. >> >>>> so I really think >>>> it's a fragile interface with no real way for the user to know how >>>> kernel threads may use its mm for any particular reason, so membarrier >>>> should synchronize all possible kernel users as well. >>> >>> I strongly doubt so, but perhaps something should be clarified in the >>> documentation >>> if you have that feeling. >> >> I'd rather go the other way and say if you have reasoning or numbers for >> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle >> then we could think about keeping this subtlety with appropriate >> documentation added, otherwise we can just kill it and remove all doubt. >> >> That being said, the x86 sync core gap that I imagined could be fixed >> by changing to rq->curr == rq->idle test does not actually exist because >> the global membarrier does not have a sync core option. So fixing the >> exit_lazy_tlb points that this series does *should* fix that. So >> PF_KTHREAD may be less problematic than I thought from implementation >> point of view, only semantics. > > Today, the membarrier global expedited command explicitly skips kernel threads, > but it happens that membarrier private expedited considers those with the > same mm as target for the IPI. > > So we already implement a semantic which differs between private and global > expedited membarriers. Which is not a good thing. > This can be explained in part by the fact that > kthread_use_mm was introduced after 4.16, where the most recent membarrier > commands where introduced. It seems that the effect on membarrier was not > considered when kthread_use_mm was introduced. No it was just renamed, it used to be called use_mm and has been in the kernel for ~ever. That you hadn't considered this is actually weight for my point, which is that there's so much subtle behaviour that's easy to miss we're better off with simpler and fewer special cases until it's proven they're needed. Not the other way around. > > Looking at membarrier(2) documentation, it states that IPIs are only sent to > threads belonging to the same process as the calling thread. If my understanding > of the notion of process is correct, this should rule out sending the IPI to > kernel threads, given they are not "part" of the same process, only borrowing > the mm. But I agree that the distinction is moot, and should be clarified. It does if you read it in a user-hostile legalistic way. The reality is userspace shouldn't and can't know about how the kernel might implement functionality. > Without a clear use-case to justify adding a constraint on membarrier, I am > tempted to simply clarify documentation of current membarrier commands, > stating clearly that they are not guaranteed to affect kernel threads. Then, > if we have a compelling use-case to implement a different behavior which covers > kthreads, this could be added consistently across membarrier commands with a > flag (or by adding new commands). > > Does this approach make sense ? The other position is without a clear use case for PF_KTHREAD, seeing as async kernel accesses had not been considered before now, we limit the optimision to only skipping the idle thread. I think that makes more sense (unless you have a reason for PF_KTHREAD but it doesn't seem like there is much of one). Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 10:04 ` Nicholas Piggin 2020-07-21 13:11 ` Mathieu Desnoyers @ 2020-07-21 15:06 ` peterz 2020-07-21 15:15 ` Mathieu Desnoyers 1 sibling, 1 reply; 59+ messages in thread From: peterz @ 2020-07-21 15:06 UTC (permalink / raw) To: Nicholas Piggin Cc: Mathieu Desnoyers, Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, x86 On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > That being said, the x86 sync core gap that I imagined could be fixed > by changing to rq->curr == rq->idle test does not actually exist because > the global membarrier does not have a sync core option. So fixing the > exit_lazy_tlb points that this series does *should* fix that. So > PF_KTHREAD may be less problematic than I thought from implementation > point of view, only semantics. So I've been trying to figure out where that PF_KTHREAD comes from, commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. So the first version: https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com appears to unconditionally send the IPI and checks p->mm in the IPI context, but then v2: https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com has the current code. But I've been unable to find the reason the 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. The comment doesn't really help either; sure we have the whole lazy mm thing, but that's ->active_mm, not ->mm. Possibly it is because {,un}use_mm() do not have sufficient barriers to make the remote p->mm test work? Or were we over-eager with the !p->mm doesn't imply kthread 'cleanups' at the time? Also, I just realized, I still have a fix for use_mm() now kthread_use_mm() that seems to have been lost. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 15:06 ` peterz @ 2020-07-21 15:15 ` Mathieu Desnoyers 2020-07-21 15:19 ` Peter Zijlstra 0 siblings, 1 reply; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-21 15:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, x86 ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > >> That being said, the x86 sync core gap that I imagined could be fixed >> by changing to rq->curr == rq->idle test does not actually exist because >> the global membarrier does not have a sync core option. So fixing the >> exit_lazy_tlb points that this series does *should* fix that. So >> PF_KTHREAD may be less problematic than I thought from implementation >> point of view, only semantics. > > So I've been trying to figure out where that PF_KTHREAD comes from, > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. > > So the first version: > > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com > > appears to unconditionally send the IPI and checks p->mm in the IPI > context, but then v2: > > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com > > has the current code. But I've been unable to find the reason the > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. Looking back at my inbox, it seems like you are the one who proposed to skip all kthreads: https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net > > The comment doesn't really help either; sure we have the whole lazy mm > thing, but that's ->active_mm, not ->mm. > > Possibly it is because {,un}use_mm() do not have sufficient barriers to > make the remote p->mm test work? Or were we over-eager with the !p->mm > doesn't imply kthread 'cleanups' at the time? The nice thing about adding back kthreads to the threads considered for membarrier IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread rely on this, and we just provide an additional guarantee for future kthread implementations. > Also, I just realized, I still have a fix for use_mm() now > kthread_use_mm() that seems to have been lost. I suspect we need to at least document the memory barriers in kthread_use_mm and kthread_unuse_mm to state that they are required by membarrier if we want to ipi kthreads as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 15:15 ` Mathieu Desnoyers @ 2020-07-21 15:19 ` Peter Zijlstra 2020-07-21 15:22 ` Mathieu Desnoyers 0 siblings, 1 reply; 59+ messages in thread From: Peter Zijlstra @ 2020-07-21 15:19 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Nicholas Piggin, Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, x86 On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote: > ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: > > > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > > > >> That being said, the x86 sync core gap that I imagined could be fixed > >> by changing to rq->curr == rq->idle test does not actually exist because > >> the global membarrier does not have a sync core option. So fixing the > >> exit_lazy_tlb points that this series does *should* fix that. So > >> PF_KTHREAD may be less problematic than I thought from implementation > >> point of view, only semantics. > > > > So I've been trying to figure out where that PF_KTHREAD comes from, > > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy > > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. > > > > So the first version: > > > > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com > > > > appears to unconditionally send the IPI and checks p->mm in the IPI > > context, but then v2: > > > > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com > > > > has the current code. But I've been unable to find the reason the > > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. > > Looking back at my inbox, it seems like you are the one who proposed to > skip all kthreads: > > https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net I had a feeling it might've been me ;-) I just couldn't find the email. > > The comment doesn't really help either; sure we have the whole lazy mm > > thing, but that's ->active_mm, not ->mm. > > > > Possibly it is because {,un}use_mm() do not have sufficient barriers to > > make the remote p->mm test work? Or were we over-eager with the !p->mm > > doesn't imply kthread 'cleanups' at the time? > > The nice thing about adding back kthreads to the threads considered for membarrier > IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread > rely on this, and we just provide an additional guarantee for future kthread > implementations. > > > Also, I just realized, I still have a fix for use_mm() now > > kthread_use_mm() that seems to have been lost. > > I suspect we need to at least document the memory barriers in kthread_use_mm and > kthread_unuse_mm to state that they are required by membarrier if we want to > ipi kthreads as well. Right, so going by that email you found it was mostly a case of being lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add any other bits that might be needed, covering kthreads should be possible. No objections from me for making it so. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode 2020-07-21 15:19 ` Peter Zijlstra @ 2020-07-21 15:22 ` Mathieu Desnoyers 0 siblings, 0 replies; 59+ messages in thread From: Mathieu Desnoyers @ 2020-07-21 15:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Nicholas Piggin, Anton Blanchard, Arnd Bergmann, Jens Axboe, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Andy Lutomirski, Andy Lutomirski, x86 ----- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra peterz@infradead.org wrote: > On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote: >> ----- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra peterz@infradead.org wrote: >> >> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: >> > >> >> That being said, the x86 sync core gap that I imagined could be fixed >> >> by changing to rq->curr == rq->idle test does not actually exist because >> >> the global membarrier does not have a sync core option. So fixing the >> >> exit_lazy_tlb points that this series does *should* fix that. So >> >> PF_KTHREAD may be less problematic than I thought from implementation >> >> point of view, only semantics. >> > >> > So I've been trying to figure out where that PF_KTHREAD comes from, >> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy >> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. >> > >> > So the first version: >> > >> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoyers@efficios.com >> > >> > appears to unconditionally send the IPI and checks p->mm in the IPI >> > context, but then v2: >> > >> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoyers@efficios.com >> > >> > has the current code. But I've been unable to find the reason the >> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. >> >> Looking back at my inbox, it seems like you are the one who proposed to >> skip all kthreads: >> >> https://lkml.kernel.org/r/20190904124333.GQ2332@hirez.programming.kicks-ass.net > > I had a feeling it might've been me ;-) I just couldn't find the email. > >> > The comment doesn't really help either; sure we have the whole lazy mm >> > thing, but that's ->active_mm, not ->mm. >> > >> > Possibly it is because {,un}use_mm() do not have sufficient barriers to >> > make the remote p->mm test work? Or were we over-eager with the !p->mm >> > doesn't imply kthread 'cleanups' at the time? >> >> The nice thing about adding back kthreads to the threads considered for >> membarrier >> IPI is that it has no observable effect on the user-space ABI. No pre-existing >> kthread >> rely on this, and we just provide an additional guarantee for future kthread >> implementations. >> >> > Also, I just realized, I still have a fix for use_mm() now >> > kthread_use_mm() that seems to have been lost. >> >> I suspect we need to at least document the memory barriers in kthread_use_mm and >> kthread_unuse_mm to state that they are required by membarrier if we want to >> ipi kthreads as well. > > Right, so going by that email you found it was mostly a case of being > lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add > any other bits that might be needed, covering kthreads should be > possible. > > No objections from me for making it so. I'm OK on making membarrier cover kthreads using mm as well, provided we audit kthread_{,un}use_mm() to make sure the proper barriers are in place after setting task->mm and before clearing it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin ` (3 preceding siblings ...) 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 9:48 ` Peter Zijlstra 2020-07-10 1:56 ` [RFC PATCH 6/7] lazy tlb: allow lazy tlb mm switching to be configurable Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin 6 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard Add explicit _lazy_tlb annotated functions for lazy mm refcounting. This makes things a bit more explicit, and allows explicit refcounting to be removed if it is not used. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/powerpc/kernel/smp.c | 2 +- arch/powerpc/mm/book3s64/radix_tlb.c | 4 ++-- fs/exec.c | 2 +- include/linux/sched/mm.h | 17 +++++++++++++++++ kernel/cpu.c | 2 +- kernel/exit.c | 2 +- kernel/kthread.c | 11 +++++++---- kernel/sched/core.c | 13 +++++++------ 8 files changed, 37 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 73199470c265..ad95812d2a3f 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1253,7 +1253,7 @@ void start_secondary(void *unused) unsigned int cpu = smp_processor_id(); struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; - mmgrab(&init_mm); + mmgrab(&init_mm); /* XXX: where is the mmput for this? */ current->active_mm = &init_mm; smp_store_cpu_info(cpu); diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index b5cc9b23cf02..52730629b3eb 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -652,10 +652,10 @@ static void do_exit_flush_lazy_tlb(void *arg) * Must be a kernel thread because sender is single-threaded. */ BUG_ON(current->mm); - mmgrab(&init_mm); + mmgrab_lazy_tlb(&init_mm); switch_mm(mm, &init_mm, current); current->active_mm = &init_mm; - mmdrop(mm); + mmdrop_lazy_tlb(mm); } _tlbiel_pid(pid, RIC_FLUSH_ALL); } diff --git a/fs/exec.c b/fs/exec.c index e2ab71e88293..3a01b2751ea9 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1119,7 +1119,7 @@ static int exec_mmap(struct mm_struct *mm) mmput(old_mm); } else { exit_lazy_tlb(active_mm, tsk); - mmdrop(active_mm); + mmdrop_lazy_tlb(active_mm); } return 0; } diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 9b026264b445..110d4ad21de6 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -50,6 +50,23 @@ static inline void mmdrop(struct mm_struct *mm) void mmdrop(struct mm_struct *mm); +/* Helpers for lazy TLB mm refcounting */ +static inline void mmgrab_lazy_tlb(struct mm_struct *mm) +{ + mmgrab(mm); +} + +static inline void mmdrop_lazy_tlb(struct mm_struct *mm) +{ + mmdrop(mm); +} + +static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm) +{ + /* This depends on mmdrop providing a full smp_mb() */ + mmdrop(mm); +} + /* * This has to be called after a get_task_mm()/mmget_not_zero() * followed by taking the mmap_lock for writing before modifying the diff --git a/kernel/cpu.c b/kernel/cpu.c index 134688d79589..ff9fcbc4e76b 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -578,7 +578,7 @@ static int finish_cpu(unsigned int cpu) */ if (mm != &init_mm) idle->active_mm = &init_mm; - mmdrop(mm); + mmdrop_lazy_tlb(mm); return 0; } diff --git a/kernel/exit.c b/kernel/exit.c index 727150f28103..d535da9fd2f8 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -470,7 +470,7 @@ static void exit_mm(void) __set_current_state(TASK_RUNNING); mmap_read_lock(mm); } - mmgrab(mm); + mmgrab_lazy_tlb(mm); BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); diff --git a/kernel/kthread.c b/kernel/kthread.c index 6f93c649aa97..a7133cc2ddaf 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1238,12 +1238,12 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); + mmgrab(mm); + task_lock(tsk); active_mm = tsk->active_mm; - if (active_mm != mm) { - mmgrab(mm); + if (active_mm != mm) tsk->active_mm = mm; - } tsk->mm = mm; switch_mm(active_mm, mm, tsk); task_unlock(tsk); @@ -1253,7 +1253,7 @@ void kthread_use_mm(struct mm_struct *mm) exit_lazy_tlb(active_mm, tsk); if (active_mm != mm) - mmdrop(active_mm); + mmdrop_lazy_tlb(active_mm); to_kthread(tsk)->oldfs = get_fs(); set_fs(USER_DS); @@ -1276,9 +1276,12 @@ void kthread_unuse_mm(struct mm_struct *mm) task_lock(tsk); sync_mm_rss(mm); tsk->mm = NULL; + mmgrab_lazy_tlb(mm); /* active_mm is still 'mm' */ enter_lazy_tlb(mm, tsk); task_unlock(tsk); + + mmdrop(mm); } EXPORT_SYMBOL_GPL(kthread_unuse_mm); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 31e22c79826c..d19f2f517f6c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3302,10 +3302,11 @@ static struct rq *finish_task_switch(struct task_struct *prev) * schedule between user->kernel->user threads without passing though * switch_mm(). Membarrier requires a full barrier after storing to * rq->curr, before returning to userspace, for - * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by + * mmdrop_lazy_tlb_smp_mb(). */ if (mm) - mmdrop(mm); + mmdrop_lazy_tlb_smp_mb(mm); if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) @@ -3410,9 +3411,9 @@ context_switch(struct rq *rq, struct task_struct *prev, /* * kernel -> kernel lazy + transfer active - * user -> kernel lazy + mmgrab() active + * user -> kernel lazy + mmgrab_lazy_tlb() active * - * kernel -> user switch + mmdrop() active + * kernel -> user switch + mmdrop_lazy_tlb() active * user -> user switch */ if (!next->mm) { // to kernel @@ -3420,7 +3421,7 @@ context_switch(struct rq *rq, struct task_struct *prev, next->active_mm = prev->active_mm; if (prev->mm) // from user - mmgrab(prev->active_mm); + mmgrab_lazy_tlb(prev->active_mm); else prev->active_mm = NULL; } else { // to user @@ -3438,7 +3439,7 @@ context_switch(struct rq *rq, struct task_struct *prev, if (!prev->mm) { // from kernel exit_lazy_tlb(prev->active_mm, next); - /* will mmdrop() in finish_task_switch(). */ + /* will mmdrop_lazy_tlb() in finish_task_switch(). */ rq->prev_mm = prev->active_mm; prev->active_mm = NULL; } -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions 2020-07-10 1:56 ` [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin @ 2020-07-10 9:48 ` Peter Zijlstra 0 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2020-07-10 9:48 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, x86, Mathieu Desnoyers, Arnd Bergmann, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard On Fri, Jul 10, 2020 at 11:56:44AM +1000, Nicholas Piggin wrote: > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index 73199470c265..ad95812d2a3f 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -1253,7 +1253,7 @@ void start_secondary(void *unused) > unsigned int cpu = smp_processor_id(); > struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask; > > - mmgrab(&init_mm); > + mmgrab(&init_mm); /* XXX: where is the mmput for this? */ > current->active_mm = &init_mm; > > smp_store_cpu_info(cpu); Right; so IIRC it should be this one: > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 134688d79589..ff9fcbc4e76b 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -578,7 +578,7 @@ static int finish_cpu(unsigned int cpu) > */ > if (mm != &init_mm) > idle->active_mm = &init_mm; > - mmdrop(mm); > + mmdrop_lazy_tlb(mm); > return 0; > } ^ permalink raw reply [flat|nested] 59+ messages in thread
* [RFC PATCH 6/7] lazy tlb: allow lazy tlb mm switching to be configurable 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin ` (4 preceding siblings ...) 2020-07-10 1:56 ` [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin 6 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard NOMMU systems could easily go without this and save a bit of code and the mm refcounting, because their mm switch is a no-op. I haven't flipped them over because haven't audited all arch code to convert over to using the _lazy_tlb refcounting. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/Kconfig | 7 +++++ include/linux/sched/mm.h | 12 ++++++--- kernel/sched/core.c | 55 +++++++++++++++++++++++++++------------- kernel/sched/sched.h | 4 ++- 4 files changed, 55 insertions(+), 23 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc35dc556c7..2daf8fe6146a 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -411,6 +411,13 @@ config MMU_GATHER_NO_GATHER bool depends on MMU_GATHER_TABLE_FREE +# Would like to make this depend on MMU, because there is little use for lazy mm switching +# with NOMMU, but have to audit NOMMU architecture code first. +config MMU_LAZY_TLB + def_bool y + help + Enable "lazy TLB" mmu context switching for kernel threads. + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 110d4ad21de6..2c2b20e2ccc7 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -53,18 +53,22 @@ void mmdrop(struct mm_struct *mm); /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - mmgrab(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - mmdrop(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + mmdrop(mm); } static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm) { - /* This depends on mmdrop providing a full smp_mb() */ - mmdrop(mm); + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() */ + else + smp_mb(); } /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d19f2f517f6c..14b4fae6f6e3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3253,7 +3253,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) __releases(rq->lock) { struct rq *rq = this_rq(); - struct mm_struct *mm = rq->prev_mm; + struct mm_struct *mm = NULL; long prev_state; /* @@ -3272,7 +3272,10 @@ static struct rq *finish_task_switch(struct task_struct *prev) current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); - rq->prev_mm = NULL; +#ifdef CONFIG_MMU_LAZY_TLB + mm = rq->prev_lazy_mm; + rq->prev_lazy_mm = NULL; +#endif /* * A task struct has one reference for the use as "current". @@ -3393,22 +3396,11 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) calculate_sigpending(); } -/* - * context_switch - switch to the new MM and the new thread's register state. - */ -static __always_inline struct rq * -context_switch(struct rq *rq, struct task_struct *prev, - struct task_struct *next, struct rq_flags *rf) +static __always_inline void +context_switch_mm(struct rq *rq, struct task_struct *prev, + struct task_struct *next) { - prepare_task_switch(rq, prev, next); - - /* - * For paravirt, this is coupled with an exit in switch_to to - * combine the page table reload and the switch backend into - * one hypercall. - */ - arch_start_context_switch(prev); - +#ifdef CONFIG_MMU_LAZY_TLB /* * kernel -> kernel lazy + transfer active * user -> kernel lazy + mmgrab_lazy_tlb() active @@ -3440,10 +3432,37 @@ context_switch(struct rq *rq, struct task_struct *prev, exit_lazy_tlb(prev->active_mm, next); /* will mmdrop_lazy_tlb() in finish_task_switch(). */ - rq->prev_mm = prev->active_mm; + rq->prev_lazy_mm = prev->active_mm; prev->active_mm = NULL; } } +#else + if (!next->mm) + next->active_mm = &init_mm; + membarrier_switch_mm(rq, prev->active_mm, next->active_mm); + switch_mm_irqs_off(prev->active_mm, next->active_mm, next); + if (!prev->mm) + prev->active_mm = NULL; +#endif +} + +/* + * context_switch - switch to the new MM and the new thread's register state. + */ +static __always_inline struct rq * +context_switch(struct rq *rq, struct task_struct *prev, + struct task_struct *next, struct rq_flags *rf) +{ + prepare_task_switch(rq, prev, next); + + /* + * For paravirt, this is coupled with an exit in switch_to to + * combine the page table reload and the switch backend into + * one hypercall. + */ + arch_start_context_switch(prev); + + context_switch_mm(rq, prev, next); rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 877fb08eb1b0..b196dd885d33 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -929,7 +929,9 @@ struct rq { struct task_struct *idle; struct task_struct *stop; unsigned long next_balance; - struct mm_struct *prev_mm; +#ifdef CONFIG_MMU_LAZY_TLB + struct mm_struct *prev_lazy_mm; +#endif unsigned int clock_update_flags; u64 clock; -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin ` (5 preceding siblings ...) 2020-07-10 1:56 ` [RFC PATCH 6/7] lazy tlb: allow lazy tlb mm switching to be configurable Nicholas Piggin @ 2020-07-10 1:56 ` Nicholas Piggin 2020-07-10 9:35 ` Peter Zijlstra 2020-07-13 15:59 ` Andy Lutomirski 6 siblings, 2 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-10 1:56 UTC (permalink / raw) To: linux-arch Cc: Nicholas Piggin, x86, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications (particularly switching between the idle thread and an application thread). Abandoning lazy tlb slows switching down quite a bit in the important user->idle->user cases, so so instead implement a non-refcounted scheme that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any remaining lazy ones. On a 16-socket 192-core POWER8 system, a context switching benchmark with as many software threads as CPUs (so each switch will go in and out of idle), upstream can achieve a rate of about 1 million context switches per second. After this patch it goes up to 118 million. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- arch/Kconfig | 16 ++++++++++++++++ arch/powerpc/Kconfig | 1 + include/linux/sched/mm.h | 6 +++--- kernel/fork.c | 39 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 59 insertions(+), 3 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 2daf8fe6146a..edf69437a971 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -418,6 +418,22 @@ config MMU_LAZY_TLB help Enable "lazy TLB" mmu context switching for kernel threads. +config MMU_LAZY_TLB_REFCOUNT + def_bool y + depends on MMU_LAZY_TLB + depends on !MMU_LAZY_TLB_SHOOTDOWN + +config MMU_LAZY_TLB_SHOOTDOWN + bool + depends on MMU_LAZY_TLB + help + Instead of refcounting the "lazy tlb" mm struct, which can cause + contention with multi-threaded apps on large multiprocessor systems, + this option causes __mmdrop to IPI all CPUs in the mm_cpumask and + switch to init_mm if they were using the to-be-freed mm as the lazy + tlb. Architectures which do not track all possible lazy tlb CPUs in + mm_cpumask can not use this (without modification). + config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 920c4e3ca4ef..24ac85c868db 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -225,6 +225,7 @@ config PPC select HAVE_PERF_USER_STACK_DUMP select MMU_GATHER_RCU_TABLE_FREE select MMU_GATHER_PAGE_SIZE + select MMU_LAZY_TLB_SHOOTDOWN select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 2c2b20e2ccc7..1067af8039bd 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -53,19 +53,19 @@ void mmdrop(struct mm_struct *mm); /* Helpers for lazy TLB mm refcounting */ static inline void mmgrab_lazy_tlb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmgrab(mm); } static inline void mmdrop_lazy_tlb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmdrop(mm); } static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm) { - if (IS_ENABLED(CONFIG_MMU_LAZY_TLB)) + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() */ else smp_mb(); diff --git a/kernel/fork.c b/kernel/fork.c index 142b23645d82..da0fba9e6079 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -685,6 +685,40 @@ static void check_mm(struct mm_struct *mm) #define allocate_mm() (kmem_cache_alloc(mm_cachep, GFP_KERNEL)) #define free_mm(mm) (kmem_cache_free(mm_cachep, (mm))) +static void do_shoot_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + if (current->active_mm == mm) { + BUG_ON(current->mm); + switch_mm(mm, &init_mm, current); + current->active_mm = &init_mm; + } +} + +static void do_check_lazy_tlb(void *arg) +{ + struct mm_struct *mm = arg; + + BUG_ON(current->active_mm == mm); +} + +static void shoot_lazy_tlbs(struct mm_struct *mm) +{ + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); + do_shoot_lazy_tlb(mm); + } +} + +static void check_lazy_tlbs(struct mm_struct *mm) +{ + if (IS_ENABLED(CONFIG_DEBUG_VM)) { + smp_call_function(do_check_lazy_tlb, (void *)mm, 1); + do_check_lazy_tlb(mm); + } +} + /* * Called when the last reference to the mm * is dropped: either by a lazy thread or by @@ -695,6 +729,11 @@ void __mmdrop(struct mm_struct *mm) BUG_ON(mm == &init_mm); WARN_ON_ONCE(mm == current->mm); WARN_ON_ONCE(mm == current->active_mm); + + /* Ensure no CPUs are using this as their lazy tlb mm */ + shoot_lazy_tlbs(mm); + check_lazy_tlbs(mm); + mm_free_pgd(mm); destroy_context(mm); mmu_notifier_subscriptions_destroy(mm); -- 2.23.0 ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-10 1:56 ` [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin @ 2020-07-10 9:35 ` Peter Zijlstra 2020-07-13 4:58 ` Nicholas Piggin 2020-07-13 15:59 ` Andy Lutomirski 1 sibling, 1 reply; 59+ messages in thread From: Peter Zijlstra @ 2020-07-10 9:35 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, x86, Mathieu Desnoyers, Arnd Bergmann, linux-kernel, linuxppc-dev, linux-mm, Anton Blanchard On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote: > On big systems, the mm refcount can become highly contented when doing > a lot of context switching with threaded applications (particularly > switching between the idle thread and an application thread). > > Abandoning lazy tlb slows switching down quite a bit in the important > user->idle->user cases, so so instead implement a non-refcounted scheme > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > any remaining lazy ones. > > On a 16-socket 192-core POWER8 system, a context switching benchmark > with as many software threads as CPUs (so each switch will go in and > out of idle), upstream can achieve a rate of about 1 million context > switches per second. After this patch it goes up to 118 million. That's mighty impressive, however: > +static void shoot_lazy_tlbs(struct mm_struct *mm) > +{ > + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { > + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); > + do_shoot_lazy_tlb(mm); > + } > +} IIRC you (power) never clear a CPU from that mask, so for other workloads I can see this resulting in massive amounts of IPIs. For instance, take as many processes as you have CPUs. For each, manually walk the task across all CPUs and exit. Again. Clearly, that's an extreme, but still... ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-10 9:35 ` Peter Zijlstra @ 2020-07-13 4:58 ` Nicholas Piggin 0 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-13 4:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Anton Blanchard, Arnd Bergmann, linux-arch, linux-kernel, linux-mm, linuxppc-dev, Mathieu Desnoyers, x86 Excerpts from Peter Zijlstra's message of July 10, 2020 7:35 pm: > On Fri, Jul 10, 2020 at 11:56:46AM +1000, Nicholas Piggin wrote: >> On big systems, the mm refcount can become highly contented when doing >> a lot of context switching with threaded applications (particularly >> switching between the idle thread and an application thread). >> >> Abandoning lazy tlb slows switching down quite a bit in the important >> user->idle->user cases, so so instead implement a non-refcounted scheme >> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >> any remaining lazy ones. >> >> On a 16-socket 192-core POWER8 system, a context switching benchmark >> with as many software threads as CPUs (so each switch will go in and >> out of idle), upstream can achieve a rate of about 1 million context >> switches per second. After this patch it goes up to 118 million. > > That's mighty impressive, however: Well, it's the usual case of "find a bouncing line and scale up the machine size until you achieve your desired improvements" :) But we are looking at some fundamental scalabilities and seeing if we can improve a few things. > >> +static void shoot_lazy_tlbs(struct mm_struct *mm) >> +{ >> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) { >> + smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1); >> + do_shoot_lazy_tlb(mm); >> + } >> +} > > IIRC you (power) never clear a CPU from that mask, so for other > workloads I can see this resulting in massive amounts of IPIs. > > For instance, take as many processes as you have CPUs. For each, > manually walk the task across all CPUs and exit. Again. > > Clearly, that's an extreme, but still... We do have some issues with that, it does tend to be very self-limiting though, short lived tasks that can drive lots of exits won't get to run on a lot of cores. It's worth keeping an eye on, it may not be too hard to mitigate the IPIs doing something dumb like collecting a queue of mms before killing a batch of them. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-10 1:56 ` [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin 2020-07-10 9:35 ` Peter Zijlstra @ 2020-07-13 15:59 ` Andy Lutomirski 2020-07-13 16:48 ` Nicholas Piggin 1 sibling, 1 reply; 59+ messages in thread From: Andy Lutomirski @ 2020-07-13 15:59 UTC (permalink / raw) To: Nicholas Piggin Cc: linux-arch, X86 ML, Mathieu Desnoyers, Arnd Bergmann, Peter Zijlstra, LKML, linuxppc-dev, Linux-MM, Anton Blanchard On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > On big systems, the mm refcount can become highly contented when doing > a lot of context switching with threaded applications (particularly > switching between the idle thread and an application thread). > > Abandoning lazy tlb slows switching down quite a bit in the important > user->idle->user cases, so so instead implement a non-refcounted scheme > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down > any remaining lazy ones. > > On a 16-socket 192-core POWER8 system, a context switching benchmark > with as many software threads as CPUs (so each switch will go in and > out of idle), upstream can achieve a rate of about 1 million context > switches per second. After this patch it goes up to 118 million. > I read the patch a couple of times, and I have a suggestion that could be nonsense. You are, effectively, using mm_cpumask() as a sort of refcount. You're saying "hey, this mm has no more references, but it still has nonempty mm_cpumask(), so let's send an IPI and shoot down those references too." I'm wondering whether you actually need the IPI. What if, instead, you actually treated mm_cpumask as a refcount for real? Roughly, in __mmdrop(), you would only free the page tables if mm_cpumask() is empty. And, in the code that removes a CPU from mm_cpumask(), you would check if mm_users == 0 and, if so, check if you just removed the last bit from mm_cpumask and potentially free the mm. Getting the locking right here could be a bit tricky -- you need to avoid two CPUs simultaneously exiting lazy TLB and thinking they should free the mm, and you also need to avoid an mm with mm_users hitting zero concurrently with the last remote CPU using it lazily exiting lazy TLB. Perhaps this could be resolved by having mm_count == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the mm" and mm_count == 0 meaning "now it's dead" and using some careful cmpxchg or dec_return to make sure that only one CPU frees it. Or maybe you'd need a lock or RCU for this, but the idea would be to only ever take the lock after mm_users goes to zero. --Andy ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-13 15:59 ` Andy Lutomirski @ 2020-07-13 16:48 ` Nicholas Piggin 2020-07-13 18:18 ` Andy Lutomirski 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-13 16:48 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: > On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> On big systems, the mm refcount can become highly contented when doing >> a lot of context switching with threaded applications (particularly >> switching between the idle thread and an application thread). >> >> Abandoning lazy tlb slows switching down quite a bit in the important >> user->idle->user cases, so so instead implement a non-refcounted scheme >> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >> any remaining lazy ones. >> >> On a 16-socket 192-core POWER8 system, a context switching benchmark >> with as many software threads as CPUs (so each switch will go in and >> out of idle), upstream can achieve a rate of about 1 million context >> switches per second. After this patch it goes up to 118 million. >> > > I read the patch a couple of times, and I have a suggestion that could > be nonsense. You are, effectively, using mm_cpumask() as a sort of > refcount. You're saying "hey, this mm has no more references, but it > still has nonempty mm_cpumask(), so let's send an IPI and shoot down > those references too." I'm wondering whether you actually need the > IPI. What if, instead, you actually treated mm_cpumask as a refcount > for real? Roughly, in __mmdrop(), you would only free the page tables > if mm_cpumask() is empty. And, in the code that removes a CPU from > mm_cpumask(), you would check if mm_users == 0 and, if so, check if > you just removed the last bit from mm_cpumask and potentially free the > mm. > > Getting the locking right here could be a bit tricky -- you need to > avoid two CPUs simultaneously exiting lazy TLB and thinking they > should free the mm, and you also need to avoid an mm with mm_users > hitting zero concurrently with the last remote CPU using it lazily > exiting lazy TLB. Perhaps this could be resolved by having mm_count > == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the > mm" and mm_count == 0 meaning "now it's dead" and using some careful > cmpxchg or dec_return to make sure that only one CPU frees it. > > Or maybe you'd need a lock or RCU for this, but the idea would be to > only ever take the lock after mm_users goes to zero. I don't think it's nonsense, it could be a good way to avoid IPIs. I haven't seen much problem here that made me too concerned about IPIs yet, so I think the simple patch may be good enough to start with for powerpc. I'm looking at avoiding/reducing the IPIs by combining the unlazying with the exit TLB flush without doing anything fancy with ref counting, but we'll see. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-13 16:48 ` Nicholas Piggin @ 2020-07-13 18:18 ` Andy Lutomirski 2020-07-14 5:04 ` Nicholas Piggin 0 siblings, 1 reply; 59+ messages in thread From: Andy Lutomirski @ 2020-07-13 18:18 UTC (permalink / raw) To: Nicholas Piggin Cc: Andy Lutomirski, Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Mathieu Desnoyers, Peter Zijlstra, X86 ML > On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>> >>> On big systems, the mm refcount can become highly contented when doing >>> a lot of context switching with threaded applications (particularly >>> switching between the idle thread and an application thread). >>> >>> Abandoning lazy tlb slows switching down quite a bit in the important >>> user->idle->user cases, so so instead implement a non-refcounted scheme >>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>> any remaining lazy ones. >>> >>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>> with as many software threads as CPUs (so each switch will go in and >>> out of idle), upstream can achieve a rate of about 1 million context >>> switches per second. After this patch it goes up to 118 million. >>> >> >> I read the patch a couple of times, and I have a suggestion that could >> be nonsense. You are, effectively, using mm_cpumask() as a sort of >> refcount. You're saying "hey, this mm has no more references, but it >> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >> those references too." I'm wondering whether you actually need the >> IPI. What if, instead, you actually treated mm_cpumask as a refcount >> for real? Roughly, in __mmdrop(), you would only free the page tables >> if mm_cpumask() is empty. And, in the code that removes a CPU from >> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >> you just removed the last bit from mm_cpumask and potentially free the >> mm. >> >> Getting the locking right here could be a bit tricky -- you need to >> avoid two CPUs simultaneously exiting lazy TLB and thinking they >> should free the mm, and you also need to avoid an mm with mm_users >> hitting zero concurrently with the last remote CPU using it lazily >> exiting lazy TLB. Perhaps this could be resolved by having mm_count >> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >> mm" and mm_count == 0 meaning "now it's dead" and using some careful >> cmpxchg or dec_return to make sure that only one CPU frees it. >> >> Or maybe you'd need a lock or RCU for this, but the idea would be to >> only ever take the lock after mm_users goes to zero. > > I don't think it's nonsense, it could be a good way to avoid IPIs. > > I haven't seen much problem here that made me too concerned about IPIs > yet, so I think the simple patch may be good enough to start with > for powerpc. I'm looking at avoiding/reducing the IPIs by combining the > unlazying with the exit TLB flush without doing anything fancy with > ref counting, but we'll see. I would be cautious with benchmarking here. I would expect that the nasty cases may affect power consumption more than performance — the specific issue is IPIs hitting idle cores, and the main effects are to slow down exit() a bit but also to kick the idle core out of idle. Although, if the idle core is in a deep sleep, that IPI could be *very* slow. So I think it’s worth at least giving this a try. > > Thanks, > Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-13 18:18 ` Andy Lutomirski @ 2020-07-14 5:04 ` Nicholas Piggin 2020-07-14 6:31 ` Nicholas Piggin 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-14 5:04 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: > >> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>> >>>> On big systems, the mm refcount can become highly contented when doing >>>> a lot of context switching with threaded applications (particularly >>>> switching between the idle thread and an application thread). >>>> >>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>> any remaining lazy ones. >>>> >>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>> with as many software threads as CPUs (so each switch will go in and >>>> out of idle), upstream can achieve a rate of about 1 million context >>>> switches per second. After this patch it goes up to 118 million. >>>> >>> >>> I read the patch a couple of times, and I have a suggestion that could >>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>> refcount. You're saying "hey, this mm has no more references, but it >>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>> those references too." I'm wondering whether you actually need the >>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>> for real? Roughly, in __mmdrop(), you would only free the page tables >>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>> you just removed the last bit from mm_cpumask and potentially free the >>> mm. >>> >>> Getting the locking right here could be a bit tricky -- you need to >>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>> should free the mm, and you also need to avoid an mm with mm_users >>> hitting zero concurrently with the last remote CPU using it lazily >>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>> cmpxchg or dec_return to make sure that only one CPU frees it. >>> >>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>> only ever take the lock after mm_users goes to zero. >> >> I don't think it's nonsense, it could be a good way to avoid IPIs. >> >> I haven't seen much problem here that made me too concerned about IPIs >> yet, so I think the simple patch may be good enough to start with >> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >> unlazying with the exit TLB flush without doing anything fancy with >> ref counting, but we'll see. > > I would be cautious with benchmarking here. I would expect that the > nasty cases may affect power consumption more than performance — the > specific issue is IPIs hitting idle cores, and the main effects are to > slow down exit() a bit but also to kick the idle core out of idle. > Although, if the idle core is in a deep sleep, that IPI could be > *very* slow. It will tend to be self-limiting to some degree (deeper idle cores would tend to have less chance of IPI) but we have bigger issues on powerpc with that, like broadcast IPIs to the mm cpumask for THP management. Power hasn't really shown up as an issue but powerpc CPUs may have their own requirements and issues there, shall we say. > So I think it’s worth at least giving this a try. To be clear it's not a complete solution itself. The problem is of course that mm cpumask gives you false negatives, so the bits won't always clean up after themselves as CPUs switch away from their lazy tlb mms. I would suspect it _may_ help with garbage collecting some remainders nicely after exit, but only with somewhat of a different accounting system than powerpc uses -- we tie mm_cpumask to TLB valids, so it can become spread over CPUs that don't (and even have never) used that mm as a lazy mm I don't know that the self-culling trick would help a great deal within that scheme. So powerpc needs a bit more work on that side of things too, hence looking at doing more of this in the final TLB shootdown. There's actually a lot of other things we can do as well to reduce IPIs, batching being a simple hammer, some kind of quiescing, testing the remote CPU to check what active mm it is using, doing the un-lazy at certain defined points etc, so I'm actually not worried about IPIs suddenly popping up and rendering the whole concept unworkable. At some point (unless we go something pretty complex like a SRCU type thing, or adding extra locking .e.g, to use_mm()), then at least sometimes an IPI will be required so I think it's reasonable to start here and introduce complexity more slowly if it's justified. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-14 5:04 ` Nicholas Piggin @ 2020-07-14 6:31 ` Nicholas Piggin 2020-07-14 12:46 ` Andy Lutomirski 0 siblings, 1 reply; 59+ messages in thread From: Nicholas Piggin @ 2020-07-14 6:31 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: > Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >> >>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>> >>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> On big systems, the mm refcount can become highly contented when doing >>>>> a lot of context switching with threaded applications (particularly >>>>> switching between the idle thread and an application thread). >>>>> >>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>> any remaining lazy ones. >>>>> >>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>> with as many software threads as CPUs (so each switch will go in and >>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>> switches per second. After this patch it goes up to 118 million. >>>>> >>>> >>>> I read the patch a couple of times, and I have a suggestion that could >>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>> refcount. You're saying "hey, this mm has no more references, but it >>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>> those references too." I'm wondering whether you actually need the >>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>> you just removed the last bit from mm_cpumask and potentially free the >>>> mm. >>>> >>>> Getting the locking right here could be a bit tricky -- you need to >>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>> should free the mm, and you also need to avoid an mm with mm_users >>>> hitting zero concurrently with the last remote CPU using it lazily >>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>> >>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>> only ever take the lock after mm_users goes to zero. >>> >>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>> >>> I haven't seen much problem here that made me too concerned about IPIs >>> yet, so I think the simple patch may be good enough to start with >>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>> unlazying with the exit TLB flush without doing anything fancy with >>> ref counting, but we'll see. >> >> I would be cautious with benchmarking here. I would expect that the >> nasty cases may affect power consumption more than performance — the >> specific issue is IPIs hitting idle cores, and the main effects are to >> slow down exit() a bit but also to kick the idle core out of idle. >> Although, if the idle core is in a deep sleep, that IPI could be >> *very* slow. > > It will tend to be self-limiting to some degree (deeper idle cores > would tend to have less chance of IPI) but we have bigger issues on > powerpc with that, like broadcast IPIs to the mm cpumask for THP > management. Power hasn't really shown up as an issue but powerpc > CPUs may have their own requirements and issues there, shall we say. > >> So I think it’s worth at least giving this a try. > > To be clear it's not a complete solution itself. The problem is of > course that mm cpumask gives you false negatives, so the bits > won't always clean up after themselves as CPUs switch away from their > lazy tlb mms. ^^ False positives: CPU is in the mm_cpumask, but is not using the mm as a lazy tlb. So there can be bits left and never freed. If you closed the false positives, you're back to a shared mm cache line on lazy mm context switches. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-14 6:31 ` Nicholas Piggin @ 2020-07-14 12:46 ` Andy Lutomirski 2020-07-14 13:23 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 59+ messages in thread From: Andy Lutomirski @ 2020-07-14 12:46 UTC (permalink / raw) To: Nicholas Piggin Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, X86 ML > On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: > > Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>> >>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>> >>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>> >>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>> a lot of context switching with threaded applications (particularly >>>>>> switching between the idle thread and an application thread). >>>>>> >>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>> any remaining lazy ones. >>>>>> >>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>> switches per second. After this patch it goes up to 118 million. >>>>>> >>>>> >>>>> I read the patch a couple of times, and I have a suggestion that could >>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>> those references too." I'm wondering whether you actually need the >>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>> mm. >>>>> >>>>> Getting the locking right here could be a bit tricky -- you need to >>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>> >>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>> only ever take the lock after mm_users goes to zero. >>>> >>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>> >>>> I haven't seen much problem here that made me too concerned about IPIs >>>> yet, so I think the simple patch may be good enough to start with >>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>> unlazying with the exit TLB flush without doing anything fancy with >>>> ref counting, but we'll see. >>> >>> I would be cautious with benchmarking here. I would expect that the >>> nasty cases may affect power consumption more than performance — the >>> specific issue is IPIs hitting idle cores, and the main effects are to >>> slow down exit() a bit but also to kick the idle core out of idle. >>> Although, if the idle core is in a deep sleep, that IPI could be >>> *very* slow. >> >> It will tend to be self-limiting to some degree (deeper idle cores >> would tend to have less chance of IPI) but we have bigger issues on >> powerpc with that, like broadcast IPIs to the mm cpumask for THP >> management. Power hasn't really shown up as an issue but powerpc >> CPUs may have their own requirements and issues there, shall we say. >> >>> So I think it’s worth at least giving this a try. >> >> To be clear it's not a complete solution itself. The problem is of >> course that mm cpumask gives you false negatives, so the bits >> won't always clean up after themselves as CPUs switch away from their >> lazy tlb mms. > > ^^ > > False positives: CPU is in the mm_cpumask, but is not using the mm > as a lazy tlb. So there can be bits left and never freed. > > If you closed the false positives, you're back to a shared mm cache > line on lazy mm context switches. x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) Can your share your benchmark? > > Thanks, > Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-14 12:46 ` Andy Lutomirski @ 2020-07-14 13:23 ` Peter Zijlstra 2020-07-16 2:26 ` Nicholas Piggin 2020-07-16 2:35 ` Nicholas Piggin 2 siblings, 0 replies; 59+ messages in thread From: Peter Zijlstra @ 2020-07-14 13:23 UTC (permalink / raw) To: Andy Lutomirski Cc: Nicholas Piggin, Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, X86 ML On Tue, Jul 14, 2020 at 05:46:05AM -0700, Andy Lutomirski wrote: > x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) I've seen patches for a 'sparse' bitmap to solve related problems. It's basically the same code, except it multiplies everything (size, bit-nr) by a constant to reduce the number of active bits per line. This sadly doesn't take topology into account, but reducing contention is still good ofcourse. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-14 12:46 ` Andy Lutomirski 2020-07-14 13:23 ` Peter Zijlstra @ 2020-07-16 2:26 ` Nicholas Piggin 2020-07-16 2:35 ` Nicholas Piggin 2 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 2:26 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm: > > >> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>> >>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>>> >>>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>>> a lot of context switching with threaded applications (particularly >>>>>>> switching between the idle thread and an application thread). >>>>>>> >>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>>> any remaining lazy ones. >>>>>>> >>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>> >>>>>> >>>>>> I read the patch a couple of times, and I have a suggestion that could >>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>>> those references too." I'm wondering whether you actually need the >>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>>> mm. >>>>>> >>>>>> Getting the locking right here could be a bit tricky -- you need to >>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>> >>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>>> only ever take the lock after mm_users goes to zero. >>>>> >>>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>> >>>>> I haven't seen much problem here that made me too concerned about IPIs >>>>> yet, so I think the simple patch may be good enough to start with >>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>>> unlazying with the exit TLB flush without doing anything fancy with >>>>> ref counting, but we'll see. >>>> >>>> I would be cautious with benchmarking here. I would expect that the >>>> nasty cases may affect power consumption more than performance — the >>>> specific issue is IPIs hitting idle cores, and the main effects are to >>>> slow down exit() a bit but also to kick the idle core out of idle. >>>> Although, if the idle core is in a deep sleep, that IPI could be >>>> *very* slow. >>> >>> It will tend to be self-limiting to some degree (deeper idle cores >>> would tend to have less chance of IPI) but we have bigger issues on >>> powerpc with that, like broadcast IPIs to the mm cpumask for THP >>> management. Power hasn't really shown up as an issue but powerpc >>> CPUs may have their own requirements and issues there, shall we say. >>> >>>> So I think it’s worth at least giving this a try. >>> >>> To be clear it's not a complete solution itself. The problem is of >>> course that mm cpumask gives you false negatives, so the bits >>> won't always clean up after themselves as CPUs switch away from their >>> lazy tlb mms. >> >> ^^ >> >> False positives: CPU is in the mm_cpumask, but is not using the mm >> as a lazy tlb. So there can be bits left and never freed. >> >> If you closed the false positives, you're back to a shared mm cache >> line on lazy mm context switches. > > x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) > > Can your share your benchmark? It's just context_switch1_threads from will-it-scale, running with 1/2 the number of CPUs, and core affinity with an SMT processor (so each thread from the switching pairs gets spread to their own CPU and so you get the task->idle->task switching. It's really just about the worst case, so I wouldn't say it's something to panic about. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option 2020-07-14 12:46 ` Andy Lutomirski 2020-07-14 13:23 ` Peter Zijlstra 2020-07-16 2:26 ` Nicholas Piggin @ 2020-07-16 2:35 ` Nicholas Piggin 2 siblings, 0 replies; 59+ messages in thread From: Nicholas Piggin @ 2020-07-16 2:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Anton Blanchard, Arnd Bergmann, linux-arch, LKML, Linux-MM, linuxppc-dev, Andy Lutomirski, Mathieu Desnoyers, Peter Zijlstra, X86 ML Excerpts from Andy Lutomirski's message of July 14, 2020 10:46 pm: > > >> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: >>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>> >>>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote: >>>>> >>>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am: >>>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote: >>>>>>> >>>>>>> On big systems, the mm refcount can become highly contented when doing >>>>>>> a lot of context switching with threaded applications (particularly >>>>>>> switching between the idle thread and an application thread). >>>>>>> >>>>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme >>>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down >>>>>>> any remaining lazy ones. >>>>>>> >>>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>> >>>>>> >>>>>> I read the patch a couple of times, and I have a suggestion that could >>>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>>> those references too." I'm wondering whether you actually need the >>>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if >>>>>> you just removed the last bit from mm_cpumask and potentially free the >>>>>> mm. >>>>>> >>>>>> Getting the locking right here could be a bit tricky -- you need to >>>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the >>>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful >>>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>> >>>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>>> only ever take the lock after mm_users goes to zero. >>>>> >>>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>> >>>>> I haven't seen much problem here that made me too concerned about IPIs >>>>> yet, so I think the simple patch may be good enough to start with >>>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>>>> unlazying with the exit TLB flush without doing anything fancy with >>>>> ref counting, but we'll see. >>>> >>>> I would be cautious with benchmarking here. I would expect that the >>>> nasty cases may affect power consumption more than performance — the >>>> specific issue is IPIs hitting idle cores, and the main effects are to >>>> slow down exit() a bit but also to kick the idle core out of idle. >>>> Although, if the idle core is in a deep sleep, that IPI could be >>>> *very* slow. >>> >>> It will tend to be self-limiting to some degree (deeper idle cores >>> would tend to have less chance of IPI) but we have bigger issues on >>> powerpc with that, like broadcast IPIs to the mm cpumask for THP >>> management. Power hasn't really shown up as an issue but powerpc >>> CPUs may have their own requirements and issues there, shall we say. >>> >>>> So I think it’s worth at least giving this a try. >>> >>> To be clear it's not a complete solution itself. The problem is of >>> course that mm cpumask gives you false negatives, so the bits >>> won't always clean up after themselves as CPUs switch away from their >>> lazy tlb mms. >> >> ^^ >> >> False positives: CPU is in the mm_cpumask, but is not using the mm >> as a lazy tlb. So there can be bits left and never freed. >> >> If you closed the false positives, you're back to a shared mm cache >> line on lazy mm context switches. > > x86 has this exact problem. At least no more than 64*8 CPUs share the cache line :) > > Can your share your benchmark? Just testing the IPI rates (on a smaller 176 CPU system), on a kernel compile, it causes about 300 shootdown interrupts (not 300 broadcasts but total interrupts). And very short lived fork;exec;exit things like typical scripting commands doesn't typically generate any. So yeah the really high exit rate things self-limit pretty well. I documented the concern and added a few of the possible ways to further reduce IPIs in the comments though. Thanks, Nick ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2020-07-21 15:22 UTC | newest] Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-10 1:56 [RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup, Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 2/7] arch: use asm-generic mmu context for no-op implementations Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 3/7] mm: introduce exit_lazy_tlb Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode Nicholas Piggin 2020-07-10 9:42 ` Peter Zijlstra 2020-07-10 14:02 ` Mathieu Desnoyers 2020-07-10 17:04 ` Andy Lutomirski 2020-07-13 4:45 ` Nicholas Piggin 2020-07-13 13:47 ` Nicholas Piggin 2020-07-13 14:13 ` Mathieu Desnoyers 2020-07-13 15:48 ` Andy Lutomirski 2020-07-13 16:37 ` Nicholas Piggin 2020-07-16 4:15 ` Nicholas Piggin 2020-07-16 4:42 ` Nicholas Piggin 2020-07-16 15:46 ` Mathieu Desnoyers 2020-07-16 16:03 ` Mathieu Desnoyers 2020-07-16 18:58 ` Mathieu Desnoyers 2020-07-16 21:24 ` Alan Stern 2020-07-17 13:39 ` Mathieu Desnoyers 2020-07-17 14:51 ` Alan Stern 2020-07-17 15:39 ` Mathieu Desnoyers 2020-07-17 16:11 ` Alan Stern 2020-07-17 16:22 ` Mathieu Desnoyers 2020-07-17 17:44 ` Alan Stern 2020-07-17 17:52 ` Mathieu Desnoyers 2020-07-17 0:00 ` Nicholas Piggin 2020-07-16 5:18 ` Andy Lutomirski 2020-07-16 6:06 ` Nicholas Piggin 2020-07-16 8:50 ` Peter Zijlstra 2020-07-16 10:03 ` Nicholas Piggin 2020-07-16 11:00 ` peterz 2020-07-16 15:34 ` Mathieu Desnoyers 2020-07-16 23:26 ` Nicholas Piggin 2020-07-17 13:42 ` Mathieu Desnoyers 2020-07-20 3:03 ` Nicholas Piggin 2020-07-20 16:46 ` Mathieu Desnoyers 2020-07-21 10:04 ` Nicholas Piggin 2020-07-21 13:11 ` Mathieu Desnoyers 2020-07-21 14:30 ` Nicholas Piggin 2020-07-21 15:06 ` peterz 2020-07-21 15:15 ` Mathieu Desnoyers 2020-07-21 15:19 ` Peter Zijlstra 2020-07-21 15:22 ` Mathieu Desnoyers 2020-07-10 1:56 ` [RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions Nicholas Piggin 2020-07-10 9:48 ` Peter Zijlstra 2020-07-10 1:56 ` [RFC PATCH 6/7] lazy tlb: allow lazy tlb mm switching to be configurable Nicholas Piggin 2020-07-10 1:56 ` [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Nicholas Piggin 2020-07-10 9:35 ` Peter Zijlstra 2020-07-13 4:58 ` Nicholas Piggin 2020-07-13 15:59 ` Andy Lutomirski 2020-07-13 16:48 ` Nicholas Piggin 2020-07-13 18:18 ` Andy Lutomirski 2020-07-14 5:04 ` Nicholas Piggin 2020-07-14 6:31 ` Nicholas Piggin 2020-07-14 12:46 ` Andy Lutomirski 2020-07-14 13:23 ` Peter Zijlstra 2020-07-16 2:26 ` Nicholas Piggin 2020-07-16 2:35 ` Nicholas Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).