Folks! While working on various preempt count related things, I stumbled (again) over the inconsistency of our preempt count handling. The handling of preempt_count() is inconsistent accross kernel configurations. On kernels which have PREEMPT_COUNT=n preempt_disable/enable() and the lock/unlock functions are not affecting the preempt count, only local_bh_disable/enable() and _bh variants of locking, soft interrupt delivery, hard interrupt and NMI context affect it. It's therefore impossible to have a consistent set of checks which provide information about the context in which a function is called. In many cases it makes sense to have seperate functions for seperate contexts, but there are valid reasons to avoid that and handle different calling contexts conditionally. The lack of such indicators which work on all kernel configuratios is a constant source of trouble because developers either do not understand the implications or try to work around this inconsistency in weird ways. Neither seem these issues be catched by reviewers and testing. Recently merged code does: gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC; Looks obviously correct, except for the fact that preemptible() is unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in that code use GFP_ATOMIC on such kernels. Attempts to make preempt count unconditional and consistent have been rejected in the past with handwaving performance arguments. Freshly conducted benchmarks did not reveal any measurable impact from enabling preempt count unconditionally. On kernels with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY the preempt count is only incremented and decremented but the result of the decrement is not tested. Contrary to that enabling CONFIG_PREEMPT which tests the result has a small but measurable impact due to the conditional branch/call. It's about time to make essential functionality of the kernel consistent accross the various preemption models. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git preempt That's the first part of a larger effort related to preempt count: 1) The analysis of the usage sites of in_interrupt(), in_atomic(), in_softirq() is still ongoing, but so far the number of buggy users is clearly the vast majority. There will be seperate patch series (currently 46 and counting) to address these issues once the analysis is complete in the next days. 2) The long discussed state tracking of local irq disable in preempt count which accounts interrupt disabled sections as atomic and avoids issuing costly instructions (sti, cli, popf or their non X86 counterparts) when the state does not change, i.e. nested irq_save() or irq_restore(). I have this working on X86 already and contrary to my earlier attempts this was reasonably straight forward due to the recent entry/exit code consolidation. What I've not done yet is to optimize the preempt count handling of the [un]lock_irq* operations so they handle the interrupt disabled state and the preempt count modification in one go. That's an obvious add on, but correctness first ... 3) Lazy interrupt disabling as a straight forward extension to #2. This avoids the actual disabling at the CPU level completely and catches an incoming interrupt in the low level entry code, modifies the interrupt disabled state on the return stack, notes the interrupt as pending in software and raises it again when interrupts are reenabled. This has still a few issues which I'm hunting down (cpuidle is unhappy ...) Thanks, tglx --- arch/arm/include/asm/assembler.h | 11 -- arch/arm/kernel/iwmmxt.S | 2 arch/arm/mach-ep93xx/crunch-bits.S | 2 arch/xtensa/kernel/entry.S | 2 drivers/gpu/drm/i915/Kconfig.debug | 1 drivers/gpu/drm/i915/i915_utils.h | 3 include/linux/bit_spinlock.h | 4 - include/linux/lockdep.h | 6 - include/linux/pagemap.h | 4 - include/linux/preempt.h | 37 +--------- include/linux/uaccess.h | 6 - kernel/Kconfig.preempt | 4 - kernel/sched/core.c | 6 - lib/Kconfig.debug | 3 lib/Kconfig.debug.rej | 14 +-- tools/testing/selftests/rcutorture/configs/rcu/SRCU-t | 1 tools/testing/selftests/rcutorture/configs/rcu/SRCU-u | 1 tools/testing/selftests/rcutorture/configs/rcu/TINY01 | 1 tools/testing/selftests/rcutorture/doc/TINY_RCU.txt | 5 - tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h | 1 21 files changed, 23 insertions(+), 92 deletions(-) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
ARCH_NO_PREEMPT disables the selection of CONFIG_PREEMPT_VOLUNTARY and CONFIG_PREEMPT, but architectures which set this config option still support preempt count for hard and softirq accounting. There is absolutely no reason to prevent lockdep from using the preempt counter nor is there a reason to prevent the enablement of CONFIG_DEBUG_ATOMIC_SLEEP on such architectures. Remove the dependencies, which affects ALPHA, HEXAGON, M68K and UM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: linux-alpha@vger.kernel.org Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: linux-um@lists.infradead.org Cc: Brian Cain <bcain@codeaurora.org> Cc: linux-hexagon@vger.kernel.org Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-m68k@lists.linux-m68k.org --- lib/Kconfig.debug | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1161,7 +1161,7 @@ config PROVE_LOCKING select DEBUG_RWSEMS select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC - select PREEMPT_COUNT if !ARCH_NO_PREEMPT + select PREEMPT_COUNT select TRACE_IRQFLAGS default n help @@ -1323,7 +1323,6 @@ config DEBUG_ATOMIC_SLEEP bool "Sleep inside atomic section checking" select PREEMPT_COUNT depends on DEBUG_KERNEL - depends on !ARCH_NO_PREEMPT help If you say Y here, various routines which may sleep will become very noisy if they are called inside atomic sections: when a spinlock is _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
The handling of preempt_count() is inconsistent accross kernel configurations. On kernels which have PREEMPT_COUNT=n preempt_disable/enable() and the lock/unlock functions are not affecting the preempt count, only local_bh_disable/enable() and _bh variants of locking, soft interrupt delivery, hard interrupt and NMI context affect it. It's therefore impossible to have a consistent set of checks which provide information about the context in which a function is called. In many cases it makes sense to have seperate functions for seperate contexts, but there are valid reasons to avoid that and handle different calling contexts conditionally. The lack of such indicators which work on all kernel configuratios is a constant source of trouble because developers either do not understand the implications or try to work around this inconsistency in weird ways. Neither seem these issues be catched by reviewers and testing. Recently merged code does: gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC; Looks obviously correct, except for the fact that preemptible() is unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in that code use GFP_ATOMIC on such kernels. Attempts to make preempt count unconditional and consistent have been rejected in the past with handwaving performance arguments. Freshly conducted benchmarks did not reveal any measurable impact from enabling preempt count unconditionally. On kernels with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY the preempt count is only incremented and decremented but the result of the decrement is not tested. Contrary to that enabling CONFIG_PREEMPT which tests the result has a small but measurable impact due to the conditional branch/call. It's about time to make essential functionality of the kernel consistent accross the various preemption models. Enable CONFIG_PREEMPT_COUNT unconditionally. Follow up changes will remove the #ifdeffery and remove the config option at the end. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/Kconfig.preempt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -75,8 +75,7 @@ config PREEMPT_RT endchoice config PREEMPT_COUNT - bool + def_bool y config PREEMPTION bool - select PREEMPT_COUNT _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> --- include/linux/preempt.h | 37 ++++--------------------------------- 1 file changed, 4 insertions(+), 33 deletions(-) --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -56,8 +56,7 @@ #define PREEMPT_DISABLED (PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED) /* - * Disable preemption until the scheduler is running -- use an unconditional - * value so that it also works on !PREEMPT_COUNT kernels. + * Disable preemption until the scheduler is running. * * Reset by start_kernel()->sched_init()->init_idle()->init_idle_preempt_count(). */ @@ -69,7 +68,6 @@ * * preempt_count() == 2*PREEMPT_DISABLE_OFFSET * - * Note: PREEMPT_DISABLE_OFFSET is 0 for !PREEMPT_COUNT kernels. * Note: See finish_task_switch(). */ #define FORK_PREEMPT_COUNT (2*PREEMPT_DISABLE_OFFSET + PREEMPT_ENABLED) @@ -106,11 +104,7 @@ /* * The preempt_count offset after preempt_disable(); */ -#if defined(CONFIG_PREEMPT_COUNT) -# define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET -#else -# define PREEMPT_DISABLE_OFFSET 0 -#endif +#define PREEMPT_DISABLE_OFFSET PREEMPT_OFFSET /* * The preempt_count offset after spin_lock() @@ -122,8 +116,8 @@ * * spin_lock_bh() * - * Which need to disable both preemption (CONFIG_PREEMPT_COUNT) and - * softirqs, such that unlock sequences of: + * Which need to disable both preemption and softirqs, such that unlock + * sequences of: * * spin_unlock(); * local_bh_enable(); @@ -164,8 +158,6 @@ extern void preempt_count_sub(int val); #define preempt_count_inc() preempt_count_add(1) #define preempt_count_dec() preempt_count_sub(1) -#ifdef CONFIG_PREEMPT_COUNT - #define preempt_disable() \ do { \ preempt_count_inc(); \ @@ -231,27 +223,6 @@ do { \ __preempt_count_dec(); \ } while (0) -#else /* !CONFIG_PREEMPT_COUNT */ - -/* - * Even if we don't have any preemption, we need preempt disable/enable - * to be barriers, so that we don't have things like get_user/put_user - * that can cause faults and scheduling migrate into our preempt-protected - * region. - */ -#define preempt_disable() barrier() -#define sched_preempt_enable_no_resched() barrier() -#define preempt_enable_no_resched() barrier() -#define preempt_enable() barrier() -#define preempt_check_resched() do { } while (0) - -#define preempt_disable_notrace() barrier() -#define preempt_enable_no_resched_notrace() barrier() -#define preempt_enable_notrace() barrier() -#define preemptible() 0 - -#endif /* CONFIG_PREEMPT_COUNT */ - #ifdef MODULE /* * Modules have no business playing preemption tricks. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Will Deacon <will@kernel.org> --- include/linux/lockdep.h | 6 ++---- lib/Kconfig.debug | 1 - 2 files changed, 2 insertions(+), 5 deletions(-) --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -585,16 +585,14 @@ do { \ #define lockdep_assert_preemption_enabled() \ do { \ - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ - debug_locks && \ + WARN_ON_ONCE(debug_locks && \ (preempt_count() != 0 || \ !raw_cpu_read(hardirqs_enabled))); \ } while (0) #define lockdep_assert_preemption_disabled() \ do { \ - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ - debug_locks && \ + WARN_ON_ONCE(debug_locks && \ (preempt_count() == 0 && \ raw_cpu_read(hardirqs_enabled))); \ } while (0) --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1161,7 +1161,6 @@ config PROVE_LOCKING select DEBUG_RWSEMS select DEBUG_WW_MUTEX_SLOWPATH select DEBUG_LOCK_ALLOC - select PREEMPT_COUNT select TRACE_IRQFLAGS default n help _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org --- include/linux/pagemap.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -168,9 +168,7 @@ void release_pages(struct page **pages, static inline int __page_cache_add_speculative(struct page *page, int count) { #ifdef CONFIG_TINY_RCU -# ifdef CONFIG_PREEMPT_COUNT - VM_BUG_ON(!in_atomic() && !irqs_disabled()); -# endif + VM_BUG_ON(preemptible()) /* * Preempt must be disabled here - we rely on rcu_read_lock doing * this for us. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/bit_spinlock.h | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) --- a/include/linux/bit_spinlock.h +++ b/include/linux/bit_spinlock.h @@ -90,10 +90,8 @@ static inline int bit_spin_is_locked(int { #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) return test_bit(bitnum, addr); -#elif defined CONFIG_PREEMPT_COUNT - return preempt_count(); #else - return 1; + return preempt_count(); #endif } _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/linux/uaccess.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -230,9 +230,9 @@ static inline bool pagefault_disabled(vo * * This function should only be used by the fault handlers. Other users should * stick to pagefault_disabled(). - * Please NEVER use preempt_disable() to disable the fault handler. With - * !CONFIG_PREEMPT_COUNT, this is like a NOP. So the handler won't be disabled. - * in_atomic() will report different values based on !CONFIG_PREEMPT_COUNT. + * + * Please NEVER use preempt_disable() or local_irq_disable() to disable the + * fault handler. */ #define faulthandler_disabled() (pagefault_disabled() || in_atomic()) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> --- kernel/sched/core.c | 6 +----- lib/Kconfig.debug | 1 - 2 files changed, 1 insertion(+), 6 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3706,8 +3706,7 @@ asmlinkage __visible void schedule_tail( * finish_task_switch() for details. * * finish_task_switch() will drop rq->lock() and lower preempt_count - * and the preempt_enable() will end up enabling preemption (on - * PREEMPT_COUNT kernels). + * and the preempt_enable() will end up enabling preemption. */ rq = finish_task_switch(prev); @@ -7311,9 +7310,6 @@ void __cant_sleep(const char *file, int if (irqs_disabled()) return; - if (!IS_ENABLED(CONFIG_PREEMPT_COUNT)) - return; - if (preempt_count() > preempt_offset) return; --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1320,7 +1320,6 @@ config DEBUG_LOCKDEP config DEBUG_ATOMIC_SLEEP bool "Sleep inside atomic section checking" - select PREEMPT_COUNT depends on DEBUG_KERNEL help If you say Y here, various routines which may sleep will become very _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <linux@armlinux.org.uk> Cc: linux-arm-kernel@lists.infradead.org --- arch/arm/include/asm/assembler.h | 11 ----------- arch/arm/kernel/iwmmxt.S | 2 -- arch/arm/mach-ep93xx/crunch-bits.S | 2 -- 3 files changed, 15 deletions(-) --- a/arch/arm/include/asm/assembler.h +++ b/arch/arm/include/asm/assembler.h @@ -212,7 +212,6 @@ /* * Increment/decrement the preempt count. */ -#ifdef CONFIG_PREEMPT_COUNT .macro inc_preempt_count, ti, tmp ldr \tmp, [\ti, #TI_PREEMPT] @ get preempt count add \tmp, \tmp, #1 @ increment it @@ -229,16 +228,6 @@ get_thread_info \ti dec_preempt_count \ti, \tmp .endm -#else - .macro inc_preempt_count, ti, tmp - .endm - - .macro dec_preempt_count, ti, tmp - .endm - - .macro dec_preempt_count_ti, ti, tmp - .endm -#endif #define USERL(l, x...) \ 9999: x; \ --- a/arch/arm/kernel/iwmmxt.S +++ b/arch/arm/kernel/iwmmxt.S @@ -94,9 +94,7 @@ ENTRY(iwmmxt_task_enable) mov r2, r2 @ cpwait bl concan_save -#ifdef CONFIG_PREEMPT_COUNT get_thread_info r10 -#endif 4: dec_preempt_count r10, r3 ret r9 @ normal exit from exception --- a/arch/arm/mach-ep93xx/crunch-bits.S +++ b/arch/arm/mach-ep93xx/crunch-bits.S @@ -191,9 +191,7 @@ ENTRY(crunch_task_enable) cfldr64 mvdx15, [r0, #CRUNCH_MVDX15] 1: -#ifdef CONFIG_PREEMPT_COUNT get_thread_info r10 -#endif 2: dec_preempt_count r10, r3 ret lr _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: linux-xtensa@linux-xtensa.org --- arch/xtensa/kernel/entry.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/xtensa/kernel/entry.S +++ b/arch/xtensa/kernel/entry.S @@ -819,7 +819,7 @@ ENTRY(debug_exception) * preemption if we have HW breakpoints to preserve DEBUGCAUSE.DBNUM * meaning. */ -#if defined(CONFIG_PREEMPT_COUNT) && defined(CONFIG_HAVE_HW_BREAKPOINT) +#ifdef CONFIG_HAVE_HW_BREAKPOINT GET_THREAD_INFO(a2, a1) l32i a3, a2, TI_PRE_COUNT addi a3, a3, 1 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org --- drivers/gpu/drm/i915/Kconfig.debug | 1 - drivers/gpu/drm/i915/i915_utils.h | 3 +-- 2 files changed, 1 insertion(+), 3 deletions(-) --- a/drivers/gpu/drm/i915/Kconfig.debug +++ b/drivers/gpu/drm/i915/Kconfig.debug @@ -20,7 +20,6 @@ config DRM_I915_DEBUG bool "Enable additional driver debugging" depends on DRM_I915 select DEBUG_FS - select PREEMPT_COUNT select I2C_CHARDEV select STACKDEPOT select DRM_DP_AUX_CHARDEV --- a/drivers/gpu/drm/i915/i915_utils.h +++ b/drivers/gpu/drm/i915/i915_utils.h @@ -337,8 +337,7 @@ wait_remaining_ms_from_jiffies(unsigned (Wmax)) #define wait_for(COND, MS) _wait_for((COND), (MS) * 1000, 10, 1000) -/* If CONFIG_PREEMPT_COUNT is disabled, in_atomic() always reports false. */ -#if defined(CONFIG_DRM_I915_DEBUG) && defined(CONFIG_PREEMPT_COUNT) +#ifdef CONFIG_DRM_I915_DEBUG # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) WARN_ON_ONCE((ATOMIC) && !in_atomic()) #else # define _WAIT_FOR_ATOMIC_CHECK(ATOMIC) do { } while (0) _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be removed. Cleanup the leftovers before doing so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Cc: rcu@vger.kernel.org Cc: linux-kselftest@vger.kernel.org --- tools/testing/selftests/rcutorture/configs/rcu/SRCU-t | 1 - tools/testing/selftests/rcutorture/configs/rcu/SRCU-u | 1 - tools/testing/selftests/rcutorture/configs/rcu/TINY01 | 1 - tools/testing/selftests/rcutorture/doc/TINY_RCU.txt | 5 ++--- tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt | 1 - tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h | 1 - 6 files changed, 2 insertions(+), 8 deletions(-) --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-t @@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n CONFIG_DEBUG_ATOMIC_SLEEP=y -#CHECK#CONFIG_PREEMPT_COUNT=y --- a/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u +++ b/tools/testing/selftests/rcutorture/configs/rcu/SRCU-u @@ -7,4 +7,3 @@ CONFIG_RCU_TRACE=n CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_PREEMPT_COUNT=n --- a/tools/testing/selftests/rcutorture/configs/rcu/TINY01 +++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY01 @@ -10,4 +10,3 @@ CONFIG_RCU_TRACE=n #CHECK#CONFIG_RCU_STALL_COMMON=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_DEBUG_OBJECTS_RCU_HEAD=n -CONFIG_PREEMPT_COUNT=n --- a/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt +++ b/tools/testing/selftests/rcutorture/doc/TINY_RCU.txt @@ -3,11 +3,10 @@ This document gives a brief rationale fo Kconfig Parameters: -CONFIG_DEBUG_LOCK_ALLOC -- Do all three and none of the three. -CONFIG_PREEMPT_COUNT +CONFIG_DEBUG_LOCK_ALLOC -- Do both and none of the two. CONFIG_RCU_TRACE -The theory here is that randconfig testing will hit the other six possible +The theory here is that randconfig testing will hit the other two possible combinations of these parameters. --- a/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt +++ b/tools/testing/selftests/rcutorture/doc/TREE_RCU-kconfig.txt @@ -43,7 +43,6 @@ CONFIG_64BIT Used only to check CONFIG_RCU_FANOUT value, inspection suffices. -CONFIG_PREEMPT_COUNT CONFIG_PREEMPT_RCU Redundant with CONFIG_PREEMPT, ignore. --- a/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h +++ b/tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/config.h @@ -8,7 +8,6 @@ #undef CONFIG_HOTPLUG_CPU #undef CONFIG_MODULES #undef CONFIG_NO_HZ_FULL_SYSIDLE -#undef CONFIG_PREEMPT_COUNT #undef CONFIG_PREEMPT_RCU #undef CONFIG_PROVE_RCU #undef CONFIG_RCU_NOCB_CPU _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
All conditionals and irritations are gone. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/Kconfig.preempt | 3 --- 1 file changed, 3 deletions(-) --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -74,8 +74,5 @@ config PREEMPT_RT endchoice -config PREEMPT_COUNT - def_bool y - config PREEMPTION bool _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, 14 Sep 2020 22:42:09 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > 21 files changed, 23 insertions(+), 92 deletions(-) This alone makes it look promising, and hopefully acceptable by Linus :-) -- Steve _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Recently merged code does: > > gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC; > > Looks obviously correct, except for the fact that preemptible() is > unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in > that code use GFP_ATOMIC on such kernels. I don't think this is a good reason to entirely get rid of the no-preempt thing. The above is just garbage. It's bogus. You can't do it. Blaming the no-preempt code for this bug is extremely unfair, imho. And the no-preempt code does help make for much better code generation for simple spinlocks. Where is that horribly buggy recent code? It's not in that exact format, certainly, since 'grep' doesn't find it. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14 2020 at 13:59, Linus Torvalds wrote: > On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> Recently merged code does: >> >> gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC; >> >> Looks obviously correct, except for the fact that preemptible() is >> unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in >> that code use GFP_ATOMIC on such kernels. > > I don't think this is a good reason to entirely get rid of the > no-preempt thing. I did not say that this is a good reason. It just illustrates the problem. > The above is just garbage. It's bogus. You can't do it. > > Blaming the no-preempt code for this bug is extremely unfair, imho. I'm not blaming the no-preempt code. I'm blaming inconsistency and there is no real good argument for inconsistent behaviour, TBH. > And the no-preempt code does help make for much better code generation > for simple spinlocks. Yes it does generate better code, but I tried hard to spot a difference in various metrics exposed by perf. It's all in the noise and I only can spot a difference when the actual preemption check after the decrement which still depends on CONFIG_PREEMPT is in place, but that's not the case for PREEMPT_NONE or PREEMPT_VOLUNTARY kernels where the decrement is just a decrement w/o any conditional behind it. > Where is that horribly buggy recent code? It's not in that exact > format, certainly, since 'grep' doesn't find it. Bah, that was stuff in next which got dropped again. But just look at any check which uses preemptible(), especially those which check !preemptible(): In the X86 #GP handler we have: /* * To be potentially processing a kprobe fault and to trust the result * from kprobe_running(), we have to be non-preemptible. */ if (!preemptible() && kprobe_running() && kprobe_fault_handler(regs, X86_TRAP_GP)) goto exit; and a similar check in the S390 code in kprobe_exceptions_notify(). That all magically 'works' because that code might have been actually tested with lockdep enabled which enforces PREEMPT_COUNT... The SG code has some interesting usage as well: if (miter->__flags & SG_MITER_ATOMIC) { WARN_ON_ONCE(preemptible()); kunmap_atomic(miter->addr); How is that WARN_ON_ONCE() supposed to catch anything? Especially as calling code does: flags = SG_MITER_TO_SG; if (!preemptible()) flags |= SG_MITER_ATOMIC; which is equally useless on kernels which have PREEMPT_COUNT=n. There are bugs which are related to in_atomic() or other in_***() usage all over the place as well. Inconsistency at the core level is a clear recipe for disaster and at some point we have to bite the bullet and accept that consistency is more important than the non measurable 3 cycles? Thanks, tglx _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 2:55 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > Yes it does generate better code, but I tried hard to spot a difference > in various metrics exposed by perf. It's all in the noise and I only > can spot a difference when the actual preemption check after the > decrement I'm somewhat more worried about the small-device case. That said, the diffstat certainly has its very clear charm, and I do agree that it makes things simpler. I'm just not convinced people should ever EVER do things like that "if (preemptible())" garbage. It sounds like somebody is doing seriously bad things. The chacha20poly1305 code does look fundamentally broken. But no, the fix is not to make "preemptible" work with spinlocks, the fix is to not *do* that kind of broken things. Those things would be broken even if you changed the semantics of preemptible. There's no way that it's valid to say "use this debug flag to decide if we should do atomic allocations or not". It smells like "I got a debug failure, so I'm papering it over by checking the thing the debug code checks for". The debug check is to catch the obvious bad cases. It's not the _only_ bad cases, so copying the debug check test is just completely wrong. Ard and Herbert added to participants: see chacha20poly1305_crypt_sg_inplace(), which does flags = SG_MITER_TO_SG; if (!preemptible()) flags |= SG_MITER_ATOMIC; introduced in commit d95312a3ccc0 ("crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine"). You *fundamentally* cannot do that. Seriously. It's completely wrong. Pick one or the other, or have the caller *tell* you what the context is. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 3:24 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ard and Herbert added to participants: see > chacha20poly1305_crypt_sg_inplace(), which does > > flags = SG_MITER_TO_SG; > if (!preemptible()) > flags |= SG_MITER_ATOMIC; > > introduced in commit d95312a3ccc0 ("crypto: lib/chacha20poly1305 - > reimplement crypt_from_sg() routine"). As far as I can tell, the only reason for this all is to try to use "kmap()" rather than "kmap_atomic()". And kmap() actually has the much more complex "might_sleep()" tests, and apparently the "preemptible()" check wasn't even the proper full debug check, it was just a complete hack to catch the one that triggered. From a quick look, that code should probably just get rid of SG_MITER_ATOMIC entirely, and alwayse use kmap_atomic(). kmap_atomic() is actually the faster and proper interface to use anyway (never mind that any of this matters on any sane hardware). The old kmap() and kunmap() interfaces should generally be avoided like the plague - yes, they allow sleeping in the middle and that is sometimes required, but if you don't need that, you should never ever use them. We used to have a very nasty kmap_atomic() that required people to be very careful and know exactly which atomic entry to use, and that was admitedly quite nasty. So it _looks_ like this code started using kmap() - probably back when kmap_atomic() was so cumbersome to use - and was then converted (conditionally) to kmap_atomic() rather than just changed whole-sale. Is there actually something that wants to use those sg_miter functions and sleep? Because if there is, that choice should come from the outside, not from inside lib/scatterlist.c trying to make some bad guess based on the wrong thing entirely. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 03:37:49PM -0700, Linus Torvalds wrote: > > So it _looks_ like this code started using kmap() - probably back when > kmap_atomic() was so cumbersome to use - and was then converted > (conditionally) to kmap_atomic() rather than just changed whole-sale. > Is there actually something that wants to use those sg_miter functions > and sleep? I dug up the old zinc patch submissions and this wasn't present at all in the original. The original zinc code used blkcipher_walk which unconditinoally does kmap_atomic. So it's only the SG miter conversion that introduced this change, which appears to be a simple oversight (I think Ard was working on latency issues at that time, perhaps he was worried about keeping preemption off unnecessarily). ---8<--- There is no reason for the chacha20poly1305 SG miter code to use kmap instead of kmap_atomic as the critical section doesn't sleep anyway. So we can simply get rid of the preemptible check and set SG_MITER_ATOMIC unconditionally. Even if we need to reenable preemption to lower latency we should be doing that by interrupting the SG miter walk rather than using kmap. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c index 431e04280332..5850f3b87359 100644 --- a/lib/crypto/chacha20poly1305.c +++ b/lib/crypto/chacha20poly1305.c @@ -251,9 +251,7 @@ bool chacha20poly1305_crypt_sg_inplace(struct scatterlist *src, poly1305_update(&poly1305_state, pad0, 0x10 - (ad_len & 0xf)); } - flags = SG_MITER_TO_SG; - if (!preemptible()) - flags |= SG_MITER_ATOMIC; + flags = SG_MITER_TO_SG | SG_MITER_ATOMIC; sg_miter_start(&miter, src, sg_nents(src), flags); -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, 15 Sep 2020 at 01:43, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, Sep 14, 2020 at 3:24 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Ard and Herbert added to participants: see > > chacha20poly1305_crypt_sg_inplace(), which does > > > > flags = SG_MITER_TO_SG; > > if (!preemptible()) > > flags |= SG_MITER_ATOMIC; > > > > introduced in commit d95312a3ccc0 ("crypto: lib/chacha20poly1305 - > > reimplement crypt_from_sg() routine"). > > As far as I can tell, the only reason for this all is to try to use > "kmap()" rather than "kmap_atomic()". > > And kmap() actually has the much more complex "might_sleep()" tests, > and apparently the "preemptible()" check wasn't even the proper full > debug check, it was just a complete hack to catch the one that > triggered. > This was not driven by a failing check. The documentation of kmap_atomic() states the following: * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap * gives a more generic (and caching) interface. But kmap_atomic can * be used in IRQ contexts, so in some (very limited) cases we need * it. so if this is no longer accurate, perhaps we should fix it? But another reason I tried to avoid kmap_atomic() is that it disables preemption unconditionally, even on 64-bit architectures where HIGHMEM is irrelevant. So using kmap_atomic() here means that the bulk of WireGuard packet encryption runs with preemption disabled, essentially for legacy reasons. > From a quick look, that code should probably just get rid of > SG_MITER_ATOMIC entirely, and alwayse use kmap_atomic(). > > kmap_atomic() is actually the faster and proper interface to use > anyway (never mind that any of this matters on any sane hardware). The > old kmap() and kunmap() interfaces should generally be avoided like > the plague - yes, they allow sleeping in the middle and that is > sometimes required, but if you don't need that, you should never ever > use them. > > We used to have a very nasty kmap_atomic() that required people to be > very careful and know exactly which atomic entry to use, and that was > admitedly quite nasty. > > So it _looks_ like this code started using kmap() - probably back when > kmap_atomic() was so cumbersome to use - and was then converted > (conditionally) to kmap_atomic() rather than just changed whole-sale. > Is there actually something that wants to use those sg_miter functions > and sleep? > > Because if there is, that choice should come from the outside, not > from inside lib/scatterlist.c trying to make some bad guess based on > the wrong thing entirely. > > Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15, 2020 at 09:20:59AM +0300, Ard Biesheuvel wrote: > > The documentation of kmap_atomic() states the following: > > * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap > * gives a more generic (and caching) interface. But kmap_atomic can > * be used in IRQ contexts, so in some (very limited) cases we need > * it. > > so if this is no longer accurate, perhaps we should fix it? This hasn't been accurate for at least ten years :) > But another reason I tried to avoid kmap_atomic() is that it disables > preemption unconditionally, even on 64-bit architectures where HIGHMEM > is irrelevant. So using kmap_atomic() here means that the bulk of > WireGuard packet encryption runs with preemption disabled, essentially > for legacy reasons. Agreed. We should definitely fix that. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 11:24 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Tue, Sep 15, 2020 at 09:20:59AM +0300, Ard Biesheuvel wrote: > > > > The documentation of kmap_atomic() states the following: > > > > * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap > > * gives a more generic (and caching) interface. But kmap_atomic can > > * be used in IRQ contexts, so in some (very limited) cases we need > > * it. > > > > so if this is no longer accurate, perhaps we should fix it? > > This hasn't been accurate for at least ten years :) Yeah, that used to be true a long long time ago, but the comment is very stale. > > But another reason I tried to avoid kmap_atomic() is that it disables > > preemption unconditionally, even on 64-bit architectures where HIGHMEM > > is irrelevant. So using kmap_atomic() here means that the bulk of > > WireGuard packet encryption runs with preemption disabled, essentially > > for legacy reasons. > > Agreed. We should definitely fix that. Well, honestly, one big reason for that is debugging. The *semantics* of the kmap_atomic() is in the name - you can't sleep in between it and the kunmap_atomic(). On any sane architecture, kmap_atomic() ends up being a no-op from an implementation standpoint, and sleeping would work just fine. But we very much want to make sure that people don't then write code that doesn't work on the bad old 32-bit machines where it really needs that sequence to be safe from preemption. So it's mostly a debug thing. I say "mostly", because there might be small other details too, like shared code, and perhaps even a couple of users out in the wild that depend on the pagefault_disable() inherent in the current kmap_atomic(), who knows.. So no, the preemption disabling isn't inherent in the operation itself. But it does have some argument for it. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 10:42:15PM +0200, Thomas Gleixner wrote: > CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be > removed. Cleanup the leftovers before doing so. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > include/linux/bit_spinlock.h | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > --- a/include/linux/bit_spinlock.h > +++ b/include/linux/bit_spinlock.h > @@ -90,10 +90,8 @@ static inline int bit_spin_is_locked(int > { > #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) > return test_bit(bitnum, addr); > -#elif defined CONFIG_PREEMPT_COUNT > - return preempt_count(); > #else > - return 1; > + return preempt_count(); > #endif Acked-by: Will Deacon <will@kernel.org> Will _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 10:42:13PM +0200, Thomas Gleixner wrote: > CONFIG_PREEMPT_COUNT is now unconditionally enabled and will be > removed. Cleanup the leftovers before doing so. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@kernel.org> > Cc: Will Deacon <will@kernel.org> > --- > include/linux/lockdep.h | 6 ++---- > lib/Kconfig.debug | 1 - > 2 files changed, 2 insertions(+), 5 deletions(-) > > --- a/include/linux/lockdep.h > +++ b/include/linux/lockdep.h > @@ -585,16 +585,14 @@ do { \ > > #define lockdep_assert_preemption_enabled() \ > do { \ > - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ > - debug_locks && \ > + WARN_ON_ONCE(debug_locks && \ > (preempt_count() != 0 || \ > !raw_cpu_read(hardirqs_enabled))); \ > } while (0) > > #define lockdep_assert_preemption_disabled() \ > do { \ > - WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_COUNT) && \ > - debug_locks && \ > + WARN_ON_ONCE(debug_locks && \ > (preempt_count() == 0 && \ > raw_cpu_read(hardirqs_enabled))); \ > } while (0) > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1161,7 +1161,6 @@ config PROVE_LOCKING > select DEBUG_RWSEMS > select DEBUG_WW_MUTEX_SLOWPATH > select DEBUG_LOCK_ALLOC > - select PREEMPT_COUNT > select TRACE_IRQFLAGS > default n > help Acked-by: Will Deacon <will@kernel.org> Will _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 01:59:15PM -0700, Linus Torvalds wrote: > On Mon, Sep 14, 2020 at 1:45 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > Recently merged code does: > > > > gfp = preemptible() ? GFP_KERNEL : GFP_ATOMIC; > > > > Looks obviously correct, except for the fact that preemptible() is > > unconditionally false for CONFIF_PREEMPT_COUNT=n, i.e. all allocations in > > that code use GFP_ATOMIC on such kernels. > > I don't think this is a good reason to entirely get rid of the no-preempt thing. > > The above is just garbage. It's bogus. You can't do it. > > Blaming the no-preempt code for this bug is extremely unfair, imho. > > And the no-preempt code does help make for much better code generation > for simple spinlocks. > > Where is that horribly buggy recent code? It's not in that exact > format, certainly, since 'grep' doesn't find it. It would be convenient for that "gfp =" code to work, as this would allow better cache locality while invoking RCU callbacks, and would further provide better robustness to callback floods. The full story is quite long, but here are alternatives have not yet been proven to be abject failures: 1. Use workqueues to do the allocations in a clean context. While waiting for the allocations, the callbacks are queued in the old cache-busting manner. This functions correctly, but in the meantime (which on busy systems can be some time) the cache locality and robustness are lost. 2. Provide the ability to allocate memory in raw atomic context. This is extremely effective, especially when used in combination with #1 above, but as you might suspect, the MM guys don't like it much. In contrast, with Thomas's patch series, call_rcu() and kvfree_rcu() could just look at preemptible() to see whether or not it was safe to allocate memory, even in !PREEMPT kernels -- and in the common case, it almost always would be safe. It is quite possible that this approach would work in isolation, or failing that, that adding #1 above would do the trick. I understand that this is all very hand-wavy, and I do apologize for that. If you really want the full sad story with performance numbers and the works, let me know! Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15, 2020 at 12:24 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > Alternatively we just make highmem a bit more expensive by making these > maps preemptible. RT is doing this for a long time and it's not that > horrible. Ack. In fact, I've wanted to start just removing kmap support entirely. At some point it's not so much about "I have an old machine that wants HIGHMEM" but about "I have an old CPU, and I'll just run an old kernel". It's not that 32-bit is irrelevant, it's that 32-bit with large amounts of memory is irrelevant. Last time this was discussed, iirc the main issue was some questionable old ARM chips that were still very common in embedded environments, even with large memory. But we could definitely start de-emphasizing HIGHMEM. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > OTOH, having a working 'preemptible()' or maybe better named > 'can_schedule()' check makes tons of sense to make decisions about > allocation modes or other things. No. I think that those kinds of decisions about actual behavior are always simply fundamentally wrong. Note that this is very different from having warnings about invalid use. THAT is correct. It may not warn in all configurations, but that doesn't matter: what matters is that it warns in common enough configurations that developers will catch it. So having a warning in "might_sleep()" that doesn't always trigger, because you have a limited configuration that can't even detect the situation, that's fine and dandy and intentional. But having code like if (can_schedule()) .. do something different .. is fundamentally complete and utter garbage. It's one thing if you test for "am I in hardware interrupt context". Those tests aren't great either, but at least they make sense. But a driver - or some library routine - making a difference based on some nebulous "can I schedule" is fundamentally and basically WRONG. If some code changes behavior, it needs to be explicit to the *caller* of that code. So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) do_something_atomic()" is pure shite. And I am not IN THE LEAST interested in trying to help people doing pure shite. We need to fix them. Like the crypto code is getting fixed. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15 2020 at 10:35, Linus Torvalds wrote: > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> >> OTOH, having a working 'preemptible()' or maybe better named >> 'can_schedule()' check makes tons of sense to make decisions about >> allocation modes or other things. > > No. I think that those kinds of decisions about actual behavior are > always simply fundamentally wrong. > > Note that this is very different from having warnings about invalid > use. THAT is correct. It may not warn in all configurations, but that > doesn't matter: what matters is that it warns in common enough > configurations that developers will catch it. You wish. I just found a 7 year old bug in a 10G network driver which surely would have been found if people would enable debug configs and not just run the crap on their PREEMPT_NONE, all debug off kernel. And that driver is not subject to bitrot, it gets regular bug fixes from people who seem to care (distro folks). > So having a warning in "might_sleep()" that doesn't always trigger, > because you have a limited configuration that can't even detect the > situation, that's fine and dandy and intentional. and lets people get away with their crap. > But having code like > > if (can_schedule()) > .. do something different .. > > is fundamentally complete and utter garbage. > > It's one thing if you test for "am I in hardware interrupt context". > Those tests aren't great either, but at least they make sense. They make sense in limited situations like exception handlers and such which really have to know from which context an exception was raised. But with the above reasoning such checks do not make sense in any other general code. 'in hard interrupt context' is just another context where you can't do stuff which you can do when in preemptible task context. Most tests are way broader than a single context. in_interrupt() is true for hard interrupt, soft interrupt delivery and all BH disabled contexts, which is completely ill defined. > But a driver - or some library routine - making a difference based on > some nebulous "can I schedule" is fundamentally and basically WRONG. > > If some code changes behavior, it needs to be explicit to the *caller* > of that code. I'm fine with that, but then we have to be consequent and ban _all_ of these and not just declare can_schedule() to be a bad one. Thanks, tglx _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > OTOH, having a working 'preemptible()' or maybe better named > > 'can_schedule()' check makes tons of sense to make decisions about > > allocation modes or other things. > > No. I think that those kinds of decisions about actual behavior are > always simply fundamentally wrong. > > Note that this is very different from having warnings about invalid > use. THAT is correct. It may not warn in all configurations, but that > doesn't matter: what matters is that it warns in common enough > configurations that developers will catch it. > > So having a warning in "might_sleep()" that doesn't always trigger, > because you have a limited configuration that can't even detect the > situation, that's fine and dandy and intentional. > > But having code like > > if (can_schedule()) > .. do something different .. > > is fundamentally complete and utter garbage. > > It's one thing if you test for "am I in hardware interrupt context". > Those tests aren't great either, but at least they make sense. > > But a driver - or some library routine - making a difference based on > some nebulous "can I schedule" is fundamentally and basically WRONG. > > If some code changes behavior, it needs to be explicit to the *caller* > of that code. > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > do_something_atomic()" is pure shite. > > And I am not IN THE LEAST interested in trying to help people doing > pure shite. We need to fix them. Like the crypto code is getting > fixed. Just figured I'll throw my +1 in from reading too many (gpu) drivers. Code that tries to cleverly adjust its behaviour depending upon the context it's running in is harder to understand and blows up in more interesting ways. We still have drm_can_sleep() and it's mostly just used for debug code, and I've largely ended up just deleting everything that used it because when you're driver is blowing up the last thing you want is to realize your debug code and output can't be relied upon. Or worse, that the only Oops you have is the one in the debug code, because the real one scrolled away - the original idea behind drm_can_sleep was to make all the modeset code work automagically both in normal ioctl/kworker context and in the panic handlers or kgdb callbacks. Wishful thinking at best. Also at least for me that extends to everything, e.g. I much prefer explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for locks shared with interrupt handlers, since the former two gives me clear information from which contexts such function can be called. Other end is the memalloc_no*_save/restore functions, where I recently made a real big fool of myself because I didn't realize how much that impacts everything that's run within - suddenly "GFP_KERNEL for small stuff never fails" is wrong everywhere. It's all great for debugging and sanity checks (and we run with all that stuff enabled in our CI), but really semantic changes depending upon magic context checks freak my out :-) -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > 'can_schedule()' check makes tons of sense to make decisions about > > > allocation modes or other things. > > > > No. I think that those kinds of decisions about actual behavior are > > always simply fundamentally wrong. > > > > Note that this is very different from having warnings about invalid > > use. THAT is correct. It may not warn in all configurations, but that > > doesn't matter: what matters is that it warns in common enough > > configurations that developers will catch it. > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > because you have a limited configuration that can't even detect the > > situation, that's fine and dandy and intentional. > > > > But having code like > > > > if (can_schedule()) > > .. do something different .. > > > > is fundamentally complete and utter garbage. > > > > It's one thing if you test for "am I in hardware interrupt context". > > Those tests aren't great either, but at least they make sense. > > > > But a driver - or some library routine - making a difference based on > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > If some code changes behavior, it needs to be explicit to the *caller* > > of that code. > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > do_something_atomic()" is pure shite. > > > > And I am not IN THE LEAST interested in trying to help people doing > > pure shite. We need to fix them. Like the crypto code is getting > > fixed. > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > Code that tries to cleverly adjust its behaviour depending upon the > context it's running in is harder to understand and blows up in more > interesting ways. We still have drm_can_sleep() and it's mostly just > used for debug code, and I've largely ended up just deleting > everything that used it because when you're driver is blowing up the > last thing you want is to realize your debug code and output can't be > relied upon. Or worse, that the only Oops you have is the one in the > debug code, because the real one scrolled away - the original idea > behind drm_can_sleep was to make all the modeset code work > automagically both in normal ioctl/kworker context and in the panic > handlers or kgdb callbacks. Wishful thinking at best. > > Also at least for me that extends to everything, e.g. I much prefer > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > locks shared with interrupt handlers, since the former two gives me > clear information from which contexts such function can be called. > Other end is the memalloc_no*_save/restore functions, where I recently > made a real big fool of myself because I didn't realize how much that > impacts everything that's run within - suddenly "GFP_KERNEL for small > stuff never fails" is wrong everywhere. > > It's all great for debugging and sanity checks (and we run with all > that stuff enabled in our CI), but really semantic changes depending > upon magic context checks freak my out :-) All fair, but some of us need to write code that must handle being invoked from a wide variety of contexts. Now perhaps you like the idea of call_rcu() for schedulable contexts, call_rcu_nosched() when preemption is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, call_rcu_raw_atomic() from contexts where (for example) raw spinlocks are held, and so on. However, from what I can see, most people instead consistently prefer that the RCU API instead be consolidated. Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() needs to be able to allocate memory occasionally. It can do that when invoked from some contexts, but not when invoked from others. Right now, in !PREEMPT kernels, it cannot tell, and must either do things to the memory allocators that some of the MM hate or must unnecessarily invoke workqueues. Thomas's patches would allow the code to just allocate in the common case when these primitives are invoked from contexts where allocation is permitted. If we want to restrict access to the can_schedule() or whatever primitive, fine and good. We can add a check to checkpatch.pl, for example. Maybe we can go back to the old brlock approach of requiring certain people's review for each addition to the kernel. But there really are use cases that it would greatly help. Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 8:29 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > All fair, but some of us need to write code that must handle being > invoked from a wide variety of contexts. Note that I think that core functionality is different from random drivers. Of course core code can (and will) look at things like if (in_interrupt()) .. schedule work asynchronously .. because core code ends up being called from odd places, and code like that is expected to have understanding of the rules it plays with. But something like RCU is a very different beast from some "walk the scatter-gather list" code. RCU does its work in the background, and works with lots of different things. And it's so core and used everywhere that it knows about these things. I mean, we literally have special code explicitly to let RCU know "we entered kernel context now". But something like a driver list walking thing should not be doing different things behind peoples back depending on whether they hold spinlocks or not. It should either just work regardless, or there should be a flag (or special interface) for the "you're being called in a crtitical region". Because dynamically changing behavior really is very confusing. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 15, 2020 at 12:57 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > You wish. I just found a 7 year old bug in a 10G network driver which > surely would have been found if people would enable debug configs and > not just run the crap on their PREEMPT_NONE, all debug off kernel. And > that driver is not subject to bitrot, it gets regular bug fixes from > people who seem to care (distro folks). That driver clearly cannot be very well maintained. All the distro kernels have the basic debug checks in place, afaik. Is it some wonderful "enterprise hardware" garbage again that only gets used in special data centers? Becasue the "enterprise" people really are special. Very much in the "short bus" special kind of way. The fact that they have fooled so much of the industry into thinking that they are the competent and serious people is a disgrace. Linus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, Sep 14, 2020 at 11:55:24PM +0200, Thomas Gleixner wrote: > But just look at any check which uses preemptible(), especially those > which check !preemptible(): hmm. +++ b/include/linux/preempt.h @@ -180,7 +180,9 @@ do { \ #define preempt_enable_no_resched() sched_preempt_enable_no_resched() +#ifndef MODULE #define preemptible() (preempt_count() == 0 && !irqs_disabled()) +#endif #ifdef CONFIG_PREEMPTION #define preempt_enable() \ $ git grep -w preemptible drivers (slightly trimmed by hand to remove, eg, comments) drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); drivers/firmware/arm_sdei.c: WARN_ON(preemptible()); drivers/firmware/efi/efi-pstore.c: preemptible(), record->size, record->psi->buf); drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); drivers/scsi/hisi_sas/hisi_sas_main.c: if (!preemptible()) drivers/xen/time.c: BUG_ON(preemptible()); That only looks like two drivers that need more than WARNectomies. Although maybe rcu_read_load_sched_held() or rcu_read_lock_any_held() might get called from a module ... _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > allocation modes or other things. > > > > > > No. I think that those kinds of decisions about actual behavior are > > > always simply fundamentally wrong. > > > > > > Note that this is very different from having warnings about invalid > > > use. THAT is correct. It may not warn in all configurations, but that > > > doesn't matter: what matters is that it warns in common enough > > > configurations that developers will catch it. > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > because you have a limited configuration that can't even detect the > > > situation, that's fine and dandy and intentional. > > > > > > But having code like > > > > > > if (can_schedule()) > > > .. do something different .. > > > > > > is fundamentally complete and utter garbage. > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > Those tests aren't great either, but at least they make sense. > > > > > > But a driver - or some library routine - making a difference based on > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > of that code. > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > do_something_atomic()" is pure shite. > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > pure shite. We need to fix them. Like the crypto code is getting > > > fixed. > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > Code that tries to cleverly adjust its behaviour depending upon the > > context it's running in is harder to understand and blows up in more > > interesting ways. We still have drm_can_sleep() and it's mostly just > > used for debug code, and I've largely ended up just deleting > > everything that used it because when you're driver is blowing up the > > last thing you want is to realize your debug code and output can't be > > relied upon. Or worse, that the only Oops you have is the one in the > > debug code, because the real one scrolled away - the original idea > > behind drm_can_sleep was to make all the modeset code work > > automagically both in normal ioctl/kworker context and in the panic > > handlers or kgdb callbacks. Wishful thinking at best. > > > > Also at least for me that extends to everything, e.g. I much prefer > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > locks shared with interrupt handlers, since the former two gives me > > clear information from which contexts such function can be called. > > Other end is the memalloc_no*_save/restore functions, where I recently > > made a real big fool of myself because I didn't realize how much that > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > stuff never fails" is wrong everywhere. > > > > It's all great for debugging and sanity checks (and we run with all > > that stuff enabled in our CI), but really semantic changes depending > > upon magic context checks freak my out :-) > > All fair, but some of us need to write code that must handle being > invoked from a wide variety of contexts. Now perhaps you like the idea of > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > are held, and so on. However, from what I can see, most people instead > consistently prefer that the RCU API instead be consolidated. > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > needs to be able to allocate memory occasionally. It can do that when > invoked from some contexts, but not when invoked from others. Right now, > in !PREEMPT kernels, it cannot tell, and must either do things to the > memory allocators that some of the MM hate or must unnecessarily invoke > workqueues. Thomas's patches would allow the code to just allocate in > the common case when these primitives are invoked from contexts where > allocation is permitted. > > If we want to restrict access to the can_schedule() or whatever primitive, > fine and good. We can add a check to checkpatch.pl, for example. Maybe > we can go back to the old brlock approach of requiring certain people's > review for each addition to the kernel. > > But there really are use cases that it would greatly help. We can deadlock in random fun places if random stuff we're calling suddenly starts allocating. Sometimes. Maybe once in a blue moon, to make it extra fun to reproduce. Maybe most driver subsystems are less brittle, but gpu drivers definitely need to know about the details for exactly this example. And yes gpu drivers use rcu for freeing dma_fence structures, and that tends to happen in code that we only recently figured out should really not allocate memory. I think minimally you need to throw in an unconditional fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs with full debugging knows what might happen. It's kinda like might_sleep, but a lot more specific. might_sleep() alone is not enough, because in the specific code paths I'm thinking of (and created special lockdep annotations for just recently) sleeping is allowed, but any memory allocations with GFP_RECLAIM set are no-go. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 11:32:00AM -0700, Linus Torvalds wrote: > On Wed, Sep 16, 2020 at 8:29 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > All fair, but some of us need to write code that must handle being > > invoked from a wide variety of contexts. > > Note that I think that core functionality is different from random drivers. > > Of course core code can (and will) look at things like > > if (in_interrupt()) > .. schedule work asynchronously .. > > because core code ends up being called from odd places, and code like > that is expected to have understanding of the rules it plays with. > > But something like RCU is a very different beast from some "walk the > scatter-gather list" code. > > RCU does its work in the background, and works with lots of different > things. And it's so core and used everywhere that it knows about these > things. I mean, we literally have special code explicitly to let RCU > know "we entered kernel context now". > > But something like a driver list walking thing should not be doing > different things behind peoples back depending on whether they hold > spinlocks or not. It should either just work regardless, or there > should be a flag (or special interface) for the "you're being called > in a crtitical region". > > Because dynamically changing behavior really is very confusing. Whew! I feel much better now. ;-) Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 08:23:52PM +0100, Matthew Wilcox wrote: > On Mon, Sep 14, 2020 at 11:55:24PM +0200, Thomas Gleixner wrote: > > But just look at any check which uses preemptible(), especially those > > which check !preemptible(): > > hmm. > > +++ b/include/linux/preempt.h > @@ -180,7 +180,9 @@ do { \ > > #define preempt_enable_no_resched() sched_preempt_enable_no_resched() > > +#ifndef MODULE > #define preemptible() (preempt_count() == 0 && !irqs_disabled()) > +#endif > > #ifdef CONFIG_PREEMPTION > #define preempt_enable() \ > > > $ git grep -w preemptible drivers > (slightly trimmed by hand to remove, eg, comments) > drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); > drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); > drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); > drivers/firmware/arm_sdei.c: WARN_ON_ONCE(preemptible()); > drivers/firmware/arm_sdei.c: WARN_ON(preemptible()); > drivers/firmware/efi/efi-pstore.c: preemptible(), record->size, record->psi->buf); > drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); > drivers/irqchip/irq-gic-v4.c: WARN_ON(preemptible()); > drivers/scsi/hisi_sas/hisi_sas_main.c: if (!preemptible()) > drivers/xen/time.c: BUG_ON(preemptible()); > > That only looks like two drivers that need more than WARNectomies. I could easily imagine someone thinking that these did something in CONFIG_PREEMPT_NONE=y kernels. In fact, I could easily imagine myself making that mistake. :-/ > Although maybe rcu_read_load_sched_held() or rcu_read_lock_any_held() > might get called from a module ... But yes, from the rcutorture module for certain and also from any other RCU-using module that includes the usual RCU debug checks. Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote: > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > > allocation modes or other things. > > > > > > > > No. I think that those kinds of decisions about actual behavior are > > > > always simply fundamentally wrong. > > > > > > > > Note that this is very different from having warnings about invalid > > > > use. THAT is correct. It may not warn in all configurations, but that > > > > doesn't matter: what matters is that it warns in common enough > > > > configurations that developers will catch it. > > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > > because you have a limited configuration that can't even detect the > > > > situation, that's fine and dandy and intentional. > > > > > > > > But having code like > > > > > > > > if (can_schedule()) > > > > .. do something different .. > > > > > > > > is fundamentally complete and utter garbage. > > > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > > Those tests aren't great either, but at least they make sense. > > > > > > > > But a driver - or some library routine - making a difference based on > > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > > of that code. > > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > > do_something_atomic()" is pure shite. > > > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > > pure shite. We need to fix them. Like the crypto code is getting > > > > fixed. > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > > Code that tries to cleverly adjust its behaviour depending upon the > > > context it's running in is harder to understand and blows up in more > > > interesting ways. We still have drm_can_sleep() and it's mostly just > > > used for debug code, and I've largely ended up just deleting > > > everything that used it because when you're driver is blowing up the > > > last thing you want is to realize your debug code and output can't be > > > relied upon. Or worse, that the only Oops you have is the one in the > > > debug code, because the real one scrolled away - the original idea > > > behind drm_can_sleep was to make all the modeset code work > > > automagically both in normal ioctl/kworker context and in the panic > > > handlers or kgdb callbacks. Wishful thinking at best. > > > > > > Also at least for me that extends to everything, e.g. I much prefer > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > > locks shared with interrupt handlers, since the former two gives me > > > clear information from which contexts such function can be called. > > > Other end is the memalloc_no*_save/restore functions, where I recently > > > made a real big fool of myself because I didn't realize how much that > > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > > stuff never fails" is wrong everywhere. > > > > > > It's all great for debugging and sanity checks (and we run with all > > > that stuff enabled in our CI), but really semantic changes depending > > > upon magic context checks freak my out :-) > > > > All fair, but some of us need to write code that must handle being > > invoked from a wide variety of contexts. Now perhaps you like the idea of > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > > are held, and so on. However, from what I can see, most people instead > > consistently prefer that the RCU API instead be consolidated. > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > > needs to be able to allocate memory occasionally. It can do that when > > invoked from some contexts, but not when invoked from others. Right now, > > in !PREEMPT kernels, it cannot tell, and must either do things to the > > memory allocators that some of the MM hate or must unnecessarily invoke > > workqueues. Thomas's patches would allow the code to just allocate in > > the common case when these primitives are invoked from contexts where > > allocation is permitted. > > > > If we want to restrict access to the can_schedule() or whatever primitive, > > fine and good. We can add a check to checkpatch.pl, for example. Maybe > > we can go back to the old brlock approach of requiring certain people's > > review for each addition to the kernel. > > > > But there really are use cases that it would greatly help. > > We can deadlock in random fun places if random stuff we're calling > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to > make it extra fun to reproduce. Maybe most driver subsystems are less > brittle, but gpu drivers definitely need to know about the details for > exactly this example. And yes gpu drivers use rcu for freeing > dma_fence structures, and that tends to happen in code that we only > recently figured out should really not allocate memory. > > I think minimally you need to throw in an unconditional > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs > with full debugging knows what might happen. It's kinda like > might_sleep, but a lot more specific. might_sleep() alone is not > enough, because in the specific code paths I'm thinking of (and > created special lockdep annotations for just recently) sleeping is > allowed, but any memory allocations with GFP_RECLAIM set are no-go. Completely agreed! Any allocation on any free path must be handled -extremely- carefully. To that end... First, there is always a fallback in case the allocation fails. Which might have performance or corner-case robustness issues, but which will at least allow forward progress. Second, we consulted with a number of MM experts to arrive at appropriate GFP_* flags (and their patience is greatly appreciated). Third, the paths that can allocate will do so about one time of 500, so any issues should be spotted sooner rather than later. So you are quite right to be concerned, but I believe we will be doing the right things. And based on his previous track record, I am also quite certain that Mr. Murphy will be on hand to provide me any additional education that I might require. Finally, I have noted down your point about fs_reclaim_acquire() and fs_reclaim_release(). Whether or not they prove to be needed, I do appreciate your calling them to my attention. Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote: > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > > > allocation modes or other things. > > > > > > > > > > No. I think that those kinds of decisions about actual behavior are > > > > > always simply fundamentally wrong. > > > > > > > > > > Note that this is very different from having warnings about invalid > > > > > use. THAT is correct. It may not warn in all configurations, but that > > > > > doesn't matter: what matters is that it warns in common enough > > > > > configurations that developers will catch it. > > > > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > > > because you have a limited configuration that can't even detect the > > > > > situation, that's fine and dandy and intentional. > > > > > > > > > > But having code like > > > > > > > > > > if (can_schedule()) > > > > > .. do something different .. > > > > > > > > > > is fundamentally complete and utter garbage. > > > > > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > > > Those tests aren't great either, but at least they make sense. > > > > > > > > > > But a driver - or some library routine - making a difference based on > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > > > of that code. > > > > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > > > do_something_atomic()" is pure shite. > > > > > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > > > pure shite. We need to fix them. Like the crypto code is getting > > > > > fixed. > > > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > > > Code that tries to cleverly adjust its behaviour depending upon the > > > > context it's running in is harder to understand and blows up in more > > > > interesting ways. We still have drm_can_sleep() and it's mostly just > > > > used for debug code, and I've largely ended up just deleting > > > > everything that used it because when you're driver is blowing up the > > > > last thing you want is to realize your debug code and output can't be > > > > relied upon. Or worse, that the only Oops you have is the one in the > > > > debug code, because the real one scrolled away - the original idea > > > > behind drm_can_sleep was to make all the modeset code work > > > > automagically both in normal ioctl/kworker context and in the panic > > > > handlers or kgdb callbacks. Wishful thinking at best. > > > > > > > > Also at least for me that extends to everything, e.g. I much prefer > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > > > locks shared with interrupt handlers, since the former two gives me > > > > clear information from which contexts such function can be called. > > > > Other end is the memalloc_no*_save/restore functions, where I recently > > > > made a real big fool of myself because I didn't realize how much that > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > > > stuff never fails" is wrong everywhere. > > > > > > > > It's all great for debugging and sanity checks (and we run with all > > > > that stuff enabled in our CI), but really semantic changes depending > > > > upon magic context checks freak my out :-) > > > > > > All fair, but some of us need to write code that must handle being > > > invoked from a wide variety of contexts. Now perhaps you like the idea of > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > > > are held, and so on. However, from what I can see, most people instead > > > consistently prefer that the RCU API instead be consolidated. > > > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > > > needs to be able to allocate memory occasionally. It can do that when > > > invoked from some contexts, but not when invoked from others. Right now, > > > in !PREEMPT kernels, it cannot tell, and must either do things to the > > > memory allocators that some of the MM hate or must unnecessarily invoke > > > workqueues. Thomas's patches would allow the code to just allocate in > > > the common case when these primitives are invoked from contexts where > > > allocation is permitted. > > > > > > If we want to restrict access to the can_schedule() or whatever primitive, > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe > > > we can go back to the old brlock approach of requiring certain people's > > > review for each addition to the kernel. > > > > > > But there really are use cases that it would greatly help. > > > > We can deadlock in random fun places if random stuff we're calling > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to > > make it extra fun to reproduce. Maybe most driver subsystems are less > > brittle, but gpu drivers definitely need to know about the details for > > exactly this example. And yes gpu drivers use rcu for freeing > > dma_fence structures, and that tends to happen in code that we only > > recently figured out should really not allocate memory. > > > > I think minimally you need to throw in an unconditional > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs > > with full debugging knows what might happen. It's kinda like > > might_sleep, but a lot more specific. might_sleep() alone is not > > enough, because in the specific code paths I'm thinking of (and > > created special lockdep annotations for just recently) sleeping is > > allowed, but any memory allocations with GFP_RECLAIM set are no-go. > > Completely agreed! Any allocation on any free path must be handled > -extremely- carefully. To that end... > > First, there is always a fallback in case the allocation fails. Which > might have performance or corner-case robustness issues, but which will > at least allow forward progress. Second, we consulted with a number of > MM experts to arrive at appropriate GFP_* flags (and their patience is > greatly appreciated). Third, the paths that can allocate will do so about > one time of 500, so any issues should be spotted sooner rather than later. > > So you are quite right to be concerned, but I believe we will be doing the > right things. And based on his previous track record, I am also quite > certain that Mr. Murphy will be on hand to provide me any additional > education that I might require. > > Finally, I have noted down your point about fs_reclaim_acquire() and > fs_reclaim_release(). Whether or not they prove to be needed, I do > appreciate your calling them to my attention. I just realized that since these dma_fence structs are refcounted and userspace can hold references (directly, it can pass them around behind file descriptors) we might never hit such a path until slightly unusual or evil userspace does something interesting. Do you have links to those patches? Some googling didn't turn up anything. I can then figure out whether it's better to risk not spotting issues with call_rcu vs slapping a memalloc_noio_save/restore around all these critical section which force-degrades any allocation to GFP_ATOMIC at most, but has the risk that we run into code that assumes "GFP_KERNEL never fails for small stuff" and has a decidedly less tested fallback path than rcu code. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote: > On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote: > > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > > > > allocation modes or other things. > > > > > > > > > > > > No. I think that those kinds of decisions about actual behavior are > > > > > > always simply fundamentally wrong. > > > > > > > > > > > > Note that this is very different from having warnings about invalid > > > > > > use. THAT is correct. It may not warn in all configurations, but that > > > > > > doesn't matter: what matters is that it warns in common enough > > > > > > configurations that developers will catch it. > > > > > > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > > > > because you have a limited configuration that can't even detect the > > > > > > situation, that's fine and dandy and intentional. > > > > > > > > > > > > But having code like > > > > > > > > > > > > if (can_schedule()) > > > > > > .. do something different .. > > > > > > > > > > > > is fundamentally complete and utter garbage. > > > > > > > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > > > > Those tests aren't great either, but at least they make sense. > > > > > > > > > > > > But a driver - or some library routine - making a difference based on > > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > > > > of that code. > > > > > > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > > > > do_something_atomic()" is pure shite. > > > > > > > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > > > > pure shite. We need to fix them. Like the crypto code is getting > > > > > > fixed. > > > > > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > > > > Code that tries to cleverly adjust its behaviour depending upon the > > > > > context it's running in is harder to understand and blows up in more > > > > > interesting ways. We still have drm_can_sleep() and it's mostly just > > > > > used for debug code, and I've largely ended up just deleting > > > > > everything that used it because when you're driver is blowing up the > > > > > last thing you want is to realize your debug code and output can't be > > > > > relied upon. Or worse, that the only Oops you have is the one in the > > > > > debug code, because the real one scrolled away - the original idea > > > > > behind drm_can_sleep was to make all the modeset code work > > > > > automagically both in normal ioctl/kworker context and in the panic > > > > > handlers or kgdb callbacks. Wishful thinking at best. > > > > > > > > > > Also at least for me that extends to everything, e.g. I much prefer > > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > > > > locks shared with interrupt handlers, since the former two gives me > > > > > clear information from which contexts such function can be called. > > > > > Other end is the memalloc_no*_save/restore functions, where I recently > > > > > made a real big fool of myself because I didn't realize how much that > > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > > > > stuff never fails" is wrong everywhere. > > > > > > > > > > It's all great for debugging and sanity checks (and we run with all > > > > > that stuff enabled in our CI), but really semantic changes depending > > > > > upon magic context checks freak my out :-) > > > > > > > > All fair, but some of us need to write code that must handle being > > > > invoked from a wide variety of contexts. Now perhaps you like the idea of > > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > > > > are held, and so on. However, from what I can see, most people instead > > > > consistently prefer that the RCU API instead be consolidated. > > > > > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > > > > needs to be able to allocate memory occasionally. It can do that when > > > > invoked from some contexts, but not when invoked from others. Right now, > > > > in !PREEMPT kernels, it cannot tell, and must either do things to the > > > > memory allocators that some of the MM hate or must unnecessarily invoke > > > > workqueues. Thomas's patches would allow the code to just allocate in > > > > the common case when these primitives are invoked from contexts where > > > > allocation is permitted. > > > > > > > > If we want to restrict access to the can_schedule() or whatever primitive, > > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe > > > > we can go back to the old brlock approach of requiring certain people's > > > > review for each addition to the kernel. > > > > > > > > But there really are use cases that it would greatly help. > > > > > > We can deadlock in random fun places if random stuff we're calling > > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to > > > make it extra fun to reproduce. Maybe most driver subsystems are less > > > brittle, but gpu drivers definitely need to know about the details for > > > exactly this example. And yes gpu drivers use rcu for freeing > > > dma_fence structures, and that tends to happen in code that we only > > > recently figured out should really not allocate memory. > > > > > > I think minimally you need to throw in an unconditional > > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs > > > with full debugging knows what might happen. It's kinda like > > > might_sleep, but a lot more specific. might_sleep() alone is not > > > enough, because in the specific code paths I'm thinking of (and > > > created special lockdep annotations for just recently) sleeping is > > > allowed, but any memory allocations with GFP_RECLAIM set are no-go. > > > > Completely agreed! Any allocation on any free path must be handled > > -extremely- carefully. To that end... > > > > First, there is always a fallback in case the allocation fails. Which > > might have performance or corner-case robustness issues, but which will > > at least allow forward progress. Second, we consulted with a number of > > MM experts to arrive at appropriate GFP_* flags (and their patience is > > greatly appreciated). Third, the paths that can allocate will do so about > > one time of 500, so any issues should be spotted sooner rather than later. > > > > So you are quite right to be concerned, but I believe we will be doing the > > right things. And based on his previous track record, I am also quite > > certain that Mr. Murphy will be on hand to provide me any additional > > education that I might require. > > > > Finally, I have noted down your point about fs_reclaim_acquire() and > > fs_reclaim_release(). Whether or not they prove to be needed, I do > > appreciate your calling them to my attention. > > I just realized that since these dma_fence structs are refcounted and > userspace can hold references (directly, it can pass them around > behind file descriptors) we might never hit such a path until slightly > unusual or evil userspace does something interesting. Do you have > links to those patches? Some googling didn't turn up anything. I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Here is the previous early draft version, which will change considerably for the next version: lore.kernel.org/lkml/20200809204354.20137-1-urezki@gmail.com This does kvfree_rcu(), but we expect to handle call_rcu() similarly. The version in preparation will use workqueues to do the allocation in a known-safe environment and also use lockless access to certain portions of the allocator caches (as noted earlier, this last is not much loved by some of the MM guys). Given Thomas's patch, we could with high probability allocate directly, perhaps even not needing memory-allocator modifications. Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the allocator to do anything that the calling context prohibits. So what types of bugs are you looking for? Where reclaim calls back into the driver or some such? Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, 16 Sep 2020 at 21:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But something like a driver list walking thing should not be doing > different things behind peoples back depending on whether they hold > spinlocks or not. It should either just work regardless, or there > should be a flag (or special interface) for the "you're being called > in a crtitical region". > > Because dynamically changing behavior really is very confusing. > By the same reasoning, I don't think a generic crypto library should be playing tricks with preemption en/disabling under the hood when iterating over some data that is all directly accessible via the linear map on the platforms that most people care about. And using kmap_atomic() unconditionally achieves exactly that. As I argued before, the fact that kmap_atomic() can be called from an atomic context, and the fact that its implementation on HIGHMEM platforms requires preemption to be disabled until the next kunmap() are two different things, and I don't agree with your assertion that the name kmap_atomic() implies the latter semantics. If we can avoid disabling preemption on HIGHMEM, as Thomas suggests, we surely don't need it on !HIGHMEM either, and given that kmap_atomic() is preferred today anyway, we can just merge the two implementations. Are there any existing debug features that could help us spot [ab]use of things like raw per-CPU data within kmap_atomic regions? Re your point about deprecating HIGHMEM: some work is underway on ARM to implement a 3.75/3.75 GB kernel/user split on recent LPAE capable hardware (which shouldn't suffer from the performance issues that plagued the 4/4 split on i686), and so hopefully, there is a path forward for ARM that does not rely on HIGHMEM as it does today. _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, Sep 17, 2020 at 12:39 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote: > > On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote: > > > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > > > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > > > > > allocation modes or other things. > > > > > > > > > > > > > > No. I think that those kinds of decisions about actual behavior are > > > > > > > always simply fundamentally wrong. > > > > > > > > > > > > > > Note that this is very different from having warnings about invalid > > > > > > > use. THAT is correct. It may not warn in all configurations, but that > > > > > > > doesn't matter: what matters is that it warns in common enough > > > > > > > configurations that developers will catch it. > > > > > > > > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > > > > > because you have a limited configuration that can't even detect the > > > > > > > situation, that's fine and dandy and intentional. > > > > > > > > > > > > > > But having code like > > > > > > > > > > > > > > if (can_schedule()) > > > > > > > .. do something different .. > > > > > > > > > > > > > > is fundamentally complete and utter garbage. > > > > > > > > > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > > > > > Those tests aren't great either, but at least they make sense. > > > > > > > > > > > > > > But a driver - or some library routine - making a difference based on > > > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > > > > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > > > > > of that code. > > > > > > > > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > > > > > do_something_atomic()" is pure shite. > > > > > > > > > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > > > > > pure shite. We need to fix them. Like the crypto code is getting > > > > > > > fixed. > > > > > > > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > > > > > Code that tries to cleverly adjust its behaviour depending upon the > > > > > > context it's running in is harder to understand and blows up in more > > > > > > interesting ways. We still have drm_can_sleep() and it's mostly just > > > > > > used for debug code, and I've largely ended up just deleting > > > > > > everything that used it because when you're driver is blowing up the > > > > > > last thing you want is to realize your debug code and output can't be > > > > > > relied upon. Or worse, that the only Oops you have is the one in the > > > > > > debug code, because the real one scrolled away - the original idea > > > > > > behind drm_can_sleep was to make all the modeset code work > > > > > > automagically both in normal ioctl/kworker context and in the panic > > > > > > handlers or kgdb callbacks. Wishful thinking at best. > > > > > > > > > > > > Also at least for me that extends to everything, e.g. I much prefer > > > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > > > > > locks shared with interrupt handlers, since the former two gives me > > > > > > clear information from which contexts such function can be called. > > > > > > Other end is the memalloc_no*_save/restore functions, where I recently > > > > > > made a real big fool of myself because I didn't realize how much that > > > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > > > > > stuff never fails" is wrong everywhere. > > > > > > > > > > > > It's all great for debugging and sanity checks (and we run with all > > > > > > that stuff enabled in our CI), but really semantic changes depending > > > > > > upon magic context checks freak my out :-) > > > > > > > > > > All fair, but some of us need to write code that must handle being > > > > > invoked from a wide variety of contexts. Now perhaps you like the idea of > > > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > > > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > > > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > > > > > are held, and so on. However, from what I can see, most people instead > > > > > consistently prefer that the RCU API instead be consolidated. > > > > > > > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > > > > > needs to be able to allocate memory occasionally. It can do that when > > > > > invoked from some contexts, but not when invoked from others. Right now, > > > > > in !PREEMPT kernels, it cannot tell, and must either do things to the > > > > > memory allocators that some of the MM hate or must unnecessarily invoke > > > > > workqueues. Thomas's patches would allow the code to just allocate in > > > > > the common case when these primitives are invoked from contexts where > > > > > allocation is permitted. > > > > > > > > > > If we want to restrict access to the can_schedule() or whatever primitive, > > > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe > > > > > we can go back to the old brlock approach of requiring certain people's > > > > > review for each addition to the kernel. > > > > > > > > > > But there really are use cases that it would greatly help. > > > > > > > > We can deadlock in random fun places if random stuff we're calling > > > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to > > > > make it extra fun to reproduce. Maybe most driver subsystems are less > > > > brittle, but gpu drivers definitely need to know about the details for > > > > exactly this example. And yes gpu drivers use rcu for freeing > > > > dma_fence structures, and that tends to happen in code that we only > > > > recently figured out should really not allocate memory. > > > > > > > > I think minimally you need to throw in an unconditional > > > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs > > > > with full debugging knows what might happen. It's kinda like > > > > might_sleep, but a lot more specific. might_sleep() alone is not > > > > enough, because in the specific code paths I'm thinking of (and > > > > created special lockdep annotations for just recently) sleeping is > > > > allowed, but any memory allocations with GFP_RECLAIM set are no-go. > > > > > > Completely agreed! Any allocation on any free path must be handled > > > -extremely- carefully. To that end... > > > > > > First, there is always a fallback in case the allocation fails. Which > > > might have performance or corner-case robustness issues, but which will > > > at least allow forward progress. Second, we consulted with a number of > > > MM experts to arrive at appropriate GFP_* flags (and their patience is > > > greatly appreciated). Third, the paths that can allocate will do so about > > > one time of 500, so any issues should be spotted sooner rather than later. > > > > > > So you are quite right to be concerned, but I believe we will be doing the > > > right things. And based on his previous track record, I am also quite > > > certain that Mr. Murphy will be on hand to provide me any additional > > > education that I might require. > > > > > > Finally, I have noted down your point about fs_reclaim_acquire() and > > > fs_reclaim_release(). Whether or not they prove to be needed, I do > > > appreciate your calling them to my attention. > > > > I just realized that since these dma_fence structs are refcounted and > > userspace can hold references (directly, it can pass them around > > behind file descriptors) we might never hit such a path until slightly > > unusual or evil userspace does something interesting. Do you have > > links to those patches? Some googling didn't turn up anything. I can > > then figure out whether it's better to risk not spotting issues with > > call_rcu vs slapping a memalloc_noio_save/restore around all these > > critical section which force-degrades any allocation to GFP_ATOMIC at > > most, but has the risk that we run into code that assumes "GFP_KERNEL > > never fails for small stuff" and has a decidedly less tested fallback > > path than rcu code. > > Here is the previous early draft version, which will change considerably > for the next version: > > lore.kernel.org/lkml/20200809204354.20137-1-urezki@gmail.com > > This does kvfree_rcu(), but we expect to handle call_rcu() similarly. > > The version in preparation will use workqueues to do the allocation in a > known-safe environment and also use lockless access to certain portions > of the allocator caches (as noted earlier, this last is not much loved > by some of the MM guys). Given Thomas's patch, we could with high > probability allocate directly, perhaps even not needing memory-allocator > modifications. > > Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the > allocator to do anything that the calling context prohibits. So what > types of bugs are you looking for? Where reclaim calls back into the > driver or some such? Yeah pretty much. It's a problem for gpu, fs, block drivers and really anything else that's remotely involved in memory reclaim somehow. Generally this is all handled explicitly by passing gfp_t flags down any call chain, but in some cases it's instead solved with the memalloc_no* functions. E.g. sunrpc uses that to make sure the network stack (which generally just assumes it can allocate memory) doesn't, to avoid recursions back into nfs/sunrpc. To my knowledge there's no way to check at runtime with which gfp flags you're allowed to allocate memory, a preemptible check is definitely not enough. Disabled preemption implies only GFP_ATOMIC is allowed (ignoring nmi and stuff like that), but the inverse is not true. So if you want the automagic in call_rcu I think either - we need to replace all explicit gfp flags with the context marking memalloc_no* across the entire kernel, or at least anywhere rcu might be used. - audit all callchains and make sure a call_rcu_noalloc is used anywhere there might be a problem. probably better to have a call_rcu_gfp with explicit gfp flags parameter, since generally that needs to be passed down. But at least to me the lockless magic in mm sounds a lot safer, since it contains the complexity and doesn't leak it out to callers of call_rcu. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, Sep 17, 2020 at 09:52:30AM +0200, Daniel Vetter wrote: > On Thu, Sep 17, 2020 at 12:39 AM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote: > > > On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote: > > > > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <paulmck@kernel.org> wrote: > > > > > > > > > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote: > > > > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds > > > > > > > <torvalds@linux-foundation.org> wrote: > > > > > > > > > > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > > > > > > > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named > > > > > > > > > 'can_schedule()' check makes tons of sense to make decisions about > > > > > > > > > allocation modes or other things. > > > > > > > > > > > > > > > > No. I think that those kinds of decisions about actual behavior are > > > > > > > > always simply fundamentally wrong. > > > > > > > > > > > > > > > > Note that this is very different from having warnings about invalid > > > > > > > > use. THAT is correct. It may not warn in all configurations, but that > > > > > > > > doesn't matter: what matters is that it warns in common enough > > > > > > > > configurations that developers will catch it. > > > > > > > > > > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger, > > > > > > > > because you have a limited configuration that can't even detect the > > > > > > > > situation, that's fine and dandy and intentional. > > > > > > > > > > > > > > > > But having code like > > > > > > > > > > > > > > > > if (can_schedule()) > > > > > > > > .. do something different .. > > > > > > > > > > > > > > > > is fundamentally complete and utter garbage. > > > > > > > > > > > > > > > > It's one thing if you test for "am I in hardware interrupt context". > > > > > > > > Those tests aren't great either, but at least they make sense. > > > > > > > > > > > > > > > > But a driver - or some library routine - making a difference based on > > > > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG. > > > > > > > > > > > > > > > > If some code changes behavior, it needs to be explicit to the *caller* > > > > > > > > of that code. > > > > > > > > > > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule()) > > > > > > > > do_something_atomic()" is pure shite. > > > > > > > > > > > > > > > > And I am not IN THE LEAST interested in trying to help people doing > > > > > > > > pure shite. We need to fix them. Like the crypto code is getting > > > > > > > > fixed. > > > > > > > > > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers. > > > > > > > Code that tries to cleverly adjust its behaviour depending upon the > > > > > > > context it's running in is harder to understand and blows up in more > > > > > > > interesting ways. We still have drm_can_sleep() and it's mostly just > > > > > > > used for debug code, and I've largely ended up just deleting > > > > > > > everything that used it because when you're driver is blowing up the > > > > > > > last thing you want is to realize your debug code and output can't be > > > > > > > relied upon. Or worse, that the only Oops you have is the one in the > > > > > > > debug code, because the real one scrolled away - the original idea > > > > > > > behind drm_can_sleep was to make all the modeset code work > > > > > > > automagically both in normal ioctl/kworker context and in the panic > > > > > > > handlers or kgdb callbacks. Wishful thinking at best. > > > > > > > > > > > > > > Also at least for me that extends to everything, e.g. I much prefer > > > > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for > > > > > > > locks shared with interrupt handlers, since the former two gives me > > > > > > > clear information from which contexts such function can be called. > > > > > > > Other end is the memalloc_no*_save/restore functions, where I recently > > > > > > > made a real big fool of myself because I didn't realize how much that > > > > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small > > > > > > > stuff never fails" is wrong everywhere. > > > > > > > > > > > > > > It's all great for debugging and sanity checks (and we run with all > > > > > > > that stuff enabled in our CI), but really semantic changes depending > > > > > > > upon magic context checks freak my out :-) > > > > > > > > > > > > All fair, but some of us need to write code that must handle being > > > > > > invoked from a wide variety of contexts. Now perhaps you like the idea of > > > > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption > > > > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled, > > > > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks > > > > > > are held, and so on. However, from what I can see, most people instead > > > > > > consistently prefer that the RCU API instead be consolidated. > > > > > > > > > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu() > > > > > > needs to be able to allocate memory occasionally. It can do that when > > > > > > invoked from some contexts, but not when invoked from others. Right now, > > > > > > in !PREEMPT kernels, it cannot tell, and must either do things to the > > > > > > memory allocators that some of the MM hate or must unnecessarily invoke > > > > > > workqueues. Thomas's patches would allow the code to just allocate in > > > > > > the common case when these primitives are invoked from contexts where > > > > > > allocation is permitted. > > > > > > > > > > > > If we want to restrict access to the can_schedule() or whatever primitive, > > > > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe > > > > > > we can go back to the old brlock approach of requiring certain people's > > > > > > review for each addition to the kernel. > > > > > > > > > > > > But there really are use cases that it would greatly help. > > > > > > > > > > We can deadlock in random fun places if random stuff we're calling > > > > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to > > > > > make it extra fun to reproduce. Maybe most driver subsystems are less > > > > > brittle, but gpu drivers definitely need to know about the details for > > > > > exactly this example. And yes gpu drivers use rcu for freeing > > > > > dma_fence structures, and that tends to happen in code that we only > > > > > recently figured out should really not allocate memory. > > > > > > > > > > I think minimally you need to throw in an unconditional > > > > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs > > > > > with full debugging knows what might happen. It's kinda like > > > > > might_sleep, but a lot more specific. might_sleep() alone is not > > > > > enough, because in the specific code paths I'm thinking of (and > > > > > created special lockdep annotations for just recently) sleeping is > > > > > allowed, but any memory allocations with GFP_RECLAIM set are no-go. > > > > > > > > Completely agreed! Any allocation on any free path must be handled > > > > -extremely- carefully. To that end... > > > > > > > > First, there is always a fallback in case the allocation fails. Which > > > > might have performance or corner-case robustness issues, but which will > > > > at least allow forward progress. Second, we consulted with a number of > > > > MM experts to arrive at appropriate GFP_* flags (and their patience is > > > > greatly appreciated). Third, the paths that can allocate will do so about > > > > one time of 500, so any issues should be spotted sooner rather than later. > > > > > > > > So you are quite right to be concerned, but I believe we will be doing the > > > > right things. And based on his previous track record, I am also quite > > > > certain that Mr. Murphy will be on hand to provide me any additional > > > > education that I might require. > > > > > > > > Finally, I have noted down your point about fs_reclaim_acquire() and > > > > fs_reclaim_release(). Whether or not they prove to be needed, I do > > > > appreciate your calling them to my attention. > > > > > > I just realized that since these dma_fence structs are refcounted and > > > userspace can hold references (directly, it can pass them around > > > behind file descriptors) we might never hit such a path until slightly > > > unusual or evil userspace does something interesting. Do you have > > > links to those patches? Some googling didn't turn up anything. I can > > > then figure out whether it's better to risk not spotting issues with > > > call_rcu vs slapping a memalloc_noio_save/restore around all these > > > critical section which force-degrades any allocation to GFP_ATOMIC at > > > most, but has the risk that we run into code that assumes "GFP_KERNEL > > > never fails for small stuff" and has a decidedly less tested fallback > > > path than rcu code. > > > > Here is the previous early draft version, which will change considerably > > for the next version: > > > > lore.kernel.org/lkml/20200809204354.20137-1-urezki@gmail.com > > > > This does kvfree_rcu(), but we expect to handle call_rcu() similarly. > > > > The version in preparation will use workqueues to do the allocation in a > > known-safe environment and also use lockless access to certain portions > > of the allocator caches (as noted earlier, this last is not much loved > > by some of the MM guys). Given Thomas's patch, we could with high > > probability allocate directly, perhaps even not needing memory-allocator > > modifications. > > > > Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the > > allocator to do anything that the calling context prohibits. So what > > types of bugs are you looking for? Where reclaim calls back into the > > driver or some such? > > Yeah pretty much. It's a problem for gpu, fs, block drivers and really > anything else that's remotely involved in memory reclaim somehow. > Generally this is all handled explicitly by passing gfp_t flags down > any call chain, but in some cases it's instead solved with the > memalloc_no* functions. E.g. sunrpc uses that to make sure the network > stack (which generally just assumes it can allocate memory) doesn't, > to avoid recursions back into nfs/sunrpc. To my knowledge there's no > way to check at runtime with which gfp flags you're allowed to > allocate memory, a preemptible check is definitely not enough. > Disabled preemption implies only GFP_ATOMIC is allowed (ignoring nmi > and stuff like that), but the inverse is not true. Thank you for the confirmation! > So if you want the automagic in call_rcu I think either > - we need to replace all explicit gfp flags with the context marking > memalloc_no* across the entire kernel, or at least anywhere rcu might > be used. > - audit all callchains and make sure a call_rcu_noalloc is used > anywhere there might be a problem. probably better to have a > call_rcu_gfp with explicit gfp flags parameter, since generally that > needs to be passed down. > > But at least to me the lockless magic in mm sounds a lot safer, since > it contains the complexity and doesn't leak it out to callers of > call_rcu. Agreed, I greatly prefer Peter Zijlstra's lockless-allocation patch myself. In the meantime, it looks like we will start by causing the allocation to happen in a safe environment. That may have issues with delays, but is at least something that can be done entirely within the confines of RCU. Thanx, Paul _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at did you mean memalloc_noreclaim_* here? > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Even if the above then please note that memalloc_noreclaim_* or PF_MEMALLOC should be used with an extreme care. Essentially only for internal memory reclaimers. It grants access to _all_ the available memory so any abuse can be detrimental to the overall system operation. Allocation failure in this mode means that we are out of memory and any code relying on such an allocation has to carefuly consider failure. This is not a random allocation mode. -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at did you mean memalloc_noreclaim_* here? > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Even if the above then please note that memalloc_noreclaim_* or PF_MEMALLOC should be used with an extreme care. Essentially only for internal memory reclaimers. It grants access to _all_ the available memory so any abuse can be detrimental to the overall system operation. Allocation failure in this mode means that we are out of memory and any code relying on such an allocation has to carefuly consider failure. This is not a random allocation mode. -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at did you mean memalloc_noreclaim_* here? > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Even if the above then please note that memalloc_noreclaim_* or PF_MEMALLOC should be used with an extreme care. Essentially only for internal memory reclaimers. It grants access to _all_ the available memory so any abuse can be detrimental to the overall system operation. Allocation failure in this mode means that we are out of memory and any code relying on such an allocation has to carefuly consider failure. This is not a random allocation mode. -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at did you mean memalloc_noreclaim_* here? > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Even if the above then please note that memalloc_noreclaim_* or PF_MEMALLOC should be used with an extreme care. Essentially only for internal memory reclaimers. It grants access to _all_ the available memory so any abuse can be detrimental to the overall system operation. Allocation failure in this mode means that we are out of memory and any code relying on such an allocation has to carefuly consider failure. This is not a random allocation mode. -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > I can > then figure out whether it's better to risk not spotting issues with > call_rcu vs slapping a memalloc_noio_save/restore around all these > critical section which force-degrades any allocation to GFP_ATOMIC at did you mean memalloc_noreclaim_* here? > most, but has the risk that we run into code that assumes "GFP_KERNEL > never fails for small stuff" and has a decidedly less tested fallback > path than rcu code. Even if the above then please note that memalloc_noreclaim_* or PF_MEMALLOC should be used with an extreme care. Essentially only for internal memory reclaimers. It grants access to _all_ the available memory so any abuse can be detrimental to the overall system operation. Allocation failure in this mode means that we are out of memory and any code relying on such an allocation has to carefuly consider failure. This is not a random allocation mode. -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote: > On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > > I can > > then figure out whether it's better to risk not spotting issues with > > call_rcu vs slapping a memalloc_noio_save/restore around all these > > critical section which force-degrades any allocation to GFP_ATOMIC at > > did you mean memalloc_noreclaim_* here? Yeah I picked the wrong one of that family of functions. > > most, but has the risk that we run into code that assumes "GFP_KERNEL > > never fails for small stuff" and has a decidedly less tested fallback > > path than rcu code. > > Even if the above then please note that memalloc_noreclaim_* or > PF_MEMALLOC should be used with an extreme care. Essentially only for > internal memory reclaimers. It grants access to _all_ the available > memory so any abuse can be detrimental to the overall system operation. > Allocation failure in this mode means that we are out of memory and any > code relying on such an allocation has to carefuly consider failure. > This is not a random allocation mode. Agreed, that's why I don't like having these kind of automagic critical sections. It's a bit a shotgun approach. Paul said that the code would handle failures, but the problem is that it applies everywhere. Anyway my understanding is that call_rcu will be reworked and gain a pile of tricks so that these problems for the callchains leading to call_rcu all disappear. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue 29-09-20 11:00:03, Daniel Vetter wrote: > On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote: > > On Wed 16-09-20 23:43:02, Daniel Vetter wrote: > > > I can > > > then figure out whether it's better to risk not spotting issues with > > > call_rcu vs slapping a memalloc_noio_save/restore around all these > > > critical section which force-degrades any allocation to GFP_ATOMIC at > > > > did you mean memalloc_noreclaim_* here? > > Yeah I picked the wrong one of that family of functions. > > > > most, but has the risk that we run into code that assumes "GFP_KERNEL > > > never fails for small stuff" and has a decidedly less tested fallback > > > path than rcu code. > > > > Even if the above then please note that memalloc_noreclaim_* or > > PF_MEMALLOC should be used with an extreme care. Essentially only for > > internal memory reclaimers. It grants access to _all_ the available > > memory so any abuse can be detrimental to the overall system operation. > > Allocation failure in this mode means that we are out of memory and any > > code relying on such an allocation has to carefuly consider failure. > > This is not a random allocation mode. > > Agreed, that's why I don't like having these kind of automagic critical > sections. It's a bit a shotgun approach. Paul said that the code would > handle failures, but the problem is that it applies everywhere. Ohh, in the ideal world we wouldn't need anything like that. But then the reality fires: * PF_MEMALLOC (resp memalloc_noreclaim_* for that matter) is primarily used to make sure that allocations from inside the memory reclaim - yeah that happens - will not recurse. * PF_MEMALLOC_NO{FS,IO} (resp memalloc_no{fs,io}*) are used to mark no fs/io reclaim recursion critical sections because controling that for each allocation inside fs transaction (or other sensitive) or IO contexts turned out to be unmaintainable and people simply fallen into using NOFS/NOIO unconditionally which is causing reclaim imbalance problems. * PF_MEMALLOC_NOCMA (resp memalloc_nocma*) is used for long term pinning when CMA pages cannot be pinned because that would break the CMA guarantees. Communicating this to all potential allocations during pinning is simply unfeasible. So you are absolutely right that these critical sections with side effects on all allocations are far from ideal from the API point of view but they are mostly mirroring a demand for functionality which is _practically_ impossible to achieve with our current code base. Not that we couldn't get back to drawing board and come up with a saner thing and rework the world... -- Michal Hocko SUSE Labs _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel