* [PATCH v5 0/3] x86/bus_lock: Enable bus lock detection @ 2021-03-13 5:49 Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for " Fenghua Yu ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Fenghua Yu @ 2021-03-13 5:49 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li , Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu A bus lock [1] is acquired through either split locked access to writeback (WB) memory or any locked access to non-WB memory. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores. Although split lock can be detected by #AC trap, the trap is triggered before the instruction acquires bus lock. This makes it difficult to mitigate bus lock (e.g. throttle the user application). Some CPUs have ability to notify the kernel by an #DB trap after a user instruction acquires a bus lock and is executed. This allows the kernel to enforce user application throttling or mitigations. #DB for bus lock detect fixes issues in #AC for split lock detect: 1) It's architectural ... just need to look at one CPUID bit to know it exists 2) The IA32_DEBUGCTL MSR, which reports bus lock in #DB, is per-thread. So each process or guest can have different behavior. 3) It has support for VMM/guests (new VMEXIT codes, etc). 4) It detects not only split locks but also bus locks from non-WB. Hardware only generates #DB for bus lock detect when CPL>0 to avoid nested #DB from multiple bus locks while the first #DB is being handled. Use the existing kernel command line parameter "split_lock_detect=" to handle #DB for bus lock with an additional option "ratelimit=N" to set bus lock rate limit for a user. [1] Intel Instruction Set Extension Chapter 9: https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf Change Log: v5: Address all comments from Thomas: - In the cover letter, update the latest ISE link to include the #DB for bus lock spec. - In patch 1, add commit message for breakpoint and bus lock on the same instruction. - In patch 2, change warn to #AC if both #AC and #DB are supported, remove sld and bld variables, remove bus lock checking in handle_bus_lock() etc. - In patch 3 and 4, remove bld_ratelimit < HZ/2 check and define bld_ratelimit only for Intel CPUs. - Merge patch 2 and 3 into one patch for handling warn, fatal, and ratelimit. v4 is here: https://lore.kernel.org/lkml/20201124205245.4164633-2-fenghua.yu@intel.com/ v4: - Fix a ratelimit wording issue in the doc (Randy). - Patch 4 is acked by Randy (Randy). v3: - Enable Bus Lock Detection when fatal to handle bus lock from non-WB (PeterZ). - Add Acked-by: PeterZ in patch 2. v2: - Send SIGBUS in fatal case for bus lock #DB (PeterZ). v1: - Check bus lock bit by its positive polarity (Xiaoyao). - Fix a few wording issues in the documentation (Randy). [RFC v3 can be found at: https://lore.kernel.org/patchwork/cover/1329943/] RFC v3: - Remove DR6_RESERVED change (PeterZ). - Simplify the documentation (Randy). RFC v2: - Architecture changed based on feedback from Thomas and PeterZ. #DB is no longer generated for bus lock in ring0. - Split the one single patch into four patches. [RFC v1 can be found at: https://lore.kernel.org/lkml/1595021700-68460-1-git-send-email-fenghua.yu@intel.com/] Fenghua Yu (3): x86/cpufeatures: Enumerate #DB for bus lock detection x86/bus_lock: Handle #DB for bus lock Documentation/admin-guide: Change doc for split_lock_detect parameter .../admin-guide/kernel-parameters.txt | 30 +++- arch/x86/include/asm/cpu.h | 10 +- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/uapi/asm/debugreg.h | 1 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/intel.c | 148 +++++++++++++++--- arch/x86/kernel/traps.c | 7 + include/linux/sched/user.h | 5 +- kernel/user.c | 13 ++ 10 files changed, 187 insertions(+), 31 deletions(-) -- 2.30.2 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for bus lock detection 2021-03-13 5:49 [PATCH v5 0/3] x86/bus_lock: Enable bus lock detection Fenghua Yu @ 2021-03-13 5:49 ` Fenghua Yu 2021-03-19 20:35 ` Thomas Gleixner 2021-03-13 5:49 ` [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter Fenghua Yu 2 siblings, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-03-13 5:49 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li , Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu A bus lock is acquired though either split locked access to writeback (WB) memory or any locked access to non-WB memory. This is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores. Some CPUs have ability to notify the kernel by an #DB trap after a user instruction acquires a bus lock and is executed. This allows the kernel to enforce user application throttling or mitigations. Both breakpoint and bus lock can trigger the #DB trap in the same instruction and the ordering of handling them is the kernel #DB handler's choice. The CPU feature flag to be shown in /proc/cpuinfo will be "bus_lock_detect". Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- Change Log: v5: - Add "Both breakpoint and bus lock can trigger an #DB trap..." in the commit message (Thomas). arch/x86/include/asm/cpufeatures.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index cc96e26d69f7..faec3d92d09b 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -354,6 +354,7 @@ #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */ #define X86_FEATURE_LA57 (16*32+16) /* 5-level page tables */ #define X86_FEATURE_RDPID (16*32+22) /* RDPID instruction */ +#define X86_FEATURE_BUS_LOCK_DETECT (16*32+24) /* Bus Lock detect */ #define X86_FEATURE_CLDEMOTE (16*32+25) /* CLDEMOTE instruction */ #define X86_FEATURE_MOVDIRI (16*32+27) /* MOVDIRI instruction */ #define X86_FEATURE_MOVDIR64B (16*32+28) /* MOVDIR64B instruction */ -- 2.30.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for bus lock detection 2021-03-13 5:49 ` [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for " Fenghua Yu @ 2021-03-19 20:35 ` Thomas Gleixner 2021-03-19 21:00 ` Fenghua Yu 0 siblings, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2021-03-19 20:35 UTC (permalink / raw) To: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > A bus lock is acquired though either split locked access to s/though/through/ either a > writeback (WB) memory or any locked access to non-WB memory. This is > typically >1000 cycles slower than an atomic operation within a cache > line. It also disrupts performance on other cores. > > Some CPUs have ability to notify the kernel by an #DB trap after a user the ability > instruction acquires a bus lock and is executed. This allows the kernel > to enforce user application throttling or mitigations. Both breakpoint > and bus lock can trigger the #DB trap in the same instruction and the > ordering of handling them is the kernel #DB handler's choice. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for bus lock detection 2021-03-19 20:35 ` Thomas Gleixner @ 2021-03-19 21:00 ` Fenghua Yu 0 siblings, 0 replies; 18+ messages in thread From: Fenghua Yu @ 2021-03-19 21:00 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 On Fri, Mar 19, 2021 at 09:35:39PM +0100, Thomas Gleixner wrote: > On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > > A bus lock is acquired though either split locked access to > > s/though/through/ > either a > > Some CPUs have ability to notify the kernel by an #DB trap after a user > > the ability Thank you for your review, Thomas! Will fix the issues. -Fenghua ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-13 5:49 [PATCH v5 0/3] x86/bus_lock: Enable bus lock detection Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for " Fenghua Yu @ 2021-03-13 5:49 ` Fenghua Yu 2021-03-19 21:30 ` Thomas Gleixner 2021-03-13 5:49 ` [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter Fenghua Yu 2 siblings, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-03-13 5:49 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li , Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu Bus locks degrade performance for the whole system, not just for the CPU that requested the bus lock. Two CPU features "#AC for split lock" and "#DB for bus lock" provide hooks so that the operating system may choose one of several mitigation strategies. #AC for split lock is already implemented. Add code to use the #DB for bus lock feature to cover additional situations with new options to mitigate. split_lock_detect= #AC for split lock #DB for bus lock off Do nothing Do nothing warn Kernel OOPs Warn once per task and Warn once per task and and continues to run. disable future checking When both features are supported, warn in #AC fatal Kernel OOPs Send SIGBUS to user. Send SIGBUS to user When both features are supported, fatal in #AC ratelimit:N Do nothing Limit bus lock rate to N per second in the current non-root user. Default option is "warn". Hardware only generates #DB for bus lock detect when CPL>0 to avoid nested #DB from multiple bus locks while the first #DB is being handled. So no need to handle #DB for bus lock detected in the kernel. #DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR while #AC for split lock is enabled by split lock detection bit 29 in TEST_CTRL MSR. Both breakpoint and bus lock in the same instruction can trigger one #DB. The bus lock is handled before the breakpoint in the #DB handler. Delivery of #DB for bus lock in userspace clears DR6[11]. To avoid confusion in identifying #DB, #DB handler sets the bit to 1 before returning to the interrupted task. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> --- Change Log: v5: Address all comments from Thomas: - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" options are processed in one patch. - Change warn to #AC if both #AC and #DB are supported. - Remove sld and bld variables and use boot_cpu_has() to check bus lock split lock support. - Remove bus lock checking in handle_bus_lock(). - Remove bld_ratelimit < HZ/2 check. - Add rate limit handling comment in bus lock #DB. - Define bld_ratelimit only for Intel CPUs. v3: - Enable Bus Lock Detection when fatal to handle bus lock from non-WB (PeterZ). v2: - Send SIGBUS in fatal case for bus lock #DB (PeterZ). v1:: - Check bus lock bit by its positive polarity (Xiaoyao). RFC v3: - Remove DR6_RESERVED change (PeterZ). arch/x86/include/asm/cpu.h | 10 +- arch/x86/include/asm/msr-index.h | 1 + arch/x86/include/uapi/asm/debugreg.h | 1 + arch/x86/kernel/cpu/common.c | 2 +- arch/x86/kernel/cpu/intel.c | 148 +++++++++++++++++++++++---- arch/x86/kernel/traps.c | 7 ++ include/linux/sched/user.h | 5 +- kernel/user.c | 13 +++ 8 files changed, 162 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h index da78ccbd493b..991de5f2a09c 100644 --- a/arch/x86/include/asm/cpu.h +++ b/arch/x86/include/asm/cpu.h @@ -41,12 +41,14 @@ unsigned int x86_family(unsigned int sig); unsigned int x86_model(unsigned int sig); unsigned int x86_stepping(unsigned int sig); #ifdef CONFIG_CPU_SUP_INTEL -extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c); +extern void __init sld_setup(struct cpuinfo_x86 *c); extern void switch_to_sld(unsigned long tifn); extern bool handle_user_split_lock(struct pt_regs *regs, long error_code); extern bool handle_guest_split_lock(unsigned long ip); +extern void handle_bus_lock(struct pt_regs *regs); +extern int bld_ratelimit; #else -static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {} +static inline void __init sld_setup(struct cpuinfo_x86 *c) {} static inline void switch_to_sld(unsigned long tifn) {} static inline bool handle_user_split_lock(struct pt_regs *regs, long error_code) { @@ -57,6 +59,10 @@ static inline bool handle_guest_split_lock(unsigned long ip) { return false; } + +static inline void handle_bus_lock(struct pt_regs *regs) +{ +} #endif #ifdef CONFIG_IA32_FEAT_CTL void init_ia32_feat_ctl(struct cpuinfo_x86 *c); diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index 546d6ecf0a35..558485965f21 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -265,6 +265,7 @@ #define DEBUGCTLMSR_LBR (1UL << 0) /* last branch recording */ #define DEBUGCTLMSR_BTF_SHIFT 1 #define DEBUGCTLMSR_BTF (1UL << 1) /* single-step on branches */ +#define DEBUGCTLMSR_BUS_LOCK_DETECT (1UL << 2) /* Bus lock detect */ #define DEBUGCTLMSR_TR (1UL << 6) #define DEBUGCTLMSR_BTS (1UL << 7) #define DEBUGCTLMSR_BTINT (1UL << 8) diff --git a/arch/x86/include/uapi/asm/debugreg.h b/arch/x86/include/uapi/asm/debugreg.h index d95d080b30e3..0007ba077c0c 100644 --- a/arch/x86/include/uapi/asm/debugreg.h +++ b/arch/x86/include/uapi/asm/debugreg.h @@ -24,6 +24,7 @@ #define DR_TRAP3 (0x8) /* db3 */ #define DR_TRAP_BITS (DR_TRAP0|DR_TRAP1|DR_TRAP2|DR_TRAP3) +#define DR_BUS_LOCK (0x800) /* bus_lock */ #define DR_STEP (0x4000) /* single-step */ #define DR_SWITCH (0x8000) /* task switch */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index ab640abe26b6..1a4e260b9027 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1330,7 +1330,7 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c) cpu_set_bug_bits(c); - cpu_set_core_cap_bits(c); + sld_setup(c); fpu__init_system(c); diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 0e422a544835..2bd680c0c9ae 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -10,6 +10,9 @@ #include <linux/thread_info.h> #include <linux/init.h> #include <linux/uaccess.h> +#include <linux/cred.h> +#include <linux/delay.h> +#include <linux/sched/user.h> #include <asm/cpufeature.h> #include <asm/msr.h> @@ -41,12 +44,13 @@ enum split_lock_detect_state { sld_off = 0, sld_warn, sld_fatal, + sld_ratelimit, }; /* - * Default to sld_off because most systems do not support split lock detection - * split_lock_setup() will switch this to sld_warn on systems that support - * split lock detect, unless there is a command line override. + * Default to sld_off because most systems do not support split lock detection. + * sld_state_setup() will switch this to sld_warn on systems that support + * split lock/bus lock detect, unless there is a command line override. */ static enum split_lock_detect_state sld_state __ro_after_init = sld_off; static u64 msr_test_ctrl_cache __ro_after_init; @@ -603,6 +607,7 @@ static void init_intel_misc_features(struct cpuinfo_x86 *c) } static void split_lock_init(void); +static void bus_lock_init(void); static void init_intel(struct cpuinfo_x86 *c) { @@ -720,6 +725,7 @@ static void init_intel(struct cpuinfo_x86 *c) tsx_disable(); split_lock_init(); + bus_lock_init(); intel_init_thermal(c); } @@ -995,13 +1001,24 @@ static const struct { { "off", sld_off }, { "warn", sld_warn }, { "fatal", sld_fatal }, + { "ratelimit:", sld_ratelimit }, }; static inline bool match_option(const char *arg, int arglen, const char *opt) { - int len = strlen(opt); - return len == arglen && !strncmp(arg, opt, len); + int len = strlen(opt), ratelimit; + + if (strncmp(arg, opt, len)) + return false; + + if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 && ratelimit > 0) { + bld_ratelimit = ratelimit; + + return true; + } + + return len == arglen; } static bool split_lock_verify_msr(bool on) @@ -1020,16 +1037,15 @@ static bool split_lock_verify_msr(bool on) return ctrl == tmp; } -static void __init split_lock_setup(void) +static void __init sld_state_setup(void) { enum split_lock_detect_state state = sld_warn; char arg[20]; int i, ret; - if (!split_lock_verify_msr(false)) { - pr_info("MSR access failed: Disabled\n"); + if (!boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && + !boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT)) return; - } ret = cmdline_find_option(boot_command_line, "split_lock_detect", arg, sizeof(arg)); @@ -1041,17 +1057,14 @@ static void __init split_lock_setup(void) } } } + sld_state = state; +} - switch (state) { - case sld_off: - pr_info("disabled\n"); +static void __init __split_lock_setup(void) +{ + if (!split_lock_verify_msr(false)) { + pr_info("MSR access failed: Disabled\n"); return; - case sld_warn: - pr_info("warning about user-space split_locks\n"); - break; - case sld_fatal: - pr_info("sending SIGBUS on user-space split_locks\n"); - break; } rdmsrl(MSR_TEST_CTRL, msr_test_ctrl_cache); @@ -1061,7 +1074,9 @@ static void __init split_lock_setup(void) return; } - sld_state = state; + /* Restore the MSR to its cached value. */ + wrmsrl(MSR_TEST_CTRL, msr_test_ctrl_cache); + setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT); } @@ -1082,6 +1097,15 @@ static void sld_update_msr(bool on) static void split_lock_init(void) { + /* + * #DB for bus lock handles ratelimit and #AC for split lock is + * disabled. + */ + if (sld_state == sld_ratelimit) { + split_lock_verify_msr(false); + return; + } + if (cpu_model_supports_sld) split_lock_verify_msr(sld_state != sld_off); } @@ -1118,6 +1142,29 @@ bool handle_guest_split_lock(unsigned long ip) } EXPORT_SYMBOL_GPL(handle_guest_split_lock); +static void bus_lock_init(void) +{ + u64 val; + + /* + * Warn and fatal are handled by #AC for split lock if #AC for + * split lock is supported. + */ + if (!boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT) || + (boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && + (sld_state == sld_warn || sld_state == sld_fatal)) || + sld_state == sld_off) + return; + + /* + * Enable #DB for bus lock. All bus locks are handled in #DB except + * split locks are handled in #AC in fatal case. + */ + rdmsrl(MSR_IA32_DEBUGCTLMSR, val); + val |= DEBUGCTLMSR_BUS_LOCK_DETECT; + wrmsrl(MSR_IA32_DEBUGCTLMSR, val); +} + bool handle_user_split_lock(struct pt_regs *regs, long error_code) { if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal) @@ -1126,6 +1173,26 @@ bool handle_user_split_lock(struct pt_regs *regs, long error_code) return true; } +void handle_bus_lock(struct pt_regs *regs) +{ + switch (sld_state) { + case sld_off: + break; + case sld_warn: + pr_warn_ratelimited("#DB: %s/%d took a bus_lock trap at address: 0x%lx\n", + current->comm, current->pid, regs->ip); + break; + case sld_fatal: + force_sig_fault(SIGBUS, BUS_ADRALN, NULL); + break; + case sld_ratelimit: + /* Enforce no more than bld_ratelimit bus locks/sec. */ + while (!__ratelimit(&get_current_user()->bld_ratelimit)) + msleep(1000 / bld_ratelimit); + break; + } +} + /* * This function is called only when switching between tasks with * different split-lock detection modes. It sets the MSR for the @@ -1166,7 +1233,7 @@ static const struct x86_cpu_id split_lock_cpu_ids[] __initconst = { {} }; -void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) +static void __init split_lock_setup(struct cpuinfo_x86 *c) { const struct x86_cpu_id *m; u64 ia32_core_caps; @@ -1193,5 +1260,44 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) } cpu_model_supports_sld = true; - split_lock_setup(); + __split_lock_setup(); +} + +static void sld_state_show(void) +{ + if (!boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT) && + !boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) + return; + + switch (sld_state) { + case sld_off: + pr_info("disabled\n"); + break; + case sld_warn: + if (boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) + pr_info("#AC: crashing the kernel on kernel split_locks and warning on user-space split_locks\n"); + else if (boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT)) + pr_info("#DB: warning on user-space bus_locks\n"); + break; + case sld_fatal: + if (boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT)) { + pr_info("#AC: crashing the kernel on kernel split_locks and sending SIGBUS on user-space split_locks\n"); + } else if (boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT)) { + pr_info("#DB: sending SIGBUS on user-space bus_locks%s\n", + boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) ? + " from non-WB" : ""); + } + break; + case sld_ratelimit: + if (boot_cpu_has(X86_FEATURE_BUS_LOCK_DETECT)) + pr_info("#DB: setting rate limit to %d/sec per user on non-root user-space bus_locks\n", bld_ratelimit); + break; + } +} + +void __init sld_setup(struct cpuinfo_x86 *c) +{ + split_lock_setup(c); + sld_state_setup(); + sld_state_show(); } diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 7f5aec758f0e..21f885a1609d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -979,6 +979,13 @@ static __always_inline void exc_debug_user(struct pt_regs *regs, goto out_irq; } + /* + * Handle bus lock. #DB for bus lock can only be triggered from + * userspace. The bus lock bit was flipped to positive polarity. + */ + if (dr6 & DR_BUS_LOCK) + handle_bus_lock(regs); + /* Add the virtual_dr6 bits for signals. */ dr6 |= current->thread.virtual_dr6; if (dr6 & (DR_STEP | DR_TRAP_BITS) || icebp) diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index a8ec3b6093fc..af8ec609910e 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,8 +40,11 @@ struct user_struct { atomic_t nr_watches; /* The number of watches this user currently has */ #endif - /* Miscellaneous per-user rate limit */ + /* Miscellaneous per-user rate limits */ struct ratelimit_state ratelimit; +#ifdef CONFIG_CPU_SUP_INTEL + struct ratelimit_state bld_ratelimit; +#endif }; extern int uids_sysfs_init(void); diff --git a/kernel/user.c b/kernel/user.c index a2478cddf536..654aefb7ce8f 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -103,6 +103,9 @@ struct user_struct root_user = { .locked_shm = 0, .uid = GLOBAL_ROOT_UID, .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), +#ifdef CONFIG_CPU_SUP_INTEL + .bld_ratelimit = RATELIMIT_STATE_INIT(root_user.bld_ratelimit, 0, 0), +#endif }; /* @@ -172,6 +175,11 @@ void free_uid(struct user_struct *up) free_user(up, flags); } +#ifdef CONFIG_CPU_SUP_INTEL +/* Some Intel CPUs may set this for rate-limited bus locks. */ +int bld_ratelimit; +#endif + struct user_struct *alloc_uid(kuid_t uid) { struct hlist_head *hashent = uidhashentry(uid); @@ -190,6 +198,11 @@ struct user_struct *alloc_uid(kuid_t uid) refcount_set(&new->__count, 1); ratelimit_state_init(&new->ratelimit, HZ, 100); ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE); +#ifdef CONFIG_CPU_SUP_INTEL + ratelimit_state_init(&new->bld_ratelimit, HZ, bld_ratelimit); + ratelimit_set_flags(&new->bld_ratelimit, + RATELIMIT_MSG_ON_RELEASE); +#endif /* * Before adding this, check whether we raced -- 2.30.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-13 5:49 ` [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock Fenghua Yu @ 2021-03-19 21:30 ` Thomas Gleixner 2021-03-19 21:50 ` Luck, Tony 2021-03-19 22:19 ` Fenghua Yu 0 siblings, 2 replies; 18+ messages in thread From: Thomas Gleixner @ 2021-03-19 21:30 UTC (permalink / raw) To: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > Change Log: > v5: > Address all comments from Thomas: > - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" > options are processed in one patch. What? I certainly did not request that. I said: "Why is this seperate and an all in one thing? patch 2/4 changes the parameter first and 3/4 adds a new option...." which means we want documentation for patch 2 and documentation for patch 3? The ratelimit thing is clearly an extra functionality on top of that buslock muck. Next time I write it out.. > + if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 && ratelimit > 0) { > + bld_ratelimit = ratelimit; So any rate up to INTMAX/s is valid here, right? > + case sld_ratelimit: > + /* Enforce no more than bld_ratelimit bus locks/sec. */ > + while (!__ratelimit(&get_current_user()->bld_ratelimit)) > + msleep(1000 / bld_ratelimit); which is cute because msleep() will always sleep until the next jiffie increment happens. What's not so cute here is the fact that get_current_user() takes a reference on current's UID on every invocation, but nothing ever calls free_uid(). I missed that last time over the way more obvious HZ division. > +++ b/kernel/user.c > @@ -103,6 +103,9 @@ struct user_struct root_user = { > .locked_shm = 0, > .uid = GLOBAL_ROOT_UID, > .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), > +#ifdef CONFIG_CPU_SUP_INTEL > + .bld_ratelimit = RATELIMIT_STATE_INIT(root_user.bld_ratelimit, 0, 0), > +#endif > }; > > /* > @@ -172,6 +175,11 @@ void free_uid(struct user_struct *up) > free_user(up, flags); > } > > +#ifdef CONFIG_CPU_SUP_INTEL > +/* Some Intel CPUs may set this for rate-limited bus locks. */ > +int bld_ratelimit; > +#endif Of course this variable is still required to be in the core kernel code because? While you decided to munge this all together, you obviously ignored the following review comment: "It also lacks the information that the ratelimiting is per UID and not per task and why this was chosen to be per UID..." There is still no reasoning neither in the changelog nor in the cover letter nor in a reply to my review. So let me repeat my question and make it more explicit: What is the justifucation for making this rate limit per UID and not per task, per process or systemwide? > struct user_struct *alloc_uid(kuid_t uid) > { > struct hlist_head *hashent = uidhashentry(uid); > @@ -190,6 +198,11 @@ struct user_struct *alloc_uid(kuid_t uid) > refcount_set(&new->__count, 1); > ratelimit_state_init(&new->ratelimit, HZ, 100); > ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE); > +#ifdef CONFIG_CPU_SUP_INTEL > + ratelimit_state_init(&new->bld_ratelimit, HZ, bld_ratelimit); > + ratelimit_set_flags(&new->bld_ratelimit, > + RATELIMIT_MSG_ON_RELEASE); > +#endif If this has a proper justification for being per user and having to add 40 bytes per UID for something which is mostly unused then there are definitely better ways to do that than slapping #ifdefs into architecture agnostic core code. So if you instead of munging the code patches had split the documentation, then I could apply the first 3 patches and we would only have to sort out the ratelimiting muck. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-19 21:30 ` Thomas Gleixner @ 2021-03-19 21:50 ` Luck, Tony 2021-03-20 1:01 ` Thomas Gleixner 2021-03-19 22:19 ` Fenghua Yu 1 sibling, 1 reply; 18+ messages in thread From: Luck, Tony @ 2021-03-19 21:50 UTC (permalink / raw) To: Thomas Gleixner, Yu, Fenghua, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Randy Dunlap, Li, Xiaoyao, Shankar, Ravi V Cc: linux-kernel, x86, Yu, Fenghua > What is the justifucation for making this rate limit per UID and not > per task, per process or systemwide? The concern is that a malicious user is running a workload that loops obtaining the buslock. This brings the whole system to its knees. Limiting per task doesn't help. The user can just fork(2) a whole bunch of tasks for a distributed buslock attack.. Systemwide might be an interesting alternative. Downside would be accidental rate limit of non-malicious tasks that happen to grab a bus lock periodically but in the same window with other buslocks from other users. Do you think that a risk worth taking to make the code simpler? -Tony ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-19 21:50 ` Luck, Tony @ 2021-03-20 1:01 ` Thomas Gleixner 2021-03-20 13:57 ` Thomas Gleixner 0 siblings, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2021-03-20 1:01 UTC (permalink / raw) To: Luck, Tony, Yu, Fenghua, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Randy Dunlap, Li, Xiaoyao, Shankar, Ravi V Cc: linux-kernel, x86, Yu, Fenghua On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: >> What is the justifucation for making this rate limit per UID and not >> per task, per process or systemwide? > > The concern is that a malicious user is running a workload that loops > obtaining the buslock. This brings the whole system to its knees. > > Limiting per task doesn't help. The user can just fork(2) a whole bunch > of tasks for a distributed buslock attack.. Fair enough. > Systemwide might be an interesting alternative. Downside would be accidental > rate limit of non-malicious tasks that happen to grab a bus lock periodically > but in the same window with other buslocks from other users. > > Do you think that a risk worth taking to make the code simpler? I'd consider it low risk, but I just looked for the usage of the existing ratelimit in struct user and the related commit. Nw it's dawns on me where you are coming from. So that seems to become a pattern ... so the uncompiled thing below might solve that. Yes, it makes the efivars thingy slower, but do we care? We neither care about efivars performance nor about the buslock performance. But I pretty much care about no having to stare at code which gets the fundamental refcounting wrong. Thanks, tglx --- fs/efivarfs/Kconfig | 1 + fs/efivarfs/file.c | 11 +++++++++-- include/linux/ratelimit.h | 1 + include/linux/ratelimit_uid.h | 26 ++++++++++++++++++++++++++ include/linux/sched/user.h | 4 ++-- kernel/user.c | 7 ++++--- lib/Kconfig | 3 +++ lib/ratelimit.c | 41 +++++++++++++++++++++++++++++++++++++++++ 8 files changed, 87 insertions(+), 7 deletions(-) --- a/fs/efivarfs/Kconfig +++ b/fs/efivarfs/Kconfig @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only config EFIVAR_FS tristate "EFI Variable filesystem" + select UID_RATELIMIT depends on EFI default m help --- a/fs/efivarfs/file.c +++ b/fs/efivarfs/file.c @@ -63,6 +63,14 @@ static ssize_t efivarfs_file_write(struc return bytes; } +static const struct uid_ratelimit_cfg efivars_rl = { + .which = UID_RATELIMIT_EFIVARS, + .interval = HZ, + .burst = 100, + .flags = RATELIMIT_MSG_ON_RELEASE, + .delay = 50, +}; + static ssize_t efivarfs_file_read(struct file *file, char __user *userbuf, size_t count, loff_t *ppos) { @@ -73,8 +81,7 @@ static ssize_t efivarfs_file_read(struct ssize_t size = 0; int err; - while (!__ratelimit(&file->f_cred->user->ratelimit)) - msleep(50); + uid_ratelimit(&efivars_rl); err = efivar_entry_size(var, &datasize); --- a/include/linux/ratelimit.h +++ b/include/linux/ratelimit.h @@ -3,6 +3,7 @@ #define _LINUX_RATELIMIT_H #include <linux/ratelimit_types.h> +#include <linux/ratelimit_uid.h> #include <linux/sched.h> #include <linux/spinlock.h> --- /dev/null +++ b/include/linux/ratelimit_uid.h @@ -0,0 +1,26 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_RATELIMIT_UID_H +#define _LINUX_RATELIMIT_UID_H + +/* Per UID ratelimits */ +enum uid_ratelimits { +#ifdef CONFIG_EFIVAR_FS + UID_RATELIMIT_EFIVARS, +#endif + UID_RATELIMIT_MAX, +}; + +#define UID_RATELIMIT_NODELAY ULONG_MAX + +struct uid_ratelimit_cfg { + enum uid_ratelimits which; + int interval; + int burst; + unsigned long flags; + unsigned long delay; +}; + +extern int __uid_ratelimit(const struct uid_ratelimit_cfg *cfg, const void *func); +#define uid_ratelimit(cfg) __uid_ratelimit(cfg, __func__) + +#endif --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,8 +40,8 @@ struct user_struct { atomic_t nr_watches; /* The number of watches this user currently has */ #endif - /* Miscellaneous per-user rate limit */ - struct ratelimit_state ratelimit; + /* Miscellaneous per-user rate limits storage */ + struct ratelimit_state *ratelimits[UID_RATELIMIT_MAX]; }; extern int uids_sysfs_init(void); --- a/kernel/user.c +++ b/kernel/user.c @@ -102,7 +102,6 @@ struct user_struct root_user = { .sigpending = ATOMIC_INIT(0), .locked_shm = 0, .uid = GLOBAL_ROOT_UID, - .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), }; /* @@ -139,8 +138,12 @@ static struct user_struct *uid_hash_find static void free_user(struct user_struct *up, unsigned long flags) __releases(&uidhash_lock) { + unsigned int i; + uid_hash_remove(up); spin_unlock_irqrestore(&uidhash_lock, flags); + for (i = 0; i < UID_RATELIMIT_MAX; i++) + kfree(up->ratelimits[i]); kmem_cache_free(uid_cachep, up); } @@ -188,8 +191,6 @@ struct user_struct *alloc_uid(kuid_t uid new->uid = uid; refcount_set(&new->__count, 1); - ratelimit_state_init(&new->ratelimit, HZ, 100); - ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE); /* * Before adding this, check whether we raced --- a/lib/Kconfig +++ b/lib/Kconfig @@ -701,3 +701,6 @@ config GENERIC_LIB_DEVMEM_IS_ALLOWED config PLDMFW bool default n + +config UID_RATELIMIT + bool --- a/lib/ratelimit.c +++ b/lib/ratelimit.c @@ -11,6 +11,9 @@ #include <linux/ratelimit.h> #include <linux/jiffies.h> #include <linux/export.h> +#include <linux/delay.h> +#include <linux/slab.h> +#include <linux/cred.h> /* * __ratelimit - rate limiting @@ -68,3 +71,41 @@ int ___ratelimit(struct ratelimit_state return ret; } EXPORT_SYMBOL(___ratelimit); + +#ifdef CONFIG_UID_RATELIMIT +int __uid_ratelimit(const struct uid_ratelimit_cfg *cfg, const void *func) +{ + struct user_struct *up = get_current_user(); + unsigned int which = cfg->which; + struct ratelimit_state *rs; + int ret = 0; + + if (WARN_ON_ONCE(which > UID_RATELIMIT_MAX)) + goto out; + + rs = up->ratelimits[which]; + if (!rs) { + /* ratelimit_state_init() does a memset(,0,) */ + rs = kmalloc(sizeof(*rs), GFP_KERNEL); + if (!rs) + goto out; + ratelimit_state_init(rs, cfg->interval, cfg->burst); + ratelimit_set_flags(rs, cfg->flags); + if (cmpxchg(&up->ratelimits[which], NULL, rs) != NULL) { + kfree(rs); + rs = up->ratelimits[which]; + } + } + + while (1) { + ret = ___ratelimit(rs, func); + if (ret || cfg->delay == UID_RATELIMIT_NODELAY) + break; + msleep(cfg->delay); + } +out: + free_uid(up); + return ret; +} +EXPORT_SYMBOL_GPL(__uid_ratelimit); +#endif ^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-20 1:01 ` Thomas Gleixner @ 2021-03-20 13:57 ` Thomas Gleixner 2021-04-03 0:50 ` Fenghua Yu 0 siblings, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2021-03-20 13:57 UTC (permalink / raw) To: Luck, Tony, Yu, Fenghua, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Randy Dunlap, Li, Xiaoyao, Shankar, Ravi V Cc: linux-kernel, x86, Yu, Fenghua On Sat, Mar 20 2021 at 02:01, Thomas Gleixner wrote: > On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: >>> What is the justifucation for making this rate limit per UID and not >>> per task, per process or systemwide? >> >> The concern is that a malicious user is running a workload that loops >> obtaining the buslock. This brings the whole system to its knees. >> >> Limiting per task doesn't help. The user can just fork(2) a whole bunch >> of tasks for a distributed buslock attack.. > > Fair enough. > >> Systemwide might be an interesting alternative. Downside would be accidental >> rate limit of non-malicious tasks that happen to grab a bus lock periodically >> but in the same window with other buslocks from other users. >> >> Do you think that a risk worth taking to make the code simpler? > > I'd consider it low risk, but I just looked for the usage of the > existing ratelimit in struct user and the related commit. Nw it's dawns > on me where you are coming from. So after getting real numbers myself, I have more thoughts on this. Setting a reasonable per user limit might be hard when you want to protect e.g. against an orchestrated effort by several users (containers...). If each of them stays under the limit which is easy enough to figure out then you still end up with significant accumulated damage. So systemwide might not be the worst option after all. The question is how wide spread are bus locks in existing applications? I haven't found any on a dozen machines with random variants of workloads so far according to perf ... -e sq_misc.split_lock. What's the actual scenario in the real world where a buslock access might be legitimate? And what's the advice, recommendation for a system administrator how to analyze the situation and what kind of parameter to set? I tried to get answers from Documentation/x86/buslock.rst, but .... Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-20 13:57 ` Thomas Gleixner @ 2021-04-03 0:50 ` Fenghua Yu 0 siblings, 0 replies; 18+ messages in thread From: Fenghua Yu @ 2021-04-03 0:50 UTC (permalink / raw) To: Thomas Gleixner Cc: Luck, Tony, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Randy Dunlap, Li, Xiaoyao, Shankar, Ravi V, linux-kernel, x86 Hi, Thomas, On Sat, Mar 20, 2021 at 02:57:52PM +0100, Thomas Gleixner wrote: > On Sat, Mar 20 2021 at 02:01, Thomas Gleixner wrote: > > > On Fri, Mar 19 2021 at 21:50, Tony Luck wrote: > >>> What is the justifucation for making this rate limit per UID and not > >>> per task, per process or systemwide? > >> > >> The concern is that a malicious user is running a workload that loops > >> obtaining the buslock. This brings the whole system to its knees. > >> > >> Limiting per task doesn't help. The user can just fork(2) a whole bunch > >> of tasks for a distributed buslock attack.. > > > > Fair enough. > > > >> Systemwide might be an interesting alternative. Downside would be accidental > >> rate limit of non-malicious tasks that happen to grab a bus lock periodically > >> but in the same window with other buslocks from other users. > >> > >> Do you think that a risk worth taking to make the code simpler? > > > > I'd consider it low risk, but I just looked for the usage of the > > existing ratelimit in struct user and the related commit. Nw it's dawns > > on me where you are coming from. > > So after getting real numbers myself, I have more thoughts on > this. Setting a reasonable per user limit might be hard when you want to > protect e.g. against an orchestrated effort by several users > (containers...). If each of them stays under the limit which is easy > enough to figure out then you still end up with significant accumulated > damage. > > So systemwide might not be the worst option after all. Indeed. > > The question is how wide spread are bus locks in existing applications? > I haven't found any on a dozen machines with random variants of > workloads so far according to perf ... -e sq_misc.split_lock. We have been running various tests widely inside Intel (and also outside) after enabling split lock and captured a few split lock issues in firmware, kernel, drivers, and apps. As you know, we have submitted a few patches to fix the split lock issues in the kernel and drivers (e.g. split lock in bit ops) and fixed a few split lock issues in firmware. But so far I'm not aware of any split lock issues in user space yet. I guess compilers do good cache line alignment good job to avoid this issue. But inline asm in user apps can easily hit this issue (on purpose). > > What's the actual scenario in the real world where a buslock access > might be legitimate? I did a simple experiment: looping on a split locked instruction on one core in one user can slow down "dd" command running on another core in another user by 7 times. A malicious user can do similar things to slow down the whole system performance, right? > > And what's the advice, recommendation for a system administrator how to > analyze the situation and what kind of parameter to set? > > I tried to get answers from Documentation/x86/buslock.rst, but .... Can I change the sleep code in the handle_bus_lock() to the following? while (!__ratelimit(&global_bld_ratelimit)) usleep_range(1000000 / bld_ratelimit, 1000000 / bld_ratelimit); Maybe the system wide bus lock ratelimit can be set to default value 1000,000/s which is also the max ratelimit value. The max sleep in the kernel is 1 us which means max bld_ratelimit can be up to 1000,000. If the system administrator think bus locks are less tolerant and wants to throttle bus lock further, bld_ratelimit can be set as a smaller number. The smallest bld_ratelimit is 1. When I gradually decreases bld_ratelimit value, I can see less bus locks can be issued per second systemwide and "dd" command or other memory benchmarks are less impacted by the bus locks. If this works, I will have the buslock.rst doc to explain the situation and how to set the parameter. Thanks. -Fenghua ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-19 21:30 ` Thomas Gleixner 2021-03-19 21:50 ` Luck, Tony @ 2021-03-19 22:19 ` Fenghua Yu 2021-03-20 12:42 ` Thomas Gleixner 1 sibling, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-03-19 22:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 Hi, Thomas, On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: > On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > > Change Log: > > v5: > > Address all comments from Thomas: > > - Merge patch 2 and patch 3 into one patch so all "split_lock_detect=" > > options are processed in one patch. > > What? I certainly did not request that. I said: > > "Why is this seperate and an all in one thing? patch 2/4 changes the > parameter first and 3/4 adds a new option...." > > which means we want documentation for patch 2 and documentation for > patch 3? > > The ratelimit thing is clearly an extra functionality on top of that > buslock muck. > > Next time I write it out.. Sorry for misunderstanding your comments. I will split the document patch into two: one for patch 2 (warn and fatal) and one for patch 3 (ratelimit). > > > + if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 && ratelimit > 0) { > > + bld_ratelimit = ratelimit; > > So any rate up to INTMAX/s is valid here, right? Yes. I don't see smaller limitation than INTMX/s. Is that right? > > > + case sld_ratelimit: > > + /* Enforce no more than bld_ratelimit bus locks/sec. */ > > + while (!__ratelimit(&get_current_user()->bld_ratelimit)) > > + msleep(1000 / bld_ratelimit); > > which is cute because msleep() will always sleep until the next jiffie > increment happens. > > What's not so cute here is the fact that get_current_user() takes a > reference on current's UID on every invocation, but nothing ever calls > free_uid(). I missed that last time over the way more obvious HZ division. I will call free_uid(). > > > +++ b/kernel/user.c > > @@ -103,6 +103,9 @@ struct user_struct root_user = { > > .locked_shm = 0, > > .uid = GLOBAL_ROOT_UID, > > .ratelimit = RATELIMIT_STATE_INIT(root_user.ratelimit, 0, 0), > > +#ifdef CONFIG_CPU_SUP_INTEL > > + .bld_ratelimit = RATELIMIT_STATE_INIT(root_user.bld_ratelimit, 0, 0), > > +#endif > > }; > > > > /* > > @@ -172,6 +175,11 @@ void free_uid(struct user_struct *up) > > free_user(up, flags); > > } > > > > +#ifdef CONFIG_CPU_SUP_INTEL > > +/* Some Intel CPUs may set this for rate-limited bus locks. */ > > +int bld_ratelimit; > > +#endif > > Of course this variable is still required to be in the core kernel code > because? > > While you decided to munge this all together, you obviously ignored the > following review comment: > > "It also lacks the information that the ratelimiting is per UID > and not per task and why this was chosen to be per UID..." > > There is still no reasoning neither in the changelog nor in the cover > letter nor in a reply to my review. > > So let me repeat my question and make it more explicit: > > What is the justifucation for making this rate limit per UID and not > per task, per process or systemwide? Tony jut now answered the justification. If that's OK, I will add the answer in the commit message. > > > struct user_struct *alloc_uid(kuid_t uid) > > { > > struct hlist_head *hashent = uidhashentry(uid); > > @@ -190,6 +198,11 @@ struct user_struct *alloc_uid(kuid_t uid) > > refcount_set(&new->__count, 1); > > ratelimit_state_init(&new->ratelimit, HZ, 100); > > ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE); > > +#ifdef CONFIG_CPU_SUP_INTEL > > + ratelimit_state_init(&new->bld_ratelimit, HZ, bld_ratelimit); > > + ratelimit_set_flags(&new->bld_ratelimit, > > + RATELIMIT_MSG_ON_RELEASE); > > +#endif > > If this has a proper justification for being per user and having to add > 40 bytes per UID for something which is mostly unused then there are > definitely better ways to do that than slapping #ifdefs into > architecture agnostic core code. > > So if you instead of munging the code patches had split the > documentation, then I could apply the first 3 patches and we would only > have to sort out the ratelimiting muck. If I split this whole patch set into two patch sets: 1. Three patches in the first patch set: the enumeration patch, the warn and fatal patch, and the documentation patch. 2. Two patches in the second patch set: the ratelimit patch and the documentation patch. Then I will send the two patch sets separately, you will accept them one by one. Is that OK? Or should I still send the 5 patches in one patch set so you will pick up the first 3 patches and then the next 2 patches separately? Thanks. -Fenghua ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-19 22:19 ` Fenghua Yu @ 2021-03-20 12:42 ` Thomas Gleixner 2021-04-03 1:04 ` Fenghua Yu 0 siblings, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2021-03-20 12:42 UTC (permalink / raw) To: Fenghua Yu Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: >> > + if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 && ratelimit > 0) { >> > + bld_ratelimit = ratelimit; >> >> So any rate up to INTMAX/s is valid here, right? > > Yes. I don't see smaller limitation than INTMX/s. Is that right? That's a given, but what's the point of limits in that range? A buslock access locks up the system for X cycles. So the total amount of allowable damage in cycles per second is: limit * stall_cycles_per_bus_lock ergo the time (in seconds) which the system is locked up is: limit * stall_cycles_per_bus_lock / cpufreq Which means for ~INTMAX/2 on a 2 GHz CPU: 2 * 10^9 * $CYCLES / 2 * 10^9 = $CYCLES seconds Assumed the inflicted damage is only 1 cycle then #LOCK is pretty much permanently on if there are enough threads. Sure #DB will slow them down, but it still does not make any sense at all especially as the damage is certainly greater than a single cycle. And because the changelogs and the docs are void of numbers I just got real numbers myself. With a single thread doing a 'lock inc *mem' accross a cache line boundary the workload which I measured with perf stat goes from: 5,940,985,091 instructions # 0.88 insn per cycle 2.780950806 seconds time elapsed 0.998480000 seconds user 4.202137000 seconds sys to 7,467,979,504 instructions # 0.10 insn per cycle 5.110795917 seconds time elapsed 7.123499000 seconds user 37.266852000 seconds sys The buslock injection rate is ~250k per second. Even if I ratelimit the locked inc by a delay loop of ~5000 cycles which is probably more than what the #DB will cost then this single task still impacts the workload significantly: 6,496,994,537 instructions # 0.39 insn per cycle 3.043275473 seconds time elapsed 1.899852000 seconds user 8.957088000 seconds sys The buslock injection rate is down to ~150k per second in this case. And even with throttling the injection rate further down to 25k per second the impact on the workload is still significant in the 10% range. And of course the documentation of the ratelimit parameter explains all of this in great detail so the administrator has a trivial job to tune that, right? >> > + case sld_ratelimit: >> > + /* Enforce no more than bld_ratelimit bus locks/sec. */ >> > + while (!__ratelimit(&get_current_user()->bld_ratelimit)) >> > + msleep(1000 / bld_ratelimit); For any ratelimit > 1000 this will loop up to 1000 times with CONFIG_HZ=1000. Assume that the buslock producer has tons of threads which all end up here pretty soon then you launch a mass wakeup in the worst case every jiffy. Are you sure that the cure is better than the disease? > If I split this whole patch set into two patch sets: > 1. Three patches in the first patch set: the enumeration patch, the warn > and fatal patch, and the documentation patch. > 2. Two patches in the second patch set: the ratelimit patch and the > documentation patch. > > Then I will send the two patch sets separately, you will accept them one > by one. Is that OK? That's obviously the right thing to do because #1 should be ready and we can sort out #2 seperately. See the conversation with Tony. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-03-20 12:42 ` Thomas Gleixner @ 2021-04-03 1:04 ` Fenghua Yu 2021-04-12 7:15 ` Thomas Gleixner 0 siblings, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-04-03 1:04 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 Hi, Thomas, On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: > On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > > On Fri, Mar 19, 2021 at 10:30:50PM +0100, Thomas Gleixner wrote: > >> > + if (sscanf(arg, "ratelimit:%d", &ratelimit) == 1 && ratelimit > 0) { > >> > + bld_ratelimit = ratelimit; > >> > >> So any rate up to INTMAX/s is valid here, right? > > > > Yes. I don't see smaller limitation than INTMX/s. Is that right? > > That's a given, but what's the point of limits in that range? > > A buslock access locks up the system for X cycles. So the total amount > of allowable damage in cycles per second is: > > limit * stall_cycles_per_bus_lock > > ergo the time (in seconds) which the system is locked up is: > > limit * stall_cycles_per_bus_lock / cpufreq > > Which means for ~INTMAX/2 on a 2 GHz CPU: > > 2 * 10^9 * $CYCLES / 2 * 10^9 = $CYCLES seconds > > Assumed the inflicted damage is only 1 cycle then #LOCK is pretty much > permanently on if there are enough threads. Sure #DB will slow them > down, but it still does not make any sense at all especially as the > damage is certainly greater than a single cycle. > > And because the changelogs and the docs are void of numbers I just got > real numbers myself. > > With a single thread doing a 'lock inc *mem' accross a cache line > boundary the workload which I measured with perf stat goes from: > > 5,940,985,091 instructions # 0.88 insn per cycle > 2.780950806 seconds time elapsed > 0.998480000 seconds user > 4.202137000 seconds sys > to > > 7,467,979,504 instructions # 0.10 insn per cycle > 5.110795917 seconds time elapsed > 7.123499000 seconds user > 37.266852000 seconds sys > > The buslock injection rate is ~250k per second. > > Even if I ratelimit the locked inc by a delay loop of ~5000 cycles > which is probably more than what the #DB will cost then this single task > still impacts the workload significantly: > > 6,496,994,537 instructions # 0.39 insn per cycle > 3.043275473 seconds time elapsed > 1.899852000 seconds user > 8.957088000 seconds sys > > The buslock injection rate is down to ~150k per second in this case. > > And even with throttling the injection rate further down to 25k per > second the impact on the workload is still significant in the 10% range. Thank you for your insight! So I can change the ratelimit to system wide and call usleep_range() to sleep: while (!__ratelimit(&global_bld_ratelimit)) usleep_range(1000000 / bld_ratelimit, 1000000 / bld_ratelimit); The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 usec. The min bld_ratelimit is 1/s. > > And of course the documentation of the ratelimit parameter explains all > of this in great detail so the administrator has a trivial job to tune > that, right? I will explain how to tune the parameter in buslock.rst doc. > > >> > + case sld_ratelimit: > >> > + /* Enforce no more than bld_ratelimit bus locks/sec. */ > >> > + while (!__ratelimit(&get_current_user()->bld_ratelimit)) > >> > + msleep(1000 / bld_ratelimit); > > For any ratelimit > 1000 this will loop up to 1000 times with > CONFIG_HZ=1000. > > Assume that the buslock producer has tons of threads which all end up > here pretty soon then you launch a mass wakeup in the worst case every > jiffy. Are you sure that the cure is better than the disease? if using usleep_range() to sleep, the threads will not sleep and wakeup, right? Maybe I can use msleep() for msec (bigger bld_ratelimit) and usleep_range() for usec (smaller bld_ratelimit)? Even if there is mass wakeup, throttling the threads can avoid the system wide performance degradation (e.g. 7x slower dd command in another user). Is that a good justification for throttling the threads? > > > If I split this whole patch set into two patch sets: > > 1. Three patches in the first patch set: the enumeration patch, the warn > > and fatal patch, and the documentation patch. > > 2. Two patches in the second patch set: the ratelimit patch and the > > documentation patch. > > > > Then I will send the two patch sets separately, you will accept them one > > by one. Is that OK? > > That's obviously the right thing to do because #1 should be ready and we > can sort out #2 seperately. See the conversation with Tony. Thank you for picking up the first patch set! -Fenghua ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-04-03 1:04 ` Fenghua Yu @ 2021-04-12 7:15 ` Thomas Gleixner 2021-04-13 23:40 ` Fenghua Yu 0 siblings, 1 reply; 18+ messages in thread From: Thomas Gleixner @ 2021-04-12 7:15 UTC (permalink / raw) To: Fenghua Yu Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 On Sat, Apr 03 2021 at 01:04, Fenghua Yu wrote: > On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: >> On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: >> And even with throttling the injection rate further down to 25k per >> second the impact on the workload is still significant in the 10% range. > > Thank you for your insight! > > So I can change the ratelimit to system wide and call usleep_range() > to sleep: > while (!__ratelimit(&global_bld_ratelimit)) > usleep_range(1000000 / bld_ratelimit, > 1000000 / bld_ratelimit); > > The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 > usec. Maximum sleep time is 1usec? > The min bld_ratelimit is 1/s. Again. This does not make sense at all. 1Mio bus lock events per second are way beyond the point where the machine does anything else than being stuck in buslocks. Aside of that why are you trying to make this throttling in any way accurate? It does not matter at all, really. Limit reached, put it to sleep for some time and be done with it. No point in trying to be clever for no value. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-04-12 7:15 ` Thomas Gleixner @ 2021-04-13 23:40 ` Fenghua Yu 2021-04-14 9:41 ` Thomas Gleixner 0 siblings, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-04-13 23:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 Hi, Thomas, On Mon, Apr 12, 2021 at 09:15:08AM +0200, Thomas Gleixner wrote: > On Sat, Apr 03 2021 at 01:04, Fenghua Yu wrote: > > On Sat, Mar 20, 2021 at 01:42:52PM +0100, Thomas Gleixner wrote: > >> On Fri, Mar 19 2021 at 22:19, Fenghua Yu wrote: > >> And even with throttling the injection rate further down to 25k per > >> second the impact on the workload is still significant in the 10% range. > > So I can change the ratelimit to system wide and call usleep_range() > > to sleep: > > while (!__ratelimit(&global_bld_ratelimit)) > > usleep_range(1000000 / bld_ratelimit, > > 1000000 / bld_ratelimit); > > > > The max bld_ratelimit is 1000,000/s because the max sleeping time is 1 > > usec. > > Maximum sleep time is 1usec? > > > The min bld_ratelimit is 1/s. > > Again. This does not make sense at all. 1Mio bus lock events per second > are way beyond the point where the machine does anything else than being > stuck in buslocks. > > Aside of that why are you trying to make this throttling in any way > accurate? It does not matter at all, really. Limit reached, put it to > sleep for some time and be done with it. No point in trying to be clever > for no value. Is it OK to set bld_ratelimit between 1 and 1,000 bus locks/sec for bld_ratelimit? Can I do the throttling like this? /* Enforce no more than bld_ratelimit bus locks/sec. */ while (!__ratelimit(&global_bld_ratelimit)) msleep(10); On one machine, if bld_ratelimit=1,000, that's about 5msec for a busy bus lock loop, i.e. bus is locked for about 5msec and then the process sleeps for 10msec and thus won't generate any bus lock. "dd" command running on other cores doesn't have noticeable degradation with bld_ratelimit=1,000. Thanks. -Fenghua ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock 2021-04-13 23:40 ` Fenghua Yu @ 2021-04-14 9:41 ` Thomas Gleixner 0 siblings, 0 replies; 18+ messages in thread From: Thomas Gleixner @ 2021-04-14 9:41 UTC (permalink / raw) To: Fenghua Yu Cc: Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar, linux-kernel, x86 Fenghua, On Tue, Apr 13 2021 at 23:40, Fenghua Yu wrote: > On Mon, Apr 12, 2021 at 09:15:08AM +0200, Thomas Gleixner wrote: >> Aside of that why are you trying to make this throttling in any way >> accurate? It does not matter at all, really. Limit reached, put it to >> sleep for some time and be done with it. No point in trying to be clever >> for no value. > > Is it OK to set bld_ratelimit between 1 and 1,000 bus locks/sec for > bld_ratelimit? > > Can I do the throttling like this? > > /* Enforce no more than bld_ratelimit bus locks/sec. */ > while (!__ratelimit(&global_bld_ratelimit)) > msleep(10); > > On one machine, if bld_ratelimit=1,000, that's about 5msec for a busy > bus lock loop, i.e. bus is locked for about 5msec and then the process > sleeps for 10msec and thus won't generate any bus lock. > "dd" command running on other cores doesn't have noticeable degradation > with bld_ratelimit=1,000. Something like this makes sense. Add some rationale for the upper limit you finally end up with so sysadmins can make informed decisions. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter 2021-03-13 5:49 [PATCH v5 0/3] x86/bus_lock: Enable bus lock detection Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for " Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock Fenghua Yu @ 2021-03-13 5:49 ` Fenghua Yu 2021-03-19 21:35 ` Thomas Gleixner 2 siblings, 1 reply; 18+ messages in thread From: Fenghua Yu @ 2021-03-13 5:49 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li , Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu Since #DB for bus lock detect changes the split_lock_detect parameter, update the documentation for the changes. Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> --- Change Log: v5: - Remove N < HZ/2 check info in the doc (Thomas). v4: - Fix a ratelimit wording issue in the doc (Randy). - Patch 4 is acked by Randy (Randy). v3: - Enable Bus Lock Detection when fatal to handle bus lock from non-WB (PeterZ). v1: - Fix a few wording issues (Randy). RFC v2: - Simplify the documentation (Randy). .../admin-guide/kernel-parameters.txt | 30 +++++++++++++++---- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..16b2e1c45d04 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5100,27 +5100,45 @@ spia_peddr= split_lock_detect= - [X86] Enable split lock detection + [X86] Enable split lock detection or bus lock detection When enabled (and if hardware support is present), atomic instructions that access data across cache line - boundaries will result in an alignment check exception. + boundaries will result in an alignment check exception + for split lock detection or a debug exception for + bus lock detection. off - not enabled - warn - the kernel will emit rate limited warnings + warn - the kernel will emit rate-limited warnings about applications triggering the #AC - exception. This mode is the default on CPUs - that supports split lock detection. + exception or the #DB exception. This mode is + the default on CPUs that support split lock + detection or bus lock detection. Default + behavior is by #DB if both features are + enabled in hardware. fatal - the kernel will send SIGBUS to applications - that trigger the #AC exception. + that trigger the #AC exception or the #DB + exception. If both features are enabled in + hardware, split lock triggers #AC and bus + lock from non-WB triggers #DB. + + ratelimit:N - + Set rate limit to N bus locks per second + for bus lock detection. 0 < N. + Only applied to non-root users. + + N/A for split lock detection. If an #AC exception is hit in the kernel or in firmware (i.e. not while executing in user mode) the kernel will oops in either "warn" or "fatal" mode. + #DB exception for bus lock is triggered only when + CPL > 0. + srbds= [X86,INTEL] Control the Special Register Buffer Data Sampling (SRBDS) mitigation. -- 2.30.2 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter 2021-03-13 5:49 ` [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter Fenghua Yu @ 2021-03-19 21:35 ` Thomas Gleixner 0 siblings, 0 replies; 18+ messages in thread From: Thomas Gleixner @ 2021-03-19 21:35 UTC (permalink / raw) To: Fenghua Yu, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Tony Luck, Randy Dunlap, Xiaoyao Li, Ravi V Shankar Cc: linux-kernel, x86, Fenghua Yu On Sat, Mar 13 2021 at 05:49, Fenghua Yu wrote: > + ratelimit:N - > + Set rate limit to N bus locks per second > + for bus lock detection. 0 < N. > + Only applied to non-root users. Of course this does not mention that the limit is per UID and it neither tells what the action is to rate limit the bus lock attempts. The rate limit saturation which is induced by msleep() is not mentioned either. Really useful for administrators. Thanks, tglx ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2021-04-14 9:41 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-13 5:49 [PATCH v5 0/3] x86/bus_lock: Enable bus lock detection Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 1/3] x86/cpufeatures: Enumerate #DB for " Fenghua Yu 2021-03-19 20:35 ` Thomas Gleixner 2021-03-19 21:00 ` Fenghua Yu 2021-03-13 5:49 ` [PATCH v5 2/3] x86/bus_lock: Handle #DB for bus lock Fenghua Yu 2021-03-19 21:30 ` Thomas Gleixner 2021-03-19 21:50 ` Luck, Tony 2021-03-20 1:01 ` Thomas Gleixner 2021-03-20 13:57 ` Thomas Gleixner 2021-04-03 0:50 ` Fenghua Yu 2021-03-19 22:19 ` Fenghua Yu 2021-03-20 12:42 ` Thomas Gleixner 2021-04-03 1:04 ` Fenghua Yu 2021-04-12 7:15 ` Thomas Gleixner 2021-04-13 23:40 ` Fenghua Yu 2021-04-14 9:41 ` Thomas Gleixner 2021-03-13 5:49 ` [PATCH v5 3/3] Documentation/admin-guide: Change doc for split_lock_detect parameter Fenghua Yu 2021-03-19 21:35 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).