All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH v5 00/27] MDSv5 19
@ 2019-01-19  0:50 Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
                   ` (28 more replies)
  0 siblings, 29 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Here's a new version of flushing CPU buffers for group 4.

This mainly covers single thread, not SMT (except for the idle case).

I lumped all the issues together under the Microarchitectural Data
Sampling (MDS) name because they need the same mitigations,a
and it doesn't seem worth duplicating the sysfs files and bug entries.

This version drops support for software sequences, and also
does VERW unconditionally unless disabled.

This version implements Linus' suggestion to only clear the CPU
buffer when needed. The patch kit is now a lot more complicated:
different subsystems determine if they might touch other user's
or sensitive data and schedule a cpu clear on next kernel exit.

Generally process context doesn't clear (unless it is cryptographic
or does context switches), and interrupt context schedules a clear.
There are some exceptions to these rules.

For details on the security model see the Documentation/clearcpu.txt
file. In my tests the number of clears is much lower now.

For most benchmarks we tried the difference is in the noise
level now. ebizzy and loopback apache both show about 1.7%
degradation.

It makes various assumptions on how kernel code behaves.
I did some auditing, but wasn't able to do it for everything.
Please double check the assumptions laid out in the document.

Likely a lot more interrupt and timer handlers (and tasklets
and irq poll handlers) could be white listed to not need clear, but I only
did a fairly minimum set for now that I could test.

For some of the white listed code, especially the networking and
block softirqs, as well as the EBPF mitigation, some additional auditing that
no rules are violated would be useful.

Some notes:
- Against 5.0-rc1

Changes against previous versions:
- Remove software sequences
- Make VERW unconditional
- Improved documentation
- Some other minor changes

Changes against previous versions:
- By default now flushes only when needed
- Define security model
- New administrator document
- Added mds=verw and mds=full
- Renamed mds_disable to mds=off
- KVM virtualization much improved
- Too many others to list. Most things different now.

Changes against previous versions:
- Now idle clears too to avoid an extra SMT leakage
- Don't do any workarounds for MDS_NO
- Various small changes, see individual patches

Andi Kleen (27):
  x86/speculation/mds: Add basic bug infrastructure for MDS
  x86/speculation/mds: Add mds=off
  x86/speculation/mds: Support clearing CPU data on kernel exit
  x86/speculation/mds: Support mds=full
  x86/speculation/mds: Support mds=full for NMIs
  x86/speculation/mds: Clear CPU buffers on entering idle
  x86/speculation/mds: Add sysfs reporting
  x86/speculation/mds: Export MD_CLEAR CPUID to KVM guests.
  mds: Add documentation for clear cpu usage
  mds: Add preliminary administrator documentation
  x86/speculation/mds: Introduce lazy_clear_cpu
  x86/speculation/mds: Schedule cpu clear on context switch
  x86/speculation/mds: Add tracing for clear_cpu
  mds: Force clear cpu on kernel preemption
  mds: Schedule cpu clear for memzero_explicit and kzfree
  mds: Mark interrupts clear cpu, unless opted-out
  mds: Clear cpu on all timers, unless the timer opts-out
  mds: Clear CPU on tasklets, unless opted-out
  mds: Clear CPU on irq poll, unless opted-out
  mds: Clear cpu for string io/memcpy_*io in interrupts
  mds: Schedule clear cpu in swiotlb
  mds: Instrument skb functions to clear cpu automatically
  mds: Opt out tcp tasklet to not touch user data
  mds: mark kernel/* timers safe as not touching user data
  mds: Mark AHCI interrupt as not needing cpu clear
  mds: Mark ACPI interrupt as not needing cpu clear
  mds: Mitigate BPF

 .../ABI/testing/sysfs-devices-system-cpu      |   1 +
 .../admin-guide/kernel-parameters.txt         |   8 +
 Documentation/admin-guide/mds.rst             | 108 +++++++++++
 Documentation/clearcpu.txt                    | 172 ++++++++++++++++++
 arch/Kconfig                                  |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/common.c                       |  13 +-
 arch/x86/include/asm/clearbpf.h               |  29 +++
 arch/x86/include/asm/clearcpu.h               |  82 +++++++++
 arch/x86/include/asm/cpufeatures.h            |   3 +
 arch/x86/include/asm/io.h                     |   3 +
 arch/x86/include/asm/msr-index.h              |   1 +
 arch/x86/include/asm/thread_info.h            |   2 +
 arch/x86/include/asm/trace/clearcpu.h         |  27 +++
 arch/x86/kernel/acpi/cstate.c                 |   2 +
 arch/x86/kernel/cpu/bugs.c                    |  48 +++++
 arch/x86/kernel/cpu/common.c                  |  13 ++
 arch/x86/kernel/kvm.c                         |   3 +
 arch/x86/kernel/nmi.c                         |   6 +-
 arch/x86/kernel/process.c                     |   5 +
 arch/x86/kernel/process.h                     |  25 +++
 arch/x86/kernel/smpboot.c                     |   3 +
 arch/x86/kvm/cpuid.c                          |   3 +-
 drivers/acpi/acpi_pad.c                       |   2 +
 drivers/acpi/osl.c                            |   3 +-
 drivers/acpi/processor_idle.c                 |   3 +
 drivers/ata/ahci.c                            |   2 +-
 drivers/ata/ahci.h                            |   2 +
 drivers/ata/libahci.c                         |  40 ++--
 drivers/base/cpu.c                            |   8 +
 drivers/idle/intel_idle.c                     |   5 +
 include/asm-generic/io.h                      |   3 +
 include/linux/clearcpu.h                      |  36 ++++
 include/linux/filter.h                        |  21 ++-
 include/linux/hrtimer.h                       |   4 +
 include/linux/interrupt.h                     |  18 +-
 include/linux/irq_poll.h                      |   2 +
 include/linux/skbuff.h                        |   2 +
 include/linux/timer.h                         |   9 +-
 kernel/bpf/core.c                             |   2 +
 kernel/dma/swiotlb.c                          |   2 +
 kernel/events/core.c                          |   6 +-
 kernel/fork.c                                 |   3 +-
 kernel/futex.c                                |   6 +-
 kernel/irq/handle.c                           |   8 +
 kernel/irq/manage.c                           |   1 +
 kernel/sched/core.c                           |  14 +-
 kernel/sched/deadline.c                       |   6 +-
 kernel/sched/fair.c                           |   6 +-
 kernel/sched/idle.c                           |   3 +-
 kernel/sched/rt.c                             |   3 +-
 kernel/softirq.c                              |  25 ++-
 kernel/time/alarmtimer.c                      |   2 +-
 kernel/time/hrtimer.c                         |  11 +-
 kernel/time/posix-timers.c                    |   6 +-
 kernel/time/sched_clock.c                     |   3 +-
 kernel/time/tick-sched.c                      |   6 +-
 kernel/time/timer.c                           |   8 +
 kernel/watchdog.c                             |   3 +-
 lib/irq_poll.c                                |  18 +-
 lib/string.c                                  |   6 +
 mm/slab_common.c                              |   5 +-
 net/core/skbuff.c                             |  26 +++
 net/ipv4/tcp_output.c                         |   5 +-
 64 files changed, 843 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/admin-guide/mds.rst
 create mode 100644 Documentation/clearcpu.txt
 create mode 100644 arch/x86/include/asm/clearbpf.h
 create mode 100644 arch/x86/include/asm/clearcpu.h
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h
 create mode 100644 include/linux/clearcpu.h

-- 
2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 01/27] MDSv5 26
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:17   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 12:46   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 02/27] MDSv5 14 Andi Kleen
                   ` (27 subsequent siblings)
  28 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add basic bug infrastructure
 for MDS

MDS is micro architectural data sampling, which is a side channel
attack on internal buffers in Intel CPUs.

MDS consists of multiple sub-vulnerabilities:
Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
with the first leaking store data, and the second loads and sometimes
store data, and the third load data.

They all have the same mitigations for single thread, so we lump them all
together as a single MDS issue.

This patch adds the basic infrastructure to detect if the current
CPU is affected by MDS, and if yes set the right BUG bits.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

Don't check RDCL_NO
---
 arch/x86/include/asm/cpufeatures.h |  2 ++
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/common.c       | 13 +++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..233ca598826f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MD_CLEAR		(18*32+10) /* Flush state on VERW */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -381,5 +382,6 @@
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..3e486d9d6e6c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -77,6 +77,7 @@
 						    * attack, so no Speculative Store Bypass
 						    * control required.
 						    */
+#define ARCH_CAP_MDS_NO			(1 << 5)   /* No Microarchitectural data sampling */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			(1 << 0)   /*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..bac5a3a38f0d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -998,6 +998,14 @@ static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
 	{}
 };
 
+static const __initconst struct x86_cpu_id cpu_no_mds[] = {
+	/* in addition to cpu_no_speculation */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_X	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_PLUS	},
+	{}
+};
+
 static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
@@ -1019,6 +1027,11 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	if (ia32_cap & ARCH_CAP_IBRS_ALL)
 		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
 
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    !x86_match_cpu(cpu_no_mds)) &&
+	    !(ia32_cap & ARCH_CAP_MDS_NO))
+		setup_force_cpu_bug(X86_BUG_MDS);
+
 	if (x86_match_cpu(cpu_no_meltdown))
 		return;
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 02/27] MDSv5 14
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:20   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 12:51   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
                   ` (26 subsequent siblings)
  28 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add mds=off

Normally we execute VERW for clearing the cpu unconditionally on kernel exits
that might have touched sensitive. Add a new flag to disable VERW usage.
This is intended for systems that only run trusted code and don't
want the performance impact of the extra clearing.

This just sets the flag, actual implementation is in future patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Also force mds=off for MDS_NO
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 arch/x86/include/asm/cpufeatures.h              |  1 +
 arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
 3 files changed, 14 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b799bcf67d7b..9c967d0caeca 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2357,6 +2357,9 @@
 			Format: <first>,<last>
 			Specifies range of consoles to be captured by the MDA.
 
+	mds=off		[X86, Intel]
+			Disable workarounds for Micro-architectural Data Sampling.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 233ca598826f..09347c6a8901 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -221,6 +221,7 @@
 #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
 #define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */
 #define X86_FEATURE_IBRS_ENHANCED	( 7*32+30) /* Enhanced IBRS */
+#define X86_FEATURE_NO_VERW		( 7*32+31) /* "" No VERW for MDS on kernel exit */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 1de0f4170178..2fd8faa7e23a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -37,6 +37,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -101,6 +102,8 @@ void __init check_bugs(void)
 
 	l1tf_select_mitigation();
 
+	mds_select_mitigation();
+
 #ifdef CONFIG_X86_32
 	/*
 	 * Check whether we are able to run this kernel safely on SMP.
@@ -1058,6 +1061,13 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static void mds_select_mitigation(void)
+{
+	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
+		!boot_cpu_has_bug(X86_BUG_MDS))
+		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
+}
+
 #ifdef CONFIG_SYSFS
 
 #define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 03/27] MDSv5 16
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 02/27] MDSv5 14 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:23   ` [MODERATED] " Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 04/27] MDSv5 15 Andi Kleen
                   ` (25 subsequent siblings)
  28 siblings, 3 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Support clearing CPU data on
 kernel exit

Add infrastructure for clearing CPU data on kernel exit

Instead of clearing unconditionally we support clearing
lazily when some kernel subsystem touched sensitive data
and sets the new TIF_CLEAR_CPU flag.

We handle TIF_CLEAR_CPU in kernel exit, similar to
other kernel exit action flags.

The flushing is provided by new microcode as a new side
effect of the otherwise unused VERW instruction.

So far this patch doesn't do anything, it relies on
later patches to set TIF_CLEAR_CPU.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c            |  8 +++++++-
 arch/x86/include/asm/clearcpu.h    | 23 +++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  2 ++
 3 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/clearcpu.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..924f8dab2068 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -29,6 +29,7 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
+#include <asm/clearcpu.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
 
@@ -132,7 +133,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 }
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_CLEAR_CPU |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -170,6 +171,11 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_CLEAR_CPU) {
+			clear_thread_flag(TIF_CLEAR_CPU);
+			clear_cpu();
+		}
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
new file mode 100644
index 000000000000..530ef619ac1b
--- /dev/null
+++ b/arch/x86/include/asm/clearcpu.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARCPU_H
+#define _ASM_CLEARCPU_H 1
+
+#include <linux/jump_label.h>
+#include <linux/sched/smt.h>
+#include <asm/alternative.h>
+#include <linux/thread_info.h>
+
+/*
+ * Clear CPU buffers to avoid side channels.
+ * We use microcode as a side effect of the obsolete VERW instruction
+ */
+
+static inline void clear_cpu(void)
+{
+	unsigned kernel_ds = __KERNEL_DS;
+	/* Has to be memory form, don't modify to use an register */
+	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
+		[kernelds] "m" (kernel_ds));
+}
+
+#endif
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e0eccbcb8447..0c1e3d71018e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_CLEAR_CPU		23	/* clear CPU on kernel exit */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
+#define _TIF_CLEAR_CPU		(1 << TIF_CLEAR_CPU)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 04/27] MDSv5 15
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (2 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:33   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 12:59   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
                   ` (24 subsequent siblings)
  28 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Support mds=full

Support a new command line option to support unconditional flushing
on each kernel exit. This is not enabled by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Don't enable mds=full for MDS_NO because it will be a nop.
---
 Documentation/admin-guide/kernel-parameters.txt | 5 +++++
 arch/x86/entry/common.c                         | 7 ++++++-
 arch/x86/include/asm/clearcpu.h                 | 2 ++
 arch/x86/kernel/cpu/bugs.c                      | 5 +++++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9c967d0caeca..5f5a8808c475 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,6 +2360,11 @@
 	mds=off		[X86, Intel]
 			Disable workarounds for Micro-architectural Data Sampling.
 
+	mds=full	[X86, Intel]
+			Always flush cpu buffers when exiting kernel for MDS.
+			Normally the kernel decides dynamically when flushing is
+			needed or not.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 924f8dab2068..66c08e1d493a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
-			clear_cpu();
+			/* Don't do it twice if forced */
+			if (!static_key_enabled(&force_cpu_clear))
+				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
@@ -217,6 +219,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
+	if (static_key_enabled(&force_cpu_clear))
+		clear_cpu();
+
 	user_enter_irqoff();
 }
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 530ef619ac1b..3b8ee76b9c07 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -20,4 +20,6 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
+
 #endif
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 2fd8faa7e23a..ce0e367753ff 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1061,11 +1061,16 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
 		!boot_cpu_has_bug(X86_BUG_MDS))
 		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
+	if (cmdline_find_option_bool(boot_command_line, "mds=full") &&
+		boot_cpu_has_bug(X86_BUG_MDS))
+		static_branch_enable(&force_cpu_clear);
 }
 
 #ifdef CONFIG_SYSFS
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 05/27] MDSv5 21
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (3 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 04/27] MDSv5 15 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:35   ` [MODERATED] " Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 06/27] MDSv5 18 Andi Kleen
                   ` (23 subsequent siblings)
  28 siblings, 3 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

NMIs don't go through the normal exit code when exiting
to user space. Normally we consider NMIs not sensitive anyways,
but they need special handling with mds=full.
So add an explicit check to do_nmi to clear the CPU with mds=full

Suggested-by: Josh Poimboeuf
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/nmi.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 18bc9b51ac9b..eb6e39238d1d 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -494,7 +494,7 @@ do_nmi(struct pt_regs *regs, long error_code)
 {
 	if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) {
 		this_cpu_write(nmi_state, NMI_LATCHED);
-		return;
+		goto out;
 	}
 	this_cpu_write(nmi_state, NMI_EXECUTING);
 	this_cpu_write(nmi_cr2, read_cr2());
@@ -533,6 +533,10 @@ do_nmi(struct pt_regs *regs, long error_code)
 		write_cr2(this_cpu_read(nmi_cr2));
 	if (this_cpu_dec_return(nmi_state))
 		goto nmi_restart;
+
+out:
+	if (static_key_enabled(&force_cpu_clear))
+		clear_cpu();
 }
 NOKPROBE_SYMBOL(do_nmi);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 06/27] MDSv5 18
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (4 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-21 22:41   ` [MODERATED] " Josh Poimboeuf
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Clear CPU buffers on entering
 idle

When entering idle the internal state of the current CPU might
become visible to the thread sibling because the CPU "frees" some
internal resources.

To ensure there is no MDS leakage always clear the CPU state
before doing any idling. We only do this if SMT is enabled,
as otherwise there is no leakage possible.

Not needed for idle poll because it does not share resources.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Clarify comments. Some minor placement updates>
---
 arch/x86/include/asm/clearcpu.h | 20 ++++++++++++++++++++
 arch/x86/kernel/acpi/cstate.c   |  2 ++
 arch/x86/kernel/kvm.c           |  3 +++
 arch/x86/kernel/process.c       |  5 +++++
 arch/x86/kernel/smpboot.c       |  3 +++
 drivers/acpi/acpi_pad.c         |  2 ++
 drivers/acpi/processor_idle.c   |  3 +++
 drivers/idle/intel_idle.c       |  5 +++++
 8 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 3b8ee76b9c07..dc3d04da6779 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -20,6 +20,26 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+/*
+ * Clear CPU buffers before going idle, so that no state is leaked to SMT
+ * siblings taking over thread resources.
+ * Out of line to avoid include hell.
+ *
+ * Assumes that interrupts are disabled and only get reenabled
+ * before idle, otherwise the data from a racing interrupt might not
+ * get cleared. There are some callers who violate this,
+ * but they are only used in unattackable cases, like CPU
+ * offlining.
+ */
+
+static inline void clear_cpu_idle(void)
+{
+	if (sched_smt_active()) {
+		clear_thread_flag(TIF_CLEAR_CPU);
+		clear_cpu();
+	}
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #endif
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 158ad1483c43..48adea5afacf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
 #include <acpi/processor.h>
 #include <asm/mwait.h>
 #include <asm/special_insns.h>
+#include <asm/clearcpu.h>
 
 /*
  * Initialize bm_flags based on the CPU cache properties
@@ -157,6 +158,7 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 	unsigned int cpu = smp_processor_id();
 	struct cstate_entry *percpu_entry;
 
+	clear_cpu_idle();
 	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
 	mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
 	                      percpu_entry->states[cx->index].ecx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ba4bfb7f6a36..c9206ad40a5b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -159,6 +159,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
 			/*
 			 * We cannot reschedule. So halt.
 			 */
+			clear_cpu_idle();
 			native_safe_halt();
 			local_irq_disable();
 		}
@@ -785,6 +786,8 @@ static void kvm_wait(u8 *ptr, u8 val)
 	if (READ_ONCE(*ptr) != val)
 		goto out;
 
+	clear_cpu_idle();
+
 	/*
 	 * halt until it's our turn and kicked. Note that we do safe halt
 	 * for irq enabled case to avoid hang when lock info is overwritten
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 90ae0ca51083..9d9f2d2b209d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
 #include <asm/prctl.h>
 #include <asm/spec-ctrl.h>
 #include <asm/proto.h>
+#include <asm/clearcpu.h>
 
 #include "process.h"
 
@@ -589,6 +590,8 @@ void stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
+	clear_cpu_idle();
+
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
 	 * for performing a successful kexec when going from SME inactive
@@ -675,6 +678,8 @@ static __cpuidle void mwait_idle(void)
 			mb(); /* quirk */
 		}
 
+		clear_cpu_idle();
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__sti_mwait(0, 0);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ccd1f2a8e557..c7fff6b09253 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -81,6 +81,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
+#include <asm/clearcpu.h>
 
 /* representing HT siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -1635,6 +1636,7 @@ static inline void mwait_play_dead(void)
 	wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		/*
 		 * The CLFLUSH is a workaround for erratum AAI65 for
 		 * the Xeon 7400 series.  It's not clear it is actually
@@ -1662,6 +1664,7 @@ void hlt_play_dead(void)
 		wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		native_halt();
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index a47676a55b84..2dcbc38d0880 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <asm/mwait.h>
+#include <asm/clearcpu.h>
 #include <xen/xen.h>
 
 #define ACPI_PROCESSOR_AGGREGATOR_CLASS	"acpi_pad"
@@ -175,6 +176,7 @@ static int power_saving_thread(void *data)
 			tick_broadcast_enable();
 			tick_broadcast_enter();
 			stop_critical_timings();
+			clear_cpu_idle();
 
 			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b2131c4ea124..b4406ca1dfd7 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -33,6 +33,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <acpi/processor.h>
+#include <asm/clearcpu.h>
 
 /*
  * Include the apic definitions for x86 to have the APIC timer related defines
@@ -121,6 +122,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
 static void __cpuidle acpi_safe_halt(void)
 {
 	if (!tif_need_resched()) {
+		clear_cpu_idle();
 		safe_halt();
 		local_irq_disable();
 	}
@@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 
 	ACPI_FLUSH_CPU_CACHE();
 
+	clear_cpu_idle();
 	while (1) {
 
 		if (cx->entry_method == ACPI_CSTATE_HALT)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 8b5d85c91e9d..ddaa7603d53a 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,6 +65,7 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/clearcpu.h>
 
 #define INTEL_IDLE_VERSION "0.4.1"
 
@@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 		}
 	}
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 
 	if (!static_cpu_has(X86_FEATURE_ARAT) && tick)
@@ -953,6 +956,8 @@ static void intel_idle_s2idle(struct cpuidle_device *dev,
 	unsigned long ecx = 1; /* break on interrupt flag */
 	unsigned long eax = flg2MWAIT(drv->states[index].flags);
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 07/27] MDSv5 0
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (5 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 06/27] MDSv5 18 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:39   ` [MODERATED] " Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 08/27] MDSv5 13 Andi Kleen
                   ` (21 subsequent siblings)
  28 siblings, 3 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add sysfs reporting

Report mds mitigation state in sysfs vulnerabilities.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../ABI/testing/sysfs-devices-system-cpu         |  1 +
 arch/x86/kernel/cpu/bugs.c                       | 16 ++++++++++++++++
 drivers/base/cpu.c                               |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9605dbd4b5b5..2db5c3407fd6 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -484,6 +484,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/spectre_v2
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
+		/sys/devices/system/cpu/vulnerabilities/mds
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index ce0e367753ff..715ab147f3e6 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1176,6 +1176,16 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
 			return l1tf_show_state(buf);
 		break;
+
+	case X86_BUG_MDS:
+		/* Assumes Hypervisor exposed HT state to us if in guest */
+		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: microcode\n");
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+		}
+		return sprintf(buf, "Vulnerable\n");
+
 	default:
 		break;
 	}
@@ -1207,4 +1217,10 @@ ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *b
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
 }
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
+
 #endif
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index eb9443d5bae1..2fd6ca1021c2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_spectre_v2.attr,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
+	&dev_attr_mds.attr,
 	NULL
 };
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 08/27] MDSv5 13
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (6 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:40   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Export MD_CLEAR CPUID to KVM
 guests.

Export the MD_CLEAR CPUID set by new microcode to signal
that VERW implements the clear cpu side effect to KVM guests.

Also requires corresponding qemu patches

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bbffa6c54697..d61272f50aed 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,7 +409,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
-		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP);
+		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
+		F(MD_CLEAR);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 09/27] MDSv5 23
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (7 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 08/27] MDSv5 13 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:56   ` [MODERATED] " Konrad Rzeszutek Wilk
                     ` (2 more replies)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 10/27] MDSv5 7 Andi Kleen
                   ` (19 subsequent siblings)
  28 siblings, 3 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Including the theory, and some guide lines for subsystem/driver
maintainers.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Some edits/clarification (Tim Chen)
---
 Documentation/clearcpu.txt | 172 +++++++++++++++++++++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 Documentation/clearcpu.txt

diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..eba22c28b680
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,172 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitly clearing the CPU state.
+
+We attempt to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data, to
+user space.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+Kernel data is sensitive when it involves cryptographic keys.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does not touch user data directly, consider marking
+it with IRQF_NO_USER.
+
+When your tasklet does not touch user data directly, consider marking
+it with TASKLET_NO_USER using tasklet_init_flags/or
+DECLARE_TASKLET*_NOUSER.
+
+When your timer does not touch user data mark it with TIMER_NO_USER.
+If it is a hrtimer, mark it with HRTIMER_MODE_NO_USER.
+
+When your irq poll handler does not touch user data, mark it
+with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
+
+For networking code, make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt. When the non skb data access is only in a
+hardware interrupt controlled by the driver, it can rely on not
+setting IRQF_NO_USER for that interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+If a system call touches data of its own process, CPU state does not
+need to be cleared, because it has already access to it.
+
+On context switching we clear data, unless the context switch is
+inside a process. We also clear after any context switches from kernel
+threads.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard interrupts, tasklets, timers which can run asynchronous are
+assumed to touch random user data, unless they have been audited, and
+marked with NO_USER flags.
+
+Most interrupt handlers for modern devices should not touch
+user data, because they rely on DMA and only manipulate
+pointers. This needs auditing to confirm though.
+
+For softirqs we assume that if they touch user data they use
+lazy_clear_cpu()/lazy_clear_interrupt() as needed.
+Networking is handled through skb_* below.
+Timer and Tasklets and IRQ poll are handled through opt-in.
+
+Scheduler softirq is assumed to not touch user data.
+
+Block softirq done callbacks are assumed to not touch user data.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions.  We assume that any networking code touching
+packet data uses these functions.
+
+[In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways]
+
+Some IO related functions like string PIO and memcpy_from/to_io, or
+the software pci dma bounce function, which touch data, schedule a
+buffer clear.
+
+We assume NMI/machine check code does not touch other
+processes' data.
+
+Any buffer clearing is done lazily on next kernel exit, so can be
+triggered in fast paths.
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that nearly all do.
+
+This could be further optimized by batching clears for
+many similar EBPF executions in a row (e.g. for packet
+processing). This would need ensuring that no sensitive
+data is touched inbetween the EBPF executions, and also
+that all EBPF scripts are set up by the same uid.
+We could add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitly as part of the L1TF mitigation.
+It relies on this being enabled. It also uses the "fast exit"
+optimization that only clears if an interrupt or context switch
+happened.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 10/27] MDSv5 7
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (8 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 11/27] MDSv5 2 Andi Kleen
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add a Documentation file for administrators that describes MDS on a
high level.

So far not covering SMT.

Needs updates later for public URLs of supporting documentation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/mds.rst | 108 ++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 Documentation/admin-guide/mds.rst

diff --git a/Documentation/admin-guide/mds.rst b/Documentation/admin-guide/mds.rst
new file mode 100644
index 000000000000..1f3021d20953
--- /dev/null
+++ b/Documentation/admin-guide/mds.rst
@@ -0,0 +1,108 @@
+MDS - Microarchitectural Data Sampling)
+=======================================
+
+Microarchitectural Data Sampling is a side channel vulnerability that
+allows an attacker to sample data that has been earlier used during
+program execution. Internal buffers in the CPU may keep old data
+for some limited time, which can the later be determined by an attacker
+with side channel analysis. MDS can be used to occasionaly observe
+some values accessed earlier, but it cannot be used to observe values
+not recently touched by other code running on the same core.
+
+It is difficult to target particular data on a system using MDS,
+but attackers may be able to infer secrets by collecting
+and analyzing large amounts of data. MDS does not modify
+memory.
+
+MDS consists of multiple sub-vulnerabilities:
+Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
+with the first leaking store data, and the second loads and sometimes
+store data, and the third load data.
+
+The effects and mitigations are similar for all three, so the Linux
+kernel handles and reports them all as a single vulnerability called
+MDS. This also reduces the number of acronyms in use.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors.
+Not all CPUs are affected by all of the sub vulnerabilities,
+however the kernel handles it always the same.
+
+The vulnerability is not present in
+
+    - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+The kernel will automatically detect future CPUs with hardware
+mitigations for these issues and disable any workarounds.
+
+The kernel reports if the current CPU is vulnerable and any
+mitigations used in
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+Kernel mitigation
+-----------------
+
+By default, the kernel automatically ensures no data leakage between
+different processes, or between kernel threads and interrupt handlers
+and user processes, or from any cryptographic code in the kernel.
+
+It does not isolate kernel code that only touches data of the
+current process.  If protecting such kernel code is desired,
+mds=full can be specified.
+
+The mitigation is automatically enabled, but can be further controlled
+with the command line options documented below.
+
+The mitigation can be done with microcode support, requiring
+updated microcode.
+
+The microcode should be loaded at early boot using the initrd. Hot
+updating microcode will not enable the mitigations.
+
+Virtual machine mitigation
+--------------------------
+
+The mitigation is enabled by default and controlled by the same options
+as L1TF cache clearing. See l1tf.rst for more details. In the default
+setting MDS for leaking data out of the guest into other processes
+will be mitigated.
+
+Kernel command line options
+---------------------------
+
+Normally the kernel selects reasonable defaults and no special configuration
+is needed. The default behavior can be overriden by the mds= kernel
+command line options.
+
+These options can be specified in the boot loader. Any changes require a reboot.
+
+When the system only runs trusted code, MDS mitigation can be disabled with
+mds=off as a performance optimization.
+
+   - mds=off      Disable workarounds if the CPU is not affected.
+
+By default the kernel only clears CPU data after execution
+that is known or likely to have touched user data of other processes,
+or cryptographic data. This relies on code audits done in the
+mainline Linux kernel. When running unaudited large out of tree code,
+or binary drivers, who might violate these constraints it is possible
+to use mds=full to always flush the CPU data on each kernel exit.
+
+   - mds=full     Always clear cpu state on exiting from kernel.
+
+TBD describe SMT
+
+References
+----------
+
+Fore more details on the kernel internal implementation of the MDS mitigations,
+please see Documentation/clearcpu.txt
+
+TBD Add URL for Intel white paper
+
+TBD add reference to microcodes
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 11/27] MDSv5 2
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (9 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 10/27] MDSv5 7 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22 13:11   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 12/27] MDSv5 6 Andi Kleen
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add basic infrastructure for code to request CPU buffer clearing
on the next kernel exit.

We have two functions lazy_clear_cpu to request clearing,
and lazy_clear_cpu_interrupt to request clearing if running
in an interrupt.

Non architecture specific code can include linux/clearcpu.h
and use lazy_clear_cpu / lazy_clear_interrupt. On x86
we provide low level implementations that set the TIF_CLEAR_CPU
bit.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/Kconfig                    |  3 +++
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/clearcpu.h |  5 +++++
 include/linux/clearcpu.h        | 36 +++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)
 create mode 100644 include/linux/clearcpu.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..e6b7bf9174aa 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,9 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_HAS_CLEAR_CPU
+	def_bool n
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 15af091611e2..ee05fe6821eb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -84,6 +84,7 @@ config X86
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_CLEAR_CPU
 	select BUILDTIME_EXTABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index dc3d04da6779..3be0194f48dc 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -40,6 +40,11 @@ static inline void clear_cpu_idle(void)
 	}
 }
 
+static inline void lazy_clear_cpu(void)
+{
+	set_thread_flag(TIF_CLEAR_CPU);
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #endif
diff --git a/include/linux/clearcpu.h b/include/linux/clearcpu.h
new file mode 100644
index 000000000000..63a6952b46fa
--- /dev/null
+++ b/include/linux/clearcpu.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CLEARCPU_H
+#define _LINUX_CLEARCPU_H 1
+
+#include <linux/preempt.h>
+
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearcpu.h>
+#else
+static inline void lazy_clear_cpu(void)
+{
+}
+#endif
+
+/*
+ * Use this function when potentially touching (reading or writing)
+ * user data in an interrupt. In this case schedule to clear the
+ * CPU buffers on kernel exit to avoid any potential side channels.
+ *
+ * If not in an interrupt we assume the touched data belongs to the
+ * current process and doesn't need to be cleared.
+ *
+ * This version is for code who might be in an interrupt.
+ * If you know for sure you're in interrupt context call
+ * lazy_clear_cpu directly.
+ *
+ * lazy_clear_cpu is reasonably cheap (just sets a bit) and
+ * can be used in fast paths.
+ */
+static inline void lazy_clear_cpu_interrupt(void)
+{
+	if (in_interrupt())
+		lazy_clear_cpu();
+}
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 12/27] MDSv5 6
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (10 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 11/27] MDSv5 2 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22 14:01   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 13/27] MDSv5 17 Andi Kleen
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Schedule cpu clear on context
 switch

On context switch we need to schedule a cpu clear on the next
kernel exit when:

- We're switching between different processes
- We're switching from a kernel thread
For kernel threads like work queue assume they might contain
sensitive (other user's or crypto) data.

The code hooks into the generic context switch, not
the mm context switch, because the mm context switch
doesn't handle the kernel thread case.

This also transfers the clear cpu bit to the next task.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>

---

v2: Clear on idle too.
---
 arch/x86/kernel/process.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 320ab978fb1f..f8c0b484a329 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -2,6 +2,7 @@
 //
 // Code shared between 32 and 64 bit
 
+#include <linux/clearcpu.h>
 #include <asm/spec-ctrl.h>
 
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
@@ -29,6 +30,30 @@ static inline void switch_to_extra(struct task_struct *prev,
 		}
 	}
 
+	/*
+	 * When we switch to a different process, or we switch
+	 * from a kernel thread, clear the CPU buffers on next kernel exit.
+	 *
+	 * This has to be here because switch_mm doesn't get
+	 * called in the kernel thread case.
+	 *
+	 * We flush when switching from idle too because idle
+	 * might inherit some leaked data from the SMT sibling.
+	 * This could be optimized for the SMT off case.
+	 */
+	if (static_cpu_has(X86_BUG_MDS)) {
+		if (next->mm != prev->mm || prev->mm == NULL)
+			lazy_clear_cpu();
+		/*
+		 * Also transfer the clearcpu flag from the previous task.
+		 * Can be done non atomically because interrupts are off.
+		 */
+		task_thread_info(next)->status |=
+			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
+		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;
+	}
+
+
 	/*
 	 * __switch_to_xtra() handles debug registers, i/o bitmaps,
 	 * speculation mitigations etc.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 13/27] MDSv5 17
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (11 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 12/27] MDSv5 6 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 14/27] MDSv5 3 Andi Kleen
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add tracing for clear_cpu

Add trace points for clear_cpu and lazy_clear_cpu. This is useful
for debugging and performance testing.

The trace points have to be partially out of line to avoid
include loops, but the fast path jump labels are inlined.

The idle case cannot be traced because trace points
don't like idle context.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h       | 36 +++++++++++++++++++++++++--
 arch/x86/include/asm/trace/clearcpu.h | 27 ++++++++++++++++++++
 arch/x86/kernel/cpu/bugs.c            | 17 +++++++++++++
 3 files changed, 78 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 3be0194f48dc..74a2fa052dd5 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -7,12 +7,35 @@
 #include <asm/alternative.h>
 #include <linux/thread_info.h>
 
+/*
+ * We cannot directly include the trace point header here
+ * because it leads to include loops with other trace point
+ * files pulling this one in. Define the static
+ * key manually here, which handles noping the fast path,
+ * and the actual tracing is done out of line.
+ */
+#ifdef CONFIG_TRACEPOINTS
+#include <asm/atomic.h>
+#include <linux/tracepoint-defs.h>
+
+extern struct tracepoint __tracepoint_clear_cpu;
+extern struct tracepoint __tracepoint_lazy_clear_cpu;
+#define cc_tracepoint_active(t) static_key_false(&(t).key)
+
+extern void do_trace_clear_cpu(void);
+extern void do_trace_lazy_clear_cpu(void);
+#else
+#define cc_tracepoint_active(t) false
+static inline void do_trace_clear_cpu(void) {}
+static inline void do_trace_lazy_clear_cpu(void) {}
+#endif
+
 /*
  * Clear CPU buffers to avoid side channels.
  * We use microcode as a side effect of the obsolete VERW instruction
  */
 
-static inline void clear_cpu(void)
+static inline void __clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
@@ -20,6 +43,13 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+static inline void clear_cpu(void)
+{
+	if (cc_tracepoint_active(__tracepoint_clear_cpu))
+		do_trace_clear_cpu();
+	__clear_cpu();
+}
+
 /*
  * Clear CPU buffers before going idle, so that no state is leaked to SMT
  * siblings taking over thread resources.
@@ -36,12 +66,14 @@ static inline void clear_cpu_idle(void)
 {
 	if (sched_smt_active()) {
 		clear_thread_flag(TIF_CLEAR_CPU);
-		clear_cpu();
+		__clear_cpu();
 	}
 }
 
 static inline void lazy_clear_cpu(void)
 {
+	if (cc_tracepoint_active(__tracepoint_lazy_clear_cpu))
+		do_trace_lazy_clear_cpu();
 	set_thread_flag(TIF_CLEAR_CPU);
 }
 
diff --git a/arch/x86/include/asm/trace/clearcpu.h b/arch/x86/include/asm/trace/clearcpu.h
new file mode 100644
index 000000000000..e742b5cd8ee9
--- /dev/null
+++ b/arch/x86/include/asm/trace/clearcpu.h
@@ -0,0 +1,27 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM clearcpu
+
+#if !defined(_TRACE_CLEARCPU_H) || defined(TRACE_HEADER_MULTI_READ)
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(clear_cpu,
+		    TP_PROTO(int dummy),
+		    TP_ARGS(dummy),
+		    TP_STRUCT__entry(__field(int, dummy)),
+		    TP_fast_assign(),
+		    TP_printk("%d", __entry->dummy));
+
+DEFINE_EVENT(clear_cpu, clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+DEFINE_EVENT(clear_cpu, lazy_clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+
+#define _TRACE_CLEARCPU_H
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE clearcpu
+#endif /* _TRACE_CLEARCPU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 715ab147f3e6..e80fba5d121a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1061,6 +1061,23 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/clearcpu.h>
+
+void do_trace_clear_cpu(void)
+{
+	trace_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(clear_cpu);
+
+void do_trace_lazy_clear_cpu(void)
+{
+	trace_lazy_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_lazy_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(lazy_clear_cpu);
+
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
 
 static void mds_select_mitigation(void)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 14/27] MDSv5 3
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (12 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 13/27] MDSv5 17 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:48   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 15:58   ` Thomas Gleixner
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 15/27] MDSv5 1 Andi Kleen
                   ` (14 subsequent siblings)
  28 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Force clear cpu on kernel preemption

When the kernel is preempted we need to force a cpu clear,
because the preemption might happen before the code
has a chance to set TIF_CPU_CLEAR later.

We cannot rely on kernel code setting the flag before
touching sensitive data: the flag setting could
be implicit, like in memzero_explicit, which is always
called later.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/sched/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a674c7db2f29..b04918e9115c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11,6 +11,8 @@
 
 #include <linux/kcov.h>
 
+#include <linux/clearcpu.h>
+
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
 
@@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	if (likely(!preemptible()))
 		return;
 
+	/*
+	 * For kernel preemption we need to force a cpu clear
+	 * because it could happen before the code has a chance
+	 * to set TIF_CLEAR_CPU.
+	 */
+	lazy_clear_cpu();
+
 	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 15/27] MDSv5 1
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (13 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 14/27] MDSv5 3 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:48   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 16/27] MDSv5 10 Andi Kleen
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Schedule cpu clear for memzero_explicit and
 kzfree

Assume that any code using these functions is sensitive and shouldn't
leak any data.

This handles clearing for key data used in the kernel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 lib/string.c     | 6 ++++++
 mm/slab_common.c | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/string.c b/lib/string.c
index 38e4ca08e757..9ce59dd86541 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -28,6 +28,7 @@
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/clearcpu.h>
 
 #include <asm/byteorder.h>
 #include <asm/word-at-a-time.h>
@@ -715,12 +716,17 @@ EXPORT_SYMBOL(memset);
  * necessary, memzero_explicit() should be used instead in
  * order to prevent the compiler from optimising away zeroing.
  *
+ * As a side effect this may also trigger extra cleaning
+ * of CPU state before the next kernel exit to avoid
+ * side channels.
+ *
  * memzero_explicit() doesn't need an arch-specific version as
  * it just invokes the one of memset() implicitly.
  */
 void memzero_explicit(void *s, size_t count)
 {
 	memset(s, 0, count);
+	lazy_clear_cpu();
 	barrier_data(s);
 }
 EXPORT_SYMBOL(memzero_explicit);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 81732d05e74a..7b5e2e1318a2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1576,6 +1576,9 @@ EXPORT_SYMBOL(krealloc);
  * Note: this function zeroes the whole allocated buffer which can be a good
  * deal bigger than the requested buffer size passed to kmalloc(). So be
  * careful when using this function in performance sensitive code.
+ *
+ * As a side effect this may also clear CPU state later before the
+ * next kernel exit to avoid side channels.
  */
 void kzfree(const void *p)
 {
@@ -1585,7 +1588,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 16/27] MDSv5 10
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (14 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 15/27] MDSv5 1 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  4:54   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22  7:33   ` Greg KH
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 17/27] MDSv5 9 Andi Kleen
                   ` (12 subsequent siblings)
  28 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Mark interrupts clear cpu, unless opted-out

Interrupts might touch user data from other processes
in any context.

By default we clear the CPU on the next kernel exit.

Add a new IRQ_F_NO_USER interrupt flag. When the flag
is not set on interrupt execution we clear the cpu state on
next kernel exit.

This allows interrupts to opt-out from the extra clearing
overhead, but is safe by default.

Over time as more interrupt code is audited it can set the opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 2 ++
 kernel/irq/handle.c       | 8 ++++++++
 kernel/irq/manage.c       | 1 +
 3 files changed, 11 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c672f34235e7..291b7fee3afe 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,7 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_NO_USER	- Interrupt does not touch user data
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +75,7 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_NO_USER		0x00080000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e5910938ce2b 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/clearcpu.h>
 
 #include <trace/events/irq.h>
 
@@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 		res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
 
+		/*
+		 * We aren't sure if the interrupt handler did or did not
+		 * touch user data. Schedule a cpu clear just in case.
+		 */
+		if (!(action->flags & IRQF_NO_USER))
+			lazy_clear_cpu();
+
 		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pF enabled interrupts\n",
 			      irq, action->handler))
 			local_irq_disable();
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index a4888ce4667a..3f0c99240638 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1793,6 +1793,7 @@ EXPORT_SYMBOL(free_irq);
  *
  *	IRQF_SHARED		Interrupt is shared
  *	IRQF_TRIGGER_*		Specify active edge(s) or level
+ *	IRQF_NOUSER		Does not touch user data.
  *
  */
 int request_threaded_irq(unsigned int irq, irq_handler_t handler,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 17/27] MDSv5 9
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (15 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 16/27] MDSv5 10 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 18/27] MDSv5 8 Andi Kleen
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Clear cpu on all timers, unless the timer
 opts-out

By default we assume timers might touch user data and schedule
a cpu clear on next kernel exit.

Support opt-outs where timer and hrtimer handlers can opt-in
they they don't touch any user data.

Note this takes one bit from the timer wheel index field away,
but it seems there are less wheels available anyways, so that
should be ok.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/hrtimer.h | 4 ++++
 include/linux/timer.h   | 9 ++++++---
 kernel/time/hrtimer.c   | 5 +++++
 kernel/time/timer.c     | 8 ++++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 2e8957eac4d4..b32c76919f78 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -32,6 +32,7 @@ struct hrtimer_cpu_base;
  *				  when starting the timer)
  * HRTIMER_MODE_SOFT		- Timer callback function will be executed in
  *				  soft irq context
+ * HRTIMER_MODE_NO_USER		- Handler does not touch user data.
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -48,6 +49,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_ABS_PINNED_SOFT = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_SOFT,
 	HRTIMER_MODE_REL_PINNED_SOFT = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_SOFT,
 
+	HRTIMER_MODE_NO_USER	= 0x08,
 };
 
 /*
@@ -101,6 +103,7 @@ enum hrtimer_restart {
  * @state:	state information (See bit values above)
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
+ * @no_user:	function does not touch user data.
  *
  * The hrtimer structure must be initialized by hrtimer_init()
  */
@@ -112,6 +115,7 @@ struct hrtimer {
 	u8				state;
 	u8				is_rel;
 	u8				is_soft;
+	u8				no_user;
 };
 
 /**
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 7b066fd38248..222e72432be3 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -56,10 +56,13 @@ struct timer_list {
 #define TIMER_DEFERRABLE	0x00080000
 #define TIMER_PINNED		0x00100000
 #define TIMER_IRQSAFE		0x00200000
-#define TIMER_ARRAYSHIFT	22
-#define TIMER_ARRAYMASK		0xFFC00000
+#define TIMER_NO_USER		0x00400000
+#define TIMER_ARRAYSHIFT	23
+#define TIMER_ARRAYMASK		0xFF800000
 
-#define TIMER_TRACE_FLAGMASK	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
+#define TIMER_TRACE_FLAGMASK	\
+	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | \
+	 TIMER_NO_USER)
 
 #define __TIMER_INITIALIZER(_function, _flags) {		\
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f5cfa1b73d6f..e2c8776ba2a4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -42,6 +42,7 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 
@@ -1276,6 +1277,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		clock_id = CLOCK_MONOTONIC;
 
 	base += hrtimer_clockid_to_base(clock_id);
+	timer->no_user = !!(mode & HRTIMER_MODE_NO_USER);
 	timer->is_soft = softtimer;
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
@@ -1390,6 +1392,9 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	trace_hrtimer_expire_exit(timer);
 	raw_spin_lock_irq(&cpu_base->lock);
 
+	if (!timer->no_user)
+		lazy_clear_cpu();
+
 	/*
 	 * Note: We clear the running state after enqueue_hrtimer and
 	 * we do not reprogram the event hardware. Happens either in
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 444156debfa0..e6ab6986ffc8 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -43,6 +43,7 @@
 #include <linux/sched/debug.h>
 #include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -1338,6 +1339,13 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(struct timer_list
 		 */
 		preempt_count_set(count);
 	}
+
+	/*
+	 * The timer might have touched user data. Schedule
+	 * a cpu clear on the next kernel exit.
+	 */
+	if (!(timer->flags & TIMER_NO_USER))
+		lazy_clear_cpu();
 }
 
 static void expire_timers(struct timer_base *base, struct hlist_head *head)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 18/27] MDSv5 8
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (16 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 17/27] MDSv5 9 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  5:07   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 19/27] MDSv5 12 Andi Kleen
                   ` (10 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

By default we assume tasklets might touch user data and schedule
a cpu clear on next kernel exit.

Add new interfaces to allow audited tasklets to opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 16 +++++++++++++++-
 kernel/softirq.c          | 25 +++++++++++++++++++------
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 291b7fee3afe..81b852fb5ecf 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -571,11 +571,22 @@ struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
 #define DECLARE_TASKLET_DISABLED(name, func, data) \
 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
 
+#define DECLARE_TASKLET_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(0), func, data }
+
+#define DECLARE_TASKLET_DISABLED_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(1), func, data }
 
 enum
 {
 	TASKLET_STATE_SCHED,	/* Tasklet is scheduled for execution */
-	TASKLET_STATE_RUN	/* Tasklet is running (SMP only) */
+	TASKLET_STATE_RUN,	/* Tasklet is running (SMP only) */
+
+	/*
+	 * Set this flag when the tasklet is known to not touch user data,
+	 * so doesn't need extra CPU state clearing.
+	 */
+	TASKLET_NO_USER		= 1 << 5,
 };
 
 #ifdef CONFIG_SMP
@@ -639,6 +650,9 @@ extern void tasklet_kill(struct tasklet_struct *t);
 extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
+extern void tasklet_init_flags(struct tasklet_struct *t,
+			 void (*func)(unsigned long), unsigned long data,
+			 unsigned flags);
 
 struct tasklet_hrtimer {
 	struct hrtimer		timer;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d28813306b2c..fdd4e3be3db7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/clearcpu.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -522,6 +523,8 @@ static void tasklet_action_common(struct softirq_action *a,
 					BUG();
 				t->func(t->data);
 				tasklet_unlock(t);
+				if (!(t->state & TASKLET_NO_USER))
+					lazy_clear_cpu();
 				continue;
 			}
 			tasklet_unlock(t);
@@ -546,15 +549,23 @@ static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
 	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
 }
 
-void tasklet_init(struct tasklet_struct *t,
-		  void (*func)(unsigned long), unsigned long data)
+void tasklet_init_flags(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data,
+		  unsigned flags)
 {
 	t->next = NULL;
-	t->state = 0;
+	t->state = flags;
 	atomic_set(&t->count, 0);
 	t->func = func;
 	t->data = data;
 }
+EXPORT_SYMBOL(tasklet_init_flags);
+
+void tasklet_init(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data)
+{
+	tasklet_init_flags(t, func, data, 0);
+}
 EXPORT_SYMBOL(tasklet_init);
 
 void tasklet_kill(struct tasklet_struct *t)
@@ -609,7 +620,8 @@ static void __tasklet_hrtimer_trampoline(unsigned long data)
  * @ttimer:	 tasklet_hrtimer which is initialized
  * @function:	 hrtimer callback function which gets called from softirq context
  * @which_clock: clock id (CLOCK_MONOTONIC/CLOCK_REALTIME)
- * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL)
+ * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL),
+ *		 HRTIMER_MODE_NO_USER
  */
 void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 			  enum hrtimer_restart (*function)(struct hrtimer *),
@@ -617,8 +629,9 @@ void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 {
 	hrtimer_init(&ttimer->timer, which_clock, mode);
 	ttimer->timer.function = __hrtimer_tasklet_trampoline;
-	tasklet_init(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
-		     (unsigned long)ttimer);
+	tasklet_init_flags(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
+		     (unsigned long)ttimer,
+		     (mode & HRTIMER_MODE_NO_USER) ? TASKLET_NO_USER : 0);
 	ttimer->function = function;
 }
 EXPORT_SYMBOL_GPL(tasklet_hrtimer_init);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 19/27] MDSv5 12
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (17 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 18/27] MDSv5 8 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  5:09   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 20/27] MDSv5 27 Andi Kleen
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

By default we assume that irq poll handlers running in the irq poll
softirq might touch user data and we schedule a cpu clear on next
kernel exit.

Add interfaces for audited handlers to declare that they are safe.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/irq_poll.h |  2 ++
 lib/irq_poll.c           | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..5f13582f1b8e 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -15,6 +15,8 @@ struct irq_poll {
 enum {
 	IRQ_POLL_F_SCHED	= 0,
 	IRQ_POLL_F_DISABLE	= 1,
+
+	IRQ_POLL_F_NO_USER	= 1<<4,
 };
 
 extern void irq_poll_sched(struct irq_poll *);
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..cb19431f53ec 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -11,6 +11,7 @@
 #include <linux/cpu.h>
 #include <linux/irq_poll.h>
 #include <linux/delay.h>
+#include <linux/clearcpu.h>
 
 static unsigned int irq_poll_budget __read_mostly = 256;
 
@@ -111,6 +112,9 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 
 		budget -= work;
 
+		if (!(iop->state & IRQ_POLL_F_NO_USER))
+			lazy_clear_cpu();
+
 		local_irq_disable();
 
 		/*
@@ -168,21 +172,31 @@ void irq_poll_enable(struct irq_poll *iop)
 EXPORT_SYMBOL(irq_poll_enable);
 
 /**
- * irq_poll_init - Initialize this @iop
+ * irq_poll_init_flags - Initialize this @iop
  * @iop:      The parent iopoll structure
  * @weight:   The default weight (or command completion budget)
  * @poll_fn:  The handler to invoke
+ * @flags:    IRQ_POLL_F_NO_USER if callback does not touch user data.
  *
  * Description:
  *     Initialize and enable this irq_poll structure.
  **/
-void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+void irq_poll_init_flags(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn,
+			 int flags)
 {
 	memset(iop, 0, sizeof(*iop));
 	INIT_LIST_HEAD(&iop->list);
 	iop->weight = weight;
 	iop->poll = poll_fn;
+	iop->state = flags;
 }
+EXPORT_SYMBOL(irq_poll_init_flags);
+
+void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+{
+	return irq_poll_init_flags(iop, weight, poll_fn, 0);
+}
+
 EXPORT_SYMBOL(irq_poll_init);
 
 static int irq_poll_cpu_dead(unsigned int cpu)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 20/27] MDSv5 27
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (18 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 19/27] MDSv5 12 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 21/27] MDSv5 20 Andi Kleen
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Schedule a clear cpu on next kernel exit for string PIO
or memcpy_from/to_io calls, when they are called in
interrupts.

The PIO case is likely already handled by old drivers
not opting in their interrupt handlers to not clear,
but let's do it just to be sure.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/io.h | 3 +++
 include/asm-generic/io.h  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 686247db3106..19e2208eaa94 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/clearcpu.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -321,6 +322,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 			     : "+S"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }									\
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
@@ -337,6 +339,7 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 			     : "+D"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }
 
 BUILDIO(b, b, char)
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index d356f802945a..cf58bceea042 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -14,6 +14,7 @@
 #include <asm/page.h> /* I/O is all done through memory accesses */
 #include <linux/string.h> /* for memset() and memcpy() */
 #include <linux/types.h>
+#include <linux/clearcpu.h>
 
 #ifdef CONFIG_GENERIC_IOMAP
 #include <asm-generic/iomap.h>
@@ -1115,6 +1116,7 @@ static inline void memcpy_fromio(void *buffer,
 				 size_t size)
 {
 	memcpy(buffer, __io_virt(addr), size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
@@ -1132,6 +1134,7 @@ static inline void memcpy_toio(volatile void __iomem *addr, const void *buffer,
 			       size_t size)
 {
 	memcpy(__io_virt(addr), buffer, size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 21/27] MDSv5 20
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (19 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 20/27] MDSv5 27 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  5:11   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 22/27] MDSv5 24 Andi Kleen
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Schedule clear cpu in swiotlb

Schedule a cpu clear on next kernel exit for swiotlb running
in interrupt context, since it touches user data with the CPU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/dma/swiotlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d6361776dc5c..e11ff1e45a4c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -34,6 +34,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mem_encrypt.h>
 #include <linux/set_memory.h>
+#include <linux/clearcpu.h>
 
 #include <asm/io.h>
 #include <asm/dma.h>
@@ -420,6 +421,7 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 	} else {
 		memcpy(phys_to_virt(orig_addr), vaddr, size);
 	}
+	lazy_clear_cpu_interrupt();
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 22/27] MDSv5 24
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (20 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 21/27] MDSv5 20 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-21 21:24   ` [MODERATED] " Linus Torvalds
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 23/27] MDSv5 22 Andi Kleen
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Instrument some strategic skbuff functions that either touch
packet data directly, or are likely followed by a user
data touch like a memcpy, to schedule a cpu clear on next
kernel exit. This is only done inside interrupts,
outside we assume it only touches the current processes' data.

In principle network data should be encrypted anyways,
but it's better to not leak it.

This provides protection for the network softirq.

Needs more auditing.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 93f56fddd92a..5e147afa07e4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
 #include <linux/in6.h>
 #include <linux/if_packet.h>
 #include <net/flow.h>
+#include <linux/clearcpu.h>
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -2093,6 +2094,7 @@ static inline void *__skb_put(struct sk_buff *skb, unsigned int len)
 	SKB_LINEAR_ASSERT(skb);
 	skb->tail += len;
 	skb->len  += len;
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37317ffec146..eda9ef0ff63d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1189,6 +1189,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (!num_frags)
 		goto release;
 
+	/* Likely to copy user data */
+	lazy_clear_cpu_interrupt();
+
 	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
@@ -1353,6 +1356,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 	if (!n)
 		return NULL;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	/* Set the data pointer */
 	skb_reserve(n, headerlen);
 	/* Set the tail pointer and length */
@@ -1588,6 +1594,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	if (!n)
 		return NULL;
 
+	/* May copy user data */
+	lazy_clear_cpu_interrupt();
+
 	skb_reserve(n, newheadroom);
 
 	/* Set the tail pointer and length */
@@ -1676,6 +1685,8 @@ EXPORT_SYMBOL(__skb_pad);
 
 void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len)
 {
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	if (tail != skb) {
 		skb->data_len += len;
 		skb->len += len;
@@ -1701,6 +1712,8 @@ void *skb_put(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->tail > skb->end))
 		skb_over_panic(skb, len, __builtin_return_address(0));
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 EXPORT_SYMBOL(skb_put);
@@ -1720,6 +1733,7 @@ void *skb_push(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->data < skb->head))
 		skb_under_panic(skb, len, __builtin_return_address(0));
+	/* No clear cpu, assume this is only header data */
 	return skb->data;
 }
 EXPORT_SYMBOL(skb_push);
@@ -2026,6 +2040,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2387,6 +2404,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2467,6 +2487,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Checksum header. */
 	if (copy > 0) {
 		if (copy > len)
@@ -2559,6 +2582,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 23/27] MDSv5 22
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (21 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 22/27] MDSv5 24 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 24/27] MDSv5 5 Andi Kleen
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Opt out tcp tasklet to not touch user data

Mark the tcp tasklet as not needing an implicit cpu clear
flush. If any is needed it will be triggered by the skb_*
hooks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 730bc44dbad9..06bc635a54ca 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -903,9 +903,10 @@ void __init tcp_tasklet_init(void)
 		struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
 
 		INIT_LIST_HEAD(&tsq->head);
-		tasklet_init(&tsq->tasklet,
+		tasklet_init_flags(&tsq->tasklet,
 			     tcp_tasklet_func,
-			     (unsigned long)tsq);
+			     (unsigned long)tsq,
+			     TASKLET_NO_USER);
 	}
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 24/27] MDSv5 5
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (22 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 23/27] MDSv5 22 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-21 21:20   ` [MODERATED] " Linus Torvalds
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 25/27] MDSv5 4 Andi Kleen
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: mark kernel/* timers safe as not touching user
 data

Some preliminary auditing of kernel/* shows no timers touch
other processes' user data. Mark all the timers in kernel/*
as not needed an implicit cpu clear.

More auditing here would be useful.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/events/core.c       | 6 ++++--
 kernel/fork.c              | 3 ++-
 kernel/futex.c             | 6 +++---
 kernel/sched/core.c        | 5 +++--
 kernel/sched/deadline.c    | 6 ++++--
 kernel/sched/fair.c        | 6 ++++--
 kernel/sched/idle.c        | 3 ++-
 kernel/sched/rt.c          | 3 ++-
 kernel/time/alarmtimer.c   | 2 +-
 kernel/time/hrtimer.c      | 6 +++---
 kernel/time/posix-timers.c | 6 ++++--
 kernel/time/sched_clock.c  | 3 ++-
 kernel/time/tick-sched.c   | 6 ++++--
 kernel/watchdog.c          | 3 ++-
 14 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cd13a30f732..5d9a4ed0cf58 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1102,7 +1102,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
 	raw_spin_lock_init(&cpuctx->hrtimer_lock);
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -9202,7 +9203,8 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hwc->hrtimer.function = perf_swevent_hrtimer;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e6f0e0..acb2626e40a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1542,7 +1542,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 #ifdef CONFIG_POSIX_TIMERS
 	INIT_LIST_HEAD(&sig->posix_timers);
-	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sig->real_timer.function = it_real_fn;
 #endif
 
diff --git a/kernel/futex.c b/kernel/futex.c
index be3bff2315ff..4ac7a412f04b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2691,7 +2691,7 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
@@ -2792,7 +2792,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires(&to->timer, *time);
 	}
@@ -3192,7 +3192,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b04918e9115c..6ca60c91cf30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -302,7 +302,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 */
 	delay = max_t(u64, delay, 10000LL);
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
-		      HRTIMER_MODE_REL_PINNED);
+		      HRTIMER_MODE_REL_PINNED|HRTIMER_MODE_NO_USER);
 }
 #endif /* CONFIG_SMP */
 
@@ -316,7 +316,8 @@ static void hrtick_rq_init(struct rq *rq)
 	rq->hrtick_csd.info = rq;
 #endif
 
-	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rq->hrtick_timer.function = hrtick;
 }
 #else	/* CONFIG_SCHED_HRTICK */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fb8b7b5d745d..dce637e0b3bd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1054,7 +1054,8 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = dl_task_timer;
 }
 
@@ -1293,7 +1294,8 @@ void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = inactive_task_timer;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50aa2aba69bd..b8cb9aad6b74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4889,9 +4889,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
-	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->distribute_running = 0;
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f5516bae0c1b..6a4cc46d8c4b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -330,7 +330,8 @@ void play_idle(unsigned long duration_ms)
 	cpuidle_use_deepest_state(true);
 
 	it.done = 0;
-	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC,
+			      HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	it.timer.function = idle_inject_timer_fn;
 	hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e4f398ad9e73..24b90b260682 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -46,7 +46,8 @@ void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
 	raw_spin_lock_init(&rt_b->rt_runtime_lock);
 
 	hrtimer_init(&rt_b->rt_period_timer,
-			CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+			CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rt_b->rt_period_timer.function = sched_rt_period_timer;
 }
 
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 2c97e8c2d29f..f2efd9b5d0b7 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -344,7 +344,7 @@ void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		enum alarmtimer_restart (*function)(struct alarm *, ktime_t))
 {
 	hrtimer_init(&alarm->timer, alarm_bases[type].base_clockid,
-		     HRTIMER_MODE_ABS);
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	__alarm_init(alarm, type, function);
 }
 EXPORT_SYMBOL_GPL(alarm_init);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e2c8776ba2a4..58beefd3543a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1713,7 +1713,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	int ret;
 
 	hrtimer_init_on_stack(&t.timer, restart->nanosleep.clockid,
-				HRTIMER_MODE_ABS);
+				HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
 
 	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
@@ -1733,7 +1733,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
 	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
-	hrtimer_init_on_stack(&t.timer, clockid, mode);
+	hrtimer_init_on_stack(&t.timer, clockid, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, timespec64_to_ktime(*rqtp), slack);
 	ret = do_nanosleep(&t, mode);
 	if (ret != -ERESTART_RESTARTBLOCK)
@@ -1932,7 +1932,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
 		return -EINTR;
 	}
 
-	hrtimer_init_on_stack(&t.timer, clock_id, mode);
+	hrtimer_init_on_stack(&t.timer, clock_id, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, *expires, delta);
 
 	hrtimer_init_sleeper(&t, current);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0e84bb72a3da..0faf661cb4c8 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -464,7 +464,8 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 
 static int common_timer_create(struct k_itimer *new_timer)
 {
-	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock, 0);
+	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock,
+		HRTIMER_MODE_NO_USER);
 	return 0;
 }
 
@@ -789,7 +790,8 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
 	if (timr->it_clock == CLOCK_REALTIME)
 		timr->kclock = absolute ? &clock_realtime : &clock_monotonic;
 
-	hrtimer_init(&timr->it.real.timer, timr->it_clock, mode);
+	hrtimer_init(&timr->it.real.timer, timr->it_clock,
+		     mode|HRTIMER_MODE_NO_USER);
 	timr->it.real.timer.function = posix_timer_fn;
 
 	if (!absolute)
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 094b82ca95e5..e0a59ed9199f 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -249,7 +249,8 @@ void __init generic_sched_clock_init(void)
 	 * Start the timer to keep sched_clock() properly updated and
 	 * sets the initial epoch.
 	 */
-	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sched_clock_timer.function = sched_clock_poll;
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6fa52cd6df0b..b95f6f1e7bc3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1205,7 +1205,8 @@ static void tick_nohz_switch_to_nohz(void)
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	/* Get the next period */
 	next = tick_init_jiffy_update();
 
@@ -1302,7 +1303,8 @@ void tick_setup_sched_timer(void)
 	/*
 	 * Emulate tick processing via per-CPU hrtimers:
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	ts->sched_timer.function = tick_sched_timer;
 
 	/* Get the next period (per-CPU) */
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 977918d5d350..d3c9da0a4fce 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -483,7 +483,8 @@ static void watchdog_enable(unsigned int cpu)
 	 * Start the timer first to prevent the NMI watchdog triggering
 	 * before the timer has a chance to fire.
 	 */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hrtimer->function = watchdog_timer_fn;
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 25/27] MDSv5 4
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (23 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 24/27] MDSv5 5 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-22  5:15   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 26/27] MDSv5 11 Andi Kleen
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

AHCI interrupt handlers never touch user data with the CPU.

Just to get the number of clears down on my test system.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/ata/ahci.c    |  2 +-
 drivers/ata/ahci.h    |  2 ++
 drivers/ata/libahci.c | 40 ++++++++++++++++++++++++----------------
 3 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 021ce46e2e57..1455ad89d2f9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1865,7 +1865,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	pci_set_master(pdev);
 
-	rc = ahci_host_activate(host, &ahci_sht);
+	rc = ahci_host_activate_irqflags(host, &ahci_sht, IRQF_NO_USER);
 	if (rc)
 		return rc;
 
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index 8810475f307a..093ea1856307 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -432,6 +432,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 int ahci_reset_em(struct ata_host *host);
 void ahci_print_info(struct ata_host *host, const char *scc_s);
 int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht);
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags);
 void ahci_error_handler(struct ata_port *ap);
 u32 ahci_handle_port_intr(struct ata_host *host, u32 irq_masked);
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index b5f57c69c487..b32664c7d8a1 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -2548,7 +2548,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 EXPORT_SYMBOL_GPL(ahci_set_em_messages);
 
 static int ahci_host_activate_multi_irqs(struct ata_host *host,
-					 struct scsi_host_template *sht)
+					 struct scsi_host_template *sht,
+					 int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int i, rc;
@@ -2571,7 +2572,7 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 		}
 
 		rc = devm_request_irq(host->dev, irq, ahci_multi_irqs_intr_hard,
-				0, pp->irq_desc, host->ports[i]);
+				irqflags, pp->irq_desc, host->ports[i]);
 
 		if (rc)
 			return rc;
@@ -2581,18 +2582,8 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 	return ata_host_register(host, sht);
 }
 
-/**
- *	ahci_host_activate - start AHCI host, request IRQs and register it
- *	@host: target ATA host
- *	@sht: scsi_host_template to use when registering the host
- *
- *	LOCKING:
- *	Inherited from calling layer (may sleep).
- *
- *	RETURNS:
- *	0 on success, -errno otherwise.
- */
-int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int irq = hpriv->irq;
@@ -2608,15 +2599,32 @@ int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
 			return -EIO;
 		}
 
-		rc = ahci_host_activate_multi_irqs(host, sht);
+		rc = ahci_host_activate_multi_irqs(host, sht, irqflags);
 	} else {
 		rc = ata_host_activate(host, irq, hpriv->irq_handler,
-				       IRQF_SHARED, sht);
+				       irqflags|IRQF_SHARED, sht);
 	}
 
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(ahci_host_activate_irqflags);
+
+/**
+ *	ahci_host_activate - start AHCI host, request IRQs and register it
+ *	@host: target ATA host
+ *	@sht: scsi_host_template to use when registering the host
+ *
+ *	LOCKING:
+ *	Inherited from calling layer (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, -errno otherwise.
+ */
+int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+{
+	return ahci_host_activate_irqflags(host, sht, 0);
+}
 EXPORT_SYMBOL_GPL(ahci_host_activate);
 
 MODULE_AUTHOR("Jeff Garzik");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 26/27] MDSv5 11
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (24 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 25/27] MDSv5 4 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 27/27] MDSv5 25 Andi Kleen
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

ACPI doesn't touch any user data, so doesn't need a cpu clear.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/acpi/osl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index f29e427d0d1d..f31064134b37 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,7 +572,8 @@ acpi_os_install_interrupt_handler(u32 gsi, acpi_osd_handler handler,
 
 	acpi_irq_handler = handler;
 	acpi_irq_context = context;
-	if (request_irq(irq, acpi_irq, IRQF_SHARED, "acpi", acpi_irq)) {
+	if (request_irq(irq, acpi_irq, IRQF_SHARED|IRQF_NO_USER,
+				"acpi", acpi_irq)) {
 		printk(KERN_ERR PREFIX "SCI (IRQ%d) allocation failed\n", irq);
 		acpi_irq_handler = NULL;
 		return AE_NOT_ACQUIRED;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] [PATCH v5 27/27] MDSv5 25
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (25 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 26/27] MDSv5 11 Andi Kleen
@ 2019-01-19  0:50 ` Andi Kleen
  2019-01-21 21:18 ` [MODERATED] Re: [PATCH v5 00/27] MDSv5 19 Linus Torvalds
  2019-01-28 11:34 ` Thomas Gleixner
  28 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-19  0:50 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

BPF allows the user to run untrusted code in the kernel.

Normally MDS would allow some information leakage either
from other processes  or sensitive kernel code to the user
controlled BPF code.  We cannot rule out that BPF code contains
an MDS exploit and it is difficult to pattern match.

The patch aims to add limited number of clear cpus
before BPF executions to make EBPF executions safe.

Assume BPF execution does not touch other user's data, so does
not need to schedule a clear for itself.

For EBPF programs loaded privileged we never clear.

When the BPF program was loaded unprivileged clear the CPU
before the BPF execution, depending on the context it is running in:

We only do this when running in an interrupt, or if an clear cpu is
already scheduled (which means for example there was a context
switch, or crypto operation before)

In process context we check if the current process context
has the same userns+euid as the process who created the BPF.
This handles the common seccomp filter case without
any extra clears, but still adds clears when e.g. a socket
filter runs on a socket inherited to a process with different user id.

We also always clear when an earlier kernel subsystem scheduled
a clear, e.g. after a context switch or running crypto code.

Technically we would only need to do this if the BPF program
contains conditional branches and loads dominated by them, but
let's assume that near all do.

For example for running chromium with seccomp filters I see
only 15-18% of all sandbox system calls have a clear, most
are likely caused by context switches

Unprivileged EBPF usages in interrupts currently always clear.

This could be further optimized by allowing callers that do
a lot of individual BPF runs and are sure they don't touch
other user's data (that is not accessible to the EBPF anyways)
inbetween to do the clear only once at the beginning. We can add
such optimizations later based on profile data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearbpf.h | 29 +++++++++++++++++++++++++++++
 include/linux/filter.h          | 21 +++++++++++++++++++--
 kernel/bpf/core.c               |  2 ++
 3 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/clearbpf.h

diff --git a/arch/x86/include/asm/clearbpf.h b/arch/x86/include/asm/clearbpf.h
new file mode 100644
index 000000000000..dc1756722b48
--- /dev/null
+++ b/arch/x86/include/asm/clearbpf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARBPF_H
+#define _ASM_CLEARBPF_H 1
+
+#include <linux/clearcpu.h>
+#include <linux/cred.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * When the BPF program was loaded unprivileged, clear the CPU
+ * to prevent any exploits written in BPF using side channels to read
+ * data leaked from other kernel code. In some cases, like
+ * process context with the same uid, we can avoid it.
+ *
+ * See Documentation/clearcpu.txt for more details.
+ */
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+	if (!static_cpu_has(X86_BUG_MDS))
+		return;
+	if (in_interrupt() ||
+		test_thread_flag(TIF_CLEAR_CPU) ||
+		!uid_eq(current_euid(), uid)) {
+		clear_cpu();
+		clear_thread_flag(TIF_CLEAR_CPU);
+	}
+}
+
+#endif
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ad106d845b22..b32547b4bd92 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,12 +20,21 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/clearcpu.h>
 
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
 #include <uapi/linux/bpf.h>
 
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearbpf.h>
+#else
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+}
+#endif
+
 struct sk_buff;
 struct sock;
 struct seccomp_data;
@@ -490,7 +499,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				priv:1;		/* Was loaded privileged */
+	kuid_t			uid;		/* Original uid who created it */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -513,7 +524,13 @@ struct sk_filter {
 	struct bpf_prog	*prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+static inline unsigned _bpf_prog_run(const struct bpf_prog *bp, const void *ctx)
+{
+	if (!bp->priv)
+		arch_bpf_prepare_nonpriv(bp->uid);
+	return bp->bpf_func(ctx, bp->insnsi);
+}
+#define BPF_PROG_RUN(filter, ctx) _bpf_prog_run(filter, ctx)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f908b9356025..67d845229d46 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -99,6 +99,8 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 	fp->aux = aux;
 	fp->aux->prog = fp;
 	fp->jit_requested = ebpf_jit_enabled();
+	fp->priv = !!capable(CAP_SYS_ADMIN);
+	fp->uid = current_euid();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (26 preceding siblings ...)
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 27/27] MDSv5 25 Andi Kleen
@ 2019-01-21 21:18 ` Linus Torvalds
  2019-01-22  1:14   ` Andi Kleen
  2019-01-28 11:34 ` Thomas Gleixner
  28 siblings, 1 reply; 105+ messages in thread
From: Linus Torvalds @ 2019-01-21 21:18 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 8:54 AM speck for Andi Kleen
<speck@linutronix.de> wrote:
>
>   mds: Mark interrupts clear cpu, unless opted-out
>   mds: Clear cpu on all timers, unless the timer opts-out
>   mds: Clear CPU on tasklets, unless opted-out
>   mds: Clear CPU on irq poll, unless opted-out

I do wonder if this should just be opt-in instead of opt-out?

Just what is the attack vector, and what's the interrupt/timer data
that is so sensitive? It strikes me as a really hard thing to try to
fish out of the buffers later on.

                 Linus

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 24/27] MDSv5 5
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 24/27] MDSv5 5 Andi Kleen
@ 2019-01-21 21:20   ` Linus Torvalds
  0 siblings, 0 replies; 105+ messages in thread
From: Linus Torvalds @ 2019-01-21 21:20 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 8:58 AM speck for Andi Kleen
<speck@linutronix.de> wrote:
>
> Some preliminary auditing of kernel/* shows no timers touch
> other processes' user data. Mark all the timers in kernel/*
> as not needed an implicit cpu clear.

So this is an example of the  whole "the default seems wrong". Having
most (all?) timers basically come to the conclusion that they don't
care, and now need to set some flag to say so..

I think it's the people who touch sensitive stuff that should set the
flag. Exactly because they presumably know they are touching sensitive
stuff.

                Linus

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 22/27] MDSv5 24
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 22/27] MDSv5 24 Andi Kleen
@ 2019-01-21 21:24   ` Linus Torvalds
  2019-01-22  1:22     ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Linus Torvalds @ 2019-01-21 21:24 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 8:57 AM speck for Andi Kleen
<speck@linutronix.de> wrote:
>
> Instrument some strategic skbuff functions that either touch
> packet data directly, or are likely followed by a user
> data touch like a memcpy, to schedule a cpu clear on next
> kernel exit.

I think this is crazy.

We're marking things as "clear cpu state" for when we touch data that
WAS VISIBLE ON THE NETWORK!

That makes no sense to me.  Plus is likely hurts exactly the kinds of
loads that people don't want to hurt.

               Linus

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 06/27] MDSv5 18
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 06/27] MDSv5 18 Andi Kleen
@ 2019-01-21 22:41   ` Josh Poimboeuf
  2019-01-22  1:16     ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Josh Poimboeuf @ 2019-01-21 22:41 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:21PM -0800, speck for Andi Kleen wrote:
> --- a/arch/x86/include/asm/clearcpu.h
> +++ b/arch/x86/include/asm/clearcpu.h
> @@ -20,6 +20,26 @@ static inline void clear_cpu(void)
>  		[kernelds] "m" (kernel_ds));
>  }
>  
> +/*
> + * Clear CPU buffers before going idle, so that no state is leaked to SMT
> + * siblings taking over thread resources.
> + * Out of line to avoid include hell.
> + *
> + * Assumes that interrupts are disabled and only get reenabled
> + * before idle, otherwise the data from a racing interrupt might not
> + * get cleared. There are some callers who violate this,
> + * but they are only used in unattackable cases, like CPU
> + * offlining.
> + */
> +
> +static inline void clear_cpu_idle(void)
> +{
> +	if (sched_smt_active()) {
> +		clear_thread_flag(TIF_CLEAR_CPU);
> +		clear_cpu();
> +	}
> +}
> +
>  DECLARE_STATIC_KEY_FALSE(force_cpu_clear);

This causes an error with CONFIG_ACPI_PROCESSOR_AGGREGATOR:

  ERROR: "sched_smt_present" [drivers/acpi/acpi_pad.ko] undefined!

because sched_smt_present isn't exported.

> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index b2131c4ea124..b4406ca1dfd7 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -33,6 +33,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/cpu.h>
>  #include <acpi/processor.h>
> +#include <asm/clearcpu.h>

This should be s/asm/linux/ because this code can be used by non-x86
arches.

-- 
Josh

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-01-21 21:18 ` [MODERATED] Re: [PATCH v5 00/27] MDSv5 19 Linus Torvalds
@ 2019-01-22  1:14   ` Andi Kleen
  2019-01-22  7:38     ` Greg KH
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-22  1:14 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 10:18:05AM +1300, speck for Linus Torvalds wrote:
> On Tue, Jan 22, 2019 at 8:54 AM speck for Andi Kleen
> <speck@linutronix.de> wrote:
> >
> >   mds: Mark interrupts clear cpu, unless opted-out
> >   mds: Clear cpu on all timers, unless the timer opts-out
> >   mds: Clear CPU on tasklets, unless opted-out
> >   mds: Clear CPU on irq poll, unless opted-out
> 
> I do wonder if this should just be opt-in instead of opt-out?
> 
> Just what is the attack vector, and what's the interrupt/timer data

The attack vector is usually a copy. The string instructions can leave
data in places that are not overwritten by normal integer code.

> that is so sensitive? It strikes me as a really hard thing to try to

Part of it was being conservative. I don't really know from what
context that data the timer process came from. But yes most of the
time it's just some meta data, like pointers. But if it copies
user data then it could well be leaked.

The rationale for the opt-out was that it's hard and difficult to audit
all the driver code. 

I can audit kernel/* and yes for that part opt-out would make more sense.

But with being conservative and having simple rules and don't make
too much assumptions about unaudited code we can make
a good case that the default policy is safe enough for near everyone
and they don't need mds=full

I'm open to other proposals:

In theory we could have a different default for different 
directories with some Makefile trickery, but that might be confusing?

Or could try to find some semi automated way to audit copies
in timers in drivers/* and do opt-in, but that's likely a substantial project.
It would also need to be repeated for backports and out of tree.

Thoughts?

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 06/27] MDSv5 18
  2019-01-21 22:41   ` [MODERATED] " Josh Poimboeuf
@ 2019-01-22  1:16     ` Andi Kleen
  0 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-22  1:16 UTC (permalink / raw)
  To: speck

On Mon, Jan 21, 2019 at 04:41:50PM -0600, speck for Josh Poimboeuf wrote:
> On Fri, Jan 18, 2019 at 04:50:21PM -0800, speck for Andi Kleen wrote:
> > --- a/arch/x86/include/asm/clearcpu.h
> > +++ b/arch/x86/include/asm/clearcpu.h
> > @@ -20,6 +20,26 @@ static inline void clear_cpu(void)
> >  		[kernelds] "m" (kernel_ds));
> >  }
> >  
> > +/*
> > + * Clear CPU buffers before going idle, so that no state is leaked to SMT
> > + * siblings taking over thread resources.
> > + * Out of line to avoid include hell.
> > + *
> > + * Assumes that interrupts are disabled and only get reenabled
> > + * before idle, otherwise the data from a racing interrupt might not
> > + * get cleared. There are some callers who violate this,
> > + * but they are only used in unattackable cases, like CPU
> > + * offlining.
> > + */
> > +
> > +static inline void clear_cpu_idle(void)
> > +{
> > +	if (sched_smt_active()) {
> > +		clear_thread_flag(TIF_CLEAR_CPU);
> > +		clear_cpu();
> > +	}
> > +}
> > +
> >  DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
> 
> This causes an error with CONFIG_ACPI_PROCESSOR_AGGREGATOR:
> 
>   ERROR: "sched_smt_present" [drivers/acpi/acpi_pad.ko] undefined!
> 
> because sched_smt_present isn't exported.

Yes it's a regression from the previous version. Just readd the hunk below.
I'll do so in the next version.

> 
> > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> > index b2131c4ea124..b4406ca1dfd7 100644
> > --- a/drivers/acpi/processor_idle.c
> > +++ b/drivers/acpi/processor_idle.c
> > @@ -33,6 +33,7 @@
> >  #include <linux/cpuidle.h>
> >  #include <linux/cpu.h>
> >  #include <acpi/processor.h>
> > +#include <asm/clearcpu.h>
> 
> This should be s/asm/linux/ because this code can be used by non-x86
> arches.

Ok.

-Andi


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8cb9aad6b74..b9d2a617b105 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5982,6 +5982,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL(sched_smt_present);
 
 static inline void set_idle_cores(int cpu, int val)
 {

^ permalink raw reply related	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 22/27] MDSv5 24
  2019-01-21 21:24   ` [MODERATED] " Linus Torvalds
@ 2019-01-22  1:22     ` Andi Kleen
  2019-01-22 16:09       ` Thomas Gleixner
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-22  1:22 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 10:24:46AM +1300, speck for Linus Torvalds wrote:
> On Tue, Jan 22, 2019 at 8:57 AM speck for Andi Kleen
> <speck@linutronix.de> wrote:
> >
> > Instrument some strategic skbuff functions that either touch
> > packet data directly, or are likely followed by a user
> > data touch like a memcpy, to schedule a cpu clear on next
> > kernel exit.
> 
> I think this is crazy.
> 
> We're marking things as "clear cpu state" for when we touch data that
> WAS VISIBLE ON THE NETWORK!

Well there's loopback too and it should be encrypted, but yes. 

There could be still a reasonable expectation that different users
of the network are isolated.

We could drop it, but I fear it would encourage more use of mds=full.

Or perhaps do something different for loopback? Likely more complicated,
but possible.
 
> That makes no sense to me.  Plus is likely hurts exactly the kinds of
> loads that people don't want to hurt.

I'm not sure about that actually

A normal network server doing TCP/UDP shouldn't trigger
it much because all the interesting operations on data are in process context, 
and the workloads that trigger like firewalling it are likely not
doing much ring 3 code, so there won't be much clears either. 
I haven't done any experiments to verify that though.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 01/27] MDSv5 26
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
@ 2019-01-22  4:17   ` Konrad Rzeszutek Wilk
  2019-01-22 12:46   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:17 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:16PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add basic bug infrastructure
>  for MDS
> 
> MDS is micro architectural data sampling, which is a side channel
> attack on internal buffers in Intel CPUs.
> 
> MDS consists of multiple sub-vulnerabilities:
> Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
> Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
> Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
> with the first leaking store data, and the second loads and sometimes
> store data, and the third load data.
> 
> They all have the same mitigations for single thread, so we lump them all
> together as a single MDS issue.
> 
> This patch adds the basic infrastructure to detect if the current
> CPU is affected by MDS, and if yes set the right BUG bits.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 02/27] MDSv5 14
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 02/27] MDSv5 14 Andi Kleen
@ 2019-01-22  4:20   ` Konrad Rzeszutek Wilk
  2019-01-22 12:51   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:20 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:17PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add mds=off
> 
> Normally we execute VERW for clearing the cpu unconditionally on kernel exits
> that might have touched sensitive. Add a new flag to disable VERW usage.

s/sensitive/sensitive data/

> This is intended for systems that only run trusted code and don't
> want the performance impact of the extra clearing.

"And it is set if the CPU exposes MDS_NO (no need for mitigations) as well."
> 
> This just sets the flag, actual implementation is in future patches.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> 
> ---
> 
> v2: Also force mds=off for MDS_NO
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  3 +++
>  arch/x86/include/asm/cpufeatures.h              |  1 +
>  arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index b799bcf67d7b..9c967d0caeca 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2357,6 +2357,9 @@
>  			Format: <first>,<last>
>  			Specifies range of consoles to be captured by the MDA.
>  
> +	mds=off		[X86, Intel]
> +			Disable workarounds for Micro-architectural Data Sampling.
> +
>  	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
>  			Amount of memory to be used when the kernel is not able
>  			to see the whole system memory or for test.
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 233ca598826f..09347c6a8901 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -221,6 +221,7 @@
>  #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
>  #define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */
>  #define X86_FEATURE_IBRS_ENHANCED	( 7*32+30) /* Enhanced IBRS */
> +#define X86_FEATURE_NO_VERW		( 7*32+31) /* "" No VERW for MDS on kernel exit */
>  
>  /* Virtualization flags: Linux defined, word 8 */
>  #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 1de0f4170178..2fd8faa7e23a 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -37,6 +37,7 @@
>  static void __init spectre_v2_select_mitigation(void);
>  static void __init ssb_select_mitigation(void);
>  static void __init l1tf_select_mitigation(void);
> +static void __init mds_select_mitigation(void);
>  
>  /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
>  u64 x86_spec_ctrl_base;
> @@ -101,6 +102,8 @@ void __init check_bugs(void)
>  
>  	l1tf_select_mitigation();
>  
> +	mds_select_mitigation();
> +
>  #ifdef CONFIG_X86_32
>  	/*
>  	 * Check whether we are able to run this kernel safely on SMP.
> @@ -1058,6 +1061,13 @@ early_param("l1tf", l1tf_cmdline);
>  
>  #undef pr_fmt
>  
> +static void mds_select_mitigation(void)
> +{
> +	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
> +		!boot_cpu_has_bug(X86_BUG_MDS))
> +		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
> +}
> +
>  #ifdef CONFIG_SYSFS
>  
>  #define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 03/27] MDSv5 16
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
@ 2019-01-22  4:23   ` Konrad Rzeszutek Wilk
  2019-01-22 12:55   ` Thomas Gleixner
  2019-01-27 21:58   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:23 UTC (permalink / raw)
  To: speck

> index 000000000000..530ef619ac1b
> --- /dev/null
> +++ b/arch/x86/include/asm/clearcpu.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_CLEARCPU_H
> +#define _ASM_CLEARCPU_H 1
> +
> +#include <linux/jump_label.h>
> +#include <linux/sched/smt.h>
> +#include <asm/alternative.h>
> +#include <linux/thread_info.h>

This being a new file .. any chance these can be sorted?

Either way:

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 04/27] MDSv5 15
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 04/27] MDSv5 15 Andi Kleen
@ 2019-01-22  4:33   ` Konrad Rzeszutek Wilk
  2019-01-22 12:59   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:33 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:19PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Support mds=full
> 
> Support a new command line option to support unconditional flushing
> on each kernel exit. This is not enabled by default.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> 
> ---
> 
> v2: Don't enable mds=full for MDS_NO because it will be a nop.

> ---
>  Documentation/admin-guide/kernel-parameters.txt | 5 +++++
>  arch/x86/entry/common.c                         | 7 ++++++-
>  arch/x86/include/asm/clearcpu.h                 | 2 ++
>  arch/x86/kernel/cpu/bugs.c                      | 5 +++++
>  4 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 9c967d0caeca..5f5a8808c475 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2360,6 +2360,11 @@
>  	mds=off		[X86, Intel]
>  			Disable workarounds for Micro-architectural Data Sampling.
>  
> +	mds=full	[X86, Intel]
> +			Always flush cpu buffers when exiting kernel for MDS.

.. which implies that the microcode must be loaded. But right now you could do
'mds=full' on a machine _without_ the microcode and it would just do 'verw'.

And that unpatched 'verw' would most certainly _not_ flush CPU buffers. See below
in  mds_select_mitigation

> +			Normally the kernel decides dynamically when flushing is
> +			needed or not.

Can you follow the same standard as 'ssbd' and 'l1tf' - which is that this turns
in 'mds=[off,full] and then each one has an explanation please?

> +
>  	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
>  			Amount of memory to be used when the kernel is not able
>  			to see the whole system memory or for test.
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 924f8dab2068..66c08e1d493a 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
>  
>  		if (cached_flags & _TIF_CLEAR_CPU) {
>  			clear_thread_flag(TIF_CLEAR_CPU);
> -			clear_cpu();
> +			/* Don't do it twice if forced */
> +			if (!static_key_enabled(&force_cpu_clear))
> +				clear_cpu();
>  		}
>  
>  		/* Disable IRQs and retry */
> @@ -217,6 +219,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
>  	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
>  #endif
>  
> +	if (static_key_enabled(&force_cpu_clear))
> +		clear_cpu();
> +
>  	user_enter_irqoff();
>  }
>  
> diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
> index 530ef619ac1b..3b8ee76b9c07 100644
> --- a/arch/x86/include/asm/clearcpu.h
> +++ b/arch/x86/include/asm/clearcpu.h
> @@ -20,4 +20,6 @@ static inline void clear_cpu(void)
>  		[kernelds] "m" (kernel_ds));
>  }
>  
> +DECLARE_STATIC_KEY_FALSE(force_cpu_clear);

'force_cpu_clear' sounds quite vague. As in in three months I will not remember the name
of this. Perhaps 'force_verw' ? Or 'force_mds_verw'?


> +
>  #endif
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 2fd8faa7e23a..ce0e367753ff 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1061,11 +1061,16 @@ early_param("l1tf", l1tf_cmdline);
>  
>  #undef pr_fmt
>  
> +DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
> +
>  static void mds_select_mitigation(void)
>  {
>  	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
>  		!boot_cpu_has_bug(X86_BUG_MDS))
>  		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
> +	if (cmdline_find_option_bool(boot_command_line, "mds=full") &&
> +		boot_cpu_has_bug(X86_BUG_MDS))
> +		static_branch_enable(&force_cpu_clear);

The 'mds=full' can be done on machines without the new microcode and it sets MDS
    (twice) and also does 'VERW' without any benefit. 

Why not make this:

if (!boot_cpu_has_bug(X86_BUG_MDS) {
	setup_force_cpu_cap(X86_FEATURE_NO_VERW);
	return;
} else {
	if (cmdline_find_option_bool(boot_command_line, "mds=off"))
		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
	if (cmdline_find_option_bool(boot_command_line, "mds=full") && boot_cpu_has_bug(X86_BUG_MDS))
		static_branch_enable(&force_cpu_clear);
}

?


>  }
>  
>  #ifdef CONFIG_SYSFS
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 05/27] MDSv5 21
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
@ 2019-01-22  4:35   ` Konrad Rzeszutek Wilk
  2019-01-22 13:01   ` Thomas Gleixner
  2019-02-21 12:06   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:35 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:20PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Support mds=full for NMIs
> 
> NMIs don't go through the normal exit code when exiting
> to user space. Normally we consider NMIs not sensitive anyways,
> but they need special handling with mds=full.
> So add an explicit check to do_nmi to clear the CPU with mds=full

s/to do_nmi/in do_nmi/
> 
> Suggested-by: Josh Poimboeuf

His email got lost?

> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/nmi.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 18bc9b51ac9b..eb6e39238d1d 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -494,7 +494,7 @@ do_nmi(struct pt_regs *regs, long error_code)
>  {
>  	if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) {
>  		this_cpu_write(nmi_state, NMI_LATCHED);
> -		return;
> +		goto out;
>  	}
>  	this_cpu_write(nmi_state, NMI_EXECUTING);
>  	this_cpu_write(nmi_cr2, read_cr2());
> @@ -533,6 +533,10 @@ do_nmi(struct pt_regs *regs, long error_code)
>  		write_cr2(this_cpu_read(nmi_cr2));
>  	if (this_cpu_dec_return(nmi_state))
>  		goto nmi_restart;
> +
> +out:
> +	if (static_key_enabled(&force_cpu_clear))
> +		clear_cpu();
>  }
>  NOKPROBE_SYMBOL(do_nmi);
>  
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 07/27] MDSv5 0
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
@ 2019-01-22  4:39   ` Konrad Rzeszutek Wilk
  2019-01-27 22:09   ` Thomas Gleixner
  2019-02-13 22:26   ` [MODERATED] " Tyler Hicks
  2 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:39 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:22PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add sysfs reporting
> 
> Report mds mitigation state in sysfs vulnerabilities.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 08/27] MDSv5 13
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 08/27] MDSv5 13 Andi Kleen
@ 2019-01-22  4:40   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:40 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:23PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Export MD_CLEAR CPUID to KVM
>  guests.
> 
> Export the MD_CLEAR CPUID set by new microcode to signal
> that VERW implements the clear cpu side effect to KVM guests.
> 
> Also requires corresponding qemu patches
> 
> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  arch/x86/kvm/cpuid.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index bbffa6c54697..d61272f50aed 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -409,7 +409,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  	/* cpuid 7.0.edx*/
>  	const u32 kvm_cpuid_7_0_edx_x86_features =
>  		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> -		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP);
> +		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
> +		F(MD_CLEAR);
>  
>  	/* all calls to cpuid_count() should be made on the same cpu */
>  	get_cpu();
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 14/27] MDSv5 3 Andi Kleen
@ 2019-01-22  4:48   ` Konrad Rzeszutek Wilk
  2019-01-22 15:58   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:48 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:29PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Force clear cpu on kernel preemption
> 
> When the kernel is preempted we need to force a cpu clear,
> because the preemption might happen before the code
> has a chance to set TIF_CPU_CLEAR later.
> 
> We cannot rely on kernel code setting the flag before
> touching sensitive data: the flag setting could
> be implicit, like in memzero_explicit, which is always
> called later.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  kernel/sched/core.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a674c7db2f29..b04918e9115c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11,6 +11,8 @@
>  
>  #include <linux/kcov.h>
>  
> +#include <linux/clearcpu.h>
> +
>  #include <asm/switch_to.h>
>  #include <asm/tlb.h>
>  
> @@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
>  	if (likely(!preemptible()))
>  		return;
>  
> +	/*
> +	 * For kernel preemption we need to force a cpu clear
> +	 * because it could happen before the code has a chance
> +	 * to set TIF_CLEAR_CPU.
> +	 */
> +	lazy_clear_cpu();
> +
>  	preempt_schedule_common();
>  }
>  NOKPROBE_SYMBOL(preempt_schedule);
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 15/27] MDSv5 1
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 15/27] MDSv5 1 Andi Kleen
@ 2019-01-22  4:48   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:48 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:30PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Schedule cpu clear for memzero_explicit and
>  kzfree
> 
> Assume that any code using these functions is sensitive and shouldn't
> leak any data.
> 
> This handles clearing for key data used in the kernel.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

> ---
>  lib/string.c     | 6 ++++++
>  mm/slab_common.c | 5 ++++-
>  2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/string.c b/lib/string.c
> index 38e4ca08e757..9ce59dd86541 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -28,6 +28,7 @@
>  #include <linux/bug.h>
>  #include <linux/errno.h>
>  #include <linux/slab.h>
> +#include <linux/clearcpu.h>
>  
>  #include <asm/byteorder.h>
>  #include <asm/word-at-a-time.h>
> @@ -715,12 +716,17 @@ EXPORT_SYMBOL(memset);
>   * necessary, memzero_explicit() should be used instead in
>   * order to prevent the compiler from optimising away zeroing.
>   *
> + * As a side effect this may also trigger extra cleaning
> + * of CPU state before the next kernel exit to avoid
> + * side channels.
> + *
>   * memzero_explicit() doesn't need an arch-specific version as
>   * it just invokes the one of memset() implicitly.
>   */
>  void memzero_explicit(void *s, size_t count)
>  {
>  	memset(s, 0, count);
> +	lazy_clear_cpu();
>  	barrier_data(s);
>  }
>  EXPORT_SYMBOL(memzero_explicit);
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 81732d05e74a..7b5e2e1318a2 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1576,6 +1576,9 @@ EXPORT_SYMBOL(krealloc);
>   * Note: this function zeroes the whole allocated buffer which can be a good
>   * deal bigger than the requested buffer size passed to kmalloc(). So be
>   * careful when using this function in performance sensitive code.
> + *
> + * As a side effect this may also clear CPU state later before the
> + * next kernel exit to avoid side channels.
>   */
>  void kzfree(const void *p)
>  {
> @@ -1585,7 +1588,7 @@ void kzfree(const void *p)
>  	if (unlikely(ZERO_OR_NULL_PTR(mem)))
>  		return;
>  	ks = ksize(mem);
> -	memset(mem, 0, ks);
> +	memzero_explicit(mem, ks);
>  	kfree(mem);
>  }
>  EXPORT_SYMBOL(kzfree);
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 16/27] MDSv5 10
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 16/27] MDSv5 10 Andi Kleen
@ 2019-01-22  4:54   ` Konrad Rzeszutek Wilk
  2019-01-22  7:33   ` Greg KH
  1 sibling, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:54 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:31PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Mark interrupts clear cpu, unless opted-out
> 
> Interrupts might touch user data from other processes
> in any context.
> 
> By default we clear the CPU on the next kernel exit.
> 
> Add a new IRQ_F_NO_USER interrupt flag. When the flag
> is not set on interrupt execution we clear the cpu state on

s/we clear the cpu state/we flush the CPU's MDS state's/ ?

'cpu state' implies (At least to me) - everything - like even cache.
But that is what not what we do - we do our lazy CPU flushing.
> next kernel exit.
> 
> This allows interrupts to opt-out from the extra clearing
> overhead, but is safe by default.

s/but is safe by default/if they are sanitized and carry no user data./ ?

> 
> Over time as more interrupt code is audited it can set the opt-out.

s/it can set the opt-out/we can opt-out various code/ ?

Either way:
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  include/linux/interrupt.h | 2 ++
>  kernel/irq/handle.c       | 8 ++++++++
>  kernel/irq/manage.c       | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c672f34235e7..291b7fee3afe 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -61,6 +61,7 @@
>   *                interrupt handler after suspending interrupts. For system
>   *                wakeup devices users need to implement wakeup detection in
>   *                their interrupt handlers.
> + * IRQF_NO_USER	- Interrupt does not touch user data
>   */
>  #define IRQF_SHARED		0x00000080
>  #define IRQF_PROBE_SHARED	0x00000100
> @@ -74,6 +75,7 @@
>  #define IRQF_NO_THREAD		0x00010000
>  #define IRQF_EARLY_RESUME	0x00020000
>  #define IRQF_COND_SUSPEND	0x00040000
> +#define IRQF_NO_USER		0x00080000
>  
>  #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
>  
> diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
> index 38554bc35375..e5910938ce2b 100644
> --- a/kernel/irq/handle.c
> +++ b/kernel/irq/handle.c
> @@ -13,6 +13,7 @@
>  #include <linux/sched.h>
>  #include <linux/interrupt.h>
>  #include <linux/kernel_stat.h>
> +#include <linux/clearcpu.h>
>  
>  #include <trace/events/irq.h>
>  
> @@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
>  		res = action->handler(irq, action->dev_id);
>  		trace_irq_handler_exit(irq, action, res);
>  
> +		/*
> +		 * We aren't sure if the interrupt handler did or did not
> +		 * touch user data. Schedule a cpu clear just in case.
> +		 */
> +		if (!(action->flags & IRQF_NO_USER))
> +			lazy_clear_cpu();
> +
>  		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pF enabled interrupts\n",
>  			      irq, action->handler))
>  			local_irq_disable();
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index a4888ce4667a..3f0c99240638 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1793,6 +1793,7 @@ EXPORT_SYMBOL(free_irq);
>   *
>   *	IRQF_SHARED		Interrupt is shared
>   *	IRQF_TRIGGER_*		Specify active edge(s) or level
> + *	IRQF_NOUSER		Does not touch user data.
>   *
>   */
>  int request_threaded_irq(unsigned int irq, irq_handler_t handler,
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 09/27] MDSv5 23
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
@ 2019-01-22  4:56   ` Konrad Rzeszutek Wilk
  2019-01-22  7:26   ` Greg KH
  2019-01-22 13:07   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  4:56 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:24PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Add documentation for clear cpu usage
> 
> Including the theory, and some guide lines for subsystem/driver
> maintainers.

I think you should move this to best the last set of patches.

The reason being that when I was reading this patchset I was looking for
IRQF_NO_USER and IRQ_POLL_F_NO_USER in the previous patches.. But they
were much later declared.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 18/27] MDSv5 8
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 18/27] MDSv5 8 Andi Kleen
@ 2019-01-22  5:07   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  5:07 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:33PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Clear CPU on tasklets, unless opted-out
> 
> By default we assume tasklets might touch user data and schedule
> a cpu clear on next kernel exit.
> 
> Add new interfaces to allow audited tasklets to opt-out.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  include/linux/interrupt.h | 16 +++++++++++++++-
>  kernel/softirq.c          | 25 +++++++++++++++++++------
>  2 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 291b7fee3afe..81b852fb5ecf 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -571,11 +571,22 @@ struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
>  #define DECLARE_TASKLET_DISABLED(name, func, data) \
>  struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
>  
> +#define DECLARE_TASKLET_NOUSER(name, func, data) \
> +struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(0), func, data }
> +
> +#define DECLARE_TASKLET_DISABLED_NOUSER(name, func, data) \
> +struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(1), func, data }
>  
>  enum
>  {
>  	TASKLET_STATE_SCHED,	/* Tasklet is scheduled for execution */
> -	TASKLET_STATE_RUN	/* Tasklet is running (SMP only) */
> +	TASKLET_STATE_RUN,	/* Tasklet is running (SMP only) */

I think it would be worth converting these to values like below. That is
1<< 0

and
1 << 1

And so on.
> +
> +	/*
> +	 * Set this flag when the tasklet is known to not touch user data,
> +	 * so doesn't need extra CPU state clearing.
> +	 */
> +	TASKLET_NO_USER		= 1 << 5,
>  };
>  
>  #ifdef CONFIG_SMP
> @@ -639,6 +650,9 @@ extern void tasklet_kill(struct tasklet_struct *t);
>  extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
>  extern void tasklet_init(struct tasklet_struct *t,
>  			 void (*func)(unsigned long), unsigned long data);
> +extern void tasklet_init_flags(struct tasklet_struct *t,
> +			 void (*func)(unsigned long), unsigned long data,
> +			 unsigned flags);
>  
>  struct tasklet_hrtimer {
>  	struct hrtimer		timer;
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index d28813306b2c..fdd4e3be3db7 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -26,6 +26,7 @@
>  #include <linux/smpboot.h>
>  #include <linux/tick.h>
>  #include <linux/irq.h>
> +#include <linux/clearcpu.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/irq.h>
> @@ -522,6 +523,8 @@ static void tasklet_action_common(struct softirq_action *a,
>  					BUG();
>  				t->func(t->data);
>  				tasklet_unlock(t);
> +				if (!(t->state & TASKLET_NO_USER))
> +					lazy_clear_cpu();
>  				continue;
>  			}
>  			tasklet_unlock(t);
> @@ -546,15 +549,23 @@ static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
>  	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
>  }
>  
> -void tasklet_init(struct tasklet_struct *t,
> -		  void (*func)(unsigned long), unsigned long data)
> +void tasklet_init_flags(struct tasklet_struct *t,
> +		  void (*func)(unsigned long), unsigned long data,
> +		  unsigned flags)
>  {
>  	t->next = NULL;
> -	t->state = 0;
> +	t->state = flags;

So say we have another customer setting state to TASKLET_NO_USER. That means
the check in    ksoftirqd_running will return

92         return tsk && (tsk->state == TASK_RUNNING);                             


always false. That is state will have both 1<<5 and  1<<1 set. But the check
above is for 1<<1 only.

76         if (tsk && tsk->state != TASK_RUNNING)                                  
 77                 wake_up_process(tsk);           

Also wakeup_softirqd may multiple times call wake_up_process.

Could you modify those functions please?

>  	atomic_set(&t->count, 0);
>  	t->func = func;
>  	t->data = data;
>  }
> +EXPORT_SYMBOL(tasklet_init_flags);
> +
> +void tasklet_init(struct tasklet_struct *t,
> +		  void (*func)(unsigned long), unsigned long data)
> +{
> +	tasklet_init_flags(t, func, data, 0);
> +}
>  EXPORT_SYMBOL(tasklet_init);
>  
>  void tasklet_kill(struct tasklet_struct *t)
> @@ -609,7 +620,8 @@ static void __tasklet_hrtimer_trampoline(unsigned long data)
>   * @ttimer:	 tasklet_hrtimer which is initialized
>   * @function:	 hrtimer callback function which gets called from softirq context
>   * @which_clock: clock id (CLOCK_MONOTONIC/CLOCK_REALTIME)
> - * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL)
> + * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL),
> + *		 HRTIMER_MODE_NO_USER
>   */
>  void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
>  			  enum hrtimer_restart (*function)(struct hrtimer *),
> @@ -617,8 +629,9 @@ void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
>  {
>  	hrtimer_init(&ttimer->timer, which_clock, mode);
>  	ttimer->timer.function = __hrtimer_tasklet_trampoline;
> -	tasklet_init(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
> -		     (unsigned long)ttimer);
> +	tasklet_init_flags(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
> +		     (unsigned long)ttimer,
> +		     (mode & HRTIMER_MODE_NO_USER) ? TASKLET_NO_USER : 0);
>  	ttimer->function = function;
>  }
>  EXPORT_SYMBOL_GPL(tasklet_hrtimer_init);
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 19/27] MDSv5 12
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 19/27] MDSv5 12 Andi Kleen
@ 2019-01-22  5:09   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  5:09 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:34PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Clear CPU on irq poll, unless opted-out
> 
> By default we assume that irq poll handlers running in the irq poll
> softirq might touch user data and we schedule a cpu clear on next
> kernel exit.
> 
> Add interfaces for audited handlers to declare that they are safe.
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  include/linux/irq_poll.h |  2 ++
>  lib/irq_poll.c           | 18 ++++++++++++++++--
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
> index 16aaeccb65cb..5f13582f1b8e 100644
> --- a/include/linux/irq_poll.h
> +++ b/include/linux/irq_poll.h
> @@ -15,6 +15,8 @@ struct irq_poll {
>  enum {
>  	IRQ_POLL_F_SCHED	= 0,
>  	IRQ_POLL_F_DISABLE	= 1,
> +
> +	IRQ_POLL_F_NO_USER	= 1<<4,
>  };
>  
>  extern void irq_poll_sched(struct irq_poll *);
> diff --git a/lib/irq_poll.c b/lib/irq_poll.c
> index 86a709954f5a..cb19431f53ec 100644
> --- a/lib/irq_poll.c
> +++ b/lib/irq_poll.c
> @@ -11,6 +11,7 @@
>  #include <linux/cpu.h>
>  #include <linux/irq_poll.h>
>  #include <linux/delay.h>
> +#include <linux/clearcpu.h>
>  
>  static unsigned int irq_poll_budget __read_mostly = 256;
>  
> @@ -111,6 +112,9 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
>  
>  		budget -= work;
>  
> +		if (!(iop->state & IRQ_POLL_F_NO_USER))
> +			lazy_clear_cpu();
> +
>  		local_irq_disable();
>  
>  		/*
> @@ -168,21 +172,31 @@ void irq_poll_enable(struct irq_poll *iop)
>  EXPORT_SYMBOL(irq_poll_enable);
>  
>  /**
> - * irq_poll_init - Initialize this @iop
> + * irq_poll_init_flags - Initialize this @iop
>   * @iop:      The parent iopoll structure
>   * @weight:   The default weight (or command completion budget)
>   * @poll_fn:  The handler to invoke
> + * @flags:    IRQ_POLL_F_NO_USER if callback does not touch user data.
>   *
>   * Description:
>   *     Initialize and enable this irq_poll structure.
>   **/
> -void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
> +void irq_poll_init_flags(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn,
> +			 int flags)
>  {
>  	memset(iop, 0, sizeof(*iop));
>  	INIT_LIST_HEAD(&iop->list);
>  	iop->weight = weight;
>  	iop->poll = poll_fn;
> +	iop->state = flags;
>  }
> +EXPORT_SYMBOL(irq_poll_init_flags);
> +
> +void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
> +{
> +	return irq_poll_init_flags(iop, weight, poll_fn, 0);
> +}
> +
>  EXPORT_SYMBOL(irq_poll_init);
>  
>  static int irq_poll_cpu_dead(unsigned int cpu)
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 21/27] MDSv5 20
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 21/27] MDSv5 20 Andi Kleen
@ 2019-01-22  5:11   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  5:11 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:36PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Schedule clear cpu in swiotlb
> 
> Schedule a cpu clear on next kernel exit for swiotlb running
> in interrupt context, since it touches user data with the CPU.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Thank you!
> ---
>  kernel/dma/swiotlb.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index d6361776dc5c..e11ff1e45a4c 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -34,6 +34,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/mem_encrypt.h>
>  #include <linux/set_memory.h>
> +#include <linux/clearcpu.h>
>  
>  #include <asm/io.h>
>  #include <asm/dma.h>
> @@ -420,6 +421,7 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
>  	} else {
>  		memcpy(phys_to_virt(orig_addr), vaddr, size);
>  	}
> +	lazy_clear_cpu_interrupt();
>  }
>  
>  phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 25/27] MDSv5 4
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 25/27] MDSv5 4 Andi Kleen
@ 2019-01-22  5:15   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 105+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-22  5:15 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:40PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Mark AHCI interrupt as not needing cpu clear
> 
> AHCI interrupt handlers never touch user data with the CPU.

Can you expand a bit? Asking this as folks will follow this as an example
of auditing an interrupt handler - and this commit does not have much
data on how you came to this conclusion. It would be helpful for
other developers to know what you looked for - especially say if a 
bus ran you over!

Thank you.
> 
> Just to get the number of clears down on my test system.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  drivers/ata/ahci.c    |  2 +-
>  drivers/ata/ahci.h    |  2 ++
>  drivers/ata/libahci.c | 40 ++++++++++++++++++++++++----------------
>  3 files changed, 27 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index 021ce46e2e57..1455ad89d2f9 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1865,7 +1865,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
>  
>  	pci_set_master(pdev);
>  
> -	rc = ahci_host_activate(host, &ahci_sht);
> +	rc = ahci_host_activate_irqflags(host, &ahci_sht, IRQF_NO_USER);
>  	if (rc)
>  		return rc;
>  
> diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
> index 8810475f307a..093ea1856307 100644
> --- a/drivers/ata/ahci.h
> +++ b/drivers/ata/ahci.h
> @@ -432,6 +432,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
>  int ahci_reset_em(struct ata_host *host);
>  void ahci_print_info(struct ata_host *host, const char *scc_s);
>  int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht);
> +int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
> +				int irqflags);
>  void ahci_error_handler(struct ata_port *ap);
>  u32 ahci_handle_port_intr(struct ata_host *host, u32 irq_masked);
>  
> diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
> index b5f57c69c487..b32664c7d8a1 100644
> --- a/drivers/ata/libahci.c
> +++ b/drivers/ata/libahci.c
> @@ -2548,7 +2548,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
>  EXPORT_SYMBOL_GPL(ahci_set_em_messages);
>  
>  static int ahci_host_activate_multi_irqs(struct ata_host *host,
> -					 struct scsi_host_template *sht)
> +					 struct scsi_host_template *sht,
> +					 int irqflags)
>  {
>  	struct ahci_host_priv *hpriv = host->private_data;
>  	int i, rc;
> @@ -2571,7 +2572,7 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
>  		}
>  
>  		rc = devm_request_irq(host->dev, irq, ahci_multi_irqs_intr_hard,
> -				0, pp->irq_desc, host->ports[i]);
> +				irqflags, pp->irq_desc, host->ports[i]);
>  
>  		if (rc)
>  			return rc;
> @@ -2581,18 +2582,8 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
>  	return ata_host_register(host, sht);
>  }
>  
> -/**
> - *	ahci_host_activate - start AHCI host, request IRQs and register it
> - *	@host: target ATA host
> - *	@sht: scsi_host_template to use when registering the host
> - *
> - *	LOCKING:
> - *	Inherited from calling layer (may sleep).
> - *
> - *	RETURNS:
> - *	0 on success, -errno otherwise.
> - */
> -int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
> +int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
> +				int irqflags)
>  {
>  	struct ahci_host_priv *hpriv = host->private_data;
>  	int irq = hpriv->irq;
> @@ -2608,15 +2599,32 @@ int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
>  			return -EIO;
>  		}
>  
> -		rc = ahci_host_activate_multi_irqs(host, sht);
> +		rc = ahci_host_activate_multi_irqs(host, sht, irqflags);
>  	} else {
>  		rc = ata_host_activate(host, irq, hpriv->irq_handler,
> -				       IRQF_SHARED, sht);
> +				       irqflags|IRQF_SHARED, sht);
>  	}
>  
>  
>  	return rc;
>  }
> +EXPORT_SYMBOL_GPL(ahci_host_activate_irqflags);
> +
> +/**
> + *	ahci_host_activate - start AHCI host, request IRQs and register it
> + *	@host: target ATA host
> + *	@sht: scsi_host_template to use when registering the host
> + *
> + *	LOCKING:
> + *	Inherited from calling layer (may sleep).
> + *
> + *	RETURNS:
> + *	0 on success, -errno otherwise.
> + */
> +int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
> +{
> +	return ahci_host_activate_irqflags(host, sht, 0);
> +}
>  EXPORT_SYMBOL_GPL(ahci_host_activate);
>  
>  MODULE_AUTHOR("Jeff Garzik");
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 09/27] MDSv5 23
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
  2019-01-22  4:56   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22  7:26   ` Greg KH
  2019-01-22 13:07   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-01-22  7:26 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:24PM -0800, speck for Andi Kleen wrote:
> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.

Are you missing a "not" in that sentence somewhere?  As written, this
doesn't make much sense to me.

> +Most interrupt handlers for modern devices should not touch
> +user data, because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.

You don't ever really define what "user data" is in this file, to help
people in trying to audit this type of thing.

For example, are keystrokes "user data"?  Is block i/o?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 16/27] MDSv5 10
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 16/27] MDSv5 10 Andi Kleen
  2019-01-22  4:54   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22  7:33   ` Greg KH
  1 sibling, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-01-22  7:33 UTC (permalink / raw)
  To: speck

On Fri, Jan 18, 2019 at 04:50:31PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Mark interrupts clear cpu, unless opted-out
> 
> Interrupts might touch user data from other processes
> in any context.
> 
> By default we clear the CPU on the next kernel exit.
> 
> Add a new IRQ_F_NO_USER interrupt flag. When the flag
> is not set on interrupt execution we clear the cpu state on
> next kernel exit.
> 
> This allows interrupts to opt-out from the extra clearing
> overhead, but is safe by default.
> 
> Over time as more interrupt code is audited it can set the opt-out.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  include/linux/interrupt.h | 2 ++
>  kernel/irq/handle.c       | 8 ++++++++
>  kernel/irq/manage.c       | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index c672f34235e7..291b7fee3afe 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -61,6 +61,7 @@
>   *                interrupt handler after suspending interrupts. For system
>   *                wakeup devices users need to implement wakeup detection in
>   *                their interrupt handlers.
> + * IRQF_NO_USER	- Interrupt does not touch user data
>   */
>  #define IRQF_SHARED		0x00000080
>  #define IRQF_PROBE_SHARED	0x00000100
> @@ -74,6 +75,7 @@
>  #define IRQF_NO_THREAD		0x00010000
>  #define IRQF_EARLY_RESUME	0x00020000
>  #define IRQF_COND_SUSPEND	0x00040000
> +#define IRQF_NO_USER		0x00080000

I know you want to be "safe", but I think Linus's comment about having
this be "opt in" is better from an auditing point of view in that you
should mark an IRQ as touching user data as you should know that you are
doing that.

> --- a/kernel/irq/handle.c
> +++ b/kernel/irq/handle.c
> @@ -13,6 +13,7 @@
>  #include <linux/sched.h>
>  #include <linux/interrupt.h>
>  #include <linux/kernel_stat.h>
> +#include <linux/clearcpu.h>
>  
>  #include <trace/events/irq.h>
>  
> @@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
>  		res = action->handler(irq, action->dev_id);
>  		trace_irq_handler_exit(irq, action, res);
>  
> +		/*
> +		 * We aren't sure if the interrupt handler did or did not
> +		 * touch user data. Schedule a cpu clear just in case.
> +		 */
> +		if (!(action->flags & IRQF_NO_USER))
> +			lazy_clear_cpu();

We should be sure.  Why can't we be sure?  We know what the irq did,
right, we wrote it :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-01-22  1:14   ` Andi Kleen
@ 2019-01-22  7:38     ` Greg KH
  0 siblings, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-01-22  7:38 UTC (permalink / raw)
  To: speck

On Mon, Jan 21, 2019 at 05:14:17PM -0800, speck for Andi Kleen wrote:
> The rationale for the opt-out was that it's hard and difficult to audit
> all the driver code. 

To quote Dave Jones, "kernel programming is hard, let's go shopping..."

Come on, we know what the drivers do by the "type" they are.  And if
they are "generic" (i.e. USB host controllers), then we also know that,
right?  Ask for help if you don't know what the driver type is, we have
a bunch of people here who might just know :)

> I can audit kernel/* and yes for that part opt-out would make more sense.
> 
> But with being conservative and having simple rules and don't make
> too much assumptions about unaudited code we can make
> a good case that the default policy is safe enough for near everyone
> and they don't need mds=full
> 
> I'm open to other proposals:
> 
> In theory we could have a different default for different 
> directories with some Makefile trickery, but that might be confusing?
> 
> Or could try to find some semi automated way to audit copies
> in timers in drivers/* and do opt-in, but that's likely a substantial project.

But it would be valuable information to know, right?  We are going to
have to determine this somehow eventually as these types of issues are
not going away.

> It would also need to be repeated for backports and out of tree.

Backports can deal with their own stuff, as can out-of-tree crap, if
they actually care.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 01/27] MDSv5 26
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
  2019-01-22  4:17   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 12:46   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 12:46 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add basic bug infrastructure
>  for MDS
> 
> MDS is micro architectural data sampling, which is a side channel
> attack on internal buffers in Intel CPUs.
> 
> MDS consists of multiple sub-vulnerabilities:
> Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
> Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
> Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
> with the first leaking store data, and the second loads and sometimes
> store data, and the third load data.
> 
> They all have the same mitigations for single thread, so we lump them all
> together as a single MDS issue.
> 
> This patch adds the basic infrastructure to detect if the current
> CPU is affected by MDS, and if yes set the right BUG bits.

Please read and follow Documentation/process/submitting-patches.rst.

Especially this paragraph:

  Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
  instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to
  do frotz", as if you are giving orders to the codebase to change its
  behaviour.

I asked for this over and over and no, I'm not going to fixup your changelogs
again.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 02/27] MDSv5 14
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 02/27] MDSv5 14 Andi Kleen
  2019-01-22  4:20   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 12:51   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 12:51 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add mds=off
> 
> Normally we execute VERW for clearing the cpu unconditionally on kernel exits

So what's normally?

Nothing. This is patch 2 of the series and nothing does VERW
anywhere. Changelogs have to make sense on their own and not require
knowledge of patches further down the road.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 03/27] MDSv5 16
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
  2019-01-22  4:23   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 12:55   ` Thomas Gleixner
  2019-01-27 21:58   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 12:55 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
> new file mode 100644
> index 000000000000..530ef619ac1b
> --- /dev/null
> +++ b/arch/x86/include/asm/clearcpu.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_CLEARCPU_H
> +#define _ASM_CLEARCPU_H 1
> +
> +#include <linux/jump_label.h>
> +#include <linux/sched/smt.h>
> +#include <asm/alternative.h>
> +#include <linux/thread_info.h>

Is there a reason why this needs an extra header file?

> +
> +/*
> + * Clear CPU buffers to avoid side channels.
> + * We use microcode as a side effect of the obsolete VERW instruction
> + */
> +
> +static inline void clear_cpu(void)

clear_cpu is way to broad. Please chose a function name and also a TIF name
which makes it clear what this is about.


> +{
> +	unsigned kernel_ds = __KERNEL_DS;

Newline between variable declaration and code/comment.

> +	/* Has to be memory form, don't modify to use an register */
> +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> +		[kernelds] "m" (kernel_ds));

Please align the second line proper with the first lines first argument

> +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> +			  [kernelds] "m" (kernel_ds));

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 04/27] MDSv5 15
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 04/27] MDSv5 15 Andi Kleen
  2019-01-22  4:33   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 12:59   ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 12:59 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
>  	mds=off		[X86, Intel]
>  			Disable workarounds for Micro-architectural Data Sampling.
>  
> +	mds=full	[X86, Intel]
> +			Always flush cpu buffers when exiting kernel for MDS.
> +			Normally the kernel decides dynamically when flushing is
> +			needed or not.

Errm, no. All other mitigations use

      bug=		.....

and have the options documented there. Please stay consistent.

> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 924f8dab2068..66c08e1d493a 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
>  
>  		if (cached_flags & _TIF_CLEAR_CPU) {
>  			clear_thread_flag(TIF_CLEAR_CPU);
> -			clear_cpu();
> +			/* Don't do it twice if forced */
> +			if (!static_key_enabled(&force_cpu_clear))
> +				clear_cpu();

Wouldn't it be smarter not to set the TIF flag at all if forced mode is
enabled?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 05/27] MDSv5 21
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
  2019-01-22  4:35   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 13:01   ` Thomas Gleixner
  2019-02-21 12:06   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 13:01 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Support mds=full for NMIs
> 
> NMIs don't go through the normal exit code when exiting
> to user space. Normally we consider NMIs not sensitive anyways,
> but they need special handling with mds=full.
> So add an explicit check to do_nmi to clear the CPU with mds=full
> 
> Suggested-by: Josh Poimboeuf

I assume Josh has an email adress

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 09/27] MDSv5 23
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
  2019-01-22  4:56   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22  7:26   ` Greg KH
@ 2019-01-22 13:07   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 13:07 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> ---
>  Documentation/clearcpu.txt | 172 +++++++++++++++++++++++++++++++++++++

Random choice of placement. There is Documentation/x86

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 11/27] MDSv5 2
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 11/27] MDSv5 2 Andi Kleen
@ 2019-01-22 13:11   ` Thomas Gleixner
  0 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 13:11 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
>  
> +static inline void lazy_clear_cpu(void)
> +{
> +	set_thread_flag(TIF_CLEAR_CPU);

Setting this unconditionally is just wrong. Why needs the return to user
path go into the slow path just to figure out that the BUG bit is not set
or the mitigation has been disabled on the command line?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 12/27] MDSv5 6
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 12/27] MDSv5 6 Andi Kleen
@ 2019-01-22 14:01   ` Thomas Gleixner
  2019-01-22 15:42     ` Thomas Gleixner
  2019-01-22 18:01     ` [MODERATED] " Andi Kleen
  0 siblings, 2 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 14:01 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
>  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
> @@ -29,6 +30,30 @@ static inline void switch_to_extra(struct task_struct *prev,
>  		}
>  	}
>  
> +	/*
> +	 * When we switch to a different process, or we switch
> +	 * from a kernel thread, clear the CPU buffers on next kernel exit.
> +	 *
> +	 * This has to be here because switch_mm doesn't get
> +	 * called in the kernel thread case.
> +	 *
> +	 * We flush when switching from idle too because idle
> +	 * might inherit some leaked data from the SMT sibling.
> +	 * This could be optimized for the SMT off case.
> +	 */
> +	if (static_cpu_has(X86_BUG_MDS)) {

Again, why is this evaluated when the mitigation is turned off or forced?

> +		if (next->mm != prev->mm || prev->mm == NULL)
> +			lazy_clear_cpu();

This sets the bit even when switching between two kernel threads. Makes a
lot of sense ...

> +		/*
> +		 * Also transfer the clearcpu flag from the previous task.
> +		 * Can be done non atomically because interrupts are off.
> +		 */
> +		task_thread_info(next)->status |=
> +			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
> +		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;

What has thread_info->status to do with this? TIF flags are in
thread_info->flags. Brilliant stuff that.

I assume that Linus suggested the TIF flag to avoid yet another conditional
in the syscall path, which makes sense, but the above does not make sense
at all. That needs way more thought. 

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 12/27] MDSv5 6
  2019-01-22 14:01   ` Thomas Gleixner
@ 2019-01-22 15:42     ` Thomas Gleixner
  2019-01-22 18:01     ` [MODERATED] " Andi Kleen
  1 sibling, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 15:42 UTC (permalink / raw)
  To: speck

On Tue, 22 Jan 2019, speck for Thomas Gleixner wrote:

> On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> >  void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
> > @@ -29,6 +30,30 @@ static inline void switch_to_extra(struct task_struct *prev,
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * When we switch to a different process, or we switch
> > +	 * from a kernel thread, clear the CPU buffers on next kernel exit.
> > +	 *
> > +	 * This has to be here because switch_mm doesn't get
> > +	 * called in the kernel thread case.

That's true, but enter_lazy_tlb() is called and there exists already an
indicator that it switched from a user space task to a kernel task:
cpu_tlbstate.is_lazy, which is evaluated in the next invocation of
switch_mm_irqs_off().

So the question is, whether something like this makes sense:

   - Have some indicator in cpu_tlbstate that switching is due

     cpu_tlbstate.tif_flags

     and use that TIF bit.

In the sys_exit() path do

   cached_flags = READ_ONCE(ti->flags);

   if (static_key_enabled(mds_cond_clear))
   	cached_flags |= READ_ONCE(cpu_tlbstate.tif_flags);

That's an extra read, but especially with PTI this is cache hot anyway and
the store of the flag is done in switch_mm_irqs_off(). Haven't thought it
through, but on the first glance this looks simpler and makes the whole
thing stick to the CPU instead of playing games with transferring the
thread flag on every context switch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 14/27] MDSv5 3 Andi Kleen
  2019-01-22  4:48   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-22 15:58   ` Thomas Gleixner
  2019-01-22 17:57     ` Thomas Gleixner
  1 sibling, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 15:58 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:

> From: Andi Kleen <ak@linux.intel.com>
> Subject:  mds: Force clear cpu on kernel preemption
> 
> When the kernel is preempted we need to force a cpu clear,
> because the preemption might happen before the code
> has a chance to set TIF_CPU_CLEAR later.
> 
> We cannot rely on kernel code setting the flag before
> touching sensitive data: the flag setting could
> be implicit, like in memzero_explicit, which is always
> called later.

That sentence doesn't parse at all.

> @@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
>  	if (likely(!preemptible()))
>  		return;
>  
> +	/*
> +	 * For kernel preemption we need to force a cpu clear
> +	 * because it could happen before the code has a chance
> +	 * to set TIF_CLEAR_CPU.
> +	 */
> +	lazy_clear_cpu();
> +

And looking at this makes it entirely clear that glueing everything to a
single thread flag and sprinkling cpu_clear() into randomly chosen
interfaces is just tinkering.

The requests to clear cpu state have semantically different reasons:

  1) Switching context

  2) Explicit knowledge of touching sensitive data

#1 is per CPU scheduling state

#2 needs to be tracked where sensitive data is touched and that's not a
   simple binary on/off. What you need for that is:

   start_touching_sensitive_data()

   stop_touching_sensitive_data()

   And then preemtion can handle accordingly and the state is preserved on
   migration as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 22/27] MDSv5 24
  2019-01-22  1:22     ` Andi Kleen
@ 2019-01-22 16:09       ` Thomas Gleixner
  2019-01-22 17:56         ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 16:09 UTC (permalink / raw)
  To: speck

On Mon, 21 Jan 2019, speck for Andi Kleen wrote:
> On Tue, Jan 22, 2019 at 10:24:46AM +1300, speck for Linus Torvalds wrote:
> > I think this is crazy.
> > 
> > We're marking things as "clear cpu state" for when we touch data that
> > WAS VISIBLE ON THE NETWORK!
> 
> Well there's loopback too and it should be encrypted, but yes. 
> 
> There could be still a reasonable expectation that different users
> of the network are isolated.
> 
> We could drop it, but I fear it would encourage more use of mds=full.

Well, looking at where you slap the conditionals into the code (timers,
hrtimers, interrupts, tasklets ...) and all of the things are by default
marked unsafe then I don't see how that's different from mds=full.

The only sane way IMO is to have mds=cond just handle context switching and
then have proper

    start_touching_sensitive_data()
    
    stop_touching_sensitive_data()

annotation in those places which actually can expose stuff and not the
other way round. That avoids _all_ the crazy add ons to timers,interrupts
etc. and reduces the thing to something sensible.

And the work to audit is exactly the same because to make mds=cond truly
different from mds=full you need to audit the world and some more as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 22/27] MDSv5 24
  2019-01-22 16:09       ` Thomas Gleixner
@ 2019-01-22 17:56         ` Andi Kleen
  2019-01-22 18:56           ` Thomas Gleixner
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-22 17:56 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 05:09:05PM +0100, speck for Thomas Gleixner wrote:
> On Mon, 21 Jan 2019, speck for Andi Kleen wrote:
> > On Tue, Jan 22, 2019 at 10:24:46AM +1300, speck for Linus Torvalds wrote:
> > > I think this is crazy.
> > > 
> > > We're marking things as "clear cpu state" for when we touch data that
> > > WAS VISIBLE ON THE NETWORK!
> > 
> > Well there's loopback too and it should be encrypted, but yes. 
> > 
> > There could be still a reasonable expectation that different users
> > of the network are isolated.
> > 
> > We could drop it, but I fear it would encourage more use of mds=full.
> 
> Well, looking at where you slap the conditionals into the code (timers,
> hrtimers, interrupts, tasklets ...) and all of the things are by default
> marked unsafe then I don't see how that's different from mds=full.

At least in my limited testing the patch doesn't cause that
actually, even though it may be counterintuitive.

See the numbers for Chrome for example in the last EBPF patch. That's
a complex workload with many context switches, and it gets
a clear roughly every third syscall

We also see similar results in the benchmarks. For example
loopback apache has practically no overhead because everything
interesting happens in process context.

I think the reason is that that most timers/tasklets/etc. are actually
fairly rare and don't really matter. 

I suspect the same is true for most interrupt handlers
too. Every driver that really cares about performance for bandwidth
already has some form of interrupt mitigation or polling to limit interrupt
overhead, and just adding a few clears doesn't really matter.

So this only leaves some latency sensitive workloads,
which cannot mitigate interrupts, but the interrupt handlers for those can
be fixed over time based on profiling results. Overall I suspect it
will be only a small subset of the total number of drivers.

Of course that really needs to be validated with more testing.

> The only sane way IMO is to have mds=cond just handle context switching and
> then have proper

It's only sane if you have a good way to find and maintain all these places.

So far nobody has proposed a scalable way to do that. I personally
don't know how to do it. 

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-01-22 15:58   ` Thomas Gleixner
@ 2019-01-22 17:57     ` Thomas Gleixner
  2019-01-23  1:35       ` [MODERATED] " Andi Kleen
  2019-02-16  2:00       ` [MODERATED] " Andi Kleen
  0 siblings, 2 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 17:57 UTC (permalink / raw)
  To: speck

On Tue, 22 Jan 2019, speck for Thomas Gleixner wrote:
> And looking at this makes it entirely clear that glueing everything to a
> single thread flag and sprinkling cpu_clear() into randomly chosen
> interfaces is just tinkering.
> 
> The requests to clear cpu state have semantically different reasons:
> 
>   1) Switching context
> 
>   2) Explicit knowledge of touching sensitive data
> 
> #1 is per CPU scheduling state
> 
> #2 needs to be tracked where sensitive data is touched and that's not a
>    simple binary on/off. What you need for that is:
> 
>    start_touching_sensitive_data()
> 
>    stop_touching_sensitive_data()
> 
>    And then preemtion can handle accordingly and the state is preserved on
>    migration as well.

Thought more about it while walking the dogs. It's even less complex
because it's simply strict per cpu state.

Lets look at the different scenarios:

  process A    ->   process A  		Thread switch in same process,
  	       	    	    		do nothing

  process A    ->   process B		Switches mm -> set percpu flush

  process A    ->   kernel  		Do nothing (rely on lazy mm)
     kernel       ->   process B	 Switches mm -> set percpu flush
     kernel       ->   kernel            do nothing
     kernel       ->   process A	 if lazy mm -> set percpu flush

  process A in syscall
    interrupt				handler touches sensitive data
    					-> set percpu flush
					
    softinterrupt			handler touches sensitive data
    					-> set percpu flush

The percpu flush request is only cleared on sysexit, so once it is set it's
uninteresting what other stuff runs before the next sysexit.

That covers everything from preemption to migration and whatever.

The only thing which is not covered are functions in syscall context which
touch sensitive data which does not belong to the process.

Do they actually exist? If so, then and only then you need the full
start/stop annotation and transfer the state on migration to be completely
correct.

But OTOH if the function entry sets the percpu flush request, then it's a
hard to construct scenario to get the task migrated while in the function
and no mm related flush on the target CPU being issued after it gets
scheduled in there. I'm sure you can construct such a scenario, but is it
actually reliably controllable in real world deployments?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 12/27] MDSv5 6
  2019-01-22 14:01   ` Thomas Gleixner
  2019-01-22 15:42     ` Thomas Gleixner
@ 2019-01-22 18:01     ` Andi Kleen
  1 sibling, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-22 18:01 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 03:01:21PM +0100, speck for Thomas Gleixner wrote:
> > +		if (next->mm != prev->mm || prev->mm == NULL)
> > +			lazy_clear_cpu();
> 
> This sets the bit even when switching between two kernel threads. Makes a
> lot of sense ...

Note the "lazy"

It will only do something on the next user exit, and it would
be set anyways on the context switch from kernel into that user thread.
So trying to avoid it wouldn't change anything.

> > +		/*
> > +		 * Also transfer the clearcpu flag from the previous task.
> > +		 * Can be done non atomically because interrupts are off.
> > +		 */
> > +		task_thread_info(next)->status |=
> > +			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
> > +		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;
> 
> What has thread_info->status to do with this? TIF flags are in
> thread_info->flags. Brilliant stuff that.
> 
> I assume that Linus suggested the TIF flag to avoid yet another conditional
> in the syscall path, which makes sense, but the above does not make sense
> at all. That needs way more thought. 

It makes sense to me. Please explain the problem better. 

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 22/27] MDSv5 24
  2019-01-22 17:56         ` [MODERATED] " Andi Kleen
@ 2019-01-22 18:56           ` Thomas Gleixner
  2019-01-23  1:39             ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-22 18:56 UTC (permalink / raw)
  To: speck

On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> On Tue, Jan 22, 2019 at 05:09:05PM +0100, speck for Thomas Gleixner wrote:
> At least in my limited testing the patch doesn't cause that
> actually, even though it may be counterintuitive.
> 
> See the numbers for Chrome for example in the last EBPF patch. That's
> a complex workload with many context switches, and it gets
> a clear roughly every third syscall

I'd rather see numbers with the switch_to hackery actually using
thread_info::flags.

> So this only leaves some latency sensitive workloads,
> which cannot mitigate interrupts, but the interrupt handlers for those can
> be fixed over time based on profiling results. Overall I suspect it
> will be only a small subset of the total number of drivers.

Ah, and who is going to that analysis and who is going to fix that?

> Of course that really needs to be validated with more testing.
> 
> > The only sane way IMO is to have mds=cond just handle context switching and
> > then have proper
> 
> It's only sane if you have a good way to find and maintain all these places.
> 
> So far nobody has proposed a scalable way to do that. I personally
> don't know how to do it. 

If you look at the interrupts which actually are related to data which
might be sensible, then you'll notice that most of them do absolutely
nothing except scheduling softirqs, worker threads etc. The actual data
handling happens there. If it's a worker thread, then the context switch
covers it.

Softirqs are a different story, but most of that is probably networking and
perhaps tasklets.

Just for the numbers:

   hrtimer_init():    170 instances
   timer_setup():    1129 instances
   DEFINE_TIMER():     46 instances
   tasklet_init():    361 instances
   request.*irq():   1462 instances

Now just for request_irq. Removing all threaded interrupt requests and only
looking at drivers reduces the number to 1118. Just going through the
matching files and removing the obvious non x86 and non interesting drivers
which obviously do not touch sensititve data, thermal, watchdog and lots of
others, reduces the number to less than 500. Remove 185 instances of
drivers/net/ethernet which all do merily napi_schedule() then it shrinks
very fast to something which can be easily managed to look at.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-22 17:57     ` Thomas Gleixner
@ 2019-01-23  1:35       ` Andi Kleen
  2019-01-23  9:27         ` Thomas Gleixner
  2019-02-16  2:00       ` [MODERATED] " Andi Kleen
  1 sibling, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-23  1:35 UTC (permalink / raw)
  To: speck

> Lets look at the different scenarios:
> 
>   process A    ->   process A  		Thread switch in same process,
>   	       	    	    		do nothing
> 
>   process A    ->   process B		Switches mm -> set percpu flush
> 
>   process A    ->   kernel  		Do nothing (rely on lazy mm)
>      kernel       ->   process B	 Switches mm -> set percpu flush
>      kernel       ->   kernel            do nothing
>      kernel       ->   process A	 if lazy mm -> set percpu flush
> 
>   process A in syscall
>     interrupt				handler touches sensitive data
>     					-> set percpu flush
> 					
>     softinterrupt			handler touches sensitive data
>     					-> set percpu flush

Ok the drawback of the percpu flush is that the system exit
code needs to check a new cache line. But it's probably not that bad.

> 
> The percpu flush request is only cleared on sysexit, so once it is set it's
> uninteresting what other stuff runs before the next sysexit.
> 
> That covers everything from preemption to migration and whatever.
> 
> The only thing which is not covered are functions in syscall context which
> touch sensitive data which does not belong to the process.

It's also cryptographic keys.

> 
> Do they actually exist? If so, then and only then you need the full

There is plenty crypto code in process context at least.

> start/stop annotation and transfer the state on migration to be completely
> correct.

Migration is covered in my patchkit currently by setting the flag
unconditionally on any kernel preemption.  With that one just a per cpu flag
is good enough, I don't think full annotation is needed.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 22/27] MDSv5 24
  2019-01-22 18:56           ` Thomas Gleixner
@ 2019-01-23  1:39             ` Andi Kleen
  2019-01-23  6:39               ` Greg KH
  2019-01-24  9:55               ` Thomas Gleixner
  0 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-23  1:39 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 07:56:20PM +0100, speck for Thomas Gleixner wrote:
> On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> > On Tue, Jan 22, 2019 at 05:09:05PM +0100, speck for Thomas Gleixner wrote:
> > At least in my limited testing the patch doesn't cause that
> > actually, even though it may be counterintuitive.
> > 
> > See the numbers for Chrome for example in the last EBPF patch. That's
> > a complex workload with many context switches, and it gets
> > a clear roughly every third syscall
> 
> I'd rather see numbers with the switch_to hackery actually using
> thread_info::flags.

Not sure what you mean here? It's using thread_info flags.

> 
> > So this only leaves some latency sensitive workloads,
> > which cannot mitigate interrupts, but the interrupt handlers for those can
> > be fixed over time based on profiling results. Overall I suspect it
> > will be only a small subset of the total number of drivers.
> 
> Ah, and who is going to that analysis and who is going to fix that?

People who care about performance will notice it. If they don't
it's probably not that important.


-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 22/27] MDSv5 24
  2019-01-23  1:39             ` [MODERATED] " Andi Kleen
@ 2019-01-23  6:39               ` Greg KH
  2019-01-24  9:55               ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-01-23  6:39 UTC (permalink / raw)
  To: speck

On Tue, Jan 22, 2019 at 05:39:23PM -0800, speck for Andi Kleen wrote:
> On Tue, Jan 22, 2019 at 07:56:20PM +0100, speck for Thomas Gleixner wrote:
> > > So this only leaves some latency sensitive workloads,
> > > which cannot mitigate interrupts, but the interrupt handlers for those can
> > > be fixed over time based on profiling results. Overall I suspect it
> > > will be only a small subset of the total number of drivers.
> > 
> > Ah, and who is going to that analysis and who is going to fix that?
> 
> People who care about performance will notice it. If they don't
> it's probably not that important.

So you are going to knowningly slow everything down and wait for people
to notice and then have them fix it up afterward?

Come on, that's not ok, you know better than that, that's being lazy and
forcing others to do your work for you.

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23  1:35       ` [MODERATED] " Andi Kleen
@ 2019-01-23  9:27         ` Thomas Gleixner
  2019-01-23 16:02           ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-23  9:27 UTC (permalink / raw)
  To: speck

On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> > The only thing which is not covered are functions in syscall context which
> > touch sensitive data which does not belong to the process.
> 
> It's also cryptographic keys.
>
> > Do they actually exist? If so, then and only then you need the full
> 
> There is plenty crypto code in process context at least.

Sure, but the question is whether these keys belong to the process or
not. If they do, then what's the leak?

You really want to provide a proper analysis of what can be leaked in which
context. So far this is all handwaving, might, could and because of that
you just sprinkle random mitigation calls around.

The point is that paranoid mitigation is simply 'always invoke VERW'. The
conditional modes and that's what we have done for the other
vulnerabilities as well are handling the most obvious issues and leave some
documented holes. Trying to catch everything in cond mode is just adding a
lot of pointless crap all over the code base and will still fail to plug
all holes unless you do a full audit of all kernel code.

It would be interesting to see the following test results:

  1) MDS=off

  2) MDS=always

  3) MDS=cond

     Cover the context switch based on the mm scheme I proposed and set the
     per cpu verw request on all hard and soft interrupts.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23  9:27         ` Thomas Gleixner
@ 2019-01-23 16:02           ` Andi Kleen
  2019-01-23 22:40             ` Josh Poimboeuf
  2019-01-24 12:04             ` Thomas Gleixner
  0 siblings, 2 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-23 16:02 UTC (permalink / raw)
  To: speck

On Wed, Jan 23, 2019 at 10:27:36AM +0100, speck for Thomas Gleixner wrote:
> On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> > > The only thing which is not covered are functions in syscall context which
> > > touch sensitive data which does not belong to the process.
> > 
> > It's also cryptographic keys.
> >
> > > Do they actually exist? If so, then and only then you need the full
> > 
> > There is plenty crypto code in process context at least.
> 
> Sure, but the question is whether these keys belong to the process or
> not. If they do, then what's the leak?

They often do not. A standard case is file system (network or disk) 
keys. If you leaked your file system keys to every application
which can access something then all file system permissions
can be violated.

> 
> You really want to provide a proper analysis of what can be leaked in which
> context?. 

?!? I provided a full document with my security model
as part of the patchkit.

If you think something is missing in my model please comment
on it directly instead of just giving vague statements.

If the consensus is that partial mitigation is sufficient I can 
look into it, but I would like stronger statements on this
from more reviewers (including distribution vendors) 

For example on Spectre we had the problem that some distributions
didn't trust the upstream solutions enough and ended up developing
their stronger mitigation. I would like to avoid this problem
here, and have a consensus default solution that works
for everyone.

Based on the performance data I've seen I don't see any reason to do 
anything less secure than my current patch kit.

> you just sprinkle random mitigation calls around.
> 
> The point is that paranoid mitigation is simply 'always invoke VERW'. The
> conditional modes and that's what we have done for the other
> vulnerabilities as well are handling the most obvious issues and leave some
> documented holes. Trying to catch everything in cond mode is just adding a
> lot of pointless crap all over the code base and will still fail to plug
> all holes unless you do a full audit of all kernel code.
> 
> It would be interesting to see the following test results:

You mean performance test results? That's in 0/0

The worse case slow down we've seen with the lazy flushing
is ~2% (on a benchmark that pushes data over loopback
in a tight loop). Most workloads see no change because
the existing lazy scheme works fairly well.

With full flushing the worst case seen is 8%, but again
that was a workload that does syscalls in a tight loop,
so more an outlier.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23 16:02           ` [MODERATED] " Andi Kleen
@ 2019-01-23 22:40             ` Josh Poimboeuf
  2019-01-23 22:57               ` Josh Poimboeuf
  2019-01-24  2:26               ` Andi Kleen
  2019-01-24 12:04             ` Thomas Gleixner
  1 sibling, 2 replies; 105+ messages in thread
From: Josh Poimboeuf @ 2019-01-23 22:40 UTC (permalink / raw)
  To: speck

On Wed, Jan 23, 2019 at 08:02:06AM -0800, speck for Andi Kleen wrote:
> > you just sprinkle random mitigation calls around.
> > 
> > The point is that paranoid mitigation is simply 'always invoke VERW'. The
> > conditional modes and that's what we have done for the other
> > vulnerabilities as well are handling the most obvious issues and leave some
> > documented holes. Trying to catch everything in cond mode is just adding a
> > lot of pointless crap all over the code base and will still fail to plug
> > all holes unless you do a full audit of all kernel code.
> > 
> > It would be interesting to see the following test results:
> 
> You mean performance test results? That's in 0/0
> 
> The worse case slow down we've seen with the lazy flushing
> is ~2% (on a benchmark that pushes data over loopback
> in a tight loop). Most workloads see no change because
> the existing lazy scheme works fairly well.
> 
> With full flushing the worst case seen is 8%, but again
> that was a workload that does syscalls in a tight loop,
> so more an outlier.

We're actually seeing some very large slowdowns -- much worse than 8%.

The early results are showing more like 130% slowdown in several cases
for mds=full.

The mds=auto tests are in progress.

I don't have much more detail than that at the moment, but I can try to
provide more details about the benchmarks when I get them.

It would be interesting to see if others are seeing similar results.
This is on a 4.18-based RHEL kernel.

-- 
Josh

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23 22:40             ` Josh Poimboeuf
@ 2019-01-23 22:57               ` Josh Poimboeuf
  2019-01-24  0:25                 ` Josh Poimboeuf
  2019-01-24  2:26               ` Andi Kleen
  1 sibling, 1 reply; 105+ messages in thread
From: Josh Poimboeuf @ 2019-01-23 22:57 UTC (permalink / raw)
  To: speck

On Wed, Jan 23, 2019 at 04:40:04PM -0600, Josh Poimboeuf wrote:
> On Wed, Jan 23, 2019 at 08:02:06AM -0800, speck for Andi Kleen wrote:
> > > you just sprinkle random mitigation calls around.
> > > 
> > > The point is that paranoid mitigation is simply 'always invoke VERW'. The
> > > conditional modes and that's what we have done for the other
> > > vulnerabilities as well are handling the most obvious issues and leave some
> > > documented holes. Trying to catch everything in cond mode is just adding a
> > > lot of pointless crap all over the code base and will still fail to plug
> > > all holes unless you do a full audit of all kernel code.
> > > 
> > > It would be interesting to see the following test results:
> > 
> > You mean performance test results? That's in 0/0
> > 
> > The worse case slow down we've seen with the lazy flushing
> > is ~2% (on a benchmark that pushes data over loopback
> > in a tight loop). Most workloads see no change because
> > the existing lazy scheme works fairly well.
> > 
> > With full flushing the worst case seen is 8%, but again
> > that was a workload that does syscalls in a tight loop,
> > so more an outlier.
> 
> We're actually seeing some very large slowdowns -- much worse than 8%.
> 
> The early results are showing more like 130% slowdown in several cases
> for mds=full.
> 
> The mds=auto tests are in progress.
> 
> I don't have much more detail than that at the moment, but I can try to
> provide more details about the benchmarks when I get them.
> 
> It would be interesting to see if others are seeing similar results.
> This is on a 4.18-based RHEL kernel.

So these 130+% slowdowns were microbenchmarks which call a syscall in a
tight loop.  I can provide a reproducer shortly.

The default (mds=auto?) results will be available soon; apparently they
are looking much better.

-- 
Josh

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23 22:57               ` Josh Poimboeuf
@ 2019-01-24  0:25                 ` Josh Poimboeuf
  0 siblings, 0 replies; 105+ messages in thread
From: Josh Poimboeuf @ 2019-01-24  0:25 UTC (permalink / raw)
  To: speck

On Wed, Jan 23, 2019 at 04:57:14PM -0600, Josh Poimboeuf wrote:
> On Wed, Jan 23, 2019 at 04:40:04PM -0600, Josh Poimboeuf wrote:
> > On Wed, Jan 23, 2019 at 08:02:06AM -0800, speck for Andi Kleen wrote:
> > > > you just sprinkle random mitigation calls around.
> > > > 
> > > > The point is that paranoid mitigation is simply 'always invoke VERW'. The
> > > > conditional modes and that's what we have done for the other
> > > > vulnerabilities as well are handling the most obvious issues and leave some
> > > > documented holes. Trying to catch everything in cond mode is just adding a
> > > > lot of pointless crap all over the code base and will still fail to plug
> > > > all holes unless you do a full audit of all kernel code.
> > > > 
> > > > It would be interesting to see the following test results:
> > > 
> > > You mean performance test results? That's in 0/0
> > > 
> > > The worse case slow down we've seen with the lazy flushing
> > > is ~2% (on a benchmark that pushes data over loopback
> > > in a tight loop). Most workloads see no change because
> > > the existing lazy scheme works fairly well.
> > > 
> > > With full flushing the worst case seen is 8%, but again
> > > that was a workload that does syscalls in a tight loop,
> > > so more an outlier.
> > 
> > We're actually seeing some very large slowdowns -- much worse than 8%.
> > 
> > The early results are showing more like 130% slowdown in several cases
> > for mds=full.
> > 
> > The mds=auto tests are in progress.
> > 
> > I don't have much more detail than that at the moment, but I can try to
> > provide more details about the benchmarks when I get them.
> > 
> > It would be interesting to see if others are seeing similar results.
> > This is on a 4.18-based RHEL kernel.
> 
> So these 130+% slowdowns were microbenchmarks which call a syscall in a
> tight loop.  I can provide a reproducer shortly.

Here is one of the reproducers which saw a 130% slowdown with mds=full.
It's just a basic write() microbenchmark.

#include <stdio.h> 
#include <stdlib.h> 
#include <sys/types.h>
#include <unistd.h>
#include <fcntl.h>


/* gcc write.c -O */

static __inline__ u_int64_t start_clock();
static __inline__ u_int64_t stop_clock(); 

/*
 * Measure the time (machine cycles) it takes to call write 1K buffers to /dev/null
 */
int main(int argc, char **argv)
{
    if (argc < 2) {
      printf("USAGE: %s K loop-iterations\n", argv[0]);
      return 1;
    }

    int i, iterations = 1000 * atoi(argv[1]); 
    char buf[2014];

    int fd = open("/dev/null", O_WRONLY | O_APPEND);
    if(fd < 0)
        return 1;
 
    u_int64_t start_rdtsc = start_clock();

    for (i = 0; i < iterations; i++) { 
       if(write(fd,buf, 1000) != 1000) {
          perror("write error");
          exit(1);
       }
    }

    u_int64_t stop_rdtsc = stop_clock();
    u_int64_t diff = stop_rdtsc-start_rdtsc;

    printf("TSC for %ld write calls: %ldK cycles.  Avg cycles per call: %ld\n", 
            iterations, 
            diff/1000, 
            diff/iterations);
}

static __inline__ u_int64_t start_clock() {
    // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
    u_int32_t hi, lo;
    __asm__ __volatile__ (
        "CPUID\n\t"
        "RDTSC\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
        "%rax", "%rbx", "%rcx", "%rdx");
    return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}

static __inline__ u_int64_t stop_clock() {
    // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
    u_int32_t hi, lo;
    __asm__ __volatile__(
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        "CPUID\n\t": "=r" (hi), "=r" (lo)::
        "%rax", "%rbx", "%rcx", "%rdx");
    return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
}

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23 22:40             ` Josh Poimboeuf
  2019-01-23 22:57               ` Josh Poimboeuf
@ 2019-01-24  2:26               ` Andi Kleen
  1 sibling, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-24  2:26 UTC (permalink / raw)
  To: speck

On Wed, Jan 23, 2019 at 04:40:04PM -0600, speck for Josh Poimboeuf wrote:
> On Wed, Jan 23, 2019 at 08:02:06AM -0800, speck for Andi Kleen wrote:
> > > you just sprinkle random mitigation calls around.
> > > 
> > > The point is that paranoid mitigation is simply 'always invoke VERW'. The
> > > conditional modes and that's what we have done for the other
> > > vulnerabilities as well are handling the most obvious issues and leave some
> > > documented holes. Trying to catch everything in cond mode is just adding a
> > > lot of pointless crap all over the code base and will still fail to plug
> > > all holes unless you do a full audit of all kernel code.
> > > 
> > > It would be interesting to see the following test results:
> > 
> > You mean performance test results? That's in 0/0
> > 
> > The worse case slow down we've seen with the lazy flushing
> > is ~2% (on a benchmark that pushes data over loopback
> > in a tight loop). Most workloads see no change because
> > the existing lazy scheme works fairly well.
> > 
> > With full flushing the worst case seen is 8%, but again
> > that was a workload that does syscalls in a tight loop,
> > so more an outlier.
> 
> We're actually seeing some very large slowdowns -- much worse than 8%.
> 
> The early results are showing more like 130% slowdown in several cases
> for mds=full.

I'm talking about at least somewhat realistic workloads,
not "empty system call in a tight loop" type micro benchmark
scenarios.

I don't think it's very interesting to discuss those cases.

In our case the 8% was a loopback test with apache,
which is likely already somewhat unrealistic.


-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 22/27] MDSv5 24
  2019-01-23  1:39             ` [MODERATED] " Andi Kleen
  2019-01-23  6:39               ` Greg KH
@ 2019-01-24  9:55               ` Thomas Gleixner
  1 sibling, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-24  9:55 UTC (permalink / raw)
  To: speck

On Tue, 22 Jan 2019, speck for Andi Kleen wrote:

> On Tue, Jan 22, 2019 at 07:56:20PM +0100, speck for Thomas Gleixner wrote:
> > On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> > > On Tue, Jan 22, 2019 at 05:09:05PM +0100, speck for Thomas Gleixner wrote:
> > > At least in my limited testing the patch doesn't cause that
> > > actually, even though it may be counterintuitive.
> > > 
> > > See the numbers for Chrome for example in the last EBPF patch. That's
> > > a complex workload with many context switches, and it gets
> > > a clear roughly every third syscall
> > 
> > I'd rather see numbers with the switch_to hackery actually using
> > thread_info::flags.
> 
> Not sure what you mean here? It's using thread_info flags.

I told you before, but I'm happy to tell you once again:

> +             task_thread_info(next)->status |=
> +                     task_thread_info(prev)->status & _TIF_CLEAR_CPU;
> +             task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;

status != flags. IOW, the propagation logic is broken.

Thanks,

	tglx


 

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-01-23 16:02           ` [MODERATED] " Andi Kleen
  2019-01-23 22:40             ` Josh Poimboeuf
@ 2019-01-24 12:04             ` Thomas Gleixner
  2019-01-28  3:42               ` [MODERATED] " Andi Kleen
  1 sibling, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-24 12:04 UTC (permalink / raw)
  To: speck


On Wed, 23 Jan 2019, speck for Andi Kleen wrote:

> On Wed, Jan 23, 2019 at 10:27:36AM +0100, speck for Thomas Gleixner wrote:
> > On Tue, 22 Jan 2019, speck for Andi Kleen wrote:
> > > > The only thing which is not covered are functions in syscall context which
> > > > touch sensitive data which does not belong to the process.
> > > 
> > > It's also cryptographic keys.
> > >
> > > > Do they actually exist? If so, then and only then you need the full
> > > 
> > > There is plenty crypto code in process context at least.
> > 
> > Sure, but the question is whether these keys belong to the process or
> > not. If they do, then what's the leak?
> 
> They often do not. A standard case is file system (network or disk) 
> keys. If you leaked your file system keys to every application
> which can access something then all file system permissions
> can be violated.

This is unparseable.

> > You really want to provide a proper analysis of what can be leaked in which
> > context?. 
> 
> ?!? I provided a full document with my security model
> as part of the patchkit.

Lets look at that "model" then.

> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127

I assume this is meant to be the problem description. If so, then you could
have spared the first sentence and just pointed to the completely useless
CVE numbers.

This needs to have a consice and understandable description of the
problem. How does the data end up in the buffer in the first place? Random
code patterns or specific classes of opcodes which can be mapped to kernel
functionality?

You're neither explaining how this can be exploited and in which context
the exploit happens. This is important because it defines the boundaries.

> +This can be avoided by explicitly clearing the CPU state.
> +
> +We attempt to avoid leaking data between different processes,
> +and also some sensitive data, like cryptographic data, to
> +user space.

Again, because there is no explanation how data is leaked this information
is pretty useless.

> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.
> +
> +User data is sensitive, but only for other processes.

Define 'User data'

> +Kernel data is sensitive when it involves cryptographic keys.

There is surely more than cryptographic keys. The kernel carries a lot of
clear text information which should not be accessible.

> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().

And how is anyone supposed to determine that? And what is 'user supplied
data of other processes'? Provide a clear description, preferrably with an
example.

> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
> +
> +Touching only pointers to user data is always allowed.
> +
> +When your interrupt does not touch user data directly, consider marking
> +it with IRQF_NO_USER.
> +
> +When your tasklet does not touch user data directly, consider marking
> +it with TASKLET_NO_USER using tasklet_init_flags/or
> +DECLARE_TASKLET*_NOUSER.
> +
> +When your timer does not touch user data mark it with TIMER_NO_USER.
> +If it is a hrtimer, mark it with HRTIMER_MODE_NO_USER.
> +
> +When your irq poll handler does not touch user data, mark it
> +with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
> +
> +For networking code, make sure to only touch user data through
> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or
> +lazy_clear_cpu_interrupt. When the non skb data access is only in a
> +hardware interrupt controlled by the driver, it can rely on not
> +setting IRQF_NO_USER for that interrupt.
> +
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree.
> +
> +If your RCU callback touches user data add lazy_clear_cpu().

So we have opt-in and opt-out. You nowhere explain WHY you think that RCU
and networking are generally safe and need individual annotations and
timers, tasklets, interrupts are considered unsafe by default. This is not
a model, this is just a collection of random implementation details without
any justification whatsoever.

> +Implementation details/assumptions
> +----------------------------------
> +
> +If a system call touches data of its own process, CPU state does not
> +need to be cleared, because it has already access to it.

Redundant information

> +
> +On context switching we clear data, unless the context switch is
> +inside a process. We also clear after any context switches from kernel
> +threads.

Why?

> +Cryptographic keys inside the kernel should be protected.
> +We assume they use kzfree() or memzero_explicit() to clear
> +state, so these functions trigger a cpu clear.

Assume?

> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.

And who is going to audit any of that? No, clearly not those who trip over
performance issues because there is no proper way to debug that in the
first place.

> +Most interrupt handlers for modern devices should not touch
> +user data, because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.

Oh well. It's not really hard to write up analysis scripts to filter out
which interrupt handlers are merily doing trivial stuff like
napi_schedule(), schedule_work() etc.

> +For softirqs we assume that if they touch user data they use
> +lazy_clear_cpu()/lazy_clear_interrupt() as needed.

Again, there is no justification WHY softirqs are assumed to be safe in
general.

> +Networking is handled through skb_* below.
> +Timer and Tasklets and IRQ poll are handled through opt-in.

opt-in of what? Opt-in to mitigation?

> +Scheduler softirq is assumed to not touch user data.

Assumed?

> +Block softirq done callbacks are assumed to not touch user data.

Ditto

> +For networking code, any skb functions that are likely
> +touching non header packet data schedule a clear cpu at next
> +kernel exit. This includes skb_copy and related, skb_put/push,
> +checksum functions.  We assume that any networking code touching
> +packet data uses these functions.

So who is going to audit all the networking code to figure out which code
pathes end up touching packet data?

> +[In principle packet data should be encrypted anyways for the wire,
> +but we try to avoid leaking it anyways]

This is completely useless information and has absolutely no value for a
document which claims to be a security model.

> +Some IO related functions like string PIO and memcpy_from/to_io, or
> +the software pci dma bounce function, which touch data, schedule a
> +buffer clear.

Some functions? Again, which functions exactly and how should a developer
know which are safe to use?

> +We assume NMI/machine check code does not touch other
> +processes' data.

Yet another unexplained assumption.

> +Virtualization
> +--------------
> +
> +When entering a guest in KVM we clear to avoid any leakage to a guest.
> +Normally this is done implicitly as part of the L1TF mitigation.
> +It relies on this being enabled. It also uses the "fast exit"

It relies on this. !?! If that means that MDS depends on L1TF mitigation
being disabled, then I have to ask whether all MDS affected CPUs are
affected by L1TF as well. But whatever that sentence means, it wants a
proper explanation.

So this 'security model' is a random unstructured collection of
information, which is based on unjustified assumptions and giving people
who would need to consult this when auditing code a clear receipe for
disaster.

A proper security model contains:

 1) A concise and understandable problem description.

 2) A precise definition of data which needs to be protected.

 3) A concise analysis of boundaries, e.g. execution contexts, and the
    effects of possible transitions between them.

 4) Per context analysis and conclusion with a proper justification for the
    decision.

 5) Per context guidance for developers, reviewers.

So just lets get back to the start of this mail:

> > > There is plenty crypto code in process context at least.
> > 
> > Sure, but the question is whether these keys belong to the process or
> > not. If they do, then what's the leak?
> 
> They often do not. A standard case is file system (network or disk) 
> keys. If you leaked your file system keys to every application
> which can access something then all file system permissions
> can be violated.

The question was whether a pure

    userspace -> syscall -> dosomething() -> sysexit -> userspace

transition chain can expose data which needs to be protected.

So your answer is:

 > There is plenty crypto code in process context at least

but of course w/o any proof that this is true and w/o any hint which
syscalls might be involved.

So you follow up on that with:

> ..... A standard case is file system (network or disk) keys.

As you are making these claims, you surely can tell which callchain (high
level view) can end up touching this without leaving the user thread
context.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 03/27] MDSv5 16
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
  2019-01-22  4:23   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 12:55   ` Thomas Gleixner
@ 2019-01-27 21:58   ` Thomas Gleixner
  2019-01-28  3:30     ` [MODERATED] " Andi Kleen
  2 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-27 21:58 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> +/*
> + * Clear CPU buffers to avoid side channels.
> + * We use microcode as a side effect of the obsolete VERW instruction
> + */
> +
> +static inline void clear_cpu(void)
> +{
> +	unsigned kernel_ds = __KERNEL_DS;
> +	/* Has to be memory form, don't modify to use an register */
> +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> +		[kernelds] "m" (kernel_ds));

This looks backwards. Why is this FEATURE_NO_VERW instead of
FEATURE_MDS_VERW (or such) ?

The point is that we generally enable functionality with features and not
the other way round.

There are 3 reasons why this feature is disabled:

  1) CPU is not affected

  2) It's disabled on the kernel command line

  3) VERW instruction is not providing mitigation


  #1 Is covered by whitelists and the NO_MDS bit

  #2 Is obvious

  #3 According to the patch set, there is a feature bit
     X86_FEATURE_MD_CLEAR. It's solely used in the sysfs reporting. Why is
     it not used in the mitigation selection?

     All other mitigations selector functions check whether a mitigation is
     available or not.  So please make this consistent.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 07/27] MDSv5 0
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
  2019-01-22  4:39   ` [MODERATED] " Konrad Rzeszutek Wilk
@ 2019-01-27 22:09   ` Thomas Gleixner
  2019-01-28  3:33     ` [MODERATED] " Andi Kleen
  2019-02-13 22:26   ` [MODERATED] " Tyler Hicks
  2 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-27 22:09 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> +
> +	case X86_BUG_MDS:
> +		/* Assumes Hypervisor exposed HT state to us if in guest */

That comment is relevant in which way? This is true for any feature bit
which is checked in a guest. FEATURE_MD_CLEAR is not special in any way.

> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +			if (cpu_smt_control != CPU_SMT_ENABLED)
> +				return sprintf(buf, "Mitigation: microcode\n");
> +			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
> +		}
> +		return sprintf(buf, "Vulnerable\n");

This is just wrong. If mds=off is given on the command line and the machine
has updated micro code then this still claims that micro code does
mitigation.

For heavens sake, why can't you just follow the existing logic and stay
consistent with it?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 03/27] MDSv5 16
  2019-01-27 21:58   ` Thomas Gleixner
@ 2019-01-28  3:30     ` Andi Kleen
  0 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-01-28  3:30 UTC (permalink / raw)
  To: speck

On Sun, Jan 27, 2019 at 10:58:38PM +0100, speck for Thomas Gleixner wrote:
> On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> > +/*
> > + * Clear CPU buffers to avoid side channels.
> > + * We use microcode as a side effect of the obsolete VERW instruction
> > + */
> > +
> > +static inline void clear_cpu(void)
> > +{
> > +	unsigned kernel_ds = __KERNEL_DS;
> > +	/* Has to be memory form, don't modify to use an register */
> > +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> > +		[kernelds] "m" (kernel_ds));
> 
> This looks backwards. Why is this FEATURE_NO_VERW instead of
> FEATURE_MDS_VERW (or such) ?

I had it that way originally in the earlier versions, but Linus proposed
to do it this way around for the unconditional VERW (see emails some
time back)

However I'm also thinking of changing it back to optimize
some of the other checks to avoid needing to check multiple
flags.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 07/27] MDSv5 0
  2019-01-27 22:09   ` Thomas Gleixner
@ 2019-01-28  3:33     ` Andi Kleen
  2019-01-28  8:29       ` Thomas Gleixner
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-28  3:33 UTC (permalink / raw)
  To: speck

On Sun, Jan 27, 2019 at 11:09:31PM +0100, speck for Thomas Gleixner wrote:
> On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> > +
> > +	case X86_BUG_MDS:
> > +		/* Assumes Hypervisor exposed HT state to us if in guest */
> 
> That comment is relevant in which way? This is true for any feature bit
> which is checked in a guest. FEATURE_MD_CLEAR is not special in any way.

It's different in that it refers to an underlying bug. Normally
if something is not exposed it is just not used and doesn't matter.
But in this case it's not true, if HT is there but not exposed
the code will report the wrong message.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-24 12:04             ` Thomas Gleixner
@ 2019-01-28  3:42               ` Andi Kleen
  2019-01-28  8:33                 ` Thomas Gleixner
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-01-28  3:42 UTC (permalink / raw)
  To: speck

> > +Most interrupt handlers for modern devices should not touch
> > +user data, because they rely on DMA and only manipulate
> > +pointers. This needs auditing to confirm though.
> 
> Oh well. It's not really hard to write up analysis scripts to filter out
> which interrupt handlers are merily doing trivial stuff like
> napi_schedule(), schedule_work() etc.

You keep saying that, but it's not what I see.

I've been going through interrupt handlers the last few days,
and I haven't done most of them yet, and I find it certainly
not easy because most of them are doing non trivial stuff,
and in many cases even use significant indirection and you have to go
through many files to find all the code and double check it.

But if your magic scripts can all do it then please just
send a list. This will save a lot of work for me.

Also please do it for tasklets and the other cases.

If you can't supply that list please stop spewing such nonsense.

I found some incredible bad code so far, including a recursive
interrupt handler. At least for the ones I looked at the
majority seem to be white listable. But it will take more
time to confirm all, not even starting with tasklets yet.

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 07/27] MDSv5 0
  2019-01-28  3:33     ` [MODERATED] " Andi Kleen
@ 2019-01-28  8:29       ` Thomas Gleixner
  0 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-28  8:29 UTC (permalink / raw)
  To: speck

On Sun, 27 Jan 2019, speck for Andi Kleen wrote:

> On Sun, Jan 27, 2019 at 11:09:31PM +0100, speck for Thomas Gleixner wrote:
> > On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> > > +
> > > +	case X86_BUG_MDS:
> > > +		/* Assumes Hypervisor exposed HT state to us if in guest */
> > 
> > That comment is relevant in which way? This is true for any feature bit
> > which is checked in a guest. FEATURE_MD_CLEAR is not special in any way.
> 
> It's different in that it refers to an underlying bug. Normally
> if something is not exposed it is just not used and doesn't matter.
> But in this case it's not true, if HT is there but not exposed
> the code will report the wrong message.

The rationale makes sense, but that doesnt make the comment any
better.

	     Guest SMT=off	    Guest SMT=on

Host SMT=off    not vuln	    not vuln

Host SMT=on	vuln		    vuln

The guest topology configuration can be anything and has nothing to do with
the host state at all. Therefore you cannot make any assumptions unless the
hypervisor exposes the host state to the guest independent of the guest
configuration.

So the only sane information you can expose in the guest kernel is:

       Host SMT state unknown, potentially vulnerable

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-01-28  3:42               ` [MODERATED] " Andi Kleen
@ 2019-01-28  8:33                 ` Thomas Gleixner
  0 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-28  8:33 UTC (permalink / raw)
  To: speck


On Sun, 27 Jan 2019, speck for Andi Kleen wrote:

> > > +Most interrupt handlers for modern devices should not touch
> > > +user data, because they rely on DMA and only manipulate
> > > +pointers. This needs auditing to confirm though.
> > 
> > Oh well. It's not really hard to write up analysis scripts to filter out
> > which interrupt handlers are merily doing trivial stuff like
> > napi_schedule(), schedule_work() etc.
> 
> You keep saying that, but it's not what I see.

I did not say that scripting can do everything, but it can exclude a fair
portion and you can exclude a large portion of handlers simply because they
are never ever used on x86. The rest is of course manual inspection.

> I've been going through interrupt handlers the last few days,
> and I haven't done most of them yet, and I find it certainly
> not easy because most of them are doing non trivial stuff,
> and in many cases even use significant indirection and you have to go
> through many files to find all the code and double check it.
> 
> But if your magic scripts can all do it then please just
> send a list. This will save a lot of work for me.

I've done similar analysis before and no, I'm not writing the scripts this
time. I wasted enough time mopping up Intel induced mess in the past 2
years.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 00/27] MDSv5 19
  2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
                   ` (27 preceding siblings ...)
  2019-01-21 21:18 ` [MODERATED] Re: [PATCH v5 00/27] MDSv5 19 Linus Torvalds
@ 2019-01-28 11:34 ` Thomas Gleixner
  2019-02-13 22:33   ` [MODERATED] " Tyler Hicks
  28 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-01-28 11:34 UTC (permalink / raw)
  To: speck

Andi,

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:

can you please split this into two patch series:

 1) The initial mitigation

 2) The conditional mode

#1 should come with the following patch split:

  1/n:
       - Add X86_BUG_MDS
       - Add X86_FEATURE_NO_MDS
       - The cpu_set_bug_bits() logic

  2/n:
       - Add sysfs reporting

  3/n:
       - Add X86_FEATURE_MD_CLEAR
       - Add the invocations in the exit pathes

  4/n:
       - Add command line param (mds=on/auto/off)
       - Add mds_select_mitigation() where auto defaults to on
       - Update sysfs reporting

  5/n:
       - Add admin documentation similar to l1tf.rst

Please keep all that stuff consistent with the existing mitigations.

This should go first and be ready and backported ASAP so we're prepared for
any event. The cond mode goes on top of this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 07/27] MDSv5 0
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
  2019-01-22  4:39   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-27 22:09   ` Thomas Gleixner
@ 2019-02-13 22:26   ` Tyler Hicks
  2 siblings, 0 replies; 105+ messages in thread
From: Tyler Hicks @ 2019-02-13 22:26 UTC (permalink / raw)
  To: speck

On 2019-01-18 16:50:22, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Add sysfs reporting
> 
> Report mds mitigation state in sysfs vulnerabilities.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  .../ABI/testing/sysfs-devices-system-cpu         |  1 +
>  arch/x86/kernel/cpu/bugs.c                       | 16 ++++++++++++++++
>  drivers/base/cpu.c                               |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 9605dbd4b5b5..2db5c3407fd6 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -484,6 +484,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
>  		/sys/devices/system/cpu/vulnerabilities/spectre_v2
>  		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
>  		/sys/devices/system/cpu/vulnerabilities/l1tf
> +		/sys/devices/system/cpu/vulnerabilities/mds
>  Date:		January 2018
>  Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
>  Description:	Information about CPU vulnerabilities
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index ce0e367753ff..715ab147f3e6 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -1176,6 +1176,16 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
>  		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
>  			return l1tf_show_state(buf);
>  		break;
> +
> +	case X86_BUG_MDS:
> +		/* Assumes Hypervisor exposed HT state to us if in guest */
> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +			if (cpu_smt_control != CPU_SMT_ENABLED)
> +				return sprintf(buf, "Mitigation: microcode\n");
> +			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");

Existing user-facing messaging for the status of CPU vulnerability
mitigations use "SMT" rather than "HT". For example:

 $ cat /sys/devices/system/cpu/vulnerabilities/l1tf
 Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable

Also, this STIBP log message from update_stibp_strict():

	pr_info("Update user space SMT mitigation: STIBP %s\n",
		mask & SPEC_CTRL_STIBP ? "always-on" : "off");

I think it would be best to be consistent and use "SMT" in this patch
series, too.

Tyler

> +		}
> +		return sprintf(buf, "Vulnerable\n");
> +
>  	default:
>  		break;
>  	}
> @@ -1207,4 +1217,10 @@ ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *b
>  {
>  	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
>  }
> +
> +ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
> +}
> +
>  #endif
> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> index eb9443d5bae1..2fd6ca1021c2 100644
> --- a/drivers/base/cpu.c
> +++ b/drivers/base/cpu.c
> @@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
>  	return sprintf(buf, "Not affected\n");
>  }
>  
> +ssize_t __weak cpu_show_mds(struct device *dev,
> +			    struct device_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "Not affected\n");
> +}
> +
>  static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
>  static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
>  static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
>  static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
>  static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
> +static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
>  
>  static struct attribute *cpu_root_vulnerabilities_attrs[] = {
>  	&dev_attr_meltdown.attr,
> @@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
>  	&dev_attr_spectre_v2.attr,
>  	&dev_attr_spec_store_bypass.attr,
>  	&dev_attr_l1tf.attr,
> +	&dev_attr_mds.attr,
>  	NULL
>  };
>  
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-01-28 11:34 ` Thomas Gleixner
@ 2019-02-13 22:33   ` Tyler Hicks
  2019-02-14 13:09     ` Jiri Kosina
  0 siblings, 1 reply; 105+ messages in thread
From: Tyler Hicks @ 2019-02-13 22:33 UTC (permalink / raw)
  To: speck

On 2019-01-28 12:34:38, speck for Thomas Gleixner wrote:
> Andi,
> 
> On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> 
> can you please split this into two patch series:
> 
>  1) The initial mitigation
> 
>  2) The conditional mode

Andi, I know you've been busy with the PERF changes but I'm curious if
you plan to split up this series, as requested by Thomas? Intel is
starting to ask software vendors when they might have beta builds of the
mitigations available and my answer is going to depend on whether or not
this split is going to be available soon.

Additionally, I can't seem to find the mbox of MDSv5. If you already sent
it, could you point out the message ID?

Thanks!

Tyler

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-13 22:33   ` [MODERATED] " Tyler Hicks
@ 2019-02-14 13:09     ` Jiri Kosina
  2019-02-14 13:51       ` Greg KH
  2019-02-14 16:53       ` Andi Kleen
  0 siblings, 2 replies; 105+ messages in thread
From: Jiri Kosina @ 2019-02-14 13:09 UTC (permalink / raw)
  To: speck

On Wed, 13 Feb 2019, speck for Tyler Hicks wrote:

> Andi, I know you've been busy with the PERF changes but I'm curious if 
> you plan to split up this series, as requested by Thomas? Intel is 
> starting to ask software vendors when they might have beta builds of the 
> mitigations available and my answer is going to depend on whether or not 
> this split is going to be available soon.

"Me too" from SUSE. 

We're being pushed by Intel to do testing and validating of ports to our 
kernel branches, but we don't want to invest backporting effort into a 
patchset that is not final.

Andi, could you please share your further plans, so that we could schedule 
accordingly?

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 13:09     ` Jiri Kosina
@ 2019-02-14 13:51       ` Greg KH
  2019-02-14 16:53       ` Andi Kleen
  1 sibling, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-02-14 13:51 UTC (permalink / raw)
  To: speck

On Thu, Feb 14, 2019 at 02:09:41PM +0100, speck for Jiri Kosina wrote:
> On Wed, 13 Feb 2019, speck for Tyler Hicks wrote:
> 
> > Andi, I know you've been busy with the PERF changes but I'm curious if 
> > you plan to split up this series, as requested by Thomas? Intel is 
> > starting to ask software vendors when they might have beta builds of the 
> > mitigations available and my answer is going to depend on whether or not 
> > this split is going to be available soon.
> 
> "Me too" from SUSE. 
> 
> We're being pushed by Intel to do testing and validating of ports to our 
> kernel branches, but we don't want to invest backporting effort into a 
> patchset that is not final.

I'm telling people that keep asking me about "backports to the stable
versions" that they should not do anything until we have a solid
upstream-mergable version first.

This pressure from Intel has got to stop, it's wasting people's time.

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 13:09     ` Jiri Kosina
  2019-02-14 13:51       ` Greg KH
@ 2019-02-14 16:53       ` Andi Kleen
  2019-02-14 18:00         ` Greg KH
  1 sibling, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-14 16:53 UTC (permalink / raw)
  To: speck

> Andi, could you please share your further plans, so that we could schedule 
> accordingly?

The review feedback of requiring full audits caused a lot of extra 
work and turned the single thread MDS patches from low risk small scope to
gigantic scope high risk.

I haven't seen any comments from anyone pushing back on this, so I assume
everyone understands these implications.

If you just want a short term solution now you can base on the original full
option, which hasn't really changed in any significant way since the 
original version, and is simple and straight forward.

We finished the audits on interrupt handlers now, and also on timers
and tasklet. There's still some additional work on other softirq handlers.

I did some work on the other feedbacks and some gleixnification, but that's
still a work in progress. Currently I expect a repost some time next week.

It will be likely in pieces with the bulk of the driver changes
being separated.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 16:53       ` Andi Kleen
@ 2019-02-14 18:00         ` Greg KH
  2019-02-14 18:05           ` Andrew Cooper
  2019-02-14 18:33           ` Andi Kleen
  0 siblings, 2 replies; 105+ messages in thread
From: Greg KH @ 2019-02-14 18:00 UTC (permalink / raw)
  To: speck

On Thu, Feb 14, 2019 at 08:53:55AM -0800, speck for Andi Kleen wrote:
> > Andi, could you please share your further plans, so that we could schedule 
> > accordingly?
> 
> The review feedback of requiring full audits caused a lot of extra 
> work and turned the single thread MDS patches from low risk small scope to
> gigantic scope high risk.

Wait, what?

I recall me asking some questions about your patches, those questions
being ignored, and then Thomas asking much the same type of questions,
in much more detail, and that too being ignored.

> I haven't seen any comments from anyone pushing back on this, so I assume
> everyone understands these implications.

No, we don't (or at least I do not), understand any of the implications
here because the questions I asked were never answered.  So far you have
just "told us what you are going to do" and then never even backed that
up with reasons for why you are doing that.

> If you just want a short term solution now you can base on the original full
> option, which hasn't really changed in any significant way since the 
> original version, and is simple and straight forward.

It was soundly rejected by all of us because you never answered the
questions we asked about that "solution".  Why would anyone want to take
that?

Again, please tell everyone at Intel to stop telling vendors that they
need to take these patches into their trees.  That's pressure that is
trying to circumvent the normal (well semi-normal here) review process
and is doing nothing to help your situation out.

> We finished the audits on interrupt handlers now, and also on timers
> and tasklet. There's still some additional work on other softirq handlers.

We asked what we can do here to help out with that, by asking for
specific definitions of what you are looking for.  Again, that was not
answered so don't think that you all have to do this all on your own.
This seems to be your choice :(

> I did some work on the other feedbacks and some gleixnification, but that's
> still a work in progress. Currently I expect a repost some time next week.

Can we get a date as to how this relates to when the issues are supposed
to be going public so as to see if "next week" really matters for the
distros/users?  For all I know, the release date could be on Monday...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 18:00         ` Greg KH
@ 2019-02-14 18:05           ` Andrew Cooper
  2019-02-14 18:33           ` Andi Kleen
  1 sibling, 0 replies; 105+ messages in thread
From: Andrew Cooper @ 2019-02-14 18:05 UTC (permalink / raw)
  To: speck

On 14/02/2019 18:00, speck for Greg KH wrote:
> Can we get a date as to how this relates to when the issues are supposed
> to be going public so as to see if "next week" really matters for the
> distros/users?  For all I know, the release date could be on Monday...

The embargo date for all of this is May 14th (probably 10am Pacific time).

At least one of the researchers who reported it has a slot at the IEEE
symposium shortly thereafter.

~Andrew

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 18:00         ` Greg KH
  2019-02-14 18:05           ` Andrew Cooper
@ 2019-02-14 18:33           ` Andi Kleen
  2019-02-14 18:52             ` Greg KH
  1 sibling, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-14 18:33 UTC (permalink / raw)
  To: speck

> > everyone understands these implications.
> 
> No, we don't (or at least I do not), understand any of the implications
> here because the questions I asked were never answered.  So far you have

The original patches flushed on every kernel exit.
This was rejected as default because it has some overhead
(we saw ~6% for a micro benchmark, RH saw more in a very unrealistic
micro) 

The second (actually third) version didn't flush on every
kernel exit, but only on asynchronous events like non whitelisted
interrupts/timers/tasklets

The vast majority of the performance improvement came
from not flushing for process context system calls and exceptions.
At least in our benchmarks there was very little overhead
with that solution, with most kernel exits not flushing
anymore, and any penalties near the noise level.

But it relied on asynchronous events white listing
to avoid having to patch most of the tree.

The feedback, including from you, was that this would 
be still unacceptable to slow down any interrupt handlers or timers
that don't touch user data.

So the requirement was that all interrupt handlers/timers/
tasklets in the tree need to be audited, to only flush for the ones
that touch user data. That's what we've been
working on for the next version.

Obviously it's neither simple nor soon nor quick 
or easy to maintain.

> just "told us what you are going to do" and then never even backed that
> up with reasons for why you are doing that.

!?!? I wrote long document with a full security model.

Did you read it? I don't remember any comments on that from you.

> Why would anyone want to take
> that?

Are you asking about the mds=full variant, (like v2 of the patch series)

It's really simple, straight forward, and has a very strictly defined
easy security model that can be explained in a few sentences.

It's also very similar to what will be eventually in tree
as the code path for mds=full, there haven't been any changes
to this at all.

Its only drawback is some performance penalty. Everything else
is vastly superior over any other solutions.

If anyone just wants a simple and soon fix it's the obvious
choice.


-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 18:33           ` Andi Kleen
@ 2019-02-14 18:52             ` Greg KH
  2019-02-14 19:50               ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Greg KH @ 2019-02-14 18:52 UTC (permalink / raw)
  To: speck

On Thu, Feb 14, 2019 at 10:33:59AM -0800, speck for Andi Kleen wrote:
> > > everyone understands these implications.
> > 
> > No, we don't (or at least I do not), understand any of the implications
> > here because the questions I asked were never answered.  So far you have
> 
> The original patches flushed on every kernel exit.
> This was rejected as default because it has some overhead
> (we saw ~6% for a micro benchmark, RH saw more in a very unrealistic
> micro) 
> 
> The second (actually third) version didn't flush on every
> kernel exit, but only on asynchronous events like non whitelisted
> interrupts/timers/tasklets
> 
> The vast majority of the performance improvement came
> from not flushing for process context system calls and exceptions.
> At least in our benchmarks there was very little overhead
> with that solution, with most kernel exits not flushing
> anymore, and any penalties near the noise level.
> 
> But it relied on asynchronous events white listing
> to avoid having to patch most of the tree.
> 
> The feedback, including from you, was that this would 
> be still unacceptable to slow down any interrupt handlers or timers
> that don't touch user data.

Of course that's not ok.

I also asked, as did Thomas, what is "user data" in this type of
situation?  Without having that definition, it's been impossible for me
to propose anything here.

> > just "told us what you are going to do" and then never even backed that
> > up with reasons for why you are doing that.
> 
> !?!? I wrote long document with a full security model.
> 
> Did you read it? I don't remember any comments on that from you.

Thomas beat me to it, please go back and look at his questions.  Most of
those have yet to be answered, including my "simple" one above.

> > Why would anyone want to take
> > that?
> 
> Are you asking about the mds=full variant, (like v2 of the patch series)
> 
> It's really simple, straight forward, and has a very strictly defined
> easy security model that can be explained in a few sentences.

And it's a non-viable solution.  Please stop pushing this.

> It's also very similar to what will be eventually in tree
> as the code path for mds=full, there haven't been any changes
> to this at all.
> 
> Its only drawback is some performance penalty. Everything else
> is vastly superior over any other solutions.

"only" is not ok.  I'm still getting yelled at for the spectre fixes and
how it slowed down people's workloads.  It's also easy to just run in UP
mode to solve all of these, right?  :)

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 18:52             ` Greg KH
@ 2019-02-14 19:50               ` Andi Kleen
  2019-02-15  7:06                 ` Greg KH
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-14 19:50 UTC (permalink / raw)
  To: speck

> Of course that's not ok.

Okay. That's easy to say.

No regression anywhere, not even for any ISA driver.

But then of course that turned the still relatively simple patchkit into
a gigantic boondoggle of all tree driver audits and constant maintenance
and education of all driver maintainers.

And all to handle lots of drivers which likely nobody uses anyways,
or if they use it the device is already so slow that the extra
overhead of a CPU flush is in the noise 
(e.g. cpu flush overhead << PIO access)

That's somehow ok.

Anyways it's water down the bridge at this point. We already
wasted the time doing all of this. 

But of course it's still to be figured out if any of this can be
actually practically backported or be maintained long term.

My suspicion is that practical backports will need some
form of white listing anyways.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-14 19:50               ` Andi Kleen
@ 2019-02-15  7:06                 ` Greg KH
  2019-02-15 13:06                   ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Greg KH @ 2019-02-15  7:06 UTC (permalink / raw)
  To: speck

On Thu, Feb 14, 2019 at 11:50:41AM -0800, speck for Andi Kleen wrote:
> > Of course that's not ok.
> 
> Okay. That's easy to say.
> 
> No regression anywhere, not even for any ISA driver.

Don't be silly, or melodramatic please.  Be realistic.

Slowing the whole kernel down for an ill-defined problem is not ok.  You
know that.

> But then of course that turned the still relatively simple patchkit into
> a gigantic boondoggle of all tree driver audits and constant maintenance
> and education of all driver maintainers.

Again we offered to help, if you would just answer the questions that
were asked.

Best of luck,

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-15  7:06                 ` Greg KH
@ 2019-02-15 13:06                   ` Andi Kleen
  2019-02-19 12:12                     ` Greg KH
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-15 13:06 UTC (permalink / raw)
  To: speck

> Slowing the whole kernel down for an ill-defined problem is not ok.  You
> know that

What do you mean by ill defined problem? 

Are you unclear on the group 4 vulnerabilities? I assume you've seen
the disclosures, or is that not correct?

Have you read Documentation/clearcpu.txt? 

There will be some slow down for group 4 mitigations. It's inevitable.
These changes don't come for free. We've been spending a lot of
work to minimize it, but it won't be zero.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-01-22 17:57     ` Thomas Gleixner
  2019-01-23  1:35       ` [MODERATED] " Andi Kleen
@ 2019-02-16  2:00       ` Andi Kleen
  2019-02-16 10:32         ` Thomas Gleixner
  1 sibling, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-16  2:00 UTC (permalink / raw)
  To: speck

> Lets look at the different scenarios:
> 
>   process A    ->   process A  		Thread switch in same process,
>   	       	    	    		do nothing
> 
>   process A    ->   process B		Switches mm -> set percpu flush
> 
>   process A    ->   kernel  		Do nothing (rely on lazy mm)
>      kernel       ->   process B	 Switches mm -> set percpu flush
>      kernel       ->   kernel            do nothing
>      kernel       ->   process A	 if lazy mm -> set percpu flush

I revisited this now.

I think your proposal is to unconditionally set the per cpu 
flag in switch_mm.

But this doesn't handle the following case:

	process A -> kernel -> process A 

In this case we want to flush because of the kernel thread
(e.g. might contain crypto)
but switch_mm is never called. I'll stay with the previous
version for now.

I did some updates to the logic. It now uses per cpu
state, and also clears when coming out of idle.

This is needed because the CPU coming out of idle
might inherit some SMT state from the sibling.

This actually simplifies the logic nicely, 
so it's very straight forward now.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 14/27] MDSv5 3
  2019-02-16  2:00       ` [MODERATED] " Andi Kleen
@ 2019-02-16 10:32         ` Thomas Gleixner
  2019-02-16 16:58           ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Thomas Gleixner @ 2019-02-16 10:32 UTC (permalink / raw)
  To: speck

On Fri, 15 Feb 2019, speck for Andi Kleen wrote:

> > Lets look at the different scenarios:
> > 
> >   process A    ->   process A  		Thread switch in same process,
> >   	       	    	    		do nothing
> > 
> >   process A    ->   process B		Switches mm -> set percpu flush
> > 
> >   process A    ->   kernel  		Do nothing (rely on lazy mm)
> >      kernel       ->   process B	 Switches mm -> set percpu flush
> >      kernel       ->   kernel            do nothing
> >      kernel       ->   process A	 if lazy mm -> set percpu flush
> 
> I revisited this now.
> 
> I think your proposal is to unconditionally set the per cpu 
> flag in switch_mm.
> 
> But this doesn't handle the following case:
> 
> 	process A -> kernel -> process A 
> 
> In this case we want to flush because of the kernel thread
> (e.g. might contain crypto)
> but switch_mm is never called. I'll stay with the previous
> version for now.

What?

     process A --> kernel thread
      schedule()
	context_switch()
	   enter_lazy_tlb(oldmm, next)
	     this_cpu_write(cpu_tlbstate.is_lazy, true);
      ...

     kernel thread -> Process A
      schedule()
        context_switch()
	  switch_mm_irqs_off()
	    was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);

	    if (was_lazy)
	       this_cpu_write(flush_on_exit, true);

I explained that to you before:

> That's true, but enter_lazy_tlb() is called and there exists already an
> indicator that it switched from a user space task to a kernel task:
> cpu_tlbstate.is_lazy, which is evaluated in the next invocation of
> switch_mm_irqs_off().
> 
> So the question is, whether something like this makes sense:
> 
>    - Have some indicator in cpu_tlbstate that switching is due
> 
>      cpu_tlbstate.tif_flags
> 
>      and use that TIF bit.
> 
> In the sys_exit() path do
> 
>    cached_flags = READ_ONCE(ti->flags);
> 
>    if (static_key_enabled(mds_cond_clear))
>         cached_flags |= READ_ONCE(cpu_tlbstate.tif_flags);
> 
> That's an extra read, but especially with PTI this is cache hot anyway and
> the store of the flag is done in switch_mm_irqs_off(). Haven't thought it
> through, but on the first glance this looks simpler and makes the whole
> thing stick to the CPU instead of playing games with transferring the
> thread flag on every context switch.

Ergo, this can be completely done in switch_mm_irqs_off() and does not at
all require any of that propagation logic in switch_to().

Thanks,

	tglx

	       

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-02-16 10:32         ` Thomas Gleixner
@ 2019-02-16 16:58           ` Andi Kleen
  2019-02-16 17:12             ` Andi Kleen
  0 siblings, 1 reply; 105+ messages in thread
From: Andi Kleen @ 2019-02-16 16:58 UTC (permalink / raw)
  To: speck

> Ergo, this can be completely done in switch_mm_irqs_off() and does not at
> all require any of that propagation logic in switch_to().

And why is that a benefit?

Ok it's just moving one if with a single condition to another function
to turn it into another condition

if (prev_mm != next_mm) 

->

if (this_cpu.lazy_tlb)

which depends on the quite hairy lazy tlb semantics.

I fail to see the benefit of your variant -- it is changing
a simple obvious if to be a subtly complicated if, and both
get executed for all context switches anyways, and are
the same cost to the CPU. The subtly complicated if needs a lot
more comments, and will be much harder to understand
for anyone else except you.

And worse we wasted lots of time arguing about this

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 14/27] MDSv5 3
  2019-02-16 16:58           ` [MODERATED] " Andi Kleen
@ 2019-02-16 17:12             ` Andi Kleen
  0 siblings, 0 replies; 105+ messages in thread
From: Andi Kleen @ 2019-02-16 17:12 UTC (permalink / raw)
  To: speck

On Sat, Feb 16, 2019 at 08:58:28AM -0800, speck for Andi Kleen wrote:
> > Ergo, this can be completely done in switch_mm_irqs_off() and does not at
> > all require any of that propagation logic in switch_to().
> 
> And why is that a benefit?
> 
> Ok it's just moving one if with a single condition to another function
> to turn it into another condition
> 
> if (prev_mm != next_mm) 

Ok it's actually if (prev_mm != next_mm || prev_mm == NULL)

Oh well. I will do the change. But anyone else who has to understand
this later will not love you.

-Andi

^ permalink raw reply	[flat|nested] 105+ messages in thread

* [MODERATED] Re: [PATCH v5 00/27] MDSv5 19
  2019-02-15 13:06                   ` Andi Kleen
@ 2019-02-19 12:12                     ` Greg KH
  0 siblings, 0 replies; 105+ messages in thread
From: Greg KH @ 2019-02-19 12:12 UTC (permalink / raw)
  To: speck

On Fri, Feb 15, 2019 at 05:06:01AM -0800, speck for Andi Kleen wrote:
> > Slowing the whole kernel down for an ill-defined problem is not ok.  You
> > know that
> 
> What do you mean by ill defined problem? 
> 
> Are you unclear on the group 4 vulnerabilities? I assume you've seen
> the disclosures, or is that not correct?

I have seen some things, but odds are I have not seen everything.  Do
you have a pointer to where I, and everyone else here, should be looking
to read all of the relevant ones to make sure we actually are talking
about the same thing?

> Have you read Documentation/clearcpu.txt? 

Yes I did, and I responded to your patch with some simple questions
about that file on Jan 22:
	Subject: Re: [PATCH v5 09/27] MDSv5 23
	Message-ID: <20190122072652.GA7082@kroah.com>

but never got a response from you :(

Why I have to ask for a response again (which is what I did here) is
odd...

greg k-h

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [PATCH v5 05/27] MDSv5 21
  2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
  2019-01-22  4:35   ` [MODERATED] " Konrad Rzeszutek Wilk
  2019-01-22 13:01   ` Thomas Gleixner
@ 2019-02-21 12:06   ` Thomas Gleixner
  2 siblings, 0 replies; 105+ messages in thread
From: Thomas Gleixner @ 2019-02-21 12:06 UTC (permalink / raw)
  To: speck

On Fri, 18 Jan 2019, speck for Andi Kleen wrote:
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 18bc9b51ac9b..eb6e39238d1d 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -494,7 +494,7 @@ do_nmi(struct pt_regs *regs, long error_code)
>  {
>  	if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) {
>  		this_cpu_write(nmi_state, NMI_LATCHED);
> -		return;
> +		goto out;

This is not needed. It's a nested NMI so not returning to anything else
than the running one.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2019-02-21 12:06 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-19  0:50 [MODERATED] [PATCH v5 00/27] MDSv5 19 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 01/27] MDSv5 26 Andi Kleen
2019-01-22  4:17   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 12:46   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 02/27] MDSv5 14 Andi Kleen
2019-01-22  4:20   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 12:51   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 03/27] MDSv5 16 Andi Kleen
2019-01-22  4:23   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 12:55   ` Thomas Gleixner
2019-01-27 21:58   ` Thomas Gleixner
2019-01-28  3:30     ` [MODERATED] " Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 04/27] MDSv5 15 Andi Kleen
2019-01-22  4:33   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 12:59   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 05/27] MDSv5 21 Andi Kleen
2019-01-22  4:35   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 13:01   ` Thomas Gleixner
2019-02-21 12:06   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 06/27] MDSv5 18 Andi Kleen
2019-01-21 22:41   ` [MODERATED] " Josh Poimboeuf
2019-01-22  1:16     ` Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 07/27] MDSv5 0 Andi Kleen
2019-01-22  4:39   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-27 22:09   ` Thomas Gleixner
2019-01-28  3:33     ` [MODERATED] " Andi Kleen
2019-01-28  8:29       ` Thomas Gleixner
2019-02-13 22:26   ` [MODERATED] " Tyler Hicks
2019-01-19  0:50 ` [MODERATED] [PATCH v5 08/27] MDSv5 13 Andi Kleen
2019-01-22  4:40   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 09/27] MDSv5 23 Andi Kleen
2019-01-22  4:56   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22  7:26   ` Greg KH
2019-01-22 13:07   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 10/27] MDSv5 7 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 11/27] MDSv5 2 Andi Kleen
2019-01-22 13:11   ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 12/27] MDSv5 6 Andi Kleen
2019-01-22 14:01   ` Thomas Gleixner
2019-01-22 15:42     ` Thomas Gleixner
2019-01-22 18:01     ` [MODERATED] " Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 13/27] MDSv5 17 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 14/27] MDSv5 3 Andi Kleen
2019-01-22  4:48   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22 15:58   ` Thomas Gleixner
2019-01-22 17:57     ` Thomas Gleixner
2019-01-23  1:35       ` [MODERATED] " Andi Kleen
2019-01-23  9:27         ` Thomas Gleixner
2019-01-23 16:02           ` [MODERATED] " Andi Kleen
2019-01-23 22:40             ` Josh Poimboeuf
2019-01-23 22:57               ` Josh Poimboeuf
2019-01-24  0:25                 ` Josh Poimboeuf
2019-01-24  2:26               ` Andi Kleen
2019-01-24 12:04             ` Thomas Gleixner
2019-01-28  3:42               ` [MODERATED] " Andi Kleen
2019-01-28  8:33                 ` Thomas Gleixner
2019-02-16  2:00       ` [MODERATED] " Andi Kleen
2019-02-16 10:32         ` Thomas Gleixner
2019-02-16 16:58           ` [MODERATED] " Andi Kleen
2019-02-16 17:12             ` Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 15/27] MDSv5 1 Andi Kleen
2019-01-22  4:48   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 16/27] MDSv5 10 Andi Kleen
2019-01-22  4:54   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-22  7:33   ` Greg KH
2019-01-19  0:50 ` [MODERATED] [PATCH v5 17/27] MDSv5 9 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 18/27] MDSv5 8 Andi Kleen
2019-01-22  5:07   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 19/27] MDSv5 12 Andi Kleen
2019-01-22  5:09   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 20/27] MDSv5 27 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 21/27] MDSv5 20 Andi Kleen
2019-01-22  5:11   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 22/27] MDSv5 24 Andi Kleen
2019-01-21 21:24   ` [MODERATED] " Linus Torvalds
2019-01-22  1:22     ` Andi Kleen
2019-01-22 16:09       ` Thomas Gleixner
2019-01-22 17:56         ` [MODERATED] " Andi Kleen
2019-01-22 18:56           ` Thomas Gleixner
2019-01-23  1:39             ` [MODERATED] " Andi Kleen
2019-01-23  6:39               ` Greg KH
2019-01-24  9:55               ` Thomas Gleixner
2019-01-19  0:50 ` [MODERATED] [PATCH v5 23/27] MDSv5 22 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 24/27] MDSv5 5 Andi Kleen
2019-01-21 21:20   ` [MODERATED] " Linus Torvalds
2019-01-19  0:50 ` [MODERATED] [PATCH v5 25/27] MDSv5 4 Andi Kleen
2019-01-22  5:15   ` [MODERATED] " Konrad Rzeszutek Wilk
2019-01-19  0:50 ` [MODERATED] [PATCH v5 26/27] MDSv5 11 Andi Kleen
2019-01-19  0:50 ` [MODERATED] [PATCH v5 27/27] MDSv5 25 Andi Kleen
2019-01-21 21:18 ` [MODERATED] Re: [PATCH v5 00/27] MDSv5 19 Linus Torvalds
2019-01-22  1:14   ` Andi Kleen
2019-01-22  7:38     ` Greg KH
2019-01-28 11:34 ` Thomas Gleixner
2019-02-13 22:33   ` [MODERATED] " Tyler Hicks
2019-02-14 13:09     ` Jiri Kosina
2019-02-14 13:51       ` Greg KH
2019-02-14 16:53       ` Andi Kleen
2019-02-14 18:00         ` Greg KH
2019-02-14 18:05           ` Andrew Cooper
2019-02-14 18:33           ` Andi Kleen
2019-02-14 18:52             ` Greg KH
2019-02-14 19:50               ` Andi Kleen
2019-02-15  7:06                 ` Greg KH
2019-02-15 13:06                   ` Andi Kleen
2019-02-19 12:12                     ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.