All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH v3 00/32] MDSv3 12
@ 2018-12-21  0:27 Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 01/32] MDSv3 7 Andi Kleen
                   ` (35 more replies)
  0 siblings, 36 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Here's a new version of flushing CPU buffers for group 4.

This mainly covers single thread, not SMT (except for the idle case).

I lumped all the issues together under the Microarchitectural Data
Sampling (MDS) name because they need the same mitigations,a
and it doesn't seem worth duplicating the sysfs files and bug entries.

This version implements Linus' suggestion to only clear the CPU
buffer when needed. The patch kit is now a lot more complicated:
different subsystems determine if they might touch other user's
or sensitive data and schedule a cpu clear on next kernel exit.

Generally process context doesn't clear (unless it is cryptographic
or does context switches), and interrupt context schedules a clear.
There are some exceptions to these rules.

For details on the security model see the Documentation/clearcpu.txt
file. In my tests the number of clears is much lower now.

For most benchmarks we tried the difference is in the noise
level now. ebizzy and loopback apache both show about 1.7%
degradation.

It makes various assumptions on how kernel code behaves.
I did some auditing, but wasn't able to do it for everything.
Please double check the assumptions laid out in the document.

Likely a lot more interrupt and timer handlers (and tasklets
and irq poll handlers) could be white listed to not need clear, but I only
did a fairly minimum set for now that I could test.

For some of the white listed code, especially the networking and
block softirqs, as well as the EBPF mitigation, some additional auditing that
no rules are violated would be useful.

I kept the support for software sequences because from what I'm hearing
some CPUs might need them. If that's not the case they can be still
removed.

VERW is not done unconditionally because it doesn't allow reporting
the correct status in the vulnerabilities file, which I consider important.
Instead we now have a mds=verw option that can be set as needed,
but is reported explicitely in the mitigation status.

Some notes:
- Against 4.20-rc5
- There's a new (bogus) build time warning from objtool about unreachable code.

Changes against previous versions:
- By default now flushes only when needed
- Define security model
- New administrator document
- Added mds=verw and mds=full
- Renamed mds_disable to mds=off
- KVM virtualization much improved
- Too many others to list. Most things different now.

Andi Kleen (32):
  x86/speculation/mds: Add basic bug infrastructure for MDS
  x86/speculation/mds: Support clearing CPU data on kernel exit
  x86/speculation/mds: Support mds=full
  x86/speculation/mds: Clear CPU buffers on entering idle
  x86/speculation/mds: Add sysfs reporting
  x86/speculation/mds: Add software sequences for older CPUs.
  x86/speculation/mds: Support mds=full for NMIs
  x86/speculation/mds: Avoid NMI races with software sequences
  x86/speculation/mds: Call software sequences on KVM entry
  x86/speculation/mds: Clear buffers on NMI exit on 32bit kernels.
  x86/speculation/mds: Add mds=verw
  x86/speculation/mds: Export MB_CLEAR CPUID to KVM guests.
  x86/speculation/mds: Always clear when entering guest without MB_CLEAR
  mds: Add documentation for clear cpu usage
  mds: Add preliminary administrator documentation
  x86/speculation/mds: Introduce lazy_clear_cpu
  x86/speculation/mds: Schedule cpu clear on context switch
  x86/speculation/mds: Add tracing for clear_cpu
  mds: Force clear cpu on kernel preemption
  mds: Schedule cpu clear for memzero_explicit and kzfree
  mds: Mark interrupts clear cpu, unless opted-out
  mds: Clear cpu on all timers, unless the timer opts-out
  mds: Clear CPU on tasklets, unless opted-out
  mds: Clear CPU on irq poll, unless opted-out
  mds: Clear cpu for string io/memcpy_*io in interrupts
  mds: Schedule clear cpu in swiotlb
  mds: Instrument skb functions to clear cpu automatically
  mds: Opt out tcp tasklet to not touch user data
  mds: mark kernel/* timers safe as not touching user data
  mds: Mark AHCI interrupt as not needing cpu clear
  mds: Mark ACPI interrupt as not needing cpu clear
  mds: Mitigate BPF

 .../ABI/testing/sysfs-devices-system-cpu      |   1 +
 .../admin-guide/kernel-parameters.txt         |  29 +++
 Documentation/admin-guide/mds.rst             | 128 +++++++++++++
 Documentation/clearcpu.txt                    | 179 ++++++++++++++++++
 arch/Kconfig                                  |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/common.c                       |  24 ++-
 arch/x86/entry/entry_32.S                     |   7 +
 arch/x86/entry/entry_64.S                     |  24 +++
 arch/x86/include/asm/clearbpf.h               |  29 +++
 arch/x86/include/asm/clearcpu.h               | 100 ++++++++++
 arch/x86/include/asm/cpufeatures.h            |   4 +
 arch/x86/include/asm/io.h                     |   3 +
 arch/x86/include/asm/msr-index.h              |   1 +
 arch/x86/include/asm/thread_info.h            |   2 +
 arch/x86/include/asm/trace/clearcpu.h         |  27 +++
 arch/x86/kernel/acpi/cstate.c                 |   2 +
 arch/x86/kernel/cpu/bugs.c                    | 108 +++++++++++
 arch/x86/kernel/cpu/common.c                  |  14 ++
 arch/x86/kernel/kvm.c                         |   3 +
 arch/x86/kernel/process.c                     |   5 +
 arch/x86/kernel/process.h                     |  27 +++
 arch/x86/kernel/smpboot.c                     |   3 +
 arch/x86/kvm/cpuid.c                          |   3 +-
 arch/x86/kvm/vmx.c                            |  23 ++-
 arch/x86/lib/Makefile                         |   1 +
 arch/x86/lib/clear_cpu.S                      | 104 ++++++++++
 drivers/acpi/acpi_pad.c                       |   2 +
 drivers/acpi/osl.c                            |   3 +-
 drivers/acpi/processor_idle.c                 |   3 +
 drivers/ata/ahci.c                            |   2 +-
 drivers/ata/ahci.h                            |   2 +
 drivers/ata/libahci.c                         |  40 ++--
 drivers/base/cpu.c                            |   8 +
 drivers/idle/intel_idle.c                     |   5 +
 include/asm-generic/io.h                      |   3 +
 include/linux/clearcpu.h                      |  36 ++++
 include/linux/filter.h                        |  21 +-
 include/linux/hrtimer.h                       |   4 +
 include/linux/interrupt.h                     |  18 +-
 include/linux/irq_poll.h                      |   2 +
 include/linux/skbuff.h                        |   2 +
 include/linux/timer.h                         |   9 +-
 kernel/bpf/core.c                             |   2 +
 kernel/dma/swiotlb.c                          |   2 +
 kernel/events/core.c                          |   6 +-
 kernel/fork.c                                 |   3 +-
 kernel/futex.c                                |   6 +-
 kernel/irq/handle.c                           |   8 +
 kernel/irq/manage.c                           |   1 +
 kernel/sched/core.c                           |  14 +-
 kernel/sched/deadline.c                       |   6 +-
 kernel/sched/fair.c                           |   7 +-
 kernel/sched/idle.c                           |   3 +-
 kernel/sched/rt.c                             |   3 +-
 kernel/softirq.c                              |  25 ++-
 kernel/time/alarmtimer.c                      |   2 +-
 kernel/time/hrtimer.c                         |  11 +-
 kernel/time/posix-timers.c                    |   6 +-
 kernel/time/sched_clock.c                     |   3 +-
 kernel/time/tick-sched.c                      |   6 +-
 kernel/time/timer.c                           |   8 +
 kernel/watchdog.c                             |   3 +-
 lib/irq_poll.c                                |  18 +-
 lib/string.c                                  |   6 +
 mm/slab_common.c                              |   5 +-
 net/core/skbuff.c                             |  26 +++
 net/ipv4/tcp_output.c                         |   5 +-
 68 files changed, 1138 insertions(+), 62 deletions(-)
 create mode 100644 Documentation/admin-guide/mds.rst
 create mode 100644 Documentation/clearcpu.txt
 create mode 100644 arch/x86/include/asm/clearbpf.h
 create mode 100644 arch/x86/include/asm/clearcpu.h
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h
 create mode 100644 arch/x86/lib/clear_cpu.S
 create mode 100644 include/linux/clearcpu.h

-- 
2.17.2

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 01/32] MDSv3 7
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2019-01-09 17:38   ` [MODERATED] " Konrad Rzeszutek Wilk
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 02/32] MDSv3 22 Andi Kleen
                   ` (34 subsequent siblings)
  35 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

MDS is micro architectural data sampling, which is a side channel
attack on internal buffers in Intel CPUs. They all have
the same mitigations for single thread, so we lump them all
together as a single MDS issue.

This addresses CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
for the single threaded case.

This patch adds the basic infrastructure to detect if the current
CPU is affected by MDS, and if yes set the right BUG bits.

We also provide a command line option "mds_disable" to disable
any workarounds.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 arch/x86/include/asm/cpufeatures.h              |  2 ++
 arch/x86/include/asm/msr-index.h                |  1 +
 arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
 arch/x86/kernel/cpu/common.c                    | 14 ++++++++++++++
 5 files changed, 30 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..f5c14b721eef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2341,6 +2341,9 @@
 			Format: <first>,<last>
 			Specifies range of consoles to be captured by the MDA.
 
+	mds=off		[X86, Intel]
+			Disable workarounds for Micro-architectural Data Sampling.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 28c4a502b419..93fab3a1e046 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -342,6 +342,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MB_CLEAR		(18*32+10) /* Flush state on VERW */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -379,5 +380,6 @@
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index c8f73efb4ece..303064a9a0a9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -77,6 +77,7 @@
 						    * attack, so no Speculative Store Bypass
 						    * control required.
 						    */
+#define ARCH_CAP_MDS_NO			(1 << 5)   /* No Microarchitectural data sampling */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			(1 << 0)   /*
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 500278f5308e..13eb623fe0b1 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -35,6 +35,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -99,6 +100,8 @@ void __init check_bugs(void)
 
 	l1tf_select_mitigation();
 
+	mds_select_mitigation();
+
 #ifdef CONFIG_X86_32
 	/*
 	 * Check whether we are able to run this kernel safely on SMP.
@@ -1041,6 +1044,13 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static void mds_select_mitigation(void)
+{
+	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
+	    !boot_cpu_has_bug(X86_BUG_MDS)) {
+		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+}
+
 #ifdef CONFIG_SYSFS
 
 #define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ffb181f959d2..bebeb67015fc 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -998,6 +998,14 @@ static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
 	{}
 };
 
+static const __initconst struct x86_cpu_id cpu_no_mds[] = {
+	/* in addition to cpu_no_speculation */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_X	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_PLUS	},
+	{}
+};
+
 static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
@@ -1019,6 +1027,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	if (ia32_cap & ARCH_CAP_IBRS_ALL)
 		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
 
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    !x86_match_cpu(cpu_no_mds)) &&
+	    !(ia32_cap & ARCH_CAP_MDS_NO) &&
+	    !(ia32_cap & ARCH_CAP_RDCL_NO))
+		setup_force_cpu_bug(X86_BUG_MDS);
+
 	if (x86_match_cpu(cpu_no_meltdown))
 		return;
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 02/32] MDSv3 22
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 01/32] MDSv3 7 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 03/32] MDSv3 5 Andi Kleen
                   ` (33 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Support clearing CPU data on
 kernel exit

Add infrastructure for clearing CPU data on kernel exit

Instead of clearing unconditionally we support clearing
lazily when some kernel subsystem touched sensitive data
and sets the new TIF_CLEAR_CPU flag.

We handle TIF_CLEAR_CPU in kernel exit, similar to
other kernel exit action flags.

The flushing is provided by new microcode as a new side
effect of the otherwise unused VERW instruction.

So far this patch doesn't do anything, it relies on
later patches to set TIF_CLEAR_CPU.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c            |  8 +++++++-
 arch/x86/include/asm/clearcpu.h    | 26 ++++++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  2 ++
 3 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/clearcpu.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..07cf8d32df67 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -29,6 +29,7 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
+#include <asm/clearcpu.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
 
@@ -132,7 +133,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 }
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_CLEAR_CPU |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -170,6 +171,11 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_CLEAR_CPU) {
+			clear_thread_flag(TIF_CLEAR_CPU);
+			clear_cpu();
+		}
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
new file mode 100644
index 000000000000..c45f0c28867e
--- /dev/null
+++ b/arch/x86/include/asm/clearcpu.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARCPU_H
+#define _ASM_CLEARCPU_H 1
+
+#include <linux/jump_label.h>
+#include <linux/sched/smt.h>
+#include <asm/alternative.h>
+#include <linux/thread_info.h>
+
+/*
+ * Clear CPU buffers to avoid side channels.
+ * We use either microcode (as a side effect of the obsolete
+ * "VERW" instruction), or special out of line clear sequences.
+ */
+
+static inline void clear_cpu(void)
+{
+	unsigned kernel_ds = __KERNEL_DS;
+	/* Has to be memory form, don't modify to use an register */
+	alternative_input("",
+		"verw %[kernelds]",
+		X86_FEATURE_MB_CLEAR,
+		[kernelds] "m" (kernel_ds));
+}
+
+#endif
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 82b73b75d67c..f50c05d5bc8c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_CLEAR_CPU		23	/* clear CPU on kernel exit */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
+#define _TIF_CLEAR_CPU		(1 << TIF_CLEAR_CPU)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 03/32] MDSv3 5
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 01/32] MDSv3 7 Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 02/32] MDSv3 22 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 04/32] MDSv3 3 Andi Kleen
                   ` (32 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Support mds=full

Support a new command line option to support unconditional flushing
on each kernel exit. This is not enabled by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 5 +++++
 arch/x86/entry/common.c                         | 7 ++++++-
 arch/x86/include/asm/clearcpu.h                 | 2 ++
 arch/x86/kernel/cpu/bugs.c                      | 5 +++++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f5c14b721eef..b764b4ebb1f8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2344,6 +2344,11 @@
 	mds=off		[X86, Intel]
 			Disable workarounds for Micro-architectural Data Sampling.
 
+	mds=full	[X86, Intel]
+			Always flush cpu buffers when exiting kernel for MDS.
+			Normally the kernel decides dynamically when flushing is
+			needed or not.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 07cf8d32df67..6662444b33cf 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
-			clear_cpu();
+			/* Don't do it twice if forced */
+			if (!static_key_enabled(&force_cpu_clear))
+				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
@@ -217,6 +219,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
+	if (static_key_enabled(&force_cpu_clear))
+		clear_cpu();
+
 	user_enter_irqoff();
 }
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index c45f0c28867e..35fecc86e54f 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -23,4 +23,6 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
+
 #endif
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 13eb623fe0b1..5fbdf425a84a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1044,11 +1044,16 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
 	    !boot_cpu_has_bug(X86_BUG_MDS)) {
 		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+
+	if (cmdline_find_option_bool(boot_command_line, "mds=full"))
+		static_branch_enable(&force_cpu_clear);
 }
 
 #ifdef CONFIG_SYSFS
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 04/32] MDSv3 3
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (2 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 03/32] MDSv3 5 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 05/32] MDSv3 0 Andi Kleen
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

When entering idle the internal state of the current CPU might
become visible to the thread sibling because the CPU "frees" some
internal resources.

To ensure there is no MDS leakage always clear the CPU state
before doing any idling. We only do this if SMT is enabled,
as otherwise there is no leakage possible.

Not needed for idle poll because it does not share resources.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h | 19 +++++++++++++++++++
 arch/x86/kernel/acpi/cstate.c   |  2 ++
 arch/x86/kernel/kvm.c           |  3 +++
 arch/x86/kernel/process.c       |  5 +++++
 arch/x86/kernel/smpboot.c       |  3 +++
 drivers/acpi/acpi_pad.c         |  2 ++
 drivers/acpi/processor_idle.c   |  3 +++
 drivers/idle/intel_idle.c       |  5 +++++
 kernel/sched/fair.c             |  1 +
 9 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 35fecc86e54f..9e389c8a5679 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -23,6 +23,25 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+/*
+ * Clear CPU buffers before going idle, so that no state is leaked to SMT
+ * siblings taking over thread resources.
+ * Out of line to avoid include hell.
+ *
+ * Assumes that interrupts are disabled and only get reenabled
+ * before idle, otherwise the data from a racing interrupt might not
+ * get cleared. There are some callers who violate this,
+ * but they are only used in unattackable cases.
+ */
+
+static inline void clear_cpu_idle(void)
+{
+	if (sched_smt_active()) {
+		clear_thread_flag(TIF_CLEAR_CPU);
+		clear_cpu();
+	}
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #endif
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 158ad1483c43..48adea5afacf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
 #include <acpi/processor.h>
 #include <asm/mwait.h>
 #include <asm/special_insns.h>
+#include <asm/clearcpu.h>
 
 /*
  * Initialize bm_flags based on the CPU cache properties
@@ -157,6 +158,7 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 	unsigned int cpu = smp_processor_id();
 	struct cstate_entry *percpu_entry;
 
+	clear_cpu_idle();
 	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
 	mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
 	                      percpu_entry->states[cx->index].ecx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ba4bfb7f6a36..c9206ad40a5b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -159,6 +159,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
 			/*
 			 * We cannot reschedule. So halt.
 			 */
+			clear_cpu_idle();
 			native_safe_halt();
 			local_irq_disable();
 		}
@@ -785,6 +786,8 @@ static void kvm_wait(u8 *ptr, u8 val)
 	if (READ_ONCE(*ptr) != val)
 		goto out;
 
+	clear_cpu_idle();
+
 	/*
 	 * halt until it's our turn and kicked. Note that we do safe halt
 	 * for irq enabled case to avoid hang when lock info is overwritten
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7d31192296a8..72c0fe5f69e0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,7 @@
 #include <asm/desc.h>
 #include <asm/prctl.h>
 #include <asm/spec-ctrl.h>
+#include <asm/clearcpu.h>
 
 #include "process.h"
 
@@ -586,6 +587,8 @@ void stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
+	clear_cpu_idle();
+
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
 	 * for performing a successful kexec when going from SME inactive
@@ -672,6 +675,8 @@ static __cpuidle void mwait_idle(void)
 			mb(); /* quirk */
 		}
 
+		clear_cpu_idle();
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__sti_mwait(0, 0);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a9134d1910b9..4b873873476f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -81,6 +81,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
+#include <asm/clearcpu.h>
 
 /* representing HT siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -1635,6 +1636,7 @@ static inline void mwait_play_dead(void)
 	wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		/*
 		 * The CLFLUSH is a workaround for erratum AAI65 for
 		 * the Xeon 7400 series.  It's not clear it is actually
@@ -1662,6 +1664,7 @@ void hlt_play_dead(void)
 		wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		native_halt();
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index a47676a55b84..2dcbc38d0880 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <asm/mwait.h>
+#include <asm/clearcpu.h>
 #include <xen/xen.h>
 
 #define ACPI_PROCESSOR_AGGREGATOR_CLASS	"acpi_pad"
@@ -175,6 +176,7 @@ static int power_saving_thread(void *data)
 			tick_broadcast_enable();
 			tick_broadcast_enter();
 			stop_critical_timings();
+			clear_cpu_idle();
 
 			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b2131c4ea124..0342daa122fe 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -33,6 +33,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <acpi/processor.h>
+#include <asm/clearcpu.h>
 
 /*
  * Include the apic definitions for x86 to have the APIC timer related defines
@@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
  */
 static void __cpuidle acpi_safe_halt(void)
 {
+	clear_cpu_idle();
 	if (!tif_need_resched()) {
 		safe_halt();
 		local_irq_disable();
@@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 
 	ACPI_FLUSH_CPU_CACHE();
 
+	clear_cpu_idle();
 	while (1) {
 
 		if (cx->entry_method == ACPI_CSTATE_HALT)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 8b5d85c91e9d..ddaa7603d53a 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,6 +65,7 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/clearcpu.h>
 
 #define INTEL_IDLE_VERSION "0.4.1"
 
@@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 		}
 	}
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 
 	if (!static_cpu_has(X86_FEATURE_ARAT) && tick)
@@ -953,6 +956,8 @@ static void intel_idle_s2idle(struct cpuidle_device *dev,
 	unsigned long ecx = 1; /* break on interrupt flag */
 	unsigned long eax = flg2MWAIT(drv->states[index].flags);
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..98e7f1e64a0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5935,6 +5935,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL(sched_smt_present);
 
 static inline void set_idle_cores(int cpu, int val)
 {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 05/32] MDSv3 0
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (3 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 04/32] MDSv3 3 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 06/32] MDSv3 8 Andi Kleen
                   ` (30 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Report mds mitigation state in sysfs vulnerabilities.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../ABI/testing/sysfs-devices-system-cpu         |  1 +
 arch/x86/kernel/cpu/bugs.c                       | 16 ++++++++++++++++
 drivers/base/cpu.c                               |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 73318225a368..02b7bb711214 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -477,6 +477,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/spectre_v2
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
+		/sys/devices/system/cpu/vulnerabilities/mds
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 5fbdf425a84a..a66e29a4c4f2 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1157,6 +1157,16 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
 			return l1tf_show_state(buf);
 		break;
+
+	case X86_BUG_MDS:
+		/* Assumes Hypervisor exposed HT state to us if in guest */
+		if (boot_cpu_has(X86_FEATURE_MB_CLEAR)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: microcode\n");
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+		}
+		return sprintf(buf, "Vulnerable\n");
+
 	default:
 		break;
 	}
@@ -1188,4 +1198,10 @@ ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *b
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
 }
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
+
 #endif
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index eb9443d5bae1..2fd6ca1021c2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_spectre_v2.attr,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
+	&dev_attr_mds.attr,
 	NULL
 };
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 06/32] MDSv3 8
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (4 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 05/32] MDSv3 0 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 07/32] MDSv3 21 Andi Kleen
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

On some older CPUs before Broadwell clearing the CPU buffer with VERW is
not available, so we implement software sequences. These can then be
automatically patched in as needed.

Support mitigation for Nehalem upto Broadwell. Broadwell strictly doesn't
need it because it should have the microcode update for VERW, which
is preferred. Some other CPUs may also not need it due to
microcode updates, but let's enable them for now.

There are two different sequences: one for Nehalem to IvyBridge,
and another for Haswell/Broadwell.

We add command line options to force the two different sequences,
so that it's possible to select the right (or less wrong) one in
VMs that don't report the correct CPU in CPUID. In normal
operation the kernel automatically selects the right
sequence based on the current CPU and it's microcode update
status.

Note to backporters: this patch requires eager FPU support.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../admin-guide/kernel-parameters.txt         |  15 +++
 arch/x86/include/asm/clearcpu.h               |   4 +-
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/kernel/cpu/bugs.c                    |  53 +++++++++
 arch/x86/lib/Makefile                         |   1 +
 arch/x86/lib/clear_cpu.S                      | 104 ++++++++++++++++++
 6 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/lib/clear_cpu.S

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b764b4ebb1f8..5f8ac5270beb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2349,6 +2349,21 @@
 			Normally the kernel decides dynamically when flushing is
 			needed or not.
 
+	mds=swclear	[X86, Intel]
+			Force using software sequence for clearing data that
+			could be exploited by Micro-architectural Data Sampling.
+			Normally automatically enabled when needed. This
+			option might be useful if running inside a virtual machine
+			that does not expose the correct model number. This
+			option requires a CPU with at SSE support.
+
+	mds=swclearhsw	[X86, Intel]
+			Use Haswell/Broadwell specific sequence for clearing
+			data that could be exploited by Micro-architectural Data
+			Sampling. Normally automatically enabled when needed.
+			This option might be useful if running inside a virtual machine
+			that does not expose the correct model number.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 9e389c8a5679..4a570b3b0f5e 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -17,9 +17,11 @@ static inline void clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
-	alternative_input("",
+	alternative_input_2("",
 		"verw %[kernelds]",
 		X86_FEATURE_MB_CLEAR,
+		"call do_clear_cpu",
+		X86_BUG_MDS_CLEAR_CPU,
 		[kernelds] "m" (kernel_ds));
 }
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 93fab3a1e046..110759334c88 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -381,5 +381,7 @@
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
 #define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
+#define X86_BUG_MDS_CLEAR_CPU		X86_BUG(20) /* CPU needs call to clear_cpu on kernel exit/idle for MDS */
+#define X86_BUG_MDS_CLEAR_CPU_HSW	X86_BUG(21) /* CPU needs Haswell version of clear cpu */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a66e29a4c4f2..faec1f0dd801 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -31,6 +31,7 @@
 #include <asm/intel-family.h>
 #include <asm/e820/api.h>
 #include <asm/hypervisor.h>
+#include <asm/cpu_device_id.h>
 
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
@@ -1044,14 +1045,61 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static const __initconst struct x86_cpu_id cpu_mds_clear_cpu[] = {
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_G	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_EP	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_EX	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE_EP	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE_EX	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_SANDYBRIDGE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_SANDYBRIDGE_X },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_IVYBRIDGE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_IVYBRIDGE_X	 },
+	{}
+};
+
+static const __initconst struct x86_cpu_id cpu_mds_clear_cpu_hsw[] = {
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_CORE	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_X	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_ULT	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_GT3E	    },
+
+	/* Have MB_CLEAR with microcode update, but list just in case: */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_CORE   },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_GT3E   },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_X	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_XEON_D },
+	{}
+};
+
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
 
+/* Export here to avoid warnings */
+extern __visible void do_clear_cpu(void);
+EXPORT_SYMBOL(do_clear_cpu);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
 	    !boot_cpu_has_bug(X86_BUG_MDS)) {
 		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU_HSW);
+		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+		return;
+	}
 
+	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
+		x86_match_cpu(cpu_mds_clear_cpu)) ||
+		cmdline_find_option_bool(boot_command_line, "mds=swclear"))
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
+		x86_match_cpu(cpu_mds_clear_cpu_hsw)) ||
+		cmdline_find_option_bool(boot_command_line, "mds=swclearhsw")) {
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU_HSW);
+	}
 	if (cmdline_find_option_bool(boot_command_line, "mds=full"))
 		static_branch_enable(&force_cpu_clear);
 }
@@ -1165,6 +1213,11 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 				return sprintf(buf, "Mitigation: microcode\n");
 			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
 		}
+		if (boot_cpu_has_bug(X86_BUG_MDS_CLEAR_CPU)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: software buffer clearing\n");
+			return sprintf(buf, "Mitigation: software buffer clearing, HT vulnerable\n");
+		}
 		return sprintf(buf, "Vulnerable\n");
 
 	default:
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 25a972c61b0a..ce07225e53e1 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -28,6 +28,7 @@ lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 lib-$(CONFIG_FUNCTION_ERROR_INJECTION)	+= error-inject.o
 lib-$(CONFIG_RETPOLINE) += retpoline.o
+lib-y += clear_cpu.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
diff --git a/arch/x86/lib/clear_cpu.S b/arch/x86/lib/clear_cpu.S
new file mode 100644
index 000000000000..b619aca1449b
--- /dev/null
+++ b/arch/x86/lib/clear_cpu.S
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * Clear internal CPU buffers on kernel boundaries.
+ *
+ * These sequences are somewhat fragile, please don't add
+ * or change instructions in the middle of the areas marked with
+ * start/end.
+ *
+ * Interrupts and NMIs we deal with by reclearing. We clear parts
+ * of the kernel stack, which has other advantages too.
+ *
+ * Save all registers to make it easier to use for callers.
+ *
+ * This sequence is for Nehalem-IvyBridge. For Haswell we jump
+ * to hsw_clear_buf.
+ *
+ * These functions need to be called on a full stack, as they may
+ * use upto 1.5k of stack. They should be also called with
+ * interrupts disabled. NMIs etc. are handled by letting every
+ * NMI do its own clear sequence.
+ */
+ENTRY(ivb_clear_cpu)
+GLOBAL(do_clear_cpu)
+	/*
+	 * obj[tf]ool complains about unreachable code here,
+	 * which appears to be spurious?!?
+	 */
+	ALTERNATIVE "", "jmp hsw_clear_cpu", X86_BUG_MDS_CLEAR_CPU_HSW
+	push %__ASM_REG(si)
+	push %__ASM_REG(di)
+	push %__ASM_REG(cx)
+	mov %_ASM_SP, %__ASM_REG(si)
+	sub  $16, %_ASM_SP
+	and  $-16,%_ASM_SP
+	movdqa %xmm0, (%_ASM_SP)
+	sub  $672, %_ASM_SP
+	xorpd %xmm0,%xmm0
+	movdqa %xmm0, (%_ASM_SP)
+	mov %_ASM_SP, %__ASM_REG(di)
+	/* Clear sequence start */
+	movdqu %xmm0,(%__ASM_REG(di))
+	lfence
+	orpd (%__ASM_REG(di)), %xmm0
+	orpd (%__ASM_REG(di)), %xmm1
+	mfence
+	movl $40, %ecx
+	add  $32, %__ASM_REG(di)
+1:	movntdq %xmm0, (%__ASM_REG(di))
+	add  $16, %__ASM_REG(di)
+	decl %ecx
+	jnz  1b
+	mfence
+	/* Clear sequence end */
+	add  $672, %_ASM_SP
+	movdqu (%_ASM_SP), %xmm0
+	mov  %__ASM_REG(si),%_ASM_SP
+	pop %__ASM_REG(cx)
+	pop %__ASM_REG(di)
+	pop %__ASM_REG(si)
+	ret
+END(ivb_clear_cpu)
+
+/*
+ * Version for Haswell/Broadwell.
+ */
+ENTRY(hsw_clear_cpu)
+	push %__ASM_REG(si)
+	push %__ASM_REG(di)
+	push %__ASM_REG(cx)
+	push %__ASM_REG(ax)
+	mov  %_ASM_SP, %__ASM_REG(ax)
+	sub  $16, %_ASM_SP
+	and  $-16,%_ASM_SP
+	movdqa %xmm0, (%_ASM_SP)
+	sub  $1536,%_ASM_SP
+	/* Clear sequence start */
+	xorpd %xmm0,%xmm0
+	mov  %_ASM_SP, %__ASM_REG(si)
+	mov  %__ASM_REG(si), %__ASM_REG(di)
+	movl $40,%ecx
+1:	movntdq %xmm0, (%__ASM_REG(di))
+	add  $16, %__ASM_REG(di)
+	decl %ecx
+	jnz  1b
+	mfence
+	mov  %__ASM_REG(si), %__ASM_REG(di)
+	mov $1536, %ecx
+	rep movsb
+	lfence
+	/* Clear sequence end */
+	add $1536,%_ASM_SP
+	movdqa (%_ASM_SP), %xmm0
+	mov %__ASM_REG(ax),%_ASM_SP
+	pop %__ASM_REG(ax)
+	pop %__ASM_REG(cx)
+	pop %__ASM_REG(di)
+	pop %__ASM_REG(si)
+	ret
+END(hsw_clear_cpu)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 07/32] MDSv3 21
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (5 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 06/32] MDSv3 8 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 08/32] MDSv3 15 Andi Kleen
                   ` (28 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

NMIs don't go through C code when exiting to user space, so we need
to add an assembler clear cpu for this case. Only used with
mds=full, because otherwise we assume NMIs don't touch
other users or kernel sensitive data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_64.S       | 12 ++++++++++++
 arch/x86/include/asm/clearcpu.h | 14 ++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..19b235ca2878 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -39,6 +39,7 @@
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
 #include <linux/err.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1403,6 +1404,17 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	/*
+	 * Clear only when force clearing was enabled. Otherwise
+	 * we assume NMI code is not sensitive.
+	 * If you don't have jump labels we always clear too.
+	 */
+#ifdef HAVE_JUMP_LABEL
+	STATIC_BRANCH_JMP l_yes=.Lno_clear_cpu key=force_cpu_clear, branch=1
+#endif
+	CLEAR_CPU
+.Lno_clear_cpu:
+
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
 	 * work, because we don't want to enable interrupts.
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 4a570b3b0f5e..cc03ca14140b 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_CLEARCPU_H
 #define _ASM_CLEARCPU_H 1
 
+#ifndef __ASSEMBLY__
+
 #include <linux/jump_label.h>
 #include <linux/sched/smt.h>
 #include <asm/alternative.h>
@@ -46,4 +48,16 @@ static inline void clear_cpu_idle(void)
 
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
+#else
+
+.macro CLEAR_CPU
+	/* Clear CPU buffers that could leak. Instruction must be in memory form. */
+	ALTERNATIVE_2 "", __stringify(push $__USER_DS ; verw (% _ASM_SP ) ; add $8, % _ASM_SP ),\
+		X86_FEATURE_MB_CLEAR, \
+		"call do_clear_cpu", \
+		X86_BUG_MDS_CLEAR_CPU
+.endm
+
+#endif
+
 #endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 08/32] MDSv3 15
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (6 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 07/32] MDSv3 21 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 09/32] MDSv3 10 Andi Kleen
                   ` (27 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

When we use a software sequence for clearing CPU buffers an NMI
or similar interrupt could interrupt the clearing sequence.
In this case make sure we really flush by always doing the extra
clearing on paranoid interrupt exit.

This is only needed for the software sequence because VERW
is an instruction that cannot be interrupted.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c   | 13 ++++++++++++-
 arch/x86/entry/entry_64.S | 12 ++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6662444b33cf..fd86f1e9e164 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -174,13 +174,24 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
 			/* Don't do it twice if forced */
-			if (!static_key_enabled(&force_cpu_clear))
+			if (!static_key_enabled(&force_cpu_clear) &&
+			    !static_cpu_has(X86_BUG_MDS_CLEAR_CPU))
 				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
+		/*
+		 * Software sequences can be interrupted, so we have
+		 * to do them with interrupts off. NMIs etc.
+		 * make sure to always clear even when returning
+		 * to the kernel.
+		 */
+		if (static_cpu_has(X86_BUG_MDS_CLEAR_CPU) &&
+			(cached_flags & _TIF_CLEAR_CPU))
+			clear_cpu();
+
 		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 19b235ca2878..4a41e2abd909 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1206,6 +1206,14 @@ ENTRY(paranoid_exit)
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
 	TRACE_IRQS_IRETQ_DEBUG
+	/*
+	 * Always do cpuclear in case we're racing with a MDS clear
+	 * software sequence on kernel exit.
+	 * Only needed if MB_CLEAR is not available, because VERW is atomic.
+	 */
+	ALTERNATIVE "", "jmp 1f", X86_FEATURE_MB_CLEAR
+	CLEAR_CPU
+1:
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 .Lparanoid_exit_restore:
@@ -1628,6 +1636,10 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	ALTERNATIVE "", "jmp 1f", X86_FEATURE_MB_CLEAR
+	CLEAR_CPU
+1:
+
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 09/32] MDSv3 10
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (7 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 08/32] MDSv3 15 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 10/32] MDSv3 11 Andi Kleen
                   ` (26 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

CPU buffers need to be cleared before entering a guest.
For VERW based cpu clearing we rely on the L1 cache flush for L1TF
doing it implicitely.

When using software sequences this is not done, so in this case
need to do call the software sequence explicitely.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/vmx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 02edd9960e9d..82ec518811a0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -41,6 +41,7 @@
 
 #include <asm/asm.h>
 #include <asm/cpu.h>
+#include <asm/clearcpu.h>
 #include <asm/io.h>
 #include <asm/desc.h>
 #include <asm/vmx.h>
@@ -10680,6 +10681,15 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 
 	vcpu->stat.l1d_flush++;
 
+	/*
+	 * When the CPU has MB_CLEAR the cpu buffers flush is done implicitely
+	 * by the L1D_FLUSH below. But if software sequences are used
+	 * we need to call them explicitely.
+	 */
+	if (static_cpu_has(X86_BUG_MDS) &&
+	    !static_cpu_has(X86_FEATURE_MB_CLEAR))
+		clear_cpu();
+
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
 		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
 		return;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 10/32] MDSv3 11
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (8 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 09/32] MDSv3 10 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 11/32] MDSv3 29 Andi Kleen
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

The main kernel exits on 32bit kernels are already handled by
earlier patches.

But for NMIs we need to clear in the assembler code because
they could be returning into a software sequence, or need
to do it because of mds=full.

Add an unconditional cpu clear on NMI exit for 32bit
for now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_32.S | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..0334e58e4720 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -45,6 +45,7 @@
 #include <asm/smap.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1446,6 +1447,12 @@ ENTRY(nmi)
 	movl	%ebx, %esp
 
 .Lnmi_return:
+	/*
+	 * Only needed when returning to kernel with sw sequences
+	 * or if it's forced. But for now do it unconditionally.
+	 */
+	CLEAR_CPU
+.Lno_clear_cpu:
 	CHECK_AND_APPLY_ESPFIX
 	RESTORE_ALL_NMI cr3_reg=%edi pop=4
 	jmp	.Lirq_return
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 11/32] MDSv3 29
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (9 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 10/32] MDSv3 11 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 12/32] MDSv3 19 Andi Kleen
                   ` (24 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add mds=verw

Some Hypervisors might be unable to expose the new MB_CLEAR CPUID
to guests, even though they have an updated microcode that implements
MB_CLEAR/VERW.

We won't use VERW unconditionally because we need to know whether
it is implemented to correctly report the status in
/sys/devices/system/cpu/vulnerabilities/mds

However we should have a way to let guests in such hypervisors
enable VERW even if its CPUID bit is not visible.

Add a mds=verw option to force enable VERW buffer clearing.

When VERW is forced the vulnerabitilies file will report
enabled, but add "forced".

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 arch/x86/kernel/cpu/bugs.c                      | 13 ++++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5f8ac5270beb..9499ef25da5f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2364,6 +2364,12 @@
 			This option might be useful if running inside a virtual machine
 			that does not expose the correct model number.
 
+	mds=verw	[X86, Intel]
+			Enable microcode based ("VERW") mitigation for Microarchitectural
+			Data Sampling (MDS). This is normally automatically enabled,
+			but may need to be set manually in guests when the VM
+			does not export all the CPUIDs from the host microcode.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index faec1f0dd801..b24d93fb0564 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1075,6 +1075,7 @@ static const __initconst struct x86_cpu_id cpu_mds_clear_cpu_hsw[] = {
 };
 
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+static bool __read_mostly forced_mb_clear;
 
 /* Export here to avoid warnings */
 extern __visible void do_clear_cpu(void);
@@ -1089,7 +1090,10 @@ static void mds_select_mitigation(void)
 		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
 		return;
 	}
-
+	if (cmdline_find_option_bool(boot_command_line, "mds=verw")) {
+		setup_force_cpu_cap(X86_FEATURE_MB_CLEAR);
+		forced_mb_clear = true;
+	}
 	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
 		x86_match_cpu(cpu_mds_clear_cpu)) ||
 		cmdline_find_option_bool(boot_command_line, "mds=swclear"))
@@ -1209,9 +1213,12 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 	case X86_BUG_MDS:
 		/* Assumes Hypervisor exposed HT state to us if in guest */
 		if (boot_cpu_has(X86_FEATURE_MB_CLEAR)) {
+			char *forced = forced_mb_clear ? ", forced" : "";
+
 			if (cpu_smt_control != CPU_SMT_ENABLED)
-				return sprintf(buf, "Mitigation: microcode\n");
-			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+				return sprintf(buf, "Mitigation: microcode%s\n", forced);
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable%s\n",
+					forced);
 		}
 		if (boot_cpu_has_bug(X86_BUG_MDS_CLEAR_CPU)) {
 			if (cpu_smt_control != CPU_SMT_ENABLED)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 12/32] MDSv3 19
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (10 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 11/32] MDSv3 29 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 13/32] MDSv3 6 Andi Kleen
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Export MB_CLEAR CPUID to KVM
 guests.

Export the MB_CLEAR CPUID set by new microcode to signal
that VERW implements the clear cpu side effect to KVM guests.

Also requires corresponding qemu patches

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7bcfa61375c0..0fd8a4fb8f09 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,7 +411,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
-		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES);
+		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) |
+		F(MB_CLEAR);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 13/32] MDSv3 6
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (11 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 12/32] MDSv3 19 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 14/32] MDSv3 28 Andi Kleen
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

If we don't expose MB_CLEAR to the guest it could be using software
sequences for clear cpu. If the hypervisor interrupts any of these
sequences the data will not be fully cleared. The only way to fix that
is for us to clear unconditionally on each entry.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/vmx.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 82ec518811a0..38db94df097a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10675,8 +10675,19 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();
 		kvm_clear_cpu_l1tf_flush_l1d();
 
-		if (!flush_l1d)
+		if (!flush_l1d) {
+			/*
+			 * If we don't expose MB_CLEAR to the guest it
+			 * could be using software sequences for clear
+			 * cpu. If the hypervisor interrupts any of
+			 * these sequences the data will not be fully
+			 * cleared. The only way to fix that is for
+			 * us to clear unconditionally on each entry.
+			 */
+			if (!guest_cpuid_has(vcpu, X86_FEATURE_MB_CLEAR))
+				clear_cpu();
 			return;
+		}
 	}
 
 	vcpu->stat.l1d_flush++;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 14/32] MDSv3 28
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (12 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 13/32] MDSv3 6 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 15/32] MDSv3 27 Andi Kleen
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Add documentation for clear cpu usage

Including the theory, and some guide lines for subsystem/driver
maintainers.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/clearcpu.txt | 179 +++++++++++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)
 create mode 100644 Documentation/clearcpu.txt

diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..786a207e6449
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,179 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitely clearing the CPU state.
+
+We trying to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data,
+or user data from other processes.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+Kernel data is sensitive when it is cryptographic keys.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does not touch user data directly consider marking
+it with IRQF_NO_USER.
+
+When your tasklet does not touch user data directly consider marking
+it with TASKLET_NO_USER using tasklet_init_flags/or
+DECLARE_TASKLET*_NOUSER.
+
+When your timer does not touch user data mark it with TIMER_NO_USER.
+If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.
+
+When your irq poll handler does not touch user data, mark it
+with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
+
+For networking code make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt. When the non skb data access is only in a
+hardware interrupt controlled by the driver, it can rely on not
+setting IRQF_NO_USER for that interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+If a system call touches data it is for its own process, so does not
+need to be cleared, because it has already access to it.
+
+When context switching we clear data, unless the context switch
+is inside a process, or from/to idle. We also clear after any
+context switches from kernel threads.
+
+Idle does not have sensitive data, except for in interrupts, which
+are handled separately.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard interrupts, tasklets, timers which can run asynchronous are
+assumed to touch random user data, unless they have been audited, and
+marked with NO_USER flags.
+
+Most interrupt handlers for modern devices should not touch
+user data because they rely on DMA and only manipulate
+pointers. This needs auditing to confirm though.
+
+For softirqs we assume that if they touch user data they use
+lazy_clear_cpu()/lazy_clear_interrupt() as needed.
+Networking is handled through skb_* below.
+Timer and Tasklets and IRQ poll are handled through opt-in.
+
+Scheduler softirq is assumed to not touch user data.
+
+Block softirq done callbacks are assumed to not touch user data.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions.  We assume that any networking code touching
+packet data uses these functions.
+
+[In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways]
+
+Some IO related functions like string PIO and memcpy_from/to_io, or
+the software pci dma bounce function, which touch data, schedule a
+buffer clear.
+
+We assume NMI/machine check code does not touch other
+processes' data.
+
+Any buffer clearing is done lazily on next kernel exit, so can be
+triggered in fast paths.
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that near all do.
+
+This could be further optimized by allowing callers that do
+a lot of individual BPF runs and are sure they don't touch
+other user's data inbetween to do the clear only once
+at the beginning. We can add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitely as part of the L1TF mitigation.
+It relies on this being enabled. It also uses the "fast exit"
+optimization that only clears if an interrupt or context switch
+happened.
+
+There's an exception that if we don't expose MB_CLEAR to the guest it
+may be using software sequences. Unlike VERW, the software sequences
+are not atomic, and can be interrupted by the hypervisor, and not
+clear the data correctly. To avoid this we unconditionally clear when
+entering if MB_CLEAR is not exposed.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 15/32] MDSv3 27
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (13 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 14/32] MDSv3 28 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 16/32] MDSv3 4 Andi Kleen
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add a Documentation file for administrators that describes MDS on a
high level.

So far not covering SMT.

Needs updates later for public URLs of supporting documentation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/mds.rst | 128 ++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)
 create mode 100644 Documentation/admin-guide/mds.rst

diff --git a/Documentation/admin-guide/mds.rst b/Documentation/admin-guide/mds.rst
new file mode 100644
index 000000000000..accae1497ae9
--- /dev/null
+++ b/Documentation/admin-guide/mds.rst
@@ -0,0 +1,128 @@
+MDS - Microarchitectural Data Sampling)
+=======================================
+
+Microarchitectural Data Sampling is a side channel vulnerability that
+allows an attacker to sample data that has been earlier used during
+program execution. Internal buffers in the CPU may keep old data
+for some limited time, which can the later be determined by an attacker
+with side channel analysis. MDS can be used to occasionaly observe
+some values accessed earlier, but it cannot be used to observe values
+not recently touched by other code running on the same core.
+
+It is difficult to target particular data on a system using MDS,
+but attackers may be able to infer secrets by collecting
+and analyzing large amounts of data. MDS does not modify
+memory.
+
+MDS consists of multiple sub-vulnerabilities:
+Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
+with the first leaking store data, and the second loads and sometimes
+store data, and the third load data.
+
+The effects and mitigations are similar for all three, so the Linux
+kernel handles and reports them all as a single vulnerability called
+MDS. This also reduces the number of acronyms in use.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors.
+Not all CPUs are affected by all of the sub vulnerabilities,
+however the kernel handles it always the same.
+
+The vulnerability is not present in
+
+    - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+The kernel will automatically detect future CPUs with hardware
+mitigations for these issues and disable any workarounds.
+
+The kernel reports if the current CPU is vulnerable and any
+mitigations used in
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+Kernel mitigation
+-----------------
+
+By default, the kernel automatically ensures no data leakage between
+different processes, or between kernel threads and interrupt handlers
+and user processes, or from any cryptographic code in the kernel.
+
+It does not isolate kernel code that only touches data of the
+current process.  If protecting such kernel code is desired,
+mds=full can be specified.
+
+The mitigation is automatically enabled, but can be further controlled
+with the command line options documented below.
+
+The mitigation can be done either with microcode support (requiring
+updated microcode), or through software sequences on some CPUs.
+On Skylake based CPUs only mitigation through microcode is supported.
+In general microcode mitigation is preferred.
+
+The microcode should be loaded at early boot using the initrd. Hot
+updating microcode will not enable the mitigations.
+
+Virtual machine mitigation
+--------------------------
+
+The mitigation is enabled by default and controlled by the same options
+as L1TF cache clearing. See l1tf.rst for more details. In the default
+setting
+
+To enable the mitigation in guests it may be also needed to update
+VM configurations to include the "MB_CLEAR" CPUID bit. This will
+communicate to the guest kernel that the host has the microcode
+with mitigations applied.
+
+Kernel command line options
+---------------------------
+
+Normally the kernel selects reasonable defaults and no special configuration
+is needed. The default behavior can be overriden by the mds= kernel
+command line options.
+
+These options can be specified in the boot loader. Any changes require a reboot.
+
+When the system only runs trusted code, MDS mitigation can be disabled with
+mds=off.
+
+By default the kernel only clears CPU data after execution
+that is known or likely to have touched user data of other processes,
+or cryptographic data. This relies on code audits done in the
+mainline Linux kernel. When running unaudited large out of tree code,
+or binary drivers, who might violate these constraints it is possible
+to use mds=full to always flush the CPU data on each kernel exit.
+
+By default the kernel automatically selects using microcode based ("VERW")
+mitigation, or software based mitigations, or no mitigation, based on the
+CPUID information reported by the CPU. When running virtualized
+inside a guest the CPUID information might be incomplete, or report
+a different system.
+
+In this case, and when the VM configuration cannot be fixed,
+the following options can be used to select the right mitigation:
+
+   - mds=off      Disable workarounds if the CPU is not affected.
+   - mds=swclear  Host CPU doesn't have updated microcode.
+                  Use software sequence applicable for Nehalem to IvyBridge
+   - mds=swclearhsw
+                  Host CPU doesn't have updated microcode.
+                  Use software sequence applicable to Haswell and Broadwell
+   - mds=verw     Host CPU has updated microcode
+                  Use microcode based ("VERW") mitigation.
+
+TBD describe SMT
+
+References
+----------
+
+Fore more details on the kernel internal implementation of the MDS mitigations,
+please see Documentation/clearcpu.txt
+
+TBD Add URL for Intel white paper
+
+TBD add reference to microcodes
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 16/32] MDSv3 4
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (14 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 15/32] MDSv3 27 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 17/32] MDSv3 13 Andi Kleen
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add basic infrastructure for code to request CPU buffer clearing
on the next kernel exit.

We have two functions lazy_clear_cpu to request clearing,
and lazy_clear_cpu_interrupt to request clearing if running
in an interrupt.

Non architecture specific code can include linux/clearcpu.h
and use lazy_clear_cpu / lazy_clear_interrupt. On x86
we provide low level implementations that set the TIF_CLEAR_CPU
bit.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/Kconfig                    |  3 +++
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/clearcpu.h |  5 +++++
 include/linux/clearcpu.h        | 36 +++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)
 create mode 100644 include/linux/clearcpu.h

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..32b6cd5dfe0f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -802,6 +802,9 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_HAS_CLEAR_CPU
+	def_bool n
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d76ef308a47f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_CLEAR_CPU
 	select BUILDTIME_EXTABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index cc03ca14140b..6e6f68a0cab1 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -46,6 +46,11 @@ static inline void clear_cpu_idle(void)
 	}
 }
 
+static inline void lazy_clear_cpu(void)
+{
+	set_thread_flag(TIF_CLEAR_CPU);
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #else
diff --git a/include/linux/clearcpu.h b/include/linux/clearcpu.h
new file mode 100644
index 000000000000..63a6952b46fa
--- /dev/null
+++ b/include/linux/clearcpu.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CLEARCPU_H
+#define _LINUX_CLEARCPU_H 1
+
+#include <linux/preempt.h>
+
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearcpu.h>
+#else
+static inline void lazy_clear_cpu(void)
+{
+}
+#endif
+
+/*
+ * Use this function when potentially touching (reading or writing)
+ * user data in an interrupt. In this case schedule to clear the
+ * CPU buffers on kernel exit to avoid any potential side channels.
+ *
+ * If not in an interrupt we assume the touched data belongs to the
+ * current process and doesn't need to be cleared.
+ *
+ * This version is for code who might be in an interrupt.
+ * If you know for sure you're in interrupt context call
+ * lazy_clear_cpu directly.
+ *
+ * lazy_clear_cpu is reasonably cheap (just sets a bit) and
+ * can be used in fast paths.
+ */
+static inline void lazy_clear_cpu_interrupt(void)
+{
+	if (in_interrupt())
+		lazy_clear_cpu();
+}
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 17/32] MDSv3 13
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (15 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 16/32] MDSv3 4 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 18/32] MDSv3 32 Andi Kleen
                   ` (18 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Schedule cpu clear on context
 switch

On context switch we need to schedule a cpu clear on the next
kernel exit when:

- We're switching between different processes
- We're switching from a kernel thread that is not idle.
For idle we assume only interrupts are sensitive, which
are already handled elsewhere. For kernel threads
like work queue we assume they might contain
sensitive (other user's or crypto) data.

The code hooks into the generic context switch, not
the mm context switch, because the mm context switch
doesn't handle the idle thread case.

This also transfers the clear cpu bit to the next task.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/process.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 898e97cf6629..e61a4d5ce917 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -2,6 +2,7 @@
 //
 // Code shared between 32 and 64 bit
 
+#include <linux/clearcpu.h>
 #include <asm/spec-ctrl.h>
 
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
@@ -29,6 +30,32 @@ static inline void switch_to_extra(struct task_struct *prev,
 		}
 	}
 
+	/*
+	 * When we switch to a different process, or we switch
+	 * from a kernel thread that was not idle, clear the CPU
+	 * buffers on next kernel exit.
+	 *
+	 * We assume that idle does not touch user data, except
+	 * for interrupts, which schedule their own clears as needed.
+	 * But other kernel threads, like work queues, might
+	 * touch user data, so flush in this case.
+	 *
+	 * This has to be here because switch_mm doesn't get
+	 * called in the kernel thread case.
+	 */
+	if (static_cpu_has(X86_BUG_MDS)) {
+		if (prev->pid && (next->mm != prev->mm || prev->mm == NULL))
+			lazy_clear_cpu();
+		/*
+		 * Also transfer the clearcpu flag from the previous task.
+		 * Can be done non atomically because interrupts are off.
+		 */
+		task_thread_info(next)->status |=
+			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
+		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;
+	}
+
+
 	/*
 	 * __switch_to_xtra() handles debug registers, i/o bitmaps,
 	 * speculation mitigations etc.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 18/32] MDSv3 32
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (16 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 17/32] MDSv3 13 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 19/32] MDSv3 16 Andi Kleen
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add tracing for clear_cpu

Add trace points for clear_cpu and lazy_clear_cpu. This is useful
for debugging and performance testing.

The trace points have to be partially out of line to avoid
include loops, but the fast path jump labels are inlined.

The idle case cannot be traced because trace points
don't like idle context.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h       | 38 ++++++++++++++++++++++++---
 arch/x86/include/asm/trace/clearcpu.h | 27 +++++++++++++++++++
 arch/x86/kernel/cpu/bugs.c            | 17 ++++++++++++
 3 files changed, 79 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 6e6f68a0cab1..d9709a86ef1a 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -6,8 +6,31 @@
 
 #include <linux/jump_label.h>
 #include <linux/sched/smt.h>
-#include <asm/alternative.h>
 #include <linux/thread_info.h>
+#include <asm/alternative.h>
+
+/*
+ * We cannot directly include the trace point header here
+ * because it leads to include loops with other trace point
+ * files pulling this one in. Define the static
+ * key manually here, which handles noping the fast path,
+ * and the actual tracing is done out of line.
+ */
+#ifdef CONFIG_TRACEPOINTS
+#include <asm/atomic.h>
+#include <linux/tracepoint-defs.h>
+
+extern struct tracepoint __tracepoint_clear_cpu;
+extern struct tracepoint __tracepoint_lazy_clear_cpu;
+#define cc_tracepoint_active(t) static_key_false(&(t).key)
+
+extern void do_trace_clear_cpu(void);
+extern void do_trace_lazy_clear_cpu(void);
+#else
+#define cc_tracepoint_active(t) false
+static inline void do_trace_clear_cpu(void) {}
+static inline void do_trace_lazy_clear_cpu(void) {}
+#endif
 
 /*
  * Clear CPU buffers to avoid side channels.
@@ -15,7 +38,7 @@
  * "VERW" instruction), or special out of line clear sequences.
  */
 
-static inline void clear_cpu(void)
+static inline void __clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
@@ -27,6 +50,13 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+static inline void clear_cpu(void)
+{
+	if (cc_tracepoint_active(__tracepoint_clear_cpu))
+		do_trace_clear_cpu();
+	__clear_cpu();
+}
+
 /*
  * Clear CPU buffers before going idle, so that no state is leaked to SMT
  * siblings taking over thread resources.
@@ -42,12 +72,14 @@ static inline void clear_cpu_idle(void)
 {
 	if (sched_smt_active()) {
 		clear_thread_flag(TIF_CLEAR_CPU);
-		clear_cpu();
+		__clear_cpu();
 	}
 }
 
 static inline void lazy_clear_cpu(void)
 {
+	if (cc_tracepoint_active(__tracepoint_lazy_clear_cpu))
+		do_trace_lazy_clear_cpu();
 	set_thread_flag(TIF_CLEAR_CPU);
 }
 
diff --git a/arch/x86/include/asm/trace/clearcpu.h b/arch/x86/include/asm/trace/clearcpu.h
new file mode 100644
index 000000000000..e742b5cd8ee9
--- /dev/null
+++ b/arch/x86/include/asm/trace/clearcpu.h
@@ -0,0 +1,27 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM clearcpu
+
+#if !defined(_TRACE_CLEARCPU_H) || defined(TRACE_HEADER_MULTI_READ)
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(clear_cpu,
+		    TP_PROTO(int dummy),
+		    TP_ARGS(dummy),
+		    TP_STRUCT__entry(__field(int, dummy)),
+		    TP_fast_assign(),
+		    TP_printk("%d", __entry->dummy));
+
+DEFINE_EVENT(clear_cpu, clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+DEFINE_EVENT(clear_cpu, lazy_clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+
+#define _TRACE_CLEARCPU_H
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE clearcpu
+#endif /* _TRACE_CLEARCPU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index b24d93fb0564..ba4f2bb203a5 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1045,6 +1045,23 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/clearcpu.h>
+
+void do_trace_clear_cpu(void)
+{
+	trace_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(clear_cpu);
+
+void do_trace_lazy_clear_cpu(void)
+{
+	trace_lazy_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_lazy_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(lazy_clear_cpu);
+
 static const __initconst struct x86_cpu_id cpu_mds_clear_cpu[] = {
 	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM	 },
 	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_G	 },
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 19/32] MDSv3 16
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (17 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 18/32] MDSv3 32 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 20/32] MDSv3 24 Andi Kleen
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

When the kernel is preempted we need to force a cpu clear,
because the preemption might happen before the code
has a chance to set TIF_CPU_CLEAR later.

We cannot rely on kernel code setting the flag before
touching sensitive data: the flag setting could
be implicit, like in memzero_explicit, which is always
called later.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/sched/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6fedf3a98581..2a5e40be3cb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11,6 +11,8 @@
 
 #include <linux/kcov.h>
 
+#include <linux/clearcpu.h>
+
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
 
@@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	if (likely(!preemptible()))
 		return;
 
+	/*
+	 * For kernel preemption we need to force a cpu clear
+	 * because it could happen before the code has a chance
+	 * to set TIF_CLEAR_CPU.
+	 */
+	lazy_clear_cpu();
+
 	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 20/32] MDSv3 24
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (18 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 19/32] MDSv3 16 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 21/32] MDSv3 25 Andi Kleen
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Schedule cpu clear for memzero_explicit and
 kzfree

Assume that any code using these functions is sensitive and shouldn't
leak any data.

This handles clearing for key data used in the kernel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 lib/string.c     | 6 ++++++
 mm/slab_common.c | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/string.c b/lib/string.c
index 38e4ca08e757..9ce59dd86541 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -28,6 +28,7 @@
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/clearcpu.h>
 
 #include <asm/byteorder.h>
 #include <asm/word-at-a-time.h>
@@ -715,12 +716,17 @@ EXPORT_SYMBOL(memset);
  * necessary, memzero_explicit() should be used instead in
  * order to prevent the compiler from optimising away zeroing.
  *
+ * As a side effect this may also trigger extra cleaning
+ * of CPU state before the next kernel exit to avoid
+ * side channels.
+ *
  * memzero_explicit() doesn't need an arch-specific version as
  * it just invokes the one of memset() implicitly.
  */
 void memzero_explicit(void *s, size_t count)
 {
 	memset(s, 0, count);
+	lazy_clear_cpu();
 	barrier_data(s);
 }
 EXPORT_SYMBOL(memzero_explicit);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7eb8dc136c1c..141024fd43f8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1551,6 +1551,9 @@ EXPORT_SYMBOL(krealloc);
  * Note: this function zeroes the whole allocated buffer which can be a good
  * deal bigger than the requested buffer size passed to kmalloc(). So be
  * careful when using this function in performance sensitive code.
+ *
+ * As a side effect this may also clear CPU state later before the
+ * next kernel exit to avoid side channels.
  */
 void kzfree(const void *p)
 {
@@ -1560,7 +1563,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 21/32] MDSv3 25
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (19 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 20/32] MDSv3 24 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 22/32] MDSv3 23 Andi Kleen
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Interrupts might touch user data from other processes
in any context.

By default we clear the CPU on the next kernel exit.

Add a new IRQ_F_NO_USER interrupt flag. When the flag
is not set on interrupt execution we clear the cpu state on
next kernel exit.

This allows interrupts to opt-out from the extra clearing
overhead, but is safe by default.

Over time as more interrupt code is audited it can set the opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 2 ++
 kernel/irq/handle.c       | 8 ++++++++
 kernel/irq/manage.c       | 1 +
 3 files changed, 11 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1d6711c28271..65c957e3db68 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,7 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_NO_USER	- Interrupt does not touch user data
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +75,7 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_NO_USER		0x00080000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e5910938ce2b 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/clearcpu.h>
 
 #include <trace/events/irq.h>
 
@@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 		res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
 
+		/*
+		 * We aren't sure if the interrupt handler did or did not
+		 * touch user data. Schedule a cpu clear just in case.
+		 */
+		if (!(action->flags & IRQF_NO_USER))
+			lazy_clear_cpu();
+
 		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pF enabled interrupts\n",
 			      irq, action->handler))
 			local_irq_disable();
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 9dbdccab3b6a..80a9383ea993 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1793,6 +1793,7 @@ EXPORT_SYMBOL(free_irq);
  *
  *	IRQF_SHARED		Interrupt is shared
  *	IRQF_TRIGGER_*		Specify active edge(s) or level
+ *	IRQF_NOUSER		Does not touch user data.
  *
  */
 int request_threaded_irq(unsigned int irq, irq_handler_t handler,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 22/32] MDSv3 23
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (20 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 21/32] MDSv3 25 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 23/32] MDSv3 31 Andi Kleen
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Clear cpu on all timers, unless the timer
 opts-out

By default we assume timers might touch user data and schedule
a cpu clear on next kernel exit.

Support opt-outs where timer and hrtimer handlers can opt-in
they they don't touch any user data.

Note this takes one bit from the timer wheel index field away,
but it seems there are less wheels available anyways, so that
should be ok.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/hrtimer.h | 4 ++++
 include/linux/timer.h   | 9 ++++++---
 kernel/time/hrtimer.c   | 5 +++++
 kernel/time/timer.c     | 8 ++++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 3892e9c8b2de..463579d05415 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -35,6 +35,7 @@ struct hrtimer_cpu_base;
  *				  when starting the timer)
  * HRTIMER_MODE_SOFT		- Timer callback function will be executed in
  *				  soft irq context
+ * HRTIMER_MODE_NO_USER		- Handler does not touch user data.
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -51,6 +52,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_ABS_PINNED_SOFT = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_SOFT,
 	HRTIMER_MODE_REL_PINNED_SOFT = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_SOFT,
 
+	HRTIMER_MODE_NO_USER	= 0x08,
 };
 
 /*
@@ -104,6 +106,7 @@ enum hrtimer_restart {
  * @state:	state information (See bit values above)
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
+ * @no_user:	function does not touch user data.
  *
  * The hrtimer structure must be initialized by hrtimer_init()
  */
@@ -115,6 +118,7 @@ struct hrtimer {
 	u8				state;
 	u8				is_rel;
 	u8				is_soft;
+	u8				no_user;
 };
 
 /**
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 7b066fd38248..222e72432be3 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -56,10 +56,13 @@ struct timer_list {
 #define TIMER_DEFERRABLE	0x00080000
 #define TIMER_PINNED		0x00100000
 #define TIMER_IRQSAFE		0x00200000
-#define TIMER_ARRAYSHIFT	22
-#define TIMER_ARRAYMASK		0xFFC00000
+#define TIMER_NO_USER		0x00400000
+#define TIMER_ARRAYSHIFT	23
+#define TIMER_ARRAYMASK		0xFF800000
 
-#define TIMER_TRACE_FLAGMASK	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
+#define TIMER_TRACE_FLAGMASK	\
+	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | \
+	 TIMER_NO_USER)
 
 #define __TIMER_INITIALIZER(_function, _flags) {		\
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 9cdd74bd2d27..7e8e89a47d12 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -51,6 +51,7 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 
@@ -1285,6 +1286,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		clock_id = CLOCK_MONOTONIC;
 
 	base += hrtimer_clockid_to_base(clock_id);
+	timer->no_user = !!(mode & HRTIMER_MODE_NO_USER);
 	timer->is_soft = softtimer;
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
@@ -1399,6 +1401,9 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	trace_hrtimer_expire_exit(timer);
 	raw_spin_lock_irq(&cpu_base->lock);
 
+	if (!timer->no_user)
+		lazy_clear_cpu();
+
 	/*
 	 * Note: We clear the running state after enqueue_hrtimer and
 	 * we do not reprogram the event hardware. Happens either in
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index fa49cd753dea..d05ba85bdc4b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -44,6 +44,7 @@
 #include <linux/sched/debug.h>
 #include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -1339,6 +1340,13 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(struct timer_list
 		 */
 		preempt_count_set(count);
 	}
+
+	/*
+	 * The timer might have touched user data. Schedule
+	 * a cpu clear on the next kernel exit.
+	 */
+	if (!(timer->flags & TIMER_NO_USER))
+		lazy_clear_cpu();
 }
 
 static void expire_timers(struct timer_base *base, struct hlist_head *head)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 23/32] MDSv3 31
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (21 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 22/32] MDSv3 23 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 24/32] MDSv3 30 Andi Kleen
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Clear CPU on tasklets, unless opted-out

By default we assume tasklets might touch user data and schedule
a cpu clear on next kernel exit.

Add new interfaces to allow audited tasklets to opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 16 +++++++++++++++-
 kernel/softirq.c          | 25 +++++++++++++++++++------
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 65c957e3db68..65158a13c8cb 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -556,11 +556,22 @@ struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
 #define DECLARE_TASKLET_DISABLED(name, func, data) \
 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
 
+#define DECLARE_TASKLET_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(0), func, data }
+
+#define DECLARE_TASKLET_DISABLED_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(1), func, data }
 
 enum
 {
 	TASKLET_STATE_SCHED,	/* Tasklet is scheduled for execution */
-	TASKLET_STATE_RUN	/* Tasklet is running (SMP only) */
+	TASKLET_STATE_RUN,	/* Tasklet is running (SMP only) */
+
+	/*
+	 * Set this flag when the tasklet is known to not touch user data,
+	 * so doesn't need extra CPU state clearing.
+	 */
+	TASKLET_NO_USER		= 1 << 5,
 };
 
 #ifdef CONFIG_SMP
@@ -624,6 +635,9 @@ extern void tasklet_kill(struct tasklet_struct *t);
 extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
+extern void tasklet_init_flags(struct tasklet_struct *t,
+			 void (*func)(unsigned long), unsigned long data,
+			 unsigned flags);
 
 struct tasklet_hrtimer {
 	struct hrtimer		timer;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d28813306b2c..fdd4e3be3db7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/clearcpu.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -522,6 +523,8 @@ static void tasklet_action_common(struct softirq_action *a,
 					BUG();
 				t->func(t->data);
 				tasklet_unlock(t);
+				if (!(t->state & TASKLET_NO_USER))
+					lazy_clear_cpu();
 				continue;
 			}
 			tasklet_unlock(t);
@@ -546,15 +549,23 @@ static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
 	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
 }
 
-void tasklet_init(struct tasklet_struct *t,
-		  void (*func)(unsigned long), unsigned long data)
+void tasklet_init_flags(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data,
+		  unsigned flags)
 {
 	t->next = NULL;
-	t->state = 0;
+	t->state = flags;
 	atomic_set(&t->count, 0);
 	t->func = func;
 	t->data = data;
 }
+EXPORT_SYMBOL(tasklet_init_flags);
+
+void tasklet_init(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data)
+{
+	tasklet_init_flags(t, func, data, 0);
+}
 EXPORT_SYMBOL(tasklet_init);
 
 void tasklet_kill(struct tasklet_struct *t)
@@ -609,7 +620,8 @@ static void __tasklet_hrtimer_trampoline(unsigned long data)
  * @ttimer:	 tasklet_hrtimer which is initialized
  * @function:	 hrtimer callback function which gets called from softirq context
  * @which_clock: clock id (CLOCK_MONOTONIC/CLOCK_REALTIME)
- * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL)
+ * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL),
+ *		 HRTIMER_MODE_NO_USER
  */
 void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 			  enum hrtimer_restart (*function)(struct hrtimer *),
@@ -617,8 +629,9 @@ void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 {
 	hrtimer_init(&ttimer->timer, which_clock, mode);
 	ttimer->timer.function = __hrtimer_tasklet_trampoline;
-	tasklet_init(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
-		     (unsigned long)ttimer);
+	tasklet_init_flags(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
+		     (unsigned long)ttimer,
+		     (mode & HRTIMER_MODE_NO_USER) ? TASKLET_NO_USER : 0);
 	ttimer->function = function;
 }
 EXPORT_SYMBOL_GPL(tasklet_hrtimer_init);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 24/32] MDSv3 30
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (22 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 23/32] MDSv3 31 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 25/32] MDSv3 9 Andi Kleen
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

By default we assume that irq poll handlers running in the irq poll
softirq might touch user data and we schedule a cpu clear on next
kernel exit.

Add interfaces for audited handlers to declare that they are safe.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/irq_poll.h |  2 ++
 lib/irq_poll.c           | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..5f13582f1b8e 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -15,6 +15,8 @@ struct irq_poll {
 enum {
 	IRQ_POLL_F_SCHED	= 0,
 	IRQ_POLL_F_DISABLE	= 1,
+
+	IRQ_POLL_F_NO_USER	= 1<<4,
 };
 
 extern void irq_poll_sched(struct irq_poll *);
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..cb19431f53ec 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -11,6 +11,7 @@
 #include <linux/cpu.h>
 #include <linux/irq_poll.h>
 #include <linux/delay.h>
+#include <linux/clearcpu.h>
 
 static unsigned int irq_poll_budget __read_mostly = 256;
 
@@ -111,6 +112,9 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 
 		budget -= work;
 
+		if (!(iop->state & IRQ_POLL_F_NO_USER))
+			lazy_clear_cpu();
+
 		local_irq_disable();
 
 		/*
@@ -168,21 +172,31 @@ void irq_poll_enable(struct irq_poll *iop)
 EXPORT_SYMBOL(irq_poll_enable);
 
 /**
- * irq_poll_init - Initialize this @iop
+ * irq_poll_init_flags - Initialize this @iop
  * @iop:      The parent iopoll structure
  * @weight:   The default weight (or command completion budget)
  * @poll_fn:  The handler to invoke
+ * @flags:    IRQ_POLL_F_NO_USER if callback does not touch user data.
  *
  * Description:
  *     Initialize and enable this irq_poll structure.
  **/
-void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+void irq_poll_init_flags(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn,
+			 int flags)
 {
 	memset(iop, 0, sizeof(*iop));
 	INIT_LIST_HEAD(&iop->list);
 	iop->weight = weight;
 	iop->poll = poll_fn;
+	iop->state = flags;
 }
+EXPORT_SYMBOL(irq_poll_init_flags);
+
+void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+{
+	return irq_poll_init_flags(iop, weight, poll_fn, 0);
+}
+
 EXPORT_SYMBOL(irq_poll_init);
 
 static int irq_poll_cpu_dead(unsigned int cpu)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 25/32] MDSv3 9
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (23 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 24/32] MDSv3 30 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 26/32] MDSv3 14 Andi Kleen
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Schedule a clear cpu on next kernel exit for string PIO
or memcpy_from/to_io calls, when they are called in
interrupts.

The PIO case is likely already handled by old drivers
not opting in their interrupt handlers to not clear,
but let's do it just to be sure.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/io.h | 3 +++
 include/asm-generic/io.h  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 832da8229cc7..2b9fb7890f0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/clearcpu.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -313,6 +314,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 			     : "+S"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }									\
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
@@ -329,6 +331,7 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 			     : "+D"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }
 
 BUILDIO(b, b, char)
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index d356f802945a..cf58bceea042 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -14,6 +14,7 @@
 #include <asm/page.h> /* I/O is all done through memory accesses */
 #include <linux/string.h> /* for memset() and memcpy() */
 #include <linux/types.h>
+#include <linux/clearcpu.h>
 
 #ifdef CONFIG_GENERIC_IOMAP
 #include <asm-generic/iomap.h>
@@ -1115,6 +1116,7 @@ static inline void memcpy_fromio(void *buffer,
 				 size_t size)
 {
 	memcpy(buffer, __io_virt(addr), size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
@@ -1132,6 +1134,7 @@ static inline void memcpy_toio(volatile void __iomem *addr, const void *buffer,
 			       size_t size)
 {
 	memcpy(__io_virt(addr), buffer, size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 26/32] MDSv3 14
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (24 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 25/32] MDSv3 9 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 27/32] MDSv3 18 Andi Kleen
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Schedule a cpu clear on next kernel exit for swiotlb running
in interrupt context, since it touches user data with the CPU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/dma/swiotlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..a72b9dbb39ae 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -35,6 +35,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mem_encrypt.h>
 #include <linux/set_memory.h>
+#include <linux/clearcpu.h>
 
 #include <asm/io.h>
 #include <asm/dma.h>
@@ -426,6 +427,7 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 	} else {
 		memcpy(phys_to_virt(orig_addr), vaddr, size);
 	}
+	lazy_clear_cpu_interrupt();
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 27/32] MDSv3 18
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (25 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 26/32] MDSv3 14 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 28/32] MDSv3 20 Andi Kleen
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Instrument skb functions to clear cpu
 automatically

Instrument some strategic skbuff functions that either touch
packet data directly, or are likely followed by a user
data touch like a memcpy, to schedule a cpu clear on next
kernel exit. This is only done inside interrupts,
outside we assume it only touches the current processes' data.

In principle network data should be encrypted anyways,
but it's better to not leak it.

This provides protection for the network softirq.

Needs more auditing.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0d1b2c3f127b..af90474c122f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
 #include <linux/in6.h>
 #include <linux/if_packet.h>
 #include <net/flow.h>
+#include <linux/clearcpu.h>
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -2077,6 +2078,7 @@ static inline void *__skb_put(struct sk_buff *skb, unsigned int len)
 	SKB_LINEAR_ASSERT(skb);
 	skb->tail += len;
 	skb->len  += len;
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a8217e221e19..3e5060b7712b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1184,6 +1184,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (!num_frags)
 		goto release;
 
+	/* Likely to copy user data */
+	lazy_clear_cpu_interrupt();
+
 	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
@@ -1348,6 +1351,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 	if (!n)
 		return NULL;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	/* Set the data pointer */
 	skb_reserve(n, headerlen);
 	/* Set the tail pointer and length */
@@ -1583,6 +1589,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	if (!n)
 		return NULL;
 
+	/* May copy user data */
+	lazy_clear_cpu_interrupt();
+
 	skb_reserve(n, newheadroom);
 
 	/* Set the tail pointer and length */
@@ -1671,6 +1680,8 @@ EXPORT_SYMBOL(__skb_pad);
 
 void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len)
 {
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	if (tail != skb) {
 		skb->data_len += len;
 		skb->len += len;
@@ -1696,6 +1707,8 @@ void *skb_put(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->tail > skb->end))
 		skb_over_panic(skb, len, __builtin_return_address(0));
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 EXPORT_SYMBOL(skb_put);
@@ -1715,6 +1728,7 @@ void *skb_push(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->data < skb->head))
 		skb_under_panic(skb, len, __builtin_return_address(0));
+	/* No clear cpu, assume this is only header data */
 	return skb->data;
 }
 EXPORT_SYMBOL(skb_push);
@@ -2023,6 +2037,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2397,6 +2414,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2477,6 +2497,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Checksum header. */
 	if (copy > 0) {
 		if (copy > len)
@@ -2569,6 +2592,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 28/32] MDSv3 20
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (26 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 27/32] MDSv3 18 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 29/32] MDSv3 26 Andi Kleen
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Opt out tcp tasklet to not touch user data

Mark the tcp tasklet as not needing an implicit cpu clear
flush. If any is needed it will be triggered by the skb_*
hooks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3f510cad0b3e..40c2c6134b4b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -903,9 +903,10 @@ void __init tcp_tasklet_init(void)
 		struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
 
 		INIT_LIST_HEAD(&tsq->head);
-		tasklet_init(&tsq->tasklet,
+		tasklet_init_flags(&tsq->tasklet,
 			     tcp_tasklet_func,
-			     (unsigned long)tsq);
+			     (unsigned long)tsq,
+			     TASKLET_NO_USER);
 	}
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 29/32] MDSv3 26
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (27 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 28/32] MDSv3 20 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 30/32] MDSv3 17 Andi Kleen
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Some preliminary auditing of kernel/* shows no timers touch
other processes' user data. Mark all the timers in kernel/*
as not needed an implicit cpu clear.

More auditing here would be useful.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/events/core.c       | 6 ++++--
 kernel/fork.c              | 3 ++-
 kernel/futex.c             | 6 +++---
 kernel/sched/core.c        | 5 +++--
 kernel/sched/deadline.c    | 6 ++++--
 kernel/sched/fair.c        | 6 ++++--
 kernel/sched/idle.c        | 3 ++-
 kernel/sched/rt.c          | 3 ++-
 kernel/time/alarmtimer.c   | 2 +-
 kernel/time/hrtimer.c      | 6 +++---
 kernel/time/posix-timers.c | 6 ++++--
 kernel/time/sched_clock.c  | 3 ++-
 kernel/time/tick-sched.c   | 6 ++++--
 kernel/watchdog.c          | 3 ++-
 14 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84530ab358c3..1a96e35ce95a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1102,7 +1102,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
 	raw_spin_lock_init(&cpuctx->hrtimer_lock);
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -9202,7 +9203,8 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hwc->hrtimer.function = perf_swevent_hrtimer;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff89c7b..b54d3efbd9b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1540,7 +1540,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 #ifdef CONFIG_POSIX_TIMERS
 	INIT_LIST_HEAD(&sig->posix_timers);
-	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sig->real_timer.function = it_real_fn;
 #endif
 
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..bd71f7887a4d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2626,7 +2626,7 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
@@ -2727,7 +2727,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires(&to->timer, *time);
 	}
@@ -3127,7 +3127,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a5e40be3cb4..0e9d8d450dae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -302,7 +302,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 */
 	delay = max_t(u64, delay, 10000LL);
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
-		      HRTIMER_MODE_REL_PINNED);
+		      HRTIMER_MODE_REL_PINNED|HRTIMER_MODE_NO_USER);
 }
 #endif /* CONFIG_SMP */
 
@@ -316,7 +316,8 @@ static void hrtick_rq_init(struct rq *rq)
 	rq->hrtick_csd.info = rq;
 #endif
 
-	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rq->hrtick_timer.function = hrtick;
 }
 #else	/* CONFIG_SCHED_HRTICK */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..471413fa8bb0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1054,7 +1054,8 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = dl_task_timer;
 }
 
@@ -1293,7 +1294,8 @@ void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = inactive_task_timer;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98e7f1e64a0f..89f1bf663c42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4880,9 +4880,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
-	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->distribute_running = 0;
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f5516bae0c1b..6a4cc46d8c4b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -330,7 +330,8 @@ void play_idle(unsigned long duration_ms)
 	cpuidle_use_deepest_state(true);
 
 	it.done = 0;
-	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC,
+			      HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	it.timer.function = idle_inject_timer_fn;
 	hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a21ea6021929..ef81a93cc87b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -46,7 +46,8 @@ void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
 	raw_spin_lock_init(&rt_b->rt_runtime_lock);
 
 	hrtimer_init(&rt_b->rt_period_timer,
-			CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+			CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rt_b->rt_period_timer.function = sched_rt_period_timer;
 }
 
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index fa5de5e8de61..736d3bdbcf25 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -347,7 +347,7 @@ void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		enum alarmtimer_restart (*function)(struct alarm *, ktime_t))
 {
 	hrtimer_init(&alarm->timer, alarm_bases[type].base_clockid,
-		     HRTIMER_MODE_ABS);
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	__alarm_init(alarm, type, function);
 }
 EXPORT_SYMBOL_GPL(alarm_init);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 7e8e89a47d12..1fe30427f81a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1722,7 +1722,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	int ret;
 
 	hrtimer_init_on_stack(&t.timer, restart->nanosleep.clockid,
-				HRTIMER_MODE_ABS);
+				HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
 
 	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
@@ -1742,7 +1742,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
 	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
-	hrtimer_init_on_stack(&t.timer, clockid, mode);
+	hrtimer_init_on_stack(&t.timer, clockid, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, timespec64_to_ktime(*rqtp), slack);
 	ret = do_nanosleep(&t, mode);
 	if (ret != -ERESTART_RESTARTBLOCK)
@@ -1941,7 +1941,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
 		return -EINTR;
 	}
 
-	hrtimer_init_on_stack(&t.timer, clock_id, mode);
+	hrtimer_init_on_stack(&t.timer, clock_id, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, *expires, delta);
 
 	hrtimer_init_sleeper(&t, current);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bd62b5eeb5a0..1435ad7f8360 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -488,7 +488,8 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 
 static int common_timer_create(struct k_itimer *new_timer)
 {
-	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock, 0);
+	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock,
+		HRTIMER_MODE_NO_USER);
 	return 0;
 }
 
@@ -813,7 +814,8 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
 	if (timr->it_clock == CLOCK_REALTIME)
 		timr->kclock = absolute ? &clock_realtime : &clock_monotonic;
 
-	hrtimer_init(&timr->it.real.timer, timr->it_clock, mode);
+	hrtimer_init(&timr->it.real.timer, timr->it_clock,
+		     mode|HRTIMER_MODE_NO_USER);
 	timr->it.real.timer.function = posix_timer_fn;
 
 	if (!absolute)
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index cbc72c2c1fca..cda4185c4324 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -252,7 +252,8 @@ void __init generic_sched_clock_init(void)
 	 * Start the timer to keep sched_clock() properly updated and
 	 * sets the initial epoch.
 	 */
-	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sched_clock_timer.function = sched_clock_poll;
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b88474..19f06e71fce3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1208,7 +1208,8 @@ static void tick_nohz_switch_to_nohz(void)
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	/* Get the next period */
 	next = tick_init_jiffy_update();
 
@@ -1305,7 +1306,8 @@ void tick_setup_sched_timer(void)
 	/*
 	 * Emulate tick processing via per-CPU hrtimers:
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	ts->sched_timer.function = tick_sched_timer;
 
 	/* Get the next period (per-CPU) */
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 977918d5d350..d3c9da0a4fce 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -483,7 +483,8 @@ static void watchdog_enable(unsigned int cpu)
 	 * Start the timer first to prevent the NMI watchdog triggering
 	 * before the timer has a chance to fire.
 	 */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hrtimer->function = watchdog_timer_fn;
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 30/32] MDSv3 17
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (28 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 29/32] MDSv3 26 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 31/32] MDSv3 1 Andi Kleen
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

AHCI interrupt handlers never touch user data with the CPU.

Just to get the number of clears down on my test system.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/ata/ahci.c    |  2 +-
 drivers/ata/ahci.h    |  2 ++
 drivers/ata/libahci.c | 40 ++++++++++++++++++++++++----------------
 3 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 021ce46e2e57..1455ad89d2f9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1865,7 +1865,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	pci_set_master(pdev);
 
-	rc = ahci_host_activate(host, &ahci_sht);
+	rc = ahci_host_activate_irqflags(host, &ahci_sht, IRQF_NO_USER);
 	if (rc)
 		return rc;
 
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index ef356e70e6de..42a3474f26b6 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -430,6 +430,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 int ahci_reset_em(struct ata_host *host);
 void ahci_print_info(struct ata_host *host, const char *scc_s);
 int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht);
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags);
 void ahci_error_handler(struct ata_port *ap);
 u32 ahci_handle_port_intr(struct ata_host *host, u32 irq_masked);
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index b5f57c69c487..b32664c7d8a1 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -2548,7 +2548,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 EXPORT_SYMBOL_GPL(ahci_set_em_messages);
 
 static int ahci_host_activate_multi_irqs(struct ata_host *host,
-					 struct scsi_host_template *sht)
+					 struct scsi_host_template *sht,
+					 int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int i, rc;
@@ -2571,7 +2572,7 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 		}
 
 		rc = devm_request_irq(host->dev, irq, ahci_multi_irqs_intr_hard,
-				0, pp->irq_desc, host->ports[i]);
+				irqflags, pp->irq_desc, host->ports[i]);
 
 		if (rc)
 			return rc;
@@ -2581,18 +2582,8 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 	return ata_host_register(host, sht);
 }
 
-/**
- *	ahci_host_activate - start AHCI host, request IRQs and register it
- *	@host: target ATA host
- *	@sht: scsi_host_template to use when registering the host
- *
- *	LOCKING:
- *	Inherited from calling layer (may sleep).
- *
- *	RETURNS:
- *	0 on success, -errno otherwise.
- */
-int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int irq = hpriv->irq;
@@ -2608,15 +2599,32 @@ int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
 			return -EIO;
 		}
 
-		rc = ahci_host_activate_multi_irqs(host, sht);
+		rc = ahci_host_activate_multi_irqs(host, sht, irqflags);
 	} else {
 		rc = ata_host_activate(host, irq, hpriv->irq_handler,
-				       IRQF_SHARED, sht);
+				       irqflags|IRQF_SHARED, sht);
 	}
 
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(ahci_host_activate_irqflags);
+
+/**
+ *	ahci_host_activate - start AHCI host, request IRQs and register it
+ *	@host: target ATA host
+ *	@sht: scsi_host_template to use when registering the host
+ *
+ *	LOCKING:
+ *	Inherited from calling layer (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, -errno otherwise.
+ */
+int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+{
+	return ahci_host_activate_irqflags(host, sht, 0);
+}
 EXPORT_SYMBOL_GPL(ahci_host_activate);
 
 MODULE_AUTHOR("Jeff Garzik");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 31/32] MDSv3 1
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (29 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 30/32] MDSv3 17 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 32/32] MDSv3 2 Andi Kleen
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

ACPI doesn't touch any user data, so doesn't need a cpu clear.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/acpi/osl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index b48874b8e1ea..380b6ba8f0ce 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,7 +572,8 @@ acpi_os_install_interrupt_handler(u32 gsi, acpi_osd_handler handler,
 
 	acpi_irq_handler = handler;
 	acpi_irq_context = context;
-	if (request_irq(irq, acpi_irq, IRQF_SHARED, "acpi", acpi_irq)) {
+	if (request_irq(irq, acpi_irq, IRQF_SHARED|IRQF_NO_USER,
+				"acpi", acpi_irq)) {
 		printk(KERN_ERR PREFIX "SCI (IRQ%d) allocation failed\n", irq);
 		acpi_irq_handler = NULL;
 		return AE_NOT_ACQUIRED;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] [PATCH v3 32/32] MDSv3 2
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (30 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 31/32] MDSv3 1 Andi Kleen
@ 2018-12-21  0:27 ` Andi Kleen
  2019-01-09 17:09 ` [MODERATED] Re: [PATCH v3 00/32] MDSv3 12 Linus Torvalds
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2018-12-21  0:27 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Mitigate BPF

BPF allows the user to run untrusted code in the kernel.

Normally MDS would allow some information leakage either
from other processes  or sensitive kernel code to the user
controlled BPF code.  We cannot rule out that BPF code contains
an MDS exploit and it is difficult to pattern match.

The patch aims to add limited number of clear cpus
before BPF executions to make EBPF executions safe.

Assume BPF execution does not touch other user's data, so does
not need to schedule a clear for itself.

For EBPF programs loaded privileged we never clear.

When the BPF program was loaded unprivileged clear the CPU
before the BPF execution, depending on the context it is running in:

We only do this when running in an interrupt, or if an clear cpu is
already scheduled (which means for example there was a context
switch, or crypto operation before)

In process context we check if the current process context
has the same userns+euid as the process who created the BPF.
This handles the common seccomp filter case without
any extra clears, but still adds clears when e.g. a socket
filter runs on a socket inherited to a process with different user id.

We also always clear when an earlier kernel subsystem scheduled
a clear, e.g. after a context switch or running crypto code.

Technically we would only need to do this if the BPF program
contains conditional branches and loads dominated by them, but
let's assume that near all do.

For example for running chromium with seccomp filters I see
only 15-18% of all sandbox system calls have a clear, most
are likely caused by context switches

Unprivileged EBPF usages in interrupts currently always clear.

This could be further optimized by allowing callers that do
a lot of individual BPF runs and are sure they don't touch
other user's data (that is not accessible to the EBPF anyways)
inbetween to do the clear only once at the beginning. We can add
such optimizations later based on profile data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearbpf.h | 29 +++++++++++++++++++++++++++++
 include/linux/filter.h          | 21 +++++++++++++++++++--
 kernel/bpf/core.c               |  2 ++
 3 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/clearbpf.h

diff --git a/arch/x86/include/asm/clearbpf.h b/arch/x86/include/asm/clearbpf.h
new file mode 100644
index 000000000000..dc1756722b48
--- /dev/null
+++ b/arch/x86/include/asm/clearbpf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARBPF_H
+#define _ASM_CLEARBPF_H 1
+
+#include <linux/clearcpu.h>
+#include <linux/cred.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * When the BPF program was loaded unprivileged, clear the CPU
+ * to prevent any exploits written in BPF using side channels to read
+ * data leaked from other kernel code. In some cases, like
+ * process context with the same uid, we can avoid it.
+ *
+ * See Documentation/clearcpu.txt for more details.
+ */
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+	if (!static_cpu_has(X86_BUG_MDS))
+		return;
+	if (in_interrupt() ||
+		test_thread_flag(TIF_CLEAR_CPU) ||
+		!uid_eq(current_euid(), uid)) {
+		clear_cpu();
+		clear_thread_flag(TIF_CLEAR_CPU);
+	}
+}
+
+#endif
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 448dcc448f1f..d49bdaaefd02 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,12 +20,21 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/clearcpu.h>
 
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
 #include <uapi/linux/bpf.h>
 
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearbpf.h>
+#else
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+}
+#endif
+
 struct sk_buff;
 struct sock;
 struct seccomp_data;
@@ -487,7 +496,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				priv:1;		/* Was loaded privileged */
+	kuid_t			uid;		/* Original uid who created it */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -510,7 +521,13 @@ struct sk_filter {
 	struct bpf_prog	*prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+static inline unsigned _bpf_prog_run(const struct bpf_prog *bp, const void *ctx)
+{
+	if (!bp->priv)
+		arch_bpf_prepare_nonpriv(bp->uid);
+	return bp->bpf_func(ctx, bp->insnsi);
+}
+#define BPF_PROG_RUN(filter, ctx) _bpf_prog_run(filter, ctx)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545d0ec8..90f13b1a8d67 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -96,6 +96,8 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 	fp->aux = aux;
 	fp->aux->prog = fp;
 	fp->jit_requested = ebpf_jit_enabled();
+	fp->priv = !!capable(CAP_SYS_ADMIN);
+	fp->uid = current_euid();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (31 preceding siblings ...)
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 32/32] MDSv3 2 Andi Kleen
@ 2019-01-09 17:09 ` Linus Torvalds
  2019-01-09 17:31   ` Andi Kleen
  2019-01-09 17:18 ` Konrad Rzeszutek Wilk
                   ` (2 subsequent siblings)
  35 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2019-01-09 17:09 UTC (permalink / raw)
  To: speck

No.

Stop this.

We talked about it already, this is garbage.

On Wed, Jan 9, 2019 at 3:01 AM speck for Andi Kleen <speck@linutronix.de> wrote:
>
> I kept the support for software sequences because from what I'm hearing
> some CPUs might need them. If that's not the case they can be still
> removed.

I'm  not taking the crazy code that even Intel says is not needed.

Gus, you need to realize that we still have taste and we still care
about maintainability. "Theoretical attack on older CPU's" is not an
excuse to throw that out the window.

I'm not even going to look at this series, since it's already shown
itself to not care about sanity.

                Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (32 preceding siblings ...)
  2019-01-09 17:09 ` [MODERATED] Re: [PATCH v3 00/32] MDSv3 12 Linus Torvalds
@ 2019-01-09 17:18 ` Konrad Rzeszutek Wilk
  2019-01-09 17:41   ` Andi Kleen
  2019-01-09 17:35 ` Linus Torvalds
  2019-01-09 17:39 ` Andi Kleen
  35 siblings, 1 reply; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-09 17:18 UTC (permalink / raw)
  To: speck

On Thu, Dec 20, 2018 at 04:27:10PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  MDSv3
> 
> Here's a new version of flushing CPU buffers for group 4.

Could you send also a git bundle of them please? The titles of them is not in sync with the XX/YY.
I can probably figure out the right flow but it would help (also helps in review).

Thank you!

And one more thing - I see 'MB' and 'MDS' and also 'MSB' (Microarchitectual
Store Buffer). 

> VERW is not done unconditionally because it doesn't allow reporting
> the correct status in the vulnerabilities file, which I consider important.
> Instead we now have a mds=verw option that can be set as needed,

msd=force

Is how it was handled for SSBD and L1TF I believe? Could that be used?

> but is reported explicitely in the mitigation status.
> 
> Some notes:
> - Against 4.20-rc5
> - There's a new (bogus) build time warning from objtool about unreachable code.
> 
> Changes against previous versions:
> - By default now flushes only when needed
> - Define security model
> - New administrator document
> - Added mds=verw and mds=full
> - Renamed mds_disable to mds=off
> - KVM virtualization much improved
> - Too many others to list. Most things different now.
> 
> Andi Kleen (32):
>   x86/speculation/mds: Add basic bug infrastructure for MDS
>   x86/speculation/mds: Support clearing CPU data on kernel exit
>   x86/speculation/mds: Support mds=full
>   x86/speculation/mds: Clear CPU buffers on entering idle
>   x86/speculation/mds: Add sysfs reporting
>   x86/speculation/mds: Add software sequences for older CPUs.
>   x86/speculation/mds: Support mds=full for NMIs
>   x86/speculation/mds: Avoid NMI races with software sequences
>   x86/speculation/mds: Call software sequences on KVM entry
>   x86/speculation/mds: Clear buffers on NMI exit on 32bit kernels.
>   x86/speculation/mds: Add mds=verw
>   x86/speculation/mds: Export MB_CLEAR CPUID to KVM guests.
>   x86/speculation/mds: Always clear when entering guest without MB_CLEAR
>   mds: Add documentation for clear cpu usage
>   mds: Add preliminary administrator documentation
>   x86/speculation/mds: Introduce lazy_clear_cpu
>   x86/speculation/mds: Schedule cpu clear on context switch
>   x86/speculation/mds: Add tracing for clear_cpu
>   mds: Force clear cpu on kernel preemption
>   mds: Schedule cpu clear for memzero_explicit and kzfree
>   mds: Mark interrupts clear cpu, unless opted-out
>   mds: Clear cpu on all timers, unless the timer opts-out
>   mds: Clear CPU on tasklets, unless opted-out
>   mds: Clear CPU on irq poll, unless opted-out
>   mds: Clear cpu for string io/memcpy_*io in interrupts
>   mds: Schedule clear cpu in swiotlb
>   mds: Instrument skb functions to clear cpu automatically
>   mds: Opt out tcp tasklet to not touch user data
>   mds: mark kernel/* timers safe as not touching user data
>   mds: Mark AHCI interrupt as not needing cpu clear
>   mds: Mark ACPI interrupt as not needing cpu clear
>   mds: Mitigate BPF
> 
>  .../ABI/testing/sysfs-devices-system-cpu      |   1 +
>  .../admin-guide/kernel-parameters.txt         |  29 +++
>  Documentation/admin-guide/mds.rst             | 128 +++++++++++++
>  Documentation/clearcpu.txt                    | 179 ++++++++++++++++++
>  arch/Kconfig                                  |   3 +
>  arch/x86/Kconfig                              |   1 +
>  arch/x86/entry/common.c                       |  24 ++-
>  arch/x86/entry/entry_32.S                     |   7 +
>  arch/x86/entry/entry_64.S                     |  24 +++
>  arch/x86/include/asm/clearbpf.h               |  29 +++
>  arch/x86/include/asm/clearcpu.h               | 100 ++++++++++
>  arch/x86/include/asm/cpufeatures.h            |   4 +
>  arch/x86/include/asm/io.h                     |   3 +
>  arch/x86/include/asm/msr-index.h              |   1 +
>  arch/x86/include/asm/thread_info.h            |   2 +
>  arch/x86/include/asm/trace/clearcpu.h         |  27 +++
>  arch/x86/kernel/acpi/cstate.c                 |   2 +
>  arch/x86/kernel/cpu/bugs.c                    | 108 +++++++++++
>  arch/x86/kernel/cpu/common.c                  |  14 ++
>  arch/x86/kernel/kvm.c                         |   3 +
>  arch/x86/kernel/process.c                     |   5 +
>  arch/x86/kernel/process.h                     |  27 +++
>  arch/x86/kernel/smpboot.c                     |   3 +
>  arch/x86/kvm/cpuid.c                          |   3 +-
>  arch/x86/kvm/vmx.c                            |  23 ++-
>  arch/x86/lib/Makefile                         |   1 +
>  arch/x86/lib/clear_cpu.S                      | 104 ++++++++++
>  drivers/acpi/acpi_pad.c                       |   2 +
>  drivers/acpi/osl.c                            |   3 +-
>  drivers/acpi/processor_idle.c                 |   3 +
>  drivers/ata/ahci.c                            |   2 +-
>  drivers/ata/ahci.h                            |   2 +
>  drivers/ata/libahci.c                         |  40 ++--
>  drivers/base/cpu.c                            |   8 +
>  drivers/idle/intel_idle.c                     |   5 +
>  include/asm-generic/io.h                      |   3 +
>  include/linux/clearcpu.h                      |  36 ++++
>  include/linux/filter.h                        |  21 +-
>  include/linux/hrtimer.h                       |   4 +
>  include/linux/interrupt.h                     |  18 +-
>  include/linux/irq_poll.h                      |   2 +
>  include/linux/skbuff.h                        |   2 +
>  include/linux/timer.h                         |   9 +-
>  kernel/bpf/core.c                             |   2 +
>  kernel/dma/swiotlb.c                          |   2 +
>  kernel/events/core.c                          |   6 +-
>  kernel/fork.c                                 |   3 +-
>  kernel/futex.c                                |   6 +-
>  kernel/irq/handle.c                           |   8 +
>  kernel/irq/manage.c                           |   1 +
>  kernel/sched/core.c                           |  14 +-
>  kernel/sched/deadline.c                       |   6 +-
>  kernel/sched/fair.c                           |   7 +-
>  kernel/sched/idle.c                           |   3 +-
>  kernel/sched/rt.c                             |   3 +-
>  kernel/softirq.c                              |  25 ++-
>  kernel/time/alarmtimer.c                      |   2 +-
>  kernel/time/hrtimer.c                         |  11 +-
>  kernel/time/posix-timers.c                    |   6 +-
>  kernel/time/sched_clock.c                     |   3 +-
>  kernel/time/tick-sched.c                      |   6 +-
>  kernel/time/timer.c                           |   8 +
>  kernel/watchdog.c                             |   3 +-
>  lib/irq_poll.c                                |  18 +-
>  lib/string.c                                  |   6 +
>  mm/slab_common.c                              |   5 +-
>  net/core/skbuff.c                             |  26 +++
>  net/ipv4/tcp_output.c                         |   5 +-
>  68 files changed, 1138 insertions(+), 62 deletions(-)
>  create mode 100644 Documentation/admin-guide/mds.rst
>  create mode 100644 Documentation/clearcpu.txt
>  create mode 100644 arch/x86/include/asm/clearbpf.h
>  create mode 100644 arch/x86/include/asm/clearcpu.h
>  create mode 100644 arch/x86/include/asm/trace/clearcpu.h
>  create mode 100644 arch/x86/lib/clear_cpu.S
>  create mode 100644 include/linux/clearcpu.h
> 
> -- 
> 2.17.2

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:09 ` [MODERATED] Re: [PATCH v3 00/32] MDSv3 12 Linus Torvalds
@ 2019-01-09 17:31   ` Andi Kleen
  2019-01-09 17:38     ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 17:31 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 09:09:24AM -0800, speck for Linus Torvalds wrote:
> >
> > I kept the support for software sequences because from what I'm hearing
> > some CPUs might need them. If that's not the case they can be still
> > removed.
> 
> I'm  not taking the crazy code that even Intel says is not needed.

I don't know what parts of Intel you're talking with, but the parts
I talk with say that it's likely some CPUs won't be able to do the
microcode update for VERW.

> 
> Gus, you need to realize that we still have taste and we still care
> about maintainability. "Theoretical attack on older CPU's" is not an
> excuse to throw that out the window.

Are you saying we shouldn't support a MDS fix on those CPUs?
What do you suggest to do instead?

BTW it may be possible to remove one of the sequences i have
today (the haswell+ version) if those CPUs get covered,
but I expect to need to add a new version of the software 
sequence (for KNL) too, so it would need two at least.

Also the software sequence stuff is ugly -- I fully agree on that --
but it's all in a single file and doesn't have much impact elsewhere.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (33 preceding siblings ...)
  2019-01-09 17:18 ` Konrad Rzeszutek Wilk
@ 2019-01-09 17:35 ` Linus Torvalds
  2019-01-09 18:14   ` Andi Kleen
  2019-01-09 17:39 ` Andi Kleen
  35 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2019-01-09 17:35 UTC (permalink / raw)
  To: speck

On Wed, Jan 9, 2019 at 3:01 AM speck for Andi Kleen <speck@linutronix.de> wrote:
>
> VERW is not done unconditionally because it doesn't allow reporting
> the correct status in the vulnerabilities file, which I consider important.
> Instead we now have a mds=verw option that can be set as needed,
> but is reported explicitely in the mitigation status.

I also don't see what the logic of this is AT ALL.

"Reporting" has absolutely nothign to do with "use VERW". The fact
that you link the two is crazy.

The rule for VERW should be simple: use VERW if the CPU doesn't have
the NOMDS bit set (or whatever the name is today).

And you never ever use the sw code, so the VERW conditional is a
trivial simple one-instruction alternate() thing. No out-of-line
garbage, no unnecessary jump-overs, just a single instruction that
gets nop'ed out (let gcc always generate the stack slot).

The reporting is equally simple: consider it safe if it has the new
microcode with VERW support _or_ it has NOMDS set.

But notice how the conditions aren't the same?

              Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:31   ` Andi Kleen
@ 2019-01-09 17:38     ` Linus Torvalds
  2019-01-09 18:06       ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2019-01-09 17:38 UTC (permalink / raw)
  To: speck

On Wed, Jan 9, 2019 at 9:31 AM speck for Andi Kleen <speck@linutronix.de> wrote:
>
> On Wed, Jan 09, 2019 at 09:09:24AM -0800, speck for Linus Torvalds wrote:
> > >
> > > I kept the support for software sequences because from what I'm hearing
> > > some CPUs might need them. If that's not the case they can be still
> > > removed.
> >
> > I'm  not taking the crazy code that even Intel says is not needed.
>
> I don't know what parts of Intel you're talking with, but the parts
> I talk with say that it's likely some CPUs won't be able to do the
> microcode update for VERW.

Right. And they say that because nobody cares.

And if somebody *does* care, they can damn well blame Intel. It's not
*our* problem.

Stop trying to make Intel bugs be our problems forever. We will be
very clear in who to blame.

I will not take that crazy per-microarchitecture software sequence
THAT DOESN'T EVEN WORK. Even on the micro-architectures it is designed
for, virtualization and SMM break it.

It's unmaintainable garbaghe.

End of story.

                    Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 01/32] MDSv3 7
  2018-12-21  0:27 ` [MODERATED] [PATCH v3 01/32] MDSv3 7 Andi Kleen
@ 2019-01-09 17:38   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-09 17:38 UTC (permalink / raw)
  To: speck

> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 28c4a502b419..93fab3a1e046 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -342,6 +342,7 @@
>  /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
>  #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
>  #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
> +#define X86_FEATURE_MB_CLEAR		(18*32+10) /* Flush state on VERW */

I think this is MD_CLEAR?.. Perhaps the doc is incorrect?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
                   ` (34 preceding siblings ...)
  2019-01-09 17:35 ` Linus Torvalds
@ 2019-01-09 17:39 ` Andi Kleen
  35 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 17:39 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 49 bytes --]


Here's a mailbox with all the patches in order


[-- Attachment #2: mbox --]
[-- Type: text/plain, Size: 121709 bytes --]

From 45e504714969e4f58550bb310793d5aba909d272 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 7 Nov 2018 16:08:39 -0800
Subject: [PATCH 01/32] x86/speculation/mds: Add basic bug infrastructure for
 MDS

MDS is micro architectural data sampling, which is a side channel
attack on internal buffers in Intel CPUs. They all have
the same mitigations for single thread, so we lump them all
together as a single MDS issue.

This addresses CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
for the single threaded case.

This patch adds the basic infrastructure to detect if the current
CPU is affected by MDS, and if yes set the right BUG bits.

We also provide a command line option "mds_disable" to disable
any workarounds.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 arch/x86/include/asm/cpufeatures.h              |  2 ++
 arch/x86/include/asm/msr-index.h                |  1 +
 arch/x86/kernel/cpu/bugs.c                      | 10 ++++++++++
 arch/x86/kernel/cpu/common.c                    | 14 ++++++++++++++
 5 files changed, 30 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..f5c14b721eef 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2341,6 +2341,9 @@
 			Format: <first>,<last>
 			Specifies range of consoles to be captured by the MDA.
 
+	mds=off		[X86, Intel]
+			Disable workarounds for Micro-architectural Data Sampling.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 28c4a502b419..93fab3a1e046 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -342,6 +342,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MB_CLEAR		(18*32+10) /* Flush state on VERW */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -379,5 +380,6 @@
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index c8f73efb4ece..303064a9a0a9 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -77,6 +77,7 @@
 						    * attack, so no Speculative Store Bypass
 						    * control required.
 						    */
+#define ARCH_CAP_MDS_NO			(1 << 5)   /* No Microarchitectural data sampling */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			(1 << 0)   /*
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 500278f5308e..13eb623fe0b1 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -35,6 +35,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -99,6 +100,8 @@ void __init check_bugs(void)
 
 	l1tf_select_mitigation();
 
+	mds_select_mitigation();
+
 #ifdef CONFIG_X86_32
 	/*
 	 * Check whether we are able to run this kernel safely on SMP.
@@ -1041,6 +1044,13 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static void mds_select_mitigation(void)
+{
+	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
+	    !boot_cpu_has_bug(X86_BUG_MDS)) {
+		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+}
+
 #ifdef CONFIG_SYSFS
 
 #define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ffb181f959d2..bebeb67015fc 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -998,6 +998,14 @@ static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
 	{}
 };
 
+static const __initconst struct x86_cpu_id cpu_no_mds[] = {
+	/* in addition to cpu_no_speculation */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_X	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_PLUS	},
+	{}
+};
+
 static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
@@ -1019,6 +1027,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	if (ia32_cap & ARCH_CAP_IBRS_ALL)
 		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
 
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    !x86_match_cpu(cpu_no_mds)) &&
+	    !(ia32_cap & ARCH_CAP_MDS_NO) &&
+	    !(ia32_cap & ARCH_CAP_RDCL_NO))
+		setup_force_cpu_bug(X86_BUG_MDS);
+
 	if (x86_match_cpu(cpu_no_meltdown))
 		return;
 
-- 
2.17.2


From 9cb31b65b4e0d2f32fd59d12d96ac79787321656 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 7 Nov 2018 16:12:17 -0800
Subject: [PATCH 02/32] x86/speculation/mds: Support clearing CPU data on
 kernel exit

Add infrastructure for clearing CPU data on kernel exit

Instead of clearing unconditionally we support clearing
lazily when some kernel subsystem touched sensitive data
and sets the new TIF_CLEAR_CPU flag.

We handle TIF_CLEAR_CPU in kernel exit, similar to
other kernel exit action flags.

The flushing is provided by new microcode as a new side
effect of the otherwise unused VERW instruction.

So far this patch doesn't do anything, it relies on
later patches to set TIF_CLEAR_CPU.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c            |  8 +++++++-
 arch/x86/include/asm/clearcpu.h    | 26 ++++++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  2 ++
 3 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/clearcpu.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b81918..07cf8d32df67 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -29,6 +29,7 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
+#include <asm/clearcpu.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
 
@@ -132,7 +133,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 }
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_CLEAR_CPU |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -170,6 +171,11 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_CLEAR_CPU) {
+			clear_thread_flag(TIF_CLEAR_CPU);
+			clear_cpu();
+		}
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
new file mode 100644
index 000000000000..c45f0c28867e
--- /dev/null
+++ b/arch/x86/include/asm/clearcpu.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARCPU_H
+#define _ASM_CLEARCPU_H 1
+
+#include <linux/jump_label.h>
+#include <linux/sched/smt.h>
+#include <asm/alternative.h>
+#include <linux/thread_info.h>
+
+/*
+ * Clear CPU buffers to avoid side channels.
+ * We use either microcode (as a side effect of the obsolete
+ * "VERW" instruction), or special out of line clear sequences.
+ */
+
+static inline void clear_cpu(void)
+{
+	unsigned kernel_ds = __KERNEL_DS;
+	/* Has to be memory form, don't modify to use an register */
+	alternative_input("",
+		"verw %[kernelds]",
+		X86_FEATURE_MB_CLEAR,
+		[kernelds] "m" (kernel_ds));
+}
+
+#endif
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 82b73b75d67c..f50c05d5bc8c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_CLEAR_CPU		23	/* clear CPU on kernel exit */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
+#define _TIF_CLEAR_CPU		(1 << TIF_CLEAR_CPU)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-- 
2.17.2


From 5b7a5d1c23a5dea1d5aaaae9eac8626acfff635f Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 16:40:11 -0800
Subject: [PATCH 03/32] x86/speculation/mds: Support mds=full

Support a new command line option to support unconditional flushing
on each kernel exit. This is not enabled by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 5 +++++
 arch/x86/entry/common.c                         | 7 ++++++-
 arch/x86/include/asm/clearcpu.h                 | 2 ++
 arch/x86/kernel/cpu/bugs.c                      | 5 +++++
 4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index f5c14b721eef..b764b4ebb1f8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2344,6 +2344,11 @@
 	mds=off		[X86, Intel]
 			Disable workarounds for Micro-architectural Data Sampling.
 
+	mds=full	[X86, Intel]
+			Always flush cpu buffers when exiting kernel for MDS.
+			Normally the kernel decides dynamically when flushing is
+			needed or not.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 07cf8d32df67..6662444b33cf 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
-			clear_cpu();
+			/* Don't do it twice if forced */
+			if (!static_key_enabled(&force_cpu_clear))
+				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
@@ -217,6 +219,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
+	if (static_key_enabled(&force_cpu_clear))
+		clear_cpu();
+
 	user_enter_irqoff();
 }
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index c45f0c28867e..35fecc86e54f 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -23,4 +23,6 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
+
 #endif
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 13eb623fe0b1..5fbdf425a84a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1044,11 +1044,16 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
 	    !boot_cpu_has_bug(X86_BUG_MDS)) {
 		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+
+	if (cmdline_find_option_bool(boot_command_line, "mds=full"))
+		static_branch_enable(&force_cpu_clear);
 }
 
 #ifdef CONFIG_SYSFS
-- 
2.17.2


From 4417672653b459eaaa04d2e04f95fcd1fe2756c1 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 7 Nov 2018 16:15:28 -0800
Subject: [PATCH 04/32] x86/speculation/mds: Clear CPU buffers on entering idle

When entering idle the internal state of the current CPU might
become visible to the thread sibling because the CPU "frees" some
internal resources.

To ensure there is no MDS leakage always clear the CPU state
before doing any idling. We only do this if SMT is enabled,
as otherwise there is no leakage possible.

Not needed for idle poll because it does not share resources.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h | 19 +++++++++++++++++++
 arch/x86/kernel/acpi/cstate.c   |  2 ++
 arch/x86/kernel/kvm.c           |  3 +++
 arch/x86/kernel/process.c       |  5 +++++
 arch/x86/kernel/smpboot.c       |  3 +++
 drivers/acpi/acpi_pad.c         |  2 ++
 drivers/acpi/processor_idle.c   |  3 +++
 drivers/idle/intel_idle.c       |  5 +++++
 kernel/sched/fair.c             |  1 +
 9 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 35fecc86e54f..9e389c8a5679 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -23,6 +23,25 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+/*
+ * Clear CPU buffers before going idle, so that no state is leaked to SMT
+ * siblings taking over thread resources.
+ * Out of line to avoid include hell.
+ *
+ * Assumes that interrupts are disabled and only get reenabled
+ * before idle, otherwise the data from a racing interrupt might not
+ * get cleared. There are some callers who violate this,
+ * but they are only used in unattackable cases.
+ */
+
+static inline void clear_cpu_idle(void)
+{
+	if (sched_smt_active()) {
+		clear_thread_flag(TIF_CLEAR_CPU);
+		clear_cpu();
+	}
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #endif
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 158ad1483c43..48adea5afacf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
 #include <acpi/processor.h>
 #include <asm/mwait.h>
 #include <asm/special_insns.h>
+#include <asm/clearcpu.h>
 
 /*
  * Initialize bm_flags based on the CPU cache properties
@@ -157,6 +158,7 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 	unsigned int cpu = smp_processor_id();
 	struct cstate_entry *percpu_entry;
 
+	clear_cpu_idle();
 	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
 	mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
 	                      percpu_entry->states[cx->index].ecx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ba4bfb7f6a36..c9206ad40a5b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -159,6 +159,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
 			/*
 			 * We cannot reschedule. So halt.
 			 */
+			clear_cpu_idle();
 			native_safe_halt();
 			local_irq_disable();
 		}
@@ -785,6 +786,8 @@ static void kvm_wait(u8 *ptr, u8 val)
 	if (READ_ONCE(*ptr) != val)
 		goto out;
 
+	clear_cpu_idle();
+
 	/*
 	 * halt until it's our turn and kicked. Note that we do safe halt
 	 * for irq enabled case to avoid hang when lock info is overwritten
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7d31192296a8..72c0fe5f69e0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -39,6 +39,7 @@
 #include <asm/desc.h>
 #include <asm/prctl.h>
 #include <asm/spec-ctrl.h>
+#include <asm/clearcpu.h>
 
 #include "process.h"
 
@@ -586,6 +587,8 @@ void stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
+	clear_cpu_idle();
+
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
 	 * for performing a successful kexec when going from SME inactive
@@ -672,6 +675,8 @@ static __cpuidle void mwait_idle(void)
 			mb(); /* quirk */
 		}
 
+		clear_cpu_idle();
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__sti_mwait(0, 0);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index a9134d1910b9..4b873873476f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -81,6 +81,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
+#include <asm/clearcpu.h>
 
 /* representing HT siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -1635,6 +1636,7 @@ static inline void mwait_play_dead(void)
 	wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		/*
 		 * The CLFLUSH is a workaround for erratum AAI65 for
 		 * the Xeon 7400 series.  It's not clear it is actually
@@ -1662,6 +1664,7 @@ void hlt_play_dead(void)
 		wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		native_halt();
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index a47676a55b84..2dcbc38d0880 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <asm/mwait.h>
+#include <asm/clearcpu.h>
 #include <xen/xen.h>
 
 #define ACPI_PROCESSOR_AGGREGATOR_CLASS	"acpi_pad"
@@ -175,6 +176,7 @@ static int power_saving_thread(void *data)
 			tick_broadcast_enable();
 			tick_broadcast_enter();
 			stop_critical_timings();
+			clear_cpu_idle();
 
 			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b2131c4ea124..0342daa122fe 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -33,6 +33,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <acpi/processor.h>
+#include <asm/clearcpu.h>
 
 /*
  * Include the apic definitions for x86 to have the APIC timer related defines
@@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
  */
 static void __cpuidle acpi_safe_halt(void)
 {
+	clear_cpu_idle();
 	if (!tif_need_resched()) {
 		safe_halt();
 		local_irq_disable();
@@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 
 	ACPI_FLUSH_CPU_CACHE();
 
+	clear_cpu_idle();
 	while (1) {
 
 		if (cx->entry_method == ACPI_CSTATE_HALT)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 8b5d85c91e9d..ddaa7603d53a 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,6 +65,7 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/clearcpu.h>
 
 #define INTEL_IDLE_VERSION "0.4.1"
 
@@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 		}
 	}
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 
 	if (!static_cpu_has(X86_FEATURE_ARAT) && tick)
@@ -953,6 +956,8 @@ static void intel_idle_s2idle(struct cpuidle_device *dev,
 	unsigned long ecx = 1; /* break on interrupt flag */
 	unsigned long eax = flg2MWAIT(drv->states[index].flags);
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..98e7f1e64a0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5935,6 +5935,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL(sched_smt_present);
 
 static inline void set_idle_cores(int cpu, int val)
 {
-- 
2.17.2


From 27ad10658e61bce6ae29f7d01b605d00ee7e8828 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 7 Nov 2018 16:17:11 -0800
Subject: [PATCH 05/32] x86/speculation/mds: Add sysfs reporting

Report mds mitigation state in sysfs vulnerabilities.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../ABI/testing/sysfs-devices-system-cpu         |  1 +
 arch/x86/kernel/cpu/bugs.c                       | 16 ++++++++++++++++
 drivers/base/cpu.c                               |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 73318225a368..02b7bb711214 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -477,6 +477,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/spectre_v2
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
+		/sys/devices/system/cpu/vulnerabilities/mds
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 5fbdf425a84a..a66e29a4c4f2 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1157,6 +1157,16 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
 			return l1tf_show_state(buf);
 		break;
+
+	case X86_BUG_MDS:
+		/* Assumes Hypervisor exposed HT state to us if in guest */
+		if (boot_cpu_has(X86_FEATURE_MB_CLEAR)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: microcode\n");
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+		}
+		return sprintf(buf, "Vulnerable\n");
+
 	default:
 		break;
 	}
@@ -1188,4 +1198,10 @@ ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *b
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
 }
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
+
 #endif
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index eb9443d5bae1..2fd6ca1021c2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_spectre_v2.attr,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
+	&dev_attr_mds.attr,
 	NULL
 };
 
-- 
2.17.2


From 6645a3b26690b703dff1ce87fb471abacb256ca1 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 9 Nov 2018 12:19:04 -0800
Subject: [PATCH 06/32] x86/speculation/mds: Add software sequences for older
 CPUs.

On some older CPUs before Broadwell clearing the CPU buffer with VERW is
not available, so we implement software sequences. These can then be
automatically patched in as needed.

Support mitigation for Nehalem upto Broadwell. Broadwell strictly doesn't
need it because it should have the microcode update for VERW, which
is preferred. Some other CPUs may also not need it due to
microcode updates, but let's enable them for now.

There are two different sequences: one for Nehalem to IvyBridge,
and another for Haswell/Broadwell.

We add command line options to force the two different sequences,
so that it's possible to select the right (or less wrong) one in
VMs that don't report the correct CPU in CPUID. In normal
operation the kernel automatically selects the right
sequence based on the current CPU and it's microcode update
status.

Note to backporters: this patch requires eager FPU support.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../admin-guide/kernel-parameters.txt         |  15 +++
 arch/x86/include/asm/clearcpu.h               |   4 +-
 arch/x86/include/asm/cpufeatures.h            |   2 +
 arch/x86/kernel/cpu/bugs.c                    |  53 +++++++++
 arch/x86/lib/Makefile                         |   1 +
 arch/x86/lib/clear_cpu.S                      | 104 ++++++++++++++++++
 6 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/lib/clear_cpu.S

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b764b4ebb1f8..5f8ac5270beb 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2349,6 +2349,21 @@
 			Normally the kernel decides dynamically when flushing is
 			needed or not.
 
+	mds=swclear	[X86, Intel]
+			Force using software sequence for clearing data that
+			could be exploited by Micro-architectural Data Sampling.
+			Normally automatically enabled when needed. This
+			option might be useful if running inside a virtual machine
+			that does not expose the correct model number. This
+			option requires a CPU with at SSE support.
+
+	mds=swclearhsw	[X86, Intel]
+			Use Haswell/Broadwell specific sequence for clearing
+			data that could be exploited by Micro-architectural Data
+			Sampling. Normally automatically enabled when needed.
+			This option might be useful if running inside a virtual machine
+			that does not expose the correct model number.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 9e389c8a5679..4a570b3b0f5e 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -17,9 +17,11 @@ static inline void clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
-	alternative_input("",
+	alternative_input_2("",
 		"verw %[kernelds]",
 		X86_FEATURE_MB_CLEAR,
+		"call do_clear_cpu",
+		X86_BUG_MDS_CLEAR_CPU,
 		[kernelds] "m" (kernel_ds));
 }
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 93fab3a1e046..110759334c88 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -381,5 +381,7 @@
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
 #define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
+#define X86_BUG_MDS_CLEAR_CPU		X86_BUG(20) /* CPU needs call to clear_cpu on kernel exit/idle for MDS */
+#define X86_BUG_MDS_CLEAR_CPU_HSW	X86_BUG(21) /* CPU needs Haswell version of clear cpu */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index a66e29a4c4f2..faec1f0dd801 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -31,6 +31,7 @@
 #include <asm/intel-family.h>
 #include <asm/e820/api.h>
 #include <asm/hypervisor.h>
+#include <asm/cpu_device_id.h>
 
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
@@ -1044,14 +1045,61 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static const __initconst struct x86_cpu_id cpu_mds_clear_cpu[] = {
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_G	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_EP	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_EX	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE_EP	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_WESTMERE_EX	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_SANDYBRIDGE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_SANDYBRIDGE_X },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_IVYBRIDGE	 },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_IVYBRIDGE_X	 },
+	{}
+};
+
+static const __initconst struct x86_cpu_id cpu_mds_clear_cpu_hsw[] = {
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_CORE	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_X	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_ULT	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_HASWELL_GT3E	    },
+
+	/* Have MB_CLEAR with microcode update, but list just in case: */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_CORE   },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_GT3E   },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_X	    },
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_BROADWELL_XEON_D },
+	{}
+};
+
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
 
+/* Export here to avoid warnings */
+extern __visible void do_clear_cpu(void);
+EXPORT_SYMBOL(do_clear_cpu);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off") ||
 	    !boot_cpu_has_bug(X86_BUG_MDS)) {
 		setup_clear_cpu_cap(X86_FEATURE_MB_CLEAR);
+		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU_HSW);
+		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+		return;
+	}
 
+	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
+		x86_match_cpu(cpu_mds_clear_cpu)) ||
+		cmdline_find_option_bool(boot_command_line, "mds=swclear"))
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
+		x86_match_cpu(cpu_mds_clear_cpu_hsw)) ||
+		cmdline_find_option_bool(boot_command_line, "mds=swclearhsw")) {
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
+		setup_force_cpu_cap(X86_BUG_MDS_CLEAR_CPU_HSW);
+	}
 	if (cmdline_find_option_bool(boot_command_line, "mds=full"))
 		static_branch_enable(&force_cpu_clear);
 }
@@ -1165,6 +1213,11 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 				return sprintf(buf, "Mitigation: microcode\n");
 			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
 		}
+		if (boot_cpu_has_bug(X86_BUG_MDS_CLEAR_CPU)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: software buffer clearing\n");
+			return sprintf(buf, "Mitigation: software buffer clearing, HT vulnerable\n");
+		}
 		return sprintf(buf, "Vulnerable\n");
 
 	default:
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index 25a972c61b0a..ce07225e53e1 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -28,6 +28,7 @@ lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o
 lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
 lib-$(CONFIG_FUNCTION_ERROR_INJECTION)	+= error-inject.o
 lib-$(CONFIG_RETPOLINE) += retpoline.o
+lib-y += clear_cpu.o
 
 obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o
 
diff --git a/arch/x86/lib/clear_cpu.S b/arch/x86/lib/clear_cpu.S
new file mode 100644
index 000000000000..b619aca1449b
--- /dev/null
+++ b/arch/x86/lib/clear_cpu.S
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/linkage.h>
+#include <asm/alternative-asm.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * Clear internal CPU buffers on kernel boundaries.
+ *
+ * These sequences are somewhat fragile, please don't add
+ * or change instructions in the middle of the areas marked with
+ * start/end.
+ *
+ * Interrupts and NMIs we deal with by reclearing. We clear parts
+ * of the kernel stack, which has other advantages too.
+ *
+ * Save all registers to make it easier to use for callers.
+ *
+ * This sequence is for Nehalem-IvyBridge. For Haswell we jump
+ * to hsw_clear_buf.
+ *
+ * These functions need to be called on a full stack, as they may
+ * use upto 1.5k of stack. They should be also called with
+ * interrupts disabled. NMIs etc. are handled by letting every
+ * NMI do its own clear sequence.
+ */
+ENTRY(ivb_clear_cpu)
+GLOBAL(do_clear_cpu)
+	/*
+	 * obj[tf]ool complains about unreachable code here,
+	 * which appears to be spurious?!?
+	 */
+	ALTERNATIVE "", "jmp hsw_clear_cpu", X86_BUG_MDS_CLEAR_CPU_HSW
+	push %__ASM_REG(si)
+	push %__ASM_REG(di)
+	push %__ASM_REG(cx)
+	mov %_ASM_SP, %__ASM_REG(si)
+	sub  $16, %_ASM_SP
+	and  $-16,%_ASM_SP
+	movdqa %xmm0, (%_ASM_SP)
+	sub  $672, %_ASM_SP
+	xorpd %xmm0,%xmm0
+	movdqa %xmm0, (%_ASM_SP)
+	mov %_ASM_SP, %__ASM_REG(di)
+	/* Clear sequence start */
+	movdqu %xmm0,(%__ASM_REG(di))
+	lfence
+	orpd (%__ASM_REG(di)), %xmm0
+	orpd (%__ASM_REG(di)), %xmm1
+	mfence
+	movl $40, %ecx
+	add  $32, %__ASM_REG(di)
+1:	movntdq %xmm0, (%__ASM_REG(di))
+	add  $16, %__ASM_REG(di)
+	decl %ecx
+	jnz  1b
+	mfence
+	/* Clear sequence end */
+	add  $672, %_ASM_SP
+	movdqu (%_ASM_SP), %xmm0
+	mov  %__ASM_REG(si),%_ASM_SP
+	pop %__ASM_REG(cx)
+	pop %__ASM_REG(di)
+	pop %__ASM_REG(si)
+	ret
+END(ivb_clear_cpu)
+
+/*
+ * Version for Haswell/Broadwell.
+ */
+ENTRY(hsw_clear_cpu)
+	push %__ASM_REG(si)
+	push %__ASM_REG(di)
+	push %__ASM_REG(cx)
+	push %__ASM_REG(ax)
+	mov  %_ASM_SP, %__ASM_REG(ax)
+	sub  $16, %_ASM_SP
+	and  $-16,%_ASM_SP
+	movdqa %xmm0, (%_ASM_SP)
+	sub  $1536,%_ASM_SP
+	/* Clear sequence start */
+	xorpd %xmm0,%xmm0
+	mov  %_ASM_SP, %__ASM_REG(si)
+	mov  %__ASM_REG(si), %__ASM_REG(di)
+	movl $40,%ecx
+1:	movntdq %xmm0, (%__ASM_REG(di))
+	add  $16, %__ASM_REG(di)
+	decl %ecx
+	jnz  1b
+	mfence
+	mov  %__ASM_REG(si), %__ASM_REG(di)
+	mov $1536, %ecx
+	rep movsb
+	lfence
+	/* Clear sequence end */
+	add $1536,%_ASM_SP
+	movdqa (%_ASM_SP), %xmm0
+	mov %__ASM_REG(ax),%_ASM_SP
+	pop %__ASM_REG(ax)
+	pop %__ASM_REG(cx)
+	pop %__ASM_REG(di)
+	pop %__ASM_REG(si)
+	ret
+END(hsw_clear_cpu)
-- 
2.17.2


From 9c8a184d16e326ed0af123d112bedb89ecc34e6b Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 16:57:27 -0800
Subject: [PATCH 07/32] x86/speculation/mds: Support mds=full for NMIs

NMIs don't go through C code when exiting to user space, so we need
to add an assembler clear cpu for this case. Only used with
mds=full, because otherwise we assume NMIs don't touch
other users or kernel sensitive data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_64.S       | 12 ++++++++++++
 arch/x86/include/asm/clearcpu.h | 14 ++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index ce25d84023c0..19b235ca2878 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -39,6 +39,7 @@
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
 #include <linux/err.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1403,6 +1404,17 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	/*
+	 * Clear only when force clearing was enabled. Otherwise
+	 * we assume NMI code is not sensitive.
+	 * If you don't have jump labels we always clear too.
+	 */
+#ifdef HAVE_JUMP_LABEL
+	STATIC_BRANCH_JMP l_yes=.Lno_clear_cpu key=force_cpu_clear, branch=1
+#endif
+	CLEAR_CPU
+.Lno_clear_cpu:
+
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
 	 * work, because we don't want to enable interrupts.
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 4a570b3b0f5e..cc03ca14140b 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_CLEARCPU_H
 #define _ASM_CLEARCPU_H 1
 
+#ifndef __ASSEMBLY__
+
 #include <linux/jump_label.h>
 #include <linux/sched/smt.h>
 #include <asm/alternative.h>
@@ -46,4 +48,16 @@ static inline void clear_cpu_idle(void)
 
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
+#else
+
+.macro CLEAR_CPU
+	/* Clear CPU buffers that could leak. Instruction must be in memory form. */
+	ALTERNATIVE_2 "", __stringify(push $__USER_DS ; verw (% _ASM_SP ) ; add $8, % _ASM_SP ),\
+		X86_FEATURE_MB_CLEAR, \
+		"call do_clear_cpu", \
+		X86_BUG_MDS_CLEAR_CPU
+.endm
+
+#endif
+
 #endif
-- 
2.17.2


From 10e2d8257c3852685bb121df9078edb6741ee88d Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 16 Nov 2018 17:05:07 -0800
Subject: [PATCH 08/32] x86/speculation/mds: Avoid NMI races with software
 sequences

When we use a software sequence for clearing CPU buffers an NMI
or similar interrupt could interrupt the clearing sequence.
In this case make sure we really flush by always doing the extra
clearing on paranoid interrupt exit.

This is only needed for the software sequence because VERW
is an instruction that cannot be interrupted.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c   | 13 ++++++++++++-
 arch/x86/entry/entry_64.S | 12 ++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 6662444b33cf..fd86f1e9e164 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -174,13 +174,24 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
 			/* Don't do it twice if forced */
-			if (!static_key_enabled(&force_cpu_clear))
+			if (!static_key_enabled(&force_cpu_clear) &&
+			    !static_cpu_has(X86_BUG_MDS_CLEAR_CPU))
 				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
+		/*
+		 * Software sequences can be interrupted, so we have
+		 * to do them with interrupts off. NMIs etc.
+		 * make sure to always clear even when returning
+		 * to the kernel.
+		 */
+		if (static_cpu_has(X86_BUG_MDS_CLEAR_CPU) &&
+			(cached_flags & _TIF_CLEAR_CPU))
+			clear_cpu();
+
 		cached_flags = READ_ONCE(current_thread_info()->flags);
 
 		if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 19b235ca2878..4a41e2abd909 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1206,6 +1206,14 @@ ENTRY(paranoid_exit)
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
 	TRACE_IRQS_IRETQ_DEBUG
+	/*
+	 * Always do cpuclear in case we're racing with a MDS clear
+	 * software sequence on kernel exit.
+	 * Only needed if MB_CLEAR is not available, because VERW is atomic.
+	 */
+	ALTERNATIVE "", "jmp 1f", X86_FEATURE_MB_CLEAR
+	CLEAR_CPU
+1:
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 .Lparanoid_exit_restore:
@@ -1628,6 +1636,10 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	ALTERNATIVE "", "jmp 1f", X86_FEATURE_MB_CLEAR
+	CLEAR_CPU
+1:
+
 	/* Always restore stashed CR3 value (see paranoid_entry) */
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
-- 
2.17.2


From 523855f4dd32cf724a68b9df4683bb56f1b50968 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 16 Nov 2018 16:42:00 -0800
Subject: [PATCH 09/32] x86/speculation/mds: Call software sequences on KVM
 entry

CPU buffers need to be cleared before entering a guest.
For VERW based cpu clearing we rely on the L1 cache flush for L1TF
doing it implicitely.

When using software sequences this is not done, so in this case
need to do call the software sequence explicitely.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/vmx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 02edd9960e9d..82ec518811a0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -41,6 +41,7 @@
 
 #include <asm/asm.h>
 #include <asm/cpu.h>
+#include <asm/clearcpu.h>
 #include <asm/io.h>
 #include <asm/desc.h>
 #include <asm/vmx.h>
@@ -10680,6 +10681,15 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 
 	vcpu->stat.l1d_flush++;
 
+	/*
+	 * When the CPU has MB_CLEAR the cpu buffers flush is done implicitely
+	 * by the L1D_FLUSH below. But if software sequences are used
+	 * we need to call them explicitely.
+	 */
+	if (static_cpu_has(X86_BUG_MDS) &&
+	    !static_cpu_has(X86_FEATURE_MB_CLEAR))
+		clear_cpu();
+
 	if (static_cpu_has(X86_FEATURE_FLUSH_L1D)) {
 		wrmsrl(MSR_IA32_FLUSH_CMD, L1D_FLUSH);
 		return;
-- 
2.17.2


From e0d70b913495b8ff95b5ac7305deec422097d2c7 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 6 Dec 2018 16:49:30 -0800
Subject: [PATCH 10/32] x86/speculation/mds: Clear buffers on NMI exit on 32bit
 kernels.

The main kernel exits on 32bit kernels are already handled by
earlier patches.

But for NMIs we need to clear in the assembler code because
they could be returning into a software sequence, or need
to do it because of mds=full.

Add an unconditional cpu clear on NMI exit for 32bit
for now.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_32.S | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..0334e58e4720 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -45,6 +45,7 @@
 #include <asm/smap.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1446,6 +1447,12 @@ ENTRY(nmi)
 	movl	%ebx, %esp
 
 .Lnmi_return:
+	/*
+	 * Only needed when returning to kernel with sw sequences
+	 * or if it's forced. But for now do it unconditionally.
+	 */
+	CLEAR_CPU
+.Lno_clear_cpu:
 	CHECK_AND_APPLY_ESPFIX
 	RESTORE_ALL_NMI cr3_reg=%edi pop=4
 	jmp	.Lirq_return
-- 
2.17.2


From cb957b5b94ac55674ab0aebe43d47619f9a90a25 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Tue, 18 Dec 2018 16:37:50 -0800
Subject: [PATCH 11/32] x86/speculation/mds: Add mds=verw

Some Hypervisors might be unable to expose the new MB_CLEAR CPUID
to guests, even though they have an updated microcode that implements
MB_CLEAR/VERW.

We won't use VERW unconditionally because we need to know whether
it is implemented to correctly report the status in
/sys/devices/system/cpu/vulnerabilities/mds

However we should have a way to let guests in such hypervisors
enable VERW even if its CPUID bit is not visible.

Add a mds=verw option to force enable VERW buffer clearing.

When VERW is forced the vulnerabitilies file will report
enabled, but add "forced".

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 arch/x86/kernel/cpu/bugs.c                      | 13 ++++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5f8ac5270beb..9499ef25da5f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2364,6 +2364,12 @@
 			This option might be useful if running inside a virtual machine
 			that does not expose the correct model number.
 
+	mds=verw	[X86, Intel]
+			Enable microcode based ("VERW") mitigation for Microarchitectural
+			Data Sampling (MDS). This is normally automatically enabled,
+			but may need to be set manually in guests when the VM
+			does not export all the CPUIDs from the host microcode.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index faec1f0dd801..b24d93fb0564 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1075,6 +1075,7 @@ static const __initconst struct x86_cpu_id cpu_mds_clear_cpu_hsw[] = {
 };
 
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+static bool __read_mostly forced_mb_clear;
 
 /* Export here to avoid warnings */
 extern __visible void do_clear_cpu(void);
@@ -1089,7 +1090,10 @@ static void mds_select_mitigation(void)
 		setup_clear_cpu_cap(X86_BUG_MDS_CLEAR_CPU);
 		return;
 	}
-
+	if (cmdline_find_option_bool(boot_command_line, "mds=verw")) {
+		setup_force_cpu_cap(X86_FEATURE_MB_CLEAR);
+		forced_mb_clear = true;
+	}
 	if ((!boot_cpu_has(X86_FEATURE_MB_CLEAR) &&
 		x86_match_cpu(cpu_mds_clear_cpu)) ||
 		cmdline_find_option_bool(boot_command_line, "mds=swclear"))
@@ -1209,9 +1213,12 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 	case X86_BUG_MDS:
 		/* Assumes Hypervisor exposed HT state to us if in guest */
 		if (boot_cpu_has(X86_FEATURE_MB_CLEAR)) {
+			char *forced = forced_mb_clear ? ", forced" : "";
+
 			if (cpu_smt_control != CPU_SMT_ENABLED)
-				return sprintf(buf, "Mitigation: microcode\n");
-			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+				return sprintf(buf, "Mitigation: microcode%s\n", forced);
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable%s\n",
+					forced);
 		}
 		if (boot_cpu_has_bug(X86_BUG_MDS_CLEAR_CPU)) {
 			if (cpu_smt_control != CPU_SMT_ENABLED)
-- 
2.17.2


From 9e890395992055235b1448580c3767713c29db53 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:50:07 -0800
Subject: [PATCH 12/32] x86/speculation/mds: Export MB_CLEAR CPUID to KVM
 guests.

Export the MB_CLEAR CPUID set by new microcode to signal
that VERW implements the clear cpu side effect to KVM guests.

Also requires corresponding qemu patches

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7bcfa61375c0..0fd8a4fb8f09 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,7 +411,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
-		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES);
+		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) |
+		F(MB_CLEAR);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
-- 
2.17.2


From a11a6b56c8eedd6e5f7b79c6e2f0e1225431fca3 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Tue, 18 Dec 2018 16:44:49 -0800
Subject: [PATCH 13/32] x86/speculation/mds: Always clear when entering guest
 without MB_CLEAR

If we don't expose MB_CLEAR to the guest it could be using software
sequences for clear cpu. If the hypervisor interrupts any of these
sequences the data will not be fully cleared. The only way to fix that
is for us to clear unconditionally on each entry.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/vmx.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 82ec518811a0..38db94df097a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10675,8 +10675,19 @@ static void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 		flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();
 		kvm_clear_cpu_l1tf_flush_l1d();
 
-		if (!flush_l1d)
+		if (!flush_l1d) {
+			/*
+			 * If we don't expose MB_CLEAR to the guest it
+			 * could be using software sequences for clear
+			 * cpu. If the hypervisor interrupts any of
+			 * these sequences the data will not be fully
+			 * cleared. The only way to fix that is for
+			 * us to clear unconditionally on each entry.
+			 */
+			if (!guest_cpuid_has(vcpu, X86_FEATURE_MB_CLEAR))
+				clear_cpu();
 			return;
+		}
 	}
 
 	vcpu->stat.l1d_flush++;
-- 
2.17.2


From 89150ddf45350eaa9f153ca61cefb7680641414c Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 12:55:54 -0800
Subject: [PATCH 14/32] mds: Add documentation for clear cpu usage

Including the theory, and some guide lines for subsystem/driver
maintainers.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/clearcpu.txt | 179 +++++++++++++++++++++++++++++++++++++
 1 file changed, 179 insertions(+)
 create mode 100644 Documentation/clearcpu.txt

diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..786a207e6449
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,179 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitely clearing the CPU state.
+
+We trying to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data,
+or user data from other processes.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+Kernel data is sensitive when it is cryptographic keys.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does not touch user data directly consider marking
+it with IRQF_NO_USER.
+
+When your tasklet does not touch user data directly consider marking
+it with TASKLET_NO_USER using tasklet_init_flags/or
+DECLARE_TASKLET*_NOUSER.
+
+When your timer does not touch user data mark it with TIMER_NO_USER.
+If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.
+
+When your irq poll handler does not touch user data, mark it
+with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
+
+For networking code make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt. When the non skb data access is only in a
+hardware interrupt controlled by the driver, it can rely on not
+setting IRQF_NO_USER for that interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+If a system call touches data it is for its own process, so does not
+need to be cleared, because it has already access to it.
+
+When context switching we clear data, unless the context switch
+is inside a process, or from/to idle. We also clear after any
+context switches from kernel threads.
+
+Idle does not have sensitive data, except for in interrupts, which
+are handled separately.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard interrupts, tasklets, timers which can run asynchronous are
+assumed to touch random user data, unless they have been audited, and
+marked with NO_USER flags.
+
+Most interrupt handlers for modern devices should not touch
+user data because they rely on DMA and only manipulate
+pointers. This needs auditing to confirm though.
+
+For softirqs we assume that if they touch user data they use
+lazy_clear_cpu()/lazy_clear_interrupt() as needed.
+Networking is handled through skb_* below.
+Timer and Tasklets and IRQ poll are handled through opt-in.
+
+Scheduler softirq is assumed to not touch user data.
+
+Block softirq done callbacks are assumed to not touch user data.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions.  We assume that any networking code touching
+packet data uses these functions.
+
+[In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways]
+
+Some IO related functions like string PIO and memcpy_from/to_io, or
+the software pci dma bounce function, which touch data, schedule a
+buffer clear.
+
+We assume NMI/machine check code does not touch other
+processes' data.
+
+Any buffer clearing is done lazily on next kernel exit, so can be
+triggered in fast paths.
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that near all do.
+
+This could be further optimized by allowing callers that do
+a lot of individual BPF runs and are sure they don't touch
+other user's data inbetween to do the clear only once
+at the beginning. We can add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitely as part of the L1TF mitigation.
+It relies on this being enabled. It also uses the "fast exit"
+optimization that only clears if an interrupt or context switch
+happened.
+
+There's an exception that if we don't expose MB_CLEAR to the guest it
+may be using software sequences. Unlike VERW, the software sequences
+are not atomic, and can be interrupted by the hypervisor, and not
+clear the data correctly. To avoid this we unconditionally clear when
+entering if MB_CLEAR is not exposed.
-- 
2.17.2


From 48b5e80a7854bbd8025ed796c47183bfc24ebf39 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Tue, 18 Dec 2018 16:40:41 -0800
Subject: [PATCH 15/32] mds: Add preliminary administrator documentation

Add a Documentation file for administrators that describes MDS on a
high level.

So far not covering SMT.

Needs updates later for public URLs of supporting documentation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/mds.rst | 128 ++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)
 create mode 100644 Documentation/admin-guide/mds.rst

diff --git a/Documentation/admin-guide/mds.rst b/Documentation/admin-guide/mds.rst
new file mode 100644
index 000000000000..accae1497ae9
--- /dev/null
+++ b/Documentation/admin-guide/mds.rst
@@ -0,0 +1,128 @@
+MDS - Microarchitectural Data Sampling)
+=======================================
+
+Microarchitectural Data Sampling is a side channel vulnerability that
+allows an attacker to sample data that has been earlier used during
+program execution. Internal buffers in the CPU may keep old data
+for some limited time, which can the later be determined by an attacker
+with side channel analysis. MDS can be used to occasionaly observe
+some values accessed earlier, but it cannot be used to observe values
+not recently touched by other code running on the same core.
+
+It is difficult to target particular data on a system using MDS,
+but attackers may be able to infer secrets by collecting
+and analyzing large amounts of data. MDS does not modify
+memory.
+
+MDS consists of multiple sub-vulnerabilities:
+Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
+with the first leaking store data, and the second loads and sometimes
+store data, and the third load data.
+
+The effects and mitigations are similar for all three, so the Linux
+kernel handles and reports them all as a single vulnerability called
+MDS. This also reduces the number of acronyms in use.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors.
+Not all CPUs are affected by all of the sub vulnerabilities,
+however the kernel handles it always the same.
+
+The vulnerability is not present in
+
+    - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+The kernel will automatically detect future CPUs with hardware
+mitigations for these issues and disable any workarounds.
+
+The kernel reports if the current CPU is vulnerable and any
+mitigations used in
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+Kernel mitigation
+-----------------
+
+By default, the kernel automatically ensures no data leakage between
+different processes, or between kernel threads and interrupt handlers
+and user processes, or from any cryptographic code in the kernel.
+
+It does not isolate kernel code that only touches data of the
+current process.  If protecting such kernel code is desired,
+mds=full can be specified.
+
+The mitigation is automatically enabled, but can be further controlled
+with the command line options documented below.
+
+The mitigation can be done either with microcode support (requiring
+updated microcode), or through software sequences on some CPUs.
+On Skylake based CPUs only mitigation through microcode is supported.
+In general microcode mitigation is preferred.
+
+The microcode should be loaded at early boot using the initrd. Hot
+updating microcode will not enable the mitigations.
+
+Virtual machine mitigation
+--------------------------
+
+The mitigation is enabled by default and controlled by the same options
+as L1TF cache clearing. See l1tf.rst for more details. In the default
+setting
+
+To enable the mitigation in guests it may be also needed to update
+VM configurations to include the "MB_CLEAR" CPUID bit. This will
+communicate to the guest kernel that the host has the microcode
+with mitigations applied.
+
+Kernel command line options
+---------------------------
+
+Normally the kernel selects reasonable defaults and no special configuration
+is needed. The default behavior can be overriden by the mds= kernel
+command line options.
+
+These options can be specified in the boot loader. Any changes require a reboot.
+
+When the system only runs trusted code, MDS mitigation can be disabled with
+mds=off.
+
+By default the kernel only clears CPU data after execution
+that is known or likely to have touched user data of other processes,
+or cryptographic data. This relies on code audits done in the
+mainline Linux kernel. When running unaudited large out of tree code,
+or binary drivers, who might violate these constraints it is possible
+to use mds=full to always flush the CPU data on each kernel exit.
+
+By default the kernel automatically selects using microcode based ("VERW")
+mitigation, or software based mitigations, or no mitigation, based on the
+CPUID information reported by the CPU. When running virtualized
+inside a guest the CPUID information might be incomplete, or report
+a different system.
+
+In this case, and when the VM configuration cannot be fixed,
+the following options can be used to select the right mitigation:
+
+   - mds=off      Disable workarounds if the CPU is not affected.
+   - mds=swclear  Host CPU doesn't have updated microcode.
+                  Use software sequence applicable for Nehalem to IvyBridge
+   - mds=swclearhsw
+                  Host CPU doesn't have updated microcode.
+                  Use software sequence applicable to Haswell and Broadwell
+   - mds=verw     Host CPU has updated microcode
+                  Use microcode based ("VERW") mitigation.
+
+TBD describe SMT
+
+References
+----------
+
+Fore more details on the kernel internal implementation of the MDS mitigations,
+please see Documentation/clearcpu.txt
+
+TBD Add URL for Intel white paper
+
+TBD add reference to microcodes
-- 
2.17.2


From 99f84337bd1f7e510325671cd546d581a7eda4d7 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:43:43 -0800
Subject: [PATCH 16/32] x86/speculation/mds: Introduce lazy_clear_cpu

Add basic infrastructure for code to request CPU buffer clearing
on the next kernel exit.

We have two functions lazy_clear_cpu to request clearing,
and lazy_clear_cpu_interrupt to request clearing if running
in an interrupt.

Non architecture specific code can include linux/clearcpu.h
and use lazy_clear_cpu / lazy_clear_interrupt. On x86
we provide low level implementations that set the TIF_CLEAR_CPU
bit.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/Kconfig                    |  3 +++
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/clearcpu.h |  5 +++++
 include/linux/clearcpu.h        | 36 +++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)
 create mode 100644 include/linux/clearcpu.h

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540ffa979..32b6cd5dfe0f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -802,6 +802,9 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_HAS_CLEAR_CPU
+	def_bool n
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d76ef308a47f 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -85,6 +85,7 @@ config X86
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_CLEAR_CPU
 	select BUILDTIME_EXTABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index cc03ca14140b..6e6f68a0cab1 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -46,6 +46,11 @@ static inline void clear_cpu_idle(void)
 	}
 }
 
+static inline void lazy_clear_cpu(void)
+{
+	set_thread_flag(TIF_CLEAR_CPU);
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #else
diff --git a/include/linux/clearcpu.h b/include/linux/clearcpu.h
new file mode 100644
index 000000000000..63a6952b46fa
--- /dev/null
+++ b/include/linux/clearcpu.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CLEARCPU_H
+#define _LINUX_CLEARCPU_H 1
+
+#include <linux/preempt.h>
+
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearcpu.h>
+#else
+static inline void lazy_clear_cpu(void)
+{
+}
+#endif
+
+/*
+ * Use this function when potentially touching (reading or writing)
+ * user data in an interrupt. In this case schedule to clear the
+ * CPU buffers on kernel exit to avoid any potential side channels.
+ *
+ * If not in an interrupt we assume the touched data belongs to the
+ * current process and doesn't need to be cleared.
+ *
+ * This version is for code who might be in an interrupt.
+ * If you know for sure you're in interrupt context call
+ * lazy_clear_cpu directly.
+ *
+ * lazy_clear_cpu is reasonably cheap (just sets a bit) and
+ * can be used in fast paths.
+ */
+static inline void lazy_clear_cpu_interrupt(void)
+{
+	if (in_interrupt())
+		lazy_clear_cpu();
+}
+
+#endif
-- 
2.17.2


From edba96597d6e98304e50382e617adad271d843bb Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:49:56 -0800
Subject: [PATCH 17/32] x86/speculation/mds: Schedule cpu clear on context
 switch

On context switch we need to schedule a cpu clear on the next
kernel exit when:

- We're switching between different processes
- We're switching from a kernel thread that is not idle.
For idle we assume only interrupts are sensitive, which
are already handled elsewhere. For kernel threads
like work queue we assume they might contain
sensitive (other user's or crypto) data.

The code hooks into the generic context switch, not
the mm context switch, because the mm context switch
doesn't handle the idle thread case.

This also transfers the clear cpu bit to the next task.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/process.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 898e97cf6629..e61a4d5ce917 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -2,6 +2,7 @@
 //
 // Code shared between 32 and 64 bit
 
+#include <linux/clearcpu.h>
 #include <asm/spec-ctrl.h>
 
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
@@ -29,6 +30,32 @@ static inline void switch_to_extra(struct task_struct *prev,
 		}
 	}
 
+	/*
+	 * When we switch to a different process, or we switch
+	 * from a kernel thread that was not idle, clear the CPU
+	 * buffers on next kernel exit.
+	 *
+	 * We assume that idle does not touch user data, except
+	 * for interrupts, which schedule their own clears as needed.
+	 * But other kernel threads, like work queues, might
+	 * touch user data, so flush in this case.
+	 *
+	 * This has to be here because switch_mm doesn't get
+	 * called in the kernel thread case.
+	 */
+	if (static_cpu_has(X86_BUG_MDS)) {
+		if (prev->pid && (next->mm != prev->mm || prev->mm == NULL))
+			lazy_clear_cpu();
+		/*
+		 * Also transfer the clearcpu flag from the previous task.
+		 * Can be done non atomically because interrupts are off.
+		 */
+		task_thread_info(next)->status |=
+			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
+		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;
+	}
+
+
 	/*
 	 * __switch_to_xtra() handles debug registers, i/o bitmaps,
 	 * speculation mitigations etc.
-- 
2.17.2


From ada2f08082073f066fac7c1a78847f0f8a1c9236 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 13:33:03 -0800
Subject: [PATCH 18/32] x86/speculation/mds: Add tracing for clear_cpu

Add trace points for clear_cpu and lazy_clear_cpu. This is useful
for debugging and performance testing.

The trace points have to be partially out of line to avoid
include loops, but the fast path jump labels are inlined.

The idle case cannot be traced because trace points
don't like idle context.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h       | 38 ++++++++++++++++++++++++---
 arch/x86/include/asm/trace/clearcpu.h | 27 +++++++++++++++++++
 arch/x86/kernel/cpu/bugs.c            | 17 ++++++++++++
 3 files changed, 79 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 6e6f68a0cab1..d9709a86ef1a 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -6,8 +6,31 @@
 
 #include <linux/jump_label.h>
 #include <linux/sched/smt.h>
-#include <asm/alternative.h>
 #include <linux/thread_info.h>
+#include <asm/alternative.h>
+
+/*
+ * We cannot directly include the trace point header here
+ * because it leads to include loops with other trace point
+ * files pulling this one in. Define the static
+ * key manually here, which handles noping the fast path,
+ * and the actual tracing is done out of line.
+ */
+#ifdef CONFIG_TRACEPOINTS
+#include <asm/atomic.h>
+#include <linux/tracepoint-defs.h>
+
+extern struct tracepoint __tracepoint_clear_cpu;
+extern struct tracepoint __tracepoint_lazy_clear_cpu;
+#define cc_tracepoint_active(t) static_key_false(&(t).key)
+
+extern void do_trace_clear_cpu(void);
+extern void do_trace_lazy_clear_cpu(void);
+#else
+#define cc_tracepoint_active(t) false
+static inline void do_trace_clear_cpu(void) {}
+static inline void do_trace_lazy_clear_cpu(void) {}
+#endif
 
 /*
  * Clear CPU buffers to avoid side channels.
@@ -15,7 +38,7 @@
  * "VERW" instruction), or special out of line clear sequences.
  */
 
-static inline void clear_cpu(void)
+static inline void __clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
@@ -27,6 +50,13 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+static inline void clear_cpu(void)
+{
+	if (cc_tracepoint_active(__tracepoint_clear_cpu))
+		do_trace_clear_cpu();
+	__clear_cpu();
+}
+
 /*
  * Clear CPU buffers before going idle, so that no state is leaked to SMT
  * siblings taking over thread resources.
@@ -42,12 +72,14 @@ static inline void clear_cpu_idle(void)
 {
 	if (sched_smt_active()) {
 		clear_thread_flag(TIF_CLEAR_CPU);
-		clear_cpu();
+		__clear_cpu();
 	}
 }
 
 static inline void lazy_clear_cpu(void)
 {
+	if (cc_tracepoint_active(__tracepoint_lazy_clear_cpu))
+		do_trace_lazy_clear_cpu();
 	set_thread_flag(TIF_CLEAR_CPU);
 }
 
diff --git a/arch/x86/include/asm/trace/clearcpu.h b/arch/x86/include/asm/trace/clearcpu.h
new file mode 100644
index 000000000000..e742b5cd8ee9
--- /dev/null
+++ b/arch/x86/include/asm/trace/clearcpu.h
@@ -0,0 +1,27 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM clearcpu
+
+#if !defined(_TRACE_CLEARCPU_H) || defined(TRACE_HEADER_MULTI_READ)
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(clear_cpu,
+		    TP_PROTO(int dummy),
+		    TP_ARGS(dummy),
+		    TP_STRUCT__entry(__field(int, dummy)),
+		    TP_fast_assign(),
+		    TP_printk("%d", __entry->dummy));
+
+DEFINE_EVENT(clear_cpu, clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+DEFINE_EVENT(clear_cpu, lazy_clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+
+#define _TRACE_CLEARCPU_H
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE clearcpu
+#endif /* _TRACE_CLEARCPU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index b24d93fb0564..ba4f2bb203a5 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1045,6 +1045,23 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/clearcpu.h>
+
+void do_trace_clear_cpu(void)
+{
+	trace_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(clear_cpu);
+
+void do_trace_lazy_clear_cpu(void)
+{
+	trace_lazy_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_lazy_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(lazy_clear_cpu);
+
 static const __initconst struct x86_cpu_id cpu_mds_clear_cpu[] = {
 	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM	 },
 	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_NEHALEM_G	 },
-- 
2.17.2


From ebe494ab667fb1c767e351101271cca6ee8ceb8c Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 17:26:14 -0800
Subject: [PATCH 19/32] mds: Force clear cpu on kernel preemption

When the kernel is preempted we need to force a cpu clear,
because the preemption might happen before the code
has a chance to set TIF_CPU_CLEAR later.

We cannot rely on kernel code setting the flag before
touching sensitive data: the flag setting could
be implicit, like in memzero_explicit, which is always
called later.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/sched/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6fedf3a98581..2a5e40be3cb4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11,6 +11,8 @@
 
 #include <linux/kcov.h>
 
+#include <linux/clearcpu.h>
+
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
 
@@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	if (likely(!preemptible()))
 		return;
 
+	/*
+	 * For kernel preemption we need to force a cpu clear
+	 * because it could happen before the code has a chance
+	 * to set TIF_CLEAR_CPU.
+	 */
+	lazy_clear_cpu();
+
 	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
-- 
2.17.2


From 42307b1d4d3e2b766abf83726edf6c0f49f3f800 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:44:52 -0800
Subject: [PATCH 20/32] mds: Schedule cpu clear for memzero_explicit and kzfree

Assume that any code using these functions is sensitive and shouldn't
leak any data.

This handles clearing for key data used in the kernel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 lib/string.c     | 6 ++++++
 mm/slab_common.c | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/string.c b/lib/string.c
index 38e4ca08e757..9ce59dd86541 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -28,6 +28,7 @@
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/clearcpu.h>
 
 #include <asm/byteorder.h>
 #include <asm/word-at-a-time.h>
@@ -715,12 +716,17 @@ EXPORT_SYMBOL(memset);
  * necessary, memzero_explicit() should be used instead in
  * order to prevent the compiler from optimising away zeroing.
  *
+ * As a side effect this may also trigger extra cleaning
+ * of CPU state before the next kernel exit to avoid
+ * side channels.
+ *
  * memzero_explicit() doesn't need an arch-specific version as
  * it just invokes the one of memset() implicitly.
  */
 void memzero_explicit(void *s, size_t count)
 {
 	memset(s, 0, count);
+	lazy_clear_cpu();
 	barrier_data(s);
 }
 EXPORT_SYMBOL(memzero_explicit);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 7eb8dc136c1c..141024fd43f8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1551,6 +1551,9 @@ EXPORT_SYMBOL(krealloc);
  * Note: this function zeroes the whole allocated buffer which can be a good
  * deal bigger than the requested buffer size passed to kmalloc(). So be
  * careful when using this function in performance sensitive code.
+ *
+ * As a side effect this may also clear CPU state later before the
+ * next kernel exit to avoid side channels.
  */
 void kzfree(const void *p)
 {
@@ -1560,7 +1563,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.17.2


From 7212e63c1b56abe135bcd5119344ac5c98b863bc Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:51:44 -0800
Subject: [PATCH 21/32] mds: Mark interrupts clear cpu, unless opted-out

Interrupts might touch user data from other processes
in any context.

By default we clear the CPU on the next kernel exit.

Add a new IRQ_F_NO_USER interrupt flag. When the flag
is not set on interrupt execution we clear the cpu state on
next kernel exit.

This allows interrupts to opt-out from the extra clearing
overhead, but is safe by default.

Over time as more interrupt code is audited it can set the opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 2 ++
 kernel/irq/handle.c       | 8 ++++++++
 kernel/irq/manage.c       | 1 +
 3 files changed, 11 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 1d6711c28271..65c957e3db68 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,7 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_NO_USER	- Interrupt does not touch user data
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +75,7 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_NO_USER		0x00080000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e5910938ce2b 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/clearcpu.h>
 
 #include <trace/events/irq.h>
 
@@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 		res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
 
+		/*
+		 * We aren't sure if the interrupt handler did or did not
+		 * touch user data. Schedule a cpu clear just in case.
+		 */
+		if (!(action->flags & IRQF_NO_USER))
+			lazy_clear_cpu();
+
 		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pF enabled interrupts\n",
 			      irq, action->handler))
 			local_irq_disable();
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 9dbdccab3b6a..80a9383ea993 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1793,6 +1793,7 @@ EXPORT_SYMBOL(free_irq);
  *
  *	IRQF_SHARED		Interrupt is shared
  *	IRQF_TRIGGER_*		Specify active edge(s) or level
+ *	IRQF_NOUSER		Does not touch user data.
  *
  */
 int request_threaded_irq(unsigned int irq, irq_handler_t handler,
-- 
2.17.2


From 8a70407bdcceeac0ce254bab4f3c2977c2a779ae Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:46:09 -0800
Subject: [PATCH 22/32] mds: Clear cpu on all timers, unless the timer opts-out

By default we assume timers might touch user data and schedule
a cpu clear on next kernel exit.

Support opt-outs where timer and hrtimer handlers can opt-in
they they don't touch any user data.

Note this takes one bit from the timer wheel index field away,
but it seems there are less wheels available anyways, so that
should be ok.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/hrtimer.h | 4 ++++
 include/linux/timer.h   | 9 ++++++---
 kernel/time/hrtimer.c   | 5 +++++
 kernel/time/timer.c     | 8 ++++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 3892e9c8b2de..463579d05415 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -35,6 +35,7 @@ struct hrtimer_cpu_base;
  *				  when starting the timer)
  * HRTIMER_MODE_SOFT		- Timer callback function will be executed in
  *				  soft irq context
+ * HRTIMER_MODE_NO_USER		- Handler does not touch user data.
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -51,6 +52,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_ABS_PINNED_SOFT = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_SOFT,
 	HRTIMER_MODE_REL_PINNED_SOFT = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_SOFT,
 
+	HRTIMER_MODE_NO_USER	= 0x08,
 };
 
 /*
@@ -104,6 +106,7 @@ enum hrtimer_restart {
  * @state:	state information (See bit values above)
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
+ * @no_user:	function does not touch user data.
  *
  * The hrtimer structure must be initialized by hrtimer_init()
  */
@@ -115,6 +118,7 @@ struct hrtimer {
 	u8				state;
 	u8				is_rel;
 	u8				is_soft;
+	u8				no_user;
 };
 
 /**
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 7b066fd38248..222e72432be3 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -56,10 +56,13 @@ struct timer_list {
 #define TIMER_DEFERRABLE	0x00080000
 #define TIMER_PINNED		0x00100000
 #define TIMER_IRQSAFE		0x00200000
-#define TIMER_ARRAYSHIFT	22
-#define TIMER_ARRAYMASK		0xFFC00000
+#define TIMER_NO_USER		0x00400000
+#define TIMER_ARRAYSHIFT	23
+#define TIMER_ARRAYMASK		0xFF800000
 
-#define TIMER_TRACE_FLAGMASK	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
+#define TIMER_TRACE_FLAGMASK	\
+	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | \
+	 TIMER_NO_USER)
 
 #define __TIMER_INITIALIZER(_function, _flags) {		\
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 9cdd74bd2d27..7e8e89a47d12 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -51,6 +51,7 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 
@@ -1285,6 +1286,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		clock_id = CLOCK_MONOTONIC;
 
 	base += hrtimer_clockid_to_base(clock_id);
+	timer->no_user = !!(mode & HRTIMER_MODE_NO_USER);
 	timer->is_soft = softtimer;
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
@@ -1399,6 +1401,9 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	trace_hrtimer_expire_exit(timer);
 	raw_spin_lock_irq(&cpu_base->lock);
 
+	if (!timer->no_user)
+		lazy_clear_cpu();
+
 	/*
 	 * Note: We clear the running state after enqueue_hrtimer and
 	 * we do not reprogram the event hardware. Happens either in
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index fa49cd753dea..d05ba85bdc4b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -44,6 +44,7 @@
 #include <linux/sched/debug.h>
 #include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -1339,6 +1340,13 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(struct timer_list
 		 */
 		preempt_count_set(count);
 	}
+
+	/*
+	 * The timer might have touched user data. Schedule
+	 * a cpu clear on the next kernel exit.
+	 */
+	if (!(timer->flags & TIMER_NO_USER))
+		lazy_clear_cpu();
 }
 
 static void expire_timers(struct timer_base *base, struct hlist_head *head)
-- 
2.17.2


From fcc041b1e75faf557c1f7cdeff7bbdd3014a2ca4 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 13 Dec 2018 11:28:55 -0800
Subject: [PATCH 23/32] mds: Clear CPU on tasklets, unless opted-out

By default we assume tasklets might touch user data and schedule
a cpu clear on next kernel exit.

Add new interfaces to allow audited tasklets to opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 16 +++++++++++++++-
 kernel/softirq.c          | 25 +++++++++++++++++++------
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 65c957e3db68..65158a13c8cb 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -556,11 +556,22 @@ struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
 #define DECLARE_TASKLET_DISABLED(name, func, data) \
 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
 
+#define DECLARE_TASKLET_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(0), func, data }
+
+#define DECLARE_TASKLET_DISABLED_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(1), func, data }
 
 enum
 {
 	TASKLET_STATE_SCHED,	/* Tasklet is scheduled for execution */
-	TASKLET_STATE_RUN	/* Tasklet is running (SMP only) */
+	TASKLET_STATE_RUN,	/* Tasklet is running (SMP only) */
+
+	/*
+	 * Set this flag when the tasklet is known to not touch user data,
+	 * so doesn't need extra CPU state clearing.
+	 */
+	TASKLET_NO_USER		= 1 << 5,
 };
 
 #ifdef CONFIG_SMP
@@ -624,6 +635,9 @@ extern void tasklet_kill(struct tasklet_struct *t);
 extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
+extern void tasklet_init_flags(struct tasklet_struct *t,
+			 void (*func)(unsigned long), unsigned long data,
+			 unsigned flags);
 
 struct tasklet_hrtimer {
 	struct hrtimer		timer;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d28813306b2c..fdd4e3be3db7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/clearcpu.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -522,6 +523,8 @@ static void tasklet_action_common(struct softirq_action *a,
 					BUG();
 				t->func(t->data);
 				tasklet_unlock(t);
+				if (!(t->state & TASKLET_NO_USER))
+					lazy_clear_cpu();
 				continue;
 			}
 			tasklet_unlock(t);
@@ -546,15 +549,23 @@ static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
 	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
 }
 
-void tasklet_init(struct tasklet_struct *t,
-		  void (*func)(unsigned long), unsigned long data)
+void tasklet_init_flags(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data,
+		  unsigned flags)
 {
 	t->next = NULL;
-	t->state = 0;
+	t->state = flags;
 	atomic_set(&t->count, 0);
 	t->func = func;
 	t->data = data;
 }
+EXPORT_SYMBOL(tasklet_init_flags);
+
+void tasklet_init(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data)
+{
+	tasklet_init_flags(t, func, data, 0);
+}
 EXPORT_SYMBOL(tasklet_init);
 
 void tasklet_kill(struct tasklet_struct *t)
@@ -609,7 +620,8 @@ static void __tasklet_hrtimer_trampoline(unsigned long data)
  * @ttimer:	 tasklet_hrtimer which is initialized
  * @function:	 hrtimer callback function which gets called from softirq context
  * @which_clock: clock id (CLOCK_MONOTONIC/CLOCK_REALTIME)
- * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL)
+ * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL),
+ *		 HRTIMER_MODE_NO_USER
  */
 void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 			  enum hrtimer_restart (*function)(struct hrtimer *),
@@ -617,8 +629,9 @@ void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 {
 	hrtimer_init(&ttimer->timer, which_clock, mode);
 	ttimer->timer.function = __hrtimer_tasklet_trampoline;
-	tasklet_init(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
-		     (unsigned long)ttimer);
+	tasklet_init_flags(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
+		     (unsigned long)ttimer,
+		     (mode & HRTIMER_MODE_NO_USER) ? TASKLET_NO_USER : 0);
 	ttimer->function = function;
 }
 EXPORT_SYMBOL_GPL(tasklet_hrtimer_init);
-- 
2.17.2


From 5dfb69f574de53f23f543ff5d46def246768b7c5 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 13:29:35 -0800
Subject: [PATCH 24/32] mds: Clear CPU on irq poll, unless opted-out

By default we assume that irq poll handlers running in the irq poll
softirq might touch user data and we schedule a cpu clear on next
kernel exit.

Add interfaces for audited handlers to declare that they are safe.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/irq_poll.h |  2 ++
 lib/irq_poll.c           | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..5f13582f1b8e 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -15,6 +15,8 @@ struct irq_poll {
 enum {
 	IRQ_POLL_F_SCHED	= 0,
 	IRQ_POLL_F_DISABLE	= 1,
+
+	IRQ_POLL_F_NO_USER	= 1<<4,
 };
 
 extern void irq_poll_sched(struct irq_poll *);
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..cb19431f53ec 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -11,6 +11,7 @@
 #include <linux/cpu.h>
 #include <linux/irq_poll.h>
 #include <linux/delay.h>
+#include <linux/clearcpu.h>
 
 static unsigned int irq_poll_budget __read_mostly = 256;
 
@@ -111,6 +112,9 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 
 		budget -= work;
 
+		if (!(iop->state & IRQ_POLL_F_NO_USER))
+			lazy_clear_cpu();
+
 		local_irq_disable();
 
 		/*
@@ -168,21 +172,31 @@ void irq_poll_enable(struct irq_poll *iop)
 EXPORT_SYMBOL(irq_poll_enable);
 
 /**
- * irq_poll_init - Initialize this @iop
+ * irq_poll_init_flags - Initialize this @iop
  * @iop:      The parent iopoll structure
  * @weight:   The default weight (or command completion budget)
  * @poll_fn:  The handler to invoke
+ * @flags:    IRQ_POLL_F_NO_USER if callback does not touch user data.
  *
  * Description:
  *     Initialize and enable this irq_poll structure.
  **/
-void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+void irq_poll_init_flags(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn,
+			 int flags)
 {
 	memset(iop, 0, sizeof(*iop));
 	INIT_LIST_HEAD(&iop->list);
 	iop->weight = weight;
 	iop->poll = poll_fn;
+	iop->state = flags;
 }
+EXPORT_SYMBOL(irq_poll_init_flags);
+
+void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+{
+	return irq_poll_init_flags(iop, weight, poll_fn, 0);
+}
+
 EXPORT_SYMBOL(irq_poll_init);
 
 static int irq_poll_cpu_dead(unsigned int cpu)
-- 
2.17.2


From aa0c2c2815c8c9bb12437e6d0ec2a4613f5dc7d5 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 13 Dec 2018 11:28:23 -0800
Subject: [PATCH 25/32] mds: Clear cpu for string io/memcpy_*io in interrupts

Schedule a clear cpu on next kernel exit for string PIO
or memcpy_from/to_io calls, when they are called in
interrupts.

The PIO case is likely already handled by old drivers
not opting in their interrupt handlers to not clear,
but let's do it just to be sure.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/io.h | 3 +++
 include/asm-generic/io.h  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 832da8229cc7..2b9fb7890f0e 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/clearcpu.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -313,6 +314,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 			     : "+S"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }									\
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
@@ -329,6 +331,7 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 			     : "+D"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }
 
 BUILDIO(b, b, char)
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index d356f802945a..cf58bceea042 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -14,6 +14,7 @@
 #include <asm/page.h> /* I/O is all done through memory accesses */
 #include <linux/string.h> /* for memset() and memcpy() */
 #include <linux/types.h>
+#include <linux/clearcpu.h>
 
 #ifdef CONFIG_GENERIC_IOMAP
 #include <asm-generic/iomap.h>
@@ -1115,6 +1116,7 @@ static inline void memcpy_fromio(void *buffer,
 				 size_t size)
 {
 	memcpy(buffer, __io_virt(addr), size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
@@ -1132,6 +1134,7 @@ static inline void memcpy_toio(volatile void __iomem *addr, const void *buffer,
 			       size_t size)
 {
 	memcpy(__io_virt(addr), buffer, size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
-- 
2.17.2


From af779e6d1a2f3ab7d6ec5a9ac9b9c8c6ef438b42 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 13 Dec 2018 11:29:09 -0800
Subject: [PATCH 26/32] mds: Schedule clear cpu in swiotlb

Schedule a cpu clear on next kernel exit for swiotlb running
in interrupt context, since it touches user data with the CPU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/dma/swiotlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..a72b9dbb39ae 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -35,6 +35,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mem_encrypt.h>
 #include <linux/set_memory.h>
+#include <linux/clearcpu.h>
 
 #include <asm/io.h>
 #include <asm/dma.h>
@@ -426,6 +427,7 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 	} else {
 		memcpy(phys_to_virt(orig_addr), vaddr, size);
 	}
+	lazy_clear_cpu_interrupt();
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
-- 
2.17.2


From b8986a192df44b94924293028f75940f89fd855c Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:44:07 -0800
Subject: [PATCH 27/32] mds: Instrument skb functions to clear cpu
 automatically

Instrument some strategic skbuff functions that either touch
packet data directly, or are likely followed by a user
data touch like a memcpy, to schedule a cpu clear on next
kernel exit. This is only done inside interrupts,
outside we assume it only touches the current processes' data.

In principle network data should be encrypted anyways,
but it's better to not leak it.

This provides protection for the network softirq.

Needs more auditing.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0d1b2c3f127b..af90474c122f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
 #include <linux/in6.h>
 #include <linux/if_packet.h>
 #include <net/flow.h>
+#include <linux/clearcpu.h>
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -2077,6 +2078,7 @@ static inline void *__skb_put(struct sk_buff *skb, unsigned int len)
 	SKB_LINEAR_ASSERT(skb);
 	skb->tail += len;
 	skb->len  += len;
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a8217e221e19..3e5060b7712b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1184,6 +1184,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (!num_frags)
 		goto release;
 
+	/* Likely to copy user data */
+	lazy_clear_cpu_interrupt();
+
 	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
@@ -1348,6 +1351,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 	if (!n)
 		return NULL;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	/* Set the data pointer */
 	skb_reserve(n, headerlen);
 	/* Set the tail pointer and length */
@@ -1583,6 +1589,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	if (!n)
 		return NULL;
 
+	/* May copy user data */
+	lazy_clear_cpu_interrupt();
+
 	skb_reserve(n, newheadroom);
 
 	/* Set the tail pointer and length */
@@ -1671,6 +1680,8 @@ EXPORT_SYMBOL(__skb_pad);
 
 void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len)
 {
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	if (tail != skb) {
 		skb->data_len += len;
 		skb->len += len;
@@ -1696,6 +1707,8 @@ void *skb_put(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->tail > skb->end))
 		skb_over_panic(skb, len, __builtin_return_address(0));
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 EXPORT_SYMBOL(skb_put);
@@ -1715,6 +1728,7 @@ void *skb_push(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->data < skb->head))
 		skb_under_panic(skb, len, __builtin_return_address(0));
+	/* No clear cpu, assume this is only header data */
 	return skb->data;
 }
 EXPORT_SYMBOL(skb_push);
@@ -2023,6 +2037,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2397,6 +2414,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2477,6 +2497,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Checksum header. */
 	if (copy > 0) {
 		if (copy > len)
@@ -2569,6 +2592,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
-- 
2.17.2


From 76c40f15daaf7209e9f91a34f56475d797dba851 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 13 Dec 2018 11:31:11 -0800
Subject: [PATCH 28/32] mds: Opt out tcp tasklet to not touch user data

Mark the tcp tasklet as not needing an implicit cpu clear
flush. If any is needed it will be triggered by the skb_*
hooks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3f510cad0b3e..40c2c6134b4b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -903,9 +903,10 @@ void __init tcp_tasklet_init(void)
 		struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
 
 		INIT_LIST_HEAD(&tsq->head);
-		tasklet_init(&tsq->tasklet,
+		tasklet_init_flags(&tsq->tasklet,
 			     tcp_tasklet_func,
-			     (unsigned long)tsq);
+			     (unsigned long)tsq,
+			     TASKLET_NO_USER);
 	}
 }
 
-- 
2.17.2


From f27651c48795556c2f96cc73395e303453c653e7 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 13 Dec 2018 11:30:30 -0800
Subject: [PATCH 29/32] mds: mark kernel/* timers safe as not touching user
 data

Some preliminary auditing of kernel/* shows no timers touch
other processes' user data. Mark all the timers in kernel/*
as not needed an implicit cpu clear.

More auditing here would be useful.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/events/core.c       | 6 ++++--
 kernel/fork.c              | 3 ++-
 kernel/futex.c             | 6 +++---
 kernel/sched/core.c        | 5 +++--
 kernel/sched/deadline.c    | 6 ++++--
 kernel/sched/fair.c        | 6 ++++--
 kernel/sched/idle.c        | 3 ++-
 kernel/sched/rt.c          | 3 ++-
 kernel/time/alarmtimer.c   | 2 +-
 kernel/time/hrtimer.c      | 6 +++---
 kernel/time/posix-timers.c | 6 ++++--
 kernel/time/sched_clock.c  | 3 ++-
 kernel/time/tick-sched.c   | 6 ++++--
 kernel/watchdog.c          | 3 ++-
 14 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 84530ab358c3..1a96e35ce95a 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1102,7 +1102,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
 	raw_spin_lock_init(&cpuctx->hrtimer_lock);
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -9202,7 +9203,8 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hwc->hrtimer.function = perf_swevent_hrtimer;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff89c7b..b54d3efbd9b4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1540,7 +1540,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 #ifdef CONFIG_POSIX_TIMERS
 	INIT_LIST_HEAD(&sig->posix_timers);
-	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sig->real_timer.function = it_real_fn;
 #endif
 
diff --git a/kernel/futex.c b/kernel/futex.c
index f423f9b6577e..bd71f7887a4d 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2626,7 +2626,7 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
@@ -2727,7 +2727,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires(&to->timer, *time);
 	}
@@ -3127,7 +3127,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a5e40be3cb4..0e9d8d450dae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -302,7 +302,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 */
 	delay = max_t(u64, delay, 10000LL);
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
-		      HRTIMER_MODE_REL_PINNED);
+		      HRTIMER_MODE_REL_PINNED|HRTIMER_MODE_NO_USER);
 }
 #endif /* CONFIG_SMP */
 
@@ -316,7 +316,8 @@ static void hrtick_rq_init(struct rq *rq)
 	rq->hrtick_csd.info = rq;
 #endif
 
-	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rq->hrtick_timer.function = hrtick;
 }
 #else	/* CONFIG_SCHED_HRTICK */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91e4202b0634..471413fa8bb0 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1054,7 +1054,8 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = dl_task_timer;
 }
 
@@ -1293,7 +1294,8 @@ void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = inactive_task_timer;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 98e7f1e64a0f..89f1bf663c42 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4880,9 +4880,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
-	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->distribute_running = 0;
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f5516bae0c1b..6a4cc46d8c4b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -330,7 +330,8 @@ void play_idle(unsigned long duration_ms)
 	cpuidle_use_deepest_state(true);
 
 	it.done = 0;
-	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC,
+			      HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	it.timer.function = idle_inject_timer_fn;
 	hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a21ea6021929..ef81a93cc87b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -46,7 +46,8 @@ void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
 	raw_spin_lock_init(&rt_b->rt_runtime_lock);
 
 	hrtimer_init(&rt_b->rt_period_timer,
-			CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+			CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rt_b->rt_period_timer.function = sched_rt_period_timer;
 }
 
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index fa5de5e8de61..736d3bdbcf25 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -347,7 +347,7 @@ void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		enum alarmtimer_restart (*function)(struct alarm *, ktime_t))
 {
 	hrtimer_init(&alarm->timer, alarm_bases[type].base_clockid,
-		     HRTIMER_MODE_ABS);
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	__alarm_init(alarm, type, function);
 }
 EXPORT_SYMBOL_GPL(alarm_init);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 7e8e89a47d12..1fe30427f81a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1722,7 +1722,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	int ret;
 
 	hrtimer_init_on_stack(&t.timer, restart->nanosleep.clockid,
-				HRTIMER_MODE_ABS);
+				HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
 
 	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
@@ -1742,7 +1742,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
 	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
-	hrtimer_init_on_stack(&t.timer, clockid, mode);
+	hrtimer_init_on_stack(&t.timer, clockid, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, timespec64_to_ktime(*rqtp), slack);
 	ret = do_nanosleep(&t, mode);
 	if (ret != -ERESTART_RESTARTBLOCK)
@@ -1941,7 +1941,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
 		return -EINTR;
 	}
 
-	hrtimer_init_on_stack(&t.timer, clock_id, mode);
+	hrtimer_init_on_stack(&t.timer, clock_id, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, *expires, delta);
 
 	hrtimer_init_sleeper(&t, current);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index bd62b5eeb5a0..1435ad7f8360 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -488,7 +488,8 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 
 static int common_timer_create(struct k_itimer *new_timer)
 {
-	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock, 0);
+	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock,
+		HRTIMER_MODE_NO_USER);
 	return 0;
 }
 
@@ -813,7 +814,8 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
 	if (timr->it_clock == CLOCK_REALTIME)
 		timr->kclock = absolute ? &clock_realtime : &clock_monotonic;
 
-	hrtimer_init(&timr->it.real.timer, timr->it_clock, mode);
+	hrtimer_init(&timr->it.real.timer, timr->it_clock,
+		     mode|HRTIMER_MODE_NO_USER);
 	timr->it.real.timer.function = posix_timer_fn;
 
 	if (!absolute)
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index cbc72c2c1fca..cda4185c4324 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -252,7 +252,8 @@ void __init generic_sched_clock_init(void)
 	 * Start the timer to keep sched_clock() properly updated and
 	 * sets the initial epoch.
 	 */
-	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sched_clock_timer.function = sched_clock_poll;
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b88474..19f06e71fce3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1208,7 +1208,8 @@ static void tick_nohz_switch_to_nohz(void)
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	/* Get the next period */
 	next = tick_init_jiffy_update();
 
@@ -1305,7 +1306,8 @@ void tick_setup_sched_timer(void)
 	/*
 	 * Emulate tick processing via per-CPU hrtimers:
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	ts->sched_timer.function = tick_sched_timer;
 
 	/* Get the next period (per-CPU) */
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 977918d5d350..d3c9da0a4fce 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -483,7 +483,8 @@ static void watchdog_enable(unsigned int cpu)
 	 * Start the timer first to prevent the NMI watchdog triggering
 	 * before the timer has a chance to fire.
 	 */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hrtimer->function = watchdog_timer_fn;
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
-- 
2.17.2


From 7d2634341f2584976e8f66e4caf266342ef57c5b Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Wed, 12 Dec 2018 16:51:22 -0800
Subject: [PATCH 30/32] mds: Mark AHCI interrupt as not needing cpu clear

AHCI interrupt handlers never touch user data with the CPU.

Just to get the number of clears down on my test system.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/ata/ahci.c    |  2 +-
 drivers/ata/ahci.h    |  2 ++
 drivers/ata/libahci.c | 40 ++++++++++++++++++++++++----------------
 3 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 021ce46e2e57..1455ad89d2f9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1865,7 +1865,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	pci_set_master(pdev);
 
-	rc = ahci_host_activate(host, &ahci_sht);
+	rc = ahci_host_activate_irqflags(host, &ahci_sht, IRQF_NO_USER);
 	if (rc)
 		return rc;
 
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index ef356e70e6de..42a3474f26b6 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -430,6 +430,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 int ahci_reset_em(struct ata_host *host);
 void ahci_print_info(struct ata_host *host, const char *scc_s);
 int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht);
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags);
 void ahci_error_handler(struct ata_port *ap);
 u32 ahci_handle_port_intr(struct ata_host *host, u32 irq_masked);
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index b5f57c69c487..b32664c7d8a1 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -2548,7 +2548,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 EXPORT_SYMBOL_GPL(ahci_set_em_messages);
 
 static int ahci_host_activate_multi_irqs(struct ata_host *host,
-					 struct scsi_host_template *sht)
+					 struct scsi_host_template *sht,
+					 int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int i, rc;
@@ -2571,7 +2572,7 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 		}
 
 		rc = devm_request_irq(host->dev, irq, ahci_multi_irqs_intr_hard,
-				0, pp->irq_desc, host->ports[i]);
+				irqflags, pp->irq_desc, host->ports[i]);
 
 		if (rc)
 			return rc;
@@ -2581,18 +2582,8 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 	return ata_host_register(host, sht);
 }
 
-/**
- *	ahci_host_activate - start AHCI host, request IRQs and register it
- *	@host: target ATA host
- *	@sht: scsi_host_template to use when registering the host
- *
- *	LOCKING:
- *	Inherited from calling layer (may sleep).
- *
- *	RETURNS:
- *	0 on success, -errno otherwise.
- */
-int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int irq = hpriv->irq;
@@ -2608,15 +2599,32 @@ int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
 			return -EIO;
 		}
 
-		rc = ahci_host_activate_multi_irqs(host, sht);
+		rc = ahci_host_activate_multi_irqs(host, sht, irqflags);
 	} else {
 		rc = ata_host_activate(host, irq, hpriv->irq_handler,
-				       IRQF_SHARED, sht);
+				       irqflags|IRQF_SHARED, sht);
 	}
 
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(ahci_host_activate_irqflags);
+
+/**
+ *	ahci_host_activate - start AHCI host, request IRQs and register it
+ *	@host: target ATA host
+ *	@sht: scsi_host_template to use when registering the host
+ *
+ *	LOCKING:
+ *	Inherited from calling layer (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, -errno otherwise.
+ */
+int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+{
+	return ahci_host_activate_irqflags(host, sht, 0);
+}
 EXPORT_SYMBOL_GPL(ahci_host_activate);
 
 MODULE_AUTHOR("Jeff Garzik");
-- 
2.17.2


From ff500d0db42cab1c71f83f646685db083b26ec2c Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 14 Dec 2018 15:21:07 -0800
Subject: [PATCH 31/32] mds: Mark ACPI interrupt as not needing cpu clear

ACPI doesn't touch any user data, so doesn't need a cpu clear.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/acpi/osl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index b48874b8e1ea..380b6ba8f0ce 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,7 +572,8 @@ acpi_os_install_interrupt_handler(u32 gsi, acpi_osd_handler handler,
 
 	acpi_irq_handler = handler;
 	acpi_irq_context = context;
-	if (request_irq(irq, acpi_irq, IRQF_SHARED, "acpi", acpi_irq)) {
+	if (request_irq(irq, acpi_irq, IRQF_SHARED|IRQF_NO_USER,
+				"acpi", acpi_irq)) {
 		printk(KERN_ERR PREFIX "SCI (IRQ%d) allocation failed\n", irq);
 		acpi_irq_handler = NULL;
 		return AE_NOT_ACQUIRED;
-- 
2.17.2


From e05fa9405819713476e75859f8784508f8808f32 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Tue, 18 Dec 2018 16:46:10 -0800
Subject: [PATCH 32/32] mds: Mitigate BPF

BPF allows the user to run untrusted code in the kernel.

Normally MDS would allow some information leakage either
from other processes  or sensitive kernel code to the user
controlled BPF code.  We cannot rule out that BPF code contains
an MDS exploit and it is difficult to pattern match.

The patch aims to add limited number of clear cpus
before BPF executions to make EBPF executions safe.

Assume BPF execution does not touch other user's data, so does
not need to schedule a clear for itself.

For EBPF programs loaded privileged we never clear.

When the BPF program was loaded unprivileged clear the CPU
before the BPF execution, depending on the context it is running in:

We only do this when running in an interrupt, or if an clear cpu is
already scheduled (which means for example there was a context
switch, or crypto operation before)

In process context we check if the current process context
has the same userns+euid as the process who created the BPF.
This handles the common seccomp filter case without
any extra clears, but still adds clears when e.g. a socket
filter runs on a socket inherited to a process with different user id.

We also always clear when an earlier kernel subsystem scheduled
a clear, e.g. after a context switch or running crypto code.

Technically we would only need to do this if the BPF program
contains conditional branches and loads dominated by them, but
let's assume that near all do.

For example for running chromium with seccomp filters I see
only 15-18% of all sandbox system calls have a clear, most
are likely caused by context switches

Unprivileged EBPF usages in interrupts currently always clear.

This could be further optimized by allowing callers that do
a lot of individual BPF runs and are sure they don't touch
other user's data (that is not accessible to the EBPF anyways)
inbetween to do the clear only once at the beginning. We can add
such optimizations later based on profile data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearbpf.h | 29 +++++++++++++++++++++++++++++
 include/linux/filter.h          | 21 +++++++++++++++++++--
 kernel/bpf/core.c               |  2 ++
 3 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/clearbpf.h

diff --git a/arch/x86/include/asm/clearbpf.h b/arch/x86/include/asm/clearbpf.h
new file mode 100644
index 000000000000..dc1756722b48
--- /dev/null
+++ b/arch/x86/include/asm/clearbpf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARBPF_H
+#define _ASM_CLEARBPF_H 1
+
+#include <linux/clearcpu.h>
+#include <linux/cred.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * When the BPF program was loaded unprivileged, clear the CPU
+ * to prevent any exploits written in BPF using side channels to read
+ * data leaked from other kernel code. In some cases, like
+ * process context with the same uid, we can avoid it.
+ *
+ * See Documentation/clearcpu.txt for more details.
+ */
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+	if (!static_cpu_has(X86_BUG_MDS))
+		return;
+	if (in_interrupt() ||
+		test_thread_flag(TIF_CLEAR_CPU) ||
+		!uid_eq(current_euid(), uid)) {
+		clear_cpu();
+		clear_thread_flag(TIF_CLEAR_CPU);
+	}
+}
+
+#endif
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 448dcc448f1f..d49bdaaefd02 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,12 +20,21 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/clearcpu.h>
 
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
 #include <uapi/linux/bpf.h>
 
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearbpf.h>
+#else
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+}
+#endif
+
 struct sk_buff;
 struct sock;
 struct seccomp_data;
@@ -487,7 +496,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				priv:1;		/* Was loaded privileged */
+	kuid_t			uid;		/* Original uid who created it */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -510,7 +521,13 @@ struct sk_filter {
 	struct bpf_prog	*prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+static inline unsigned _bpf_prog_run(const struct bpf_prog *bp, const void *ctx)
+{
+	if (!bp->priv)
+		arch_bpf_prepare_nonpriv(bp->uid);
+	return bp->bpf_func(ctx, bp->insnsi);
+}
+#define BPF_PROG_RUN(filter, ctx) _bpf_prog_run(filter, ctx)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545d0ec8..90f13b1a8d67 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -96,6 +96,8 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 	fp->aux = aux;
 	fp->aux->prog = fp;
 	fp->jit_requested = ebpf_jit_enabled();
+	fp->priv = !!capable(CAP_SYS_ADMIN);
+	fp->uid = current_euid();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
 
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:18 ` Konrad Rzeszutek Wilk
@ 2019-01-09 17:41   ` Andi Kleen
  2019-01-09 18:09     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 17:41 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 12:18:04PM -0500, speck for Konrad Rzeszutek Wilk wrote:
> On Thu, Dec 20, 2018 at 04:27:10PM -0800, speck for Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > Subject:  MDSv3
> > 
> > Here's a new version of flushing CPU buffers for group 4.
> 
> Could you send also a git bundle of them please? The titles of them is not in sync with the XX/YY.
> I can probably figure out the right flow but it would help (also helps in review).
> 
> Thank you!
> 
> And one more thing - I see 'MB' and 'MDS' and also 'MSB' (Microarchitectual
> Store Buffer). 

The code all uses MDS / mds

The only thing called MB_* is the cpuid bit because that is the name
the documentation uses.

MSBDS should be only in documentation/descriptions to refer to the official
acronym from the white papers.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:38     ` Linus Torvalds
@ 2019-01-09 18:06       ` Andi Kleen
  2019-01-09 18:14         ` Linus Torvalds
  0 siblings, 1 reply; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 18:06 UTC (permalink / raw)
  To: speck

> > I don't know what parts of Intel you're talking with, but the parts
> > I talk with say that it's likely some CPUs won't be able to do the
> > microcode update for VERW.
> 
> Right. And they say that because nobody cares.

There are two different cases:

- Some old CPUs won't get any support. That's the "nobody cares" 
case. I already don't support those in the patch (e.g. pre Nehalem)

- Some newer CPUs where people care and which are widely used, but they ran
out of space for the patch.  So they would like to, but can't. That's the cases
I'm trying to handle.

> I will not take that crazy per-microarchitecture software sequence
> THAT DOESN'T EVEN WORK. Even on the micro-architectures it is designed
> for, virtualization and SMM break it.

Virtualization works, as long as the Hypervisor supports it. I fixed
KVM to do so (it's in the patchkit). Basically it needs to always
flush on entry if VERW is not exposed to the guest. This ensures
that even if the sequence is interrupted, a clear happens anyways.

SMM is a problem on CPUs without microcode update. It could be handled
by a BIOS update (and some vendors might do that). 
However SMM is quite rare and normally not user controllable (with the 
recent fixes we did for retpoline) so I think it falls into your
"theoretical only / uninteresting" cases bucket.

Does that change anything in your opinion?

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:41   ` Andi Kleen
@ 2019-01-09 18:09     ` Konrad Rzeszutek Wilk
  2019-01-09 18:42       ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Konrad Rzeszutek Wilk @ 2019-01-09 18:09 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 09:41:27AM -0800, speck for Andi Kleen wrote:
> On Wed, Jan 09, 2019 at 12:18:04PM -0500, speck for Konrad Rzeszutek Wilk wrote:
> > On Thu, Dec 20, 2018 at 04:27:10PM -0800, speck for Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > Subject:  MDSv3
> > > 
> > > Here's a new version of flushing CPU buffers for group 4.
> > 
> > Could you send also a git bundle of them please? The titles of them is not in sync with the XX/YY.
> > I can probably figure out the right flow but it would help (also helps in review).
> > 
> > Thank you!
> > 
> > And one more thing - I see 'MB' and 'MDS' and also 'MSB' (Microarchitectual
> > Store Buffer). 
> 
> The code all uses MDS / mds
> 
> The only thing called MB_* is the cpuid bit because that is the name
> the documentation uses.

Could you confirm that please? The doc in Keybase and which has been circulating
has it as 'MD_CLEAR'. Or maybe I am behind a version or too. Let me check.


> 
> MSBDS should be only in documentation/descriptions to refer to the official
> acronym from the white papers.

> 
> -Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 18:06       ` Andi Kleen
@ 2019-01-09 18:14         ` Linus Torvalds
  2019-01-09 19:49           ` Andi Kleen
  0 siblings, 1 reply; 50+ messages in thread
From: Linus Torvalds @ 2019-01-09 18:14 UTC (permalink / raw)
  To: speck

On Wed, Jan 9, 2019 at 10:07 AM speck for Andi Kleen
<speck@linutronix.de> wrote:
>
> Does that change anything in your opinion?

No. Nothing you say is new, and nothing you say is relevant to my argument.

The code is garbage. It's designed for one single microarchitecture,
and doesn't work reliably EVEN THERE. It's unmaintainable code that
adds stupid special cases and it's all Intel's problem.

             Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 17:35 ` Linus Torvalds
@ 2019-01-09 18:14   ` Andi Kleen
  2019-01-09 18:32     ` Linus Torvalds
  2019-01-10  6:01     ` Jiri Kosina
  0 siblings, 2 replies; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 18:14 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 09:35:22AM -0800, speck for Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 3:01 AM speck for Andi Kleen <speck@linutronix.de> wrote:
> >
> > VERW is not done unconditionally because it doesn't allow reporting
> > the correct status in the vulnerabilities file, which I consider important.
> > Instead we now have a mds=verw option that can be set as needed,
> > but is reported explicitely in the mitigation status.
> 
> I also don't see what the logic of this is AT ALL.
> 
> "Reporting" has absolutely nothign to do with "use VERW". The fact
> that you link the two is crazy.
> 
> The rule for VERW should be simple: use VERW if the CPU doesn't have
> the NOMDS bit set (or whatever the name is today).

The case that worried me is that with this we would end up
with some systems which are actually protected, but report vulnerable.

So you could not simply say

"If the sysfs file says you're vulnerable you're vulnerable"

but would need

"If the sysfs file says you're vulnerable, you're vulnerable except
<add some long paragraph of small print enumerating different
vmware and other hyper visor versions>"

Doesn't seem like a clear message for me.

With the current scheme at least only the people who see vulnerable
have to read the documentation and figure out the VMWare mess.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 18:14   ` Andi Kleen
@ 2019-01-09 18:32     ` Linus Torvalds
  2019-01-10  6:01     ` Jiri Kosina
  1 sibling, 0 replies; 50+ messages in thread
From: Linus Torvalds @ 2019-01-09 18:32 UTC (permalink / raw)
  To: speck

On Wed, Jan 9, 2019 at 10:20 AM speck for Andi Kleen
<speck@linutronix.de> wrote:
>
> The case that worried me is that with this we would end up
> with some systems which are actually protected, but report vulnerable.

I agree that it's not optimal, but who cares?

It will report vulnerable, and people will complain to vmware, and
it's _their_ problem.

Again, you seem to think that we should somehow solve all the worlds
problems. No. We should do a good job, but we should also set the onus
for bugs on the proper people and companies.

It's perfectly ok to say "that hardware is simply buggy, and the
manufacturer isn't maintaining it any more, and the workaround is
unmaintainable and doesn't even work reliably".

It's also perfectly ok to say "that virtualization vendor doesn't
report that they've fixed the problem properly, because they have a
bad design for cpuid upgrades".

Neither of them are our problems to solve. The mainline kernel is for
sane situations and for the majority of people. If some cloud vendor
has a million machines that are no longer properly supported by Intel,
they can sue Intel, or they can add some patch to their kernel that
_they_ maintain, but there's absolutely no reason for mainline to do
so.

We need to push back on insane vendor patches.

                 Linus

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 18:09     ` Konrad Rzeszutek Wilk
@ 2019-01-09 18:42       ` Andi Kleen
  0 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 18:42 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 01:09:17PM -0500, speck for Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 09, 2019 at 09:41:27AM -0800, speck for Andi Kleen wrote:
> > On Wed, Jan 09, 2019 at 12:18:04PM -0500, speck for Konrad Rzeszutek Wilk wrote:
> > > On Thu, Dec 20, 2018 at 04:27:10PM -0800, speck for Andi Kleen wrote:
> > > > From: Andi Kleen <ak@linux.intel.com>
> > > > Subject:  MDSv3
> > > > 
> > > > Here's a new version of flushing CPU buffers for group 4.
> > > 
> > > Could you send also a git bundle of them please? The titles of them is not in sync with the XX/YY.
> > > I can probably figure out the right flow but it would help (also helps in review).
> > > 
> > > Thank you!
> > > 
> > > And one more thing - I see 'MB' and 'MDS' and also 'MSB' (Microarchitectual
> > > Store Buffer). 
> > 
> > The code all uses MDS / mds
> > 
> > The only thing called MB_* is the cpuid bit because that is the name
> > the documentation uses.
> 
> Could you confirm that please? The doc in Keybase and which has been circulating
> has it as 'MD_CLEAR'. Or maybe I am behind a version or too. Let me check.

Frankly I don't care really. Please do some real code review
instead of bike shedding.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 18:14         ` Linus Torvalds
@ 2019-01-09 19:49           ` Andi Kleen
  0 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2019-01-09 19:49 UTC (permalink / raw)
  To: speck

On Wed, Jan 09, 2019 at 10:14:35AM -0800, speck for Linus Torvalds wrote:
> On Wed, Jan 9, 2019 at 10:07 AM speck for Andi Kleen
> <speck@linutronix.de> wrote:
> >
> > Does that change anything in your opinion?
> 
> No. Nothing you say is new, and nothing you say is relevant to my argument.
> 
> The code is garbage. It's designed for one single microarchitecture,
> and doesn't work reliably EVEN THERE. It's unmaintainable code that
> adds stupid special cases and it's all Intel's problem.

Okay. I'll ask for guidance.

BTW the software sequences are all separate patches. If you don't
want to see them, please ignore patches 6, 8, 9, 13. Everything
else is independent and can be independently reviewed.

I would be especially interested in an review on the design 
document in patch 14.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-09 18:14   ` Andi Kleen
  2019-01-09 18:32     ` Linus Torvalds
@ 2019-01-10  6:01     ` Jiri Kosina
  2019-01-10 16:05       ` Andi Kleen
  1 sibling, 1 reply; 50+ messages in thread
From: Jiri Kosina @ 2019-01-10  6:01 UTC (permalink / raw)
  To: speck

On Wed, 9 Jan 2019, speck for Andi Kleen wrote:

> The case that worried me is that with this we would end up with some 
> systems which are actually protected, but report vulnerable.
> 
> So you could not simply say
> 
> "If the sysfs file says you're vulnerable you're vulnerable"
> 
> but would need
> 
> "If the sysfs file says you're vulnerable, you're vulnerable except
> <add some long paragraph of small print enumerating different
> vmware and other hyper visor versions>"
> 
> Doesn't seem like a clear message for me.

Please see what I did for Meltdown and XenPV in commit 6cb2b08ff92. I 
believe something similar could be easily used there.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [MODERATED] Re: [PATCH v3 00/32] MDSv3 12
  2019-01-10  6:01     ` Jiri Kosina
@ 2019-01-10 16:05       ` Andi Kleen
  0 siblings, 0 replies; 50+ messages in thread
From: Andi Kleen @ 2019-01-10 16:05 UTC (permalink / raw)
  To: speck

On Thu, Jan 10, 2019 at 07:01:21AM +0100, speck for Jiri Kosina wrote:
> On Wed, 9 Jan 2019, speck for Andi Kleen wrote:
> 
> > The case that worried me is that with this we would end up with some 
> > systems which are actually protected, but report vulnerable.
> > 
> > So you could not simply say
> > 
> > "If the sysfs file says you're vulnerable you're vulnerable"
> > 
> > but would need
> > 
> > "If the sysfs file says you're vulnerable, you're vulnerable except
> > <add some long paragraph of small print enumerating different
> > vmware and other hyper visor versions>"
> > 
> > Doesn't seem like a clear message for me.
> 
> Please see what I did for Meltdown and XenPV in commit 6cb2b08ff92. I 
> believe something similar could be easily used there.

This is different because it doesn't require para virtualization.
For paravirtualization we are guaranteed that Linux knows
all cases and can enumerate them, and yes with that
it's possible to add ifs for the specific cases.

But VERW is not paravirtualized.

But for a non PV hypervisor there is no guarantee Linux even
knows about all the HyperVisors in existence (undoubtedly there
are many about which Linux has no clue of)

I don't think this is a scalable solution.

-Andi

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2019-01-10 16:05 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-21  0:27 [MODERATED] [PATCH v3 00/32] MDSv3 12 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 01/32] MDSv3 7 Andi Kleen
2019-01-09 17:38   ` [MODERATED] " Konrad Rzeszutek Wilk
2018-12-21  0:27 ` [MODERATED] [PATCH v3 02/32] MDSv3 22 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 03/32] MDSv3 5 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 04/32] MDSv3 3 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 05/32] MDSv3 0 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 06/32] MDSv3 8 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 07/32] MDSv3 21 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 08/32] MDSv3 15 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 09/32] MDSv3 10 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 10/32] MDSv3 11 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 11/32] MDSv3 29 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 12/32] MDSv3 19 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 13/32] MDSv3 6 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 14/32] MDSv3 28 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 15/32] MDSv3 27 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 16/32] MDSv3 4 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 17/32] MDSv3 13 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 18/32] MDSv3 32 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 19/32] MDSv3 16 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 20/32] MDSv3 24 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 21/32] MDSv3 25 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 22/32] MDSv3 23 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 23/32] MDSv3 31 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 24/32] MDSv3 30 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 25/32] MDSv3 9 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 26/32] MDSv3 14 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 27/32] MDSv3 18 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 28/32] MDSv3 20 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 29/32] MDSv3 26 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 30/32] MDSv3 17 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 31/32] MDSv3 1 Andi Kleen
2018-12-21  0:27 ` [MODERATED] [PATCH v3 32/32] MDSv3 2 Andi Kleen
2019-01-09 17:09 ` [MODERATED] Re: [PATCH v3 00/32] MDSv3 12 Linus Torvalds
2019-01-09 17:31   ` Andi Kleen
2019-01-09 17:38     ` Linus Torvalds
2019-01-09 18:06       ` Andi Kleen
2019-01-09 18:14         ` Linus Torvalds
2019-01-09 19:49           ` Andi Kleen
2019-01-09 17:18 ` Konrad Rzeszutek Wilk
2019-01-09 17:41   ` Andi Kleen
2019-01-09 18:09     ` Konrad Rzeszutek Wilk
2019-01-09 18:42       ` Andi Kleen
2019-01-09 17:35 ` Linus Torvalds
2019-01-09 18:14   ` Andi Kleen
2019-01-09 18:32     ` Linus Torvalds
2019-01-10  6:01     ` Jiri Kosina
2019-01-10 16:05       ` Andi Kleen
2019-01-09 17:39 ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.