All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH v4 00/28] MDSv4 2
@ 2019-01-12  1:29 Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 01/28] MDSv4 3 Andi Kleen
                   ` (28 more replies)
  0 siblings, 29 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Here's a new version of flushing CPU buffers for group 4.

This mainly covers single thread, not SMT (except for the idle case).

I lumped all the issues together under the Microarchitectural Data
Sampling (MDS) name because they need the same mitigations,a
and it doesn't seem worth duplicating the sysfs files and bug entries.

This version drops support for software sequences, and also
does VERW unconditionally unless disabled.

This version implements Linus' suggestion to only clear the CPU
buffer when needed. The patch kit is now a lot more complicated:
different subsystems determine if they might touch other user's
or sensitive data and schedule a cpu clear on next kernel exit.

Generally process context doesn't clear (unless it is cryptographic
or does context switches), and interrupt context schedules a clear.
There are some exceptions to these rules.

For details on the security model see the Documentation/clearcpu.txt
file. In my tests the number of clears is much lower now.

For most benchmarks we tried the difference is in the noise
level now. ebizzy and loopback apache both show about 1.7%
degradation.

It makes various assumptions on how kernel code behaves.
I did some auditing, but wasn't able to do it for everything.
Please double check the assumptions laid out in the document.

Likely a lot more interrupt and timer handlers (and tasklets
and irq poll handlers) could be white listed to not need clear, but I only
did a fairly minimum set for now that I could test.

For some of the white listed code, especially the networking and
block softirqs, as well as the EBPF mitigation, some additional auditing that
no rules are violated would be useful.

Some notes:
- Against 5.0-rc1

Changes against previous versions:
- Remove software sequences
- Make VERW unconditional
- Improved documentation
- Some other minor changes

Changes against previous versions:
- By default now flushes only when needed
- Define security model
- New administrator document
- Added mds=verw and mds=full
- Renamed mds_disable to mds=off
- KVM virtualization much improved
- Too many others to list. Most things different now.

Andi Kleen (28):
  x86/speculation/mds: Add basic bug infrastructure for MDS
  x86/speculation/mds: Add mds=off
  x86/speculation/mds: Support clearing CPU data on kernel exit
  x86/speculation/mds: Support mds=full
  x86/speculation/mds: Clear CPU buffers on entering idle
  x86/speculation/mds: Add sysfs reporting
  x86/speculation/mds: Support mds=full for NMIs
  x86/speculation/mds: Support mds=full for 32bit NMI
  x86/speculation/mds: Export MD_CLEAR CPUID to KVM guests.
  mds: Add documentation for clear cpu usage
  mds: Add preliminary administrator documentation
  x86/speculation/mds: Introduce lazy_clear_cpu
  x86/speculation/mds: Schedule cpu clear on context switch
  x86/speculation/mds: Add tracing for clear_cpu
  mds: Force clear cpu on kernel preemption
  mds: Schedule cpu clear for memzero_explicit and kzfree
  mds: Mark interrupts clear cpu, unless opted-out
  mds: Clear cpu on all timers, unless the timer opts-out
  mds: Clear CPU on tasklets, unless opted-out
  mds: Clear CPU on irq poll, unless opted-out
  mds: Clear cpu for string io/memcpy_*io in interrupts
  mds: Schedule clear cpu in swiotlb
  mds: Instrument skb functions to clear cpu automatically
  mds: Opt out tcp tasklet to not touch user data
  mds: mark kernel/* timers safe as not touching user data
  mds: Mark AHCI interrupt as not needing cpu clear
  mds: Mark ACPI interrupt as not needing cpu clear
  mds: Mitigate BPF

 .../ABI/testing/sysfs-devices-system-cpu      |   1 +
 .../admin-guide/kernel-parameters.txt         |   8 +
 Documentation/admin-guide/mds.rst             | 108 +++++++++++
 Documentation/clearcpu.txt                    | 173 ++++++++++++++++++
 arch/Kconfig                                  |   3 +
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/common.c                       |  13 +-
 arch/x86/entry/entry_32.S                     |   6 +
 arch/x86/entry/entry_64.S                     |  12 ++
 arch/x86/include/asm/clearbpf.h               |  29 +++
 arch/x86/include/asm/clearcpu.h               |  92 ++++++++++
 arch/x86/include/asm/cpufeatures.h            |   3 +
 arch/x86/include/asm/io.h                     |   3 +
 arch/x86/include/asm/msr-index.h              |   1 +
 arch/x86/include/asm/thread_info.h            |   2 +
 arch/x86/include/asm/trace/clearcpu.h         |  27 +++
 arch/x86/kernel/acpi/cstate.c                 |   2 +
 arch/x86/kernel/cpu/bugs.c                    |  46 +++++
 arch/x86/kernel/cpu/common.c                  |  14 ++
 arch/x86/kernel/kvm.c                         |   3 +
 arch/x86/kernel/process.c                     |   5 +
 arch/x86/kernel/process.h                     |  27 +++
 arch/x86/kernel/smpboot.c                     |   3 +
 arch/x86/kvm/cpuid.c                          |   3 +-
 drivers/acpi/acpi_pad.c                       |   2 +
 drivers/acpi/osl.c                            |   3 +-
 drivers/acpi/processor_idle.c                 |   3 +
 drivers/ata/ahci.c                            |   2 +-
 drivers/ata/ahci.h                            |   2 +
 drivers/ata/libahci.c                         |  40 ++--
 drivers/base/cpu.c                            |   8 +
 drivers/idle/intel_idle.c                     |   5 +
 include/asm-generic/io.h                      |   3 +
 include/linux/clearcpu.h                      |  36 ++++
 include/linux/filter.h                        |  21 ++-
 include/linux/hrtimer.h                       |   4 +
 include/linux/interrupt.h                     |  18 +-
 include/linux/irq_poll.h                      |   2 +
 include/linux/skbuff.h                        |   2 +
 include/linux/timer.h                         |   9 +-
 kernel/bpf/core.c                             |   2 +
 kernel/dma/swiotlb.c                          |   2 +
 kernel/events/core.c                          |   6 +-
 kernel/fork.c                                 |   3 +-
 kernel/futex.c                                |   6 +-
 kernel/irq/handle.c                           |   8 +
 kernel/irq/manage.c                           |   1 +
 kernel/sched/core.c                           |  14 +-
 kernel/sched/deadline.c                       |   6 +-
 kernel/sched/fair.c                           |   7 +-
 kernel/sched/idle.c                           |   3 +-
 kernel/sched/rt.c                             |   3 +-
 kernel/softirq.c                              |  25 ++-
 kernel/time/alarmtimer.c                      |   2 +-
 kernel/time/hrtimer.c                         |  11 +-
 kernel/time/posix-timers.c                    |   6 +-
 kernel/time/sched_clock.c                     |   3 +-
 kernel/time/tick-sched.c                      |   6 +-
 kernel/time/timer.c                           |   8 +
 kernel/watchdog.c                             |   3 +-
 lib/irq_poll.c                                |  18 +-
 lib/string.c                                  |   6 +
 mm/slab_common.c                              |   5 +-
 net/core/skbuff.c                             |  26 +++
 net/ipv4/tcp_output.c                         |   5 +-
 65 files changed, 869 insertions(+), 61 deletions(-)
 create mode 100644 Documentation/admin-guide/mds.rst
 create mode 100644 Documentation/clearcpu.txt
 create mode 100644 arch/x86/include/asm/clearbpf.h
 create mode 100644 arch/x86/include/asm/clearcpu.h
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h
 create mode 100644 include/linux/clearcpu.h

-- 
2.17.2

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 01/28] MDSv4 3
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-15 14:11   ` [MODERATED] " Andrew Cooper
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 02/28] MDSv4 22 Andi Kleen
                   ` (27 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

MDS is micro architectural data sampling, which is a side channel
attack on internal buffers in Intel CPUs.

MDS consists of multiple sub-vulnerabilities:
Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
with the first leaking store data, and the second loads and sometimes
store data, and the third load data.

They all have the same mitigations for single thread, so we lump them all
together as a single MDS issue.

This patch adds the basic infrastructure to detect if the current
CPU is affected by MDS, and if yes set the right BUG bits.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  2 ++
 arch/x86/include/asm/msr-index.h   |  1 +
 arch/x86/kernel/cpu/common.c       | 14 ++++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6d6122524711..233ca598826f 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -344,6 +344,7 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:0 (EDX), word 18 */
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
+#define X86_FEATURE_MD_CLEAR		(18*32+10) /* Flush state on VERW */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
@@ -381,5 +382,6 @@
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS	X86_BUG(17) /* CPU is affected by speculative store bypass attack */
 #define X86_BUG_L1TF			X86_BUG(18) /* CPU is affected by L1 Terminal Fault */
+#define X86_BUG_MDS			X86_BUG(19) /* CPU is affected by Microarchitectural data sampling */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 8e40c2446fd1..3e486d9d6e6c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -77,6 +77,7 @@
 						    * attack, so no Speculative Store Bypass
 						    * control required.
 						    */
+#define ARCH_CAP_MDS_NO			(1 << 5)   /* No Microarchitectural data sampling */
 
 #define MSR_IA32_FLUSH_CMD		0x0000010b
 #define L1D_FLUSH			(1 << 0)   /*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index cb28e98a0659..0c900eb6f829 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -998,6 +998,14 @@ static const __initconst struct x86_cpu_id cpu_no_l1tf[] = {
 	{}
 };
 
+static const __initconst struct x86_cpu_id cpu_no_mds[] = {
+	/* in addition to cpu_no_speculation */
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_X	},
+	{ X86_VENDOR_INTEL,	6,	INTEL_FAM6_ATOM_GOLDMONT_PLUS	},
+	{}
+};
+
 static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 {
 	u64 ia32_cap = 0;
@@ -1019,6 +1027,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
 	if (ia32_cap & ARCH_CAP_IBRS_ALL)
 		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
 
+	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+	    !x86_match_cpu(cpu_no_mds)) &&
+	    !(ia32_cap & ARCH_CAP_MDS_NO) &&
+	    !(ia32_cap & ARCH_CAP_RDCL_NO))
+		setup_force_cpu_bug(X86_BUG_MDS);
+
 	if (x86_match_cpu(cpu_no_meltdown))
 		return;
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 02/28] MDSv4 22
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 01/28] MDSv4 3 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 03/28] MDSv4 20 Andi Kleen
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Normally we execute VERW for clearing the cpu unconditionally on kernel exits
that might have touched sensitive. Add a new flag to disable VERW usage.
This is intended for systems that only run trusted code and don't
want the performance impact of the extra clearing.

This just sets the flag, actual implementation is in future patches.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 3 +++
 arch/x86/include/asm/cpufeatures.h              | 1 +
 arch/x86/kernel/cpu/bugs.c                      | 9 +++++++++
 3 files changed, 13 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b799bcf67d7b..9c967d0caeca 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2357,6 +2357,9 @@
 			Format: <first>,<last>
 			Specifies range of consoles to be captured by the MDA.
 
+	mds=off		[X86, Intel]
+			Disable workarounds for Micro-architectural Data Sampling.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 233ca598826f..09347c6a8901 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -221,6 +221,7 @@
 #define X86_FEATURE_ZEN			( 7*32+28) /* "" CPU is AMD family 0x17 (Zen) */
 #define X86_FEATURE_L1TF_PTEINV		( 7*32+29) /* "" L1TF workaround PTE inversion */
 #define X86_FEATURE_IBRS_ENHANCED	( 7*32+30) /* Enhanced IBRS */
+#define X86_FEATURE_NO_VERW		( 7*32+31) /* "" No VERW for MDS on kernel exit */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 8654b8b0c848..5426467143c9 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -37,6 +37,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init l1tf_select_mitigation(void);
+static void __init mds_select_mitigation(void);
 
 /* The base value of the SPEC_CTRL MSR that always has to be preserved. */
 u64 x86_spec_ctrl_base;
@@ -101,6 +102,8 @@ void __init check_bugs(void)
 
 	l1tf_select_mitigation();
 
+	mds_select_mitigation();
+
 #ifdef CONFIG_X86_32
 	/*
 	 * Check whether we are able to run this kernel safely on SMP.
@@ -1058,6 +1061,12 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+static void mds_select_mitigation(void)
+{
+	if (cmdline_find_option_bool(boot_command_line, "mds=off"))
+		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
+}
+
 #ifdef CONFIG_SYSFS
 
 #define L1TF_DEFAULT_MSG "Mitigation: PTE Inversion"
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 03/28] MDSv4 20
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 01/28] MDSv4 3 Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 02/28] MDSv4 22 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-14 18:50   ` [MODERATED] " Dave Hansen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 04/28] MDSv4 8 Andi Kleen
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add infrastructure for clearing CPU data on kernel exit

Instead of clearing unconditionally we support clearing
lazily when some kernel subsystem touched sensitive data
and sets the new TIF_CLEAR_CPU flag.

We handle TIF_CLEAR_CPU in kernel exit, similar to
other kernel exit action flags.

The flushing is provided by new microcode as a new side
effect of the otherwise unused VERW instruction.

So far this patch doesn't do anything, it relies on
later patches to set TIF_CLEAR_CPU.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/common.c            |  8 +++++++-
 arch/x86/include/asm/clearcpu.h    | 23 +++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  2 ++
 3 files changed, 32 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/clearcpu.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 7bc105f47d21..924f8dab2068 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -29,6 +29,7 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/vdso.h>
+#include <asm/clearcpu.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
 
@@ -132,7 +133,7 @@ static long syscall_trace_enter(struct pt_regs *regs)
 }
 
 #define EXIT_TO_USERMODE_LOOP_FLAGS				\
-	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |	\
+	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | _TIF_CLEAR_CPU |\
 	 _TIF_NEED_RESCHED | _TIF_USER_RETURN_NOTIFY | _TIF_PATCH_PENDING)
 
 static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
@@ -170,6 +171,11 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 		if (cached_flags & _TIF_USER_RETURN_NOTIFY)
 			fire_user_return_notifiers();
 
+		if (cached_flags & _TIF_CLEAR_CPU) {
+			clear_thread_flag(TIF_CLEAR_CPU);
+			clear_cpu();
+		}
+
 		/* Disable IRQs and retry */
 		local_irq_disable();
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
new file mode 100644
index 000000000000..530ef619ac1b
--- /dev/null
+++ b/arch/x86/include/asm/clearcpu.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARCPU_H
+#define _ASM_CLEARCPU_H 1
+
+#include <linux/jump_label.h>
+#include <linux/sched/smt.h>
+#include <asm/alternative.h>
+#include <linux/thread_info.h>
+
+/*
+ * Clear CPU buffers to avoid side channels.
+ * We use microcode as a side effect of the obsolete VERW instruction
+ */
+
+static inline void clear_cpu(void)
+{
+	unsigned kernel_ds = __KERNEL_DS;
+	/* Has to be memory form, don't modify to use an register */
+	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
+		[kernelds] "m" (kernel_ds));
+}
+
+#endif
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index e0eccbcb8447..0c1e3d71018e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -95,6 +95,7 @@ struct thread_info {
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
+#define TIF_CLEAR_CPU		23	/* clear CPU on kernel exit */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
 #define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOHZ		(1 << TIF_NOHZ)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
+#define _TIF_CLEAR_CPU		(1 << TIF_CLEAR_CPU)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
 #define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
 #define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 04/28] MDSv4 8
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (2 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 03/28] MDSv4 20 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Support a new command line option to support unconditional flushing
on each kernel exit. This is not enabled by default.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 5 +++++
 arch/x86/entry/common.c                         | 7 ++++++-
 arch/x86/include/asm/clearcpu.h                 | 2 ++
 arch/x86/kernel/cpu/bugs.c                      | 4 ++++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 9c967d0caeca..5f5a8808c475 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2360,6 +2360,11 @@
 	mds=off		[X86, Intel]
 			Disable workarounds for Micro-architectural Data Sampling.
 
+	mds=full	[X86, Intel]
+			Always flush cpu buffers when exiting kernel for MDS.
+			Normally the kernel decides dynamically when flushing is
+			needed or not.
+
 	mem=nn[KMG]	[KNL,BOOT] Force usage of a specific amount of memory
 			Amount of memory to be used when the kernel is not able
 			to see the whole system memory or for test.
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 924f8dab2068..66c08e1d493a 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -173,7 +173,9 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
 
 		if (cached_flags & _TIF_CLEAR_CPU) {
 			clear_thread_flag(TIF_CLEAR_CPU);
-			clear_cpu();
+			/* Don't do it twice if forced */
+			if (!static_key_enabled(&force_cpu_clear))
+				clear_cpu();
 		}
 
 		/* Disable IRQs and retry */
@@ -217,6 +219,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	ti->status &= ~(TS_COMPAT|TS_I386_REGS_POKED);
 #endif
 
+	if (static_key_enabled(&force_cpu_clear))
+		clear_cpu();
+
 	user_enter_irqoff();
 }
 
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 530ef619ac1b..3b8ee76b9c07 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -20,4 +20,6 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
+
 #endif
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 5426467143c9..40f7415dcd7e 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1061,10 +1061,14 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
+
 static void mds_select_mitigation(void)
 {
 	if (cmdline_find_option_bool(boot_command_line, "mds=off"))
 		setup_force_cpu_cap(X86_FEATURE_NO_VERW);
+	if (cmdline_find_option_bool(boot_command_line, "mds=full"))
+		static_branch_enable(&force_cpu_clear);
 }
 
 #ifdef CONFIG_SYSFS
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 05/28] MDSv4 10
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (3 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 04/28] MDSv4 8 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
  2019-01-14 23:39   ` Tim Chen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 06/28] MDSv4 11 Andi Kleen
                   ` (23 subsequent siblings)
  28 siblings, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

When entering idle the internal state of the current CPU might
become visible to the thread sibling because the CPU "frees" some
internal resources.

To ensure there is no MDS leakage always clear the CPU state
before doing any idling. We only do this if SMT is enabled,
as otherwise there is no leakage possible.

Not needed for idle poll because it does not share resources.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h | 19 +++++++++++++++++++
 arch/x86/kernel/acpi/cstate.c   |  2 ++
 arch/x86/kernel/kvm.c           |  3 +++
 arch/x86/kernel/process.c       |  5 +++++
 arch/x86/kernel/smpboot.c       |  3 +++
 drivers/acpi/acpi_pad.c         |  2 ++
 drivers/acpi/processor_idle.c   |  3 +++
 drivers/idle/intel_idle.c       |  5 +++++
 kernel/sched/fair.c             |  1 +
 9 files changed, 43 insertions(+)

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 3b8ee76b9c07..b83ef1a5268f 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -20,6 +20,25 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+/*
+ * Clear CPU buffers before going idle, so that no state is leaked to SMT
+ * siblings taking over thread resources.
+ * Out of line to avoid include hell.
+ *
+ * Assumes that interrupts are disabled and only get reenabled
+ * before idle, otherwise the data from a racing interrupt might not
+ * get cleared. There are some callers who violate this,
+ * but they are only used in unattackable cases.
+ */
+
+static inline void clear_cpu_idle(void)
+{
+	if (sched_smt_active()) {
+		clear_thread_flag(TIF_CLEAR_CPU);
+		clear_cpu();
+	}
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #endif
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index 158ad1483c43..48adea5afacf 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -14,6 +14,7 @@
 #include <acpi/processor.h>
 #include <asm/mwait.h>
 #include <asm/special_insns.h>
+#include <asm/clearcpu.h>
 
 /*
  * Initialize bm_flags based on the CPU cache properties
@@ -157,6 +158,7 @@ void __cpuidle acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 	unsigned int cpu = smp_processor_id();
 	struct cstate_entry *percpu_entry;
 
+	clear_cpu_idle();
 	percpu_entry = per_cpu_ptr(cpu_cstate_entry, cpu);
 	mwait_idle_with_hints(percpu_entry->states[cx->index].eax,
 	                      percpu_entry->states[cx->index].ecx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ba4bfb7f6a36..c9206ad40a5b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -159,6 +159,7 @@ void kvm_async_pf_task_wait(u32 token, int interrupt_kernel)
 			/*
 			 * We cannot reschedule. So halt.
 			 */
+			clear_cpu_idle();
 			native_safe_halt();
 			local_irq_disable();
 		}
@@ -785,6 +786,8 @@ static void kvm_wait(u8 *ptr, u8 val)
 	if (READ_ONCE(*ptr) != val)
 		goto out;
 
+	clear_cpu_idle();
+
 	/*
 	 * halt until it's our turn and kicked. Note that we do safe halt
 	 * for irq enabled case to avoid hang when lock info is overwritten
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 90ae0ca51083..9d9f2d2b209d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -42,6 +42,7 @@
 #include <asm/prctl.h>
 #include <asm/spec-ctrl.h>
 #include <asm/proto.h>
+#include <asm/clearcpu.h>
 
 #include "process.h"
 
@@ -589,6 +590,8 @@ void stop_this_cpu(void *dummy)
 	disable_local_APIC();
 	mcheck_cpu_clear(this_cpu_ptr(&cpu_info));
 
+	clear_cpu_idle();
+
 	/*
 	 * Use wbinvd on processors that support SME. This provides support
 	 * for performing a successful kexec when going from SME inactive
@@ -675,6 +678,8 @@ static __cpuidle void mwait_idle(void)
 			mb(); /* quirk */
 		}
 
+		clear_cpu_idle();
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__sti_mwait(0, 0);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index ccd1f2a8e557..c7fff6b09253 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -81,6 +81,7 @@
 #include <asm/cpu_device_id.h>
 #include <asm/spec-ctrl.h>
 #include <asm/hw_irq.h>
+#include <asm/clearcpu.h>
 
 /* representing HT siblings of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_map);
@@ -1635,6 +1636,7 @@ static inline void mwait_play_dead(void)
 	wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		/*
 		 * The CLFLUSH is a workaround for erratum AAI65 for
 		 * the Xeon 7400 series.  It's not clear it is actually
@@ -1662,6 +1664,7 @@ void hlt_play_dead(void)
 		wbinvd();
 
 	while (1) {
+		clear_cpu_idle();
 		native_halt();
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index a47676a55b84..2dcbc38d0880 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -27,6 +27,7 @@
 #include <linux/slab.h>
 #include <linux/acpi.h>
 #include <asm/mwait.h>
+#include <asm/clearcpu.h>
 #include <xen/xen.h>
 
 #define ACPI_PROCESSOR_AGGREGATOR_CLASS	"acpi_pad"
@@ -175,6 +176,7 @@ static int power_saving_thread(void *data)
 			tick_broadcast_enable();
 			tick_broadcast_enter();
 			stop_critical_timings();
+			clear_cpu_idle();
 
 			mwait_idle_with_hints(power_saving_mwait_eax, 1);
 
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index b2131c4ea124..0342daa122fe 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -33,6 +33,7 @@
 #include <linux/cpuidle.h>
 #include <linux/cpu.h>
 #include <acpi/processor.h>
+#include <asm/clearcpu.h>
 
 /*
  * Include the apic definitions for x86 to have the APIC timer related defines
@@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
  */
 static void __cpuidle acpi_safe_halt(void)
 {
+	clear_cpu_idle();
 	if (!tif_need_resched()) {
 		safe_halt();
 		local_irq_disable();
@@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
 
 	ACPI_FLUSH_CPU_CACHE();
 
+	clear_cpu_idle();
 	while (1) {
 
 		if (cx->entry_method == ACPI_CSTATE_HALT)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 8b5d85c91e9d..ddaa7603d53a 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,6 +65,7 @@
 #include <asm/intel-family.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/clearcpu.h>
 
 #define INTEL_IDLE_VERSION "0.4.1"
 
@@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
 		}
 	}
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 
 	if (!static_cpu_has(X86_FEATURE_ARAT) && tick)
@@ -953,6 +956,8 @@ static void intel_idle_s2idle(struct cpuidle_device *dev,
 	unsigned long ecx = 1; /* break on interrupt flag */
 	unsigned long eax = flg2MWAIT(drv->states[index].flags);
 
+	clear_cpu_idle();
+
 	mwait_idle_with_hints(eax, ecx);
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50aa2aba69bd..b5a1bd4a1a46 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 
 #ifdef CONFIG_SCHED_SMT
 DEFINE_STATIC_KEY_FALSE(sched_smt_present);
+EXPORT_SYMBOL(sched_smt_present);
 
 static inline void set_idle_cores(int cpu, int val)
 {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 06/28] MDSv4 11
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (4 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-14 19:23   ` [MODERATED] " Dave Hansen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 07/28] MDSv4 0 Andi Kleen
                   ` (22 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Report mds mitigation state in sysfs vulnerabilities.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 .../ABI/testing/sysfs-devices-system-cpu         |  1 +
 arch/x86/kernel/cpu/bugs.c                       | 16 ++++++++++++++++
 drivers/base/cpu.c                               |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9605dbd4b5b5..2db5c3407fd6 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -484,6 +484,7 @@ What:		/sys/devices/system/cpu/vulnerabilities
 		/sys/devices/system/cpu/vulnerabilities/spectre_v2
 		/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
 		/sys/devices/system/cpu/vulnerabilities/l1tf
+		/sys/devices/system/cpu/vulnerabilities/mds
 Date:		January 2018
 Contact:	Linux kernel mailing list <linux-kernel@vger.kernel.org>
 Description:	Information about CPU vulnerabilities
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 40f7415dcd7e..582b1cd019f7 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1174,6 +1174,16 @@ static ssize_t cpu_show_common(struct device *dev, struct device_attribute *attr
 		if (boot_cpu_has(X86_FEATURE_L1TF_PTEINV))
 			return l1tf_show_state(buf);
 		break;
+
+	case X86_BUG_MDS:
+		/* Assumes Hypervisor exposed HT state to us if in guest */
+		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
+			if (cpu_smt_control != CPU_SMT_ENABLED)
+				return sprintf(buf, "Mitigation: microcode\n");
+			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
+		}
+		return sprintf(buf, "Vulnerable\n");
+
 	default:
 		break;
 	}
@@ -1205,4 +1215,10 @@ ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *b
 {
 	return cpu_show_common(dev, attr, buf, X86_BUG_L1TF);
 }
+
+ssize_t cpu_show_mds(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return cpu_show_common(dev, attr, buf, X86_BUG_MDS);
+}
+
 #endif
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index eb9443d5bae1..2fd6ca1021c2 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -546,11 +546,18 @@ ssize_t __weak cpu_show_l1tf(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_mds(struct device *dev,
+			    struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
 static DEVICE_ATTR(spec_store_bypass, 0444, cpu_show_spec_store_bypass, NULL);
 static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
+static DEVICE_ATTR(mds, 0444, cpu_show_mds, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
@@ -558,6 +565,7 @@ static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_spectre_v2.attr,
 	&dev_attr_spec_store_bypass.attr,
 	&dev_attr_l1tf.attr,
+	&dev_attr_mds.attr,
 	NULL
 };
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 07/28] MDSv4 0
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (5 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 06/28] MDSv4 11 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-14  4:03   ` [MODERATED] " Josh Poimboeuf
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 08/28] MDSv4 19 Andi Kleen
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Support mds=full for NMIs

NMIs don't go through C code when exiting to user space, so we need
to add an assembler clear cpu for this case. Only used with
mds=full, because otherwise we assume NMIs don't touch
other users or kernel sensitive data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_64.S       | 12 ++++++++++++
 arch/x86/include/asm/clearcpu.h | 11 +++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..57f194e3e253 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -39,6 +39,7 @@
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
 #include <linux/err.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1407,6 +1408,17 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	/*
+	 * Clear only when force clearing was enabled. Otherwise
+	 * we assume NMI code is not sensitive.
+	 * If you don't have jump labels we always clear too.
+	 */
+#ifdef HAVE_JUMP_LABEL
+	STATIC_BRANCH_JMP l_yes=.Lno_clear_cpu key=force_cpu_clear, branch=1
+#endif
+	CLEAR_CPU
+.Lno_clear_cpu:
+
 	/*
 	 * Return back to user mode.  We must *not* do the normal exit
 	 * work, because we don't want to enable interrupts.
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index b83ef1a5268f..67c4e0d38802 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_CLEARCPU_H
 #define _ASM_CLEARCPU_H 1
 
+#ifndef __ASSEMBLY__
+
 #include <linux/jump_label.h>
 #include <linux/sched/smt.h>
 #include <asm/alternative.h>
@@ -41,4 +43,13 @@ static inline void clear_cpu_idle(void)
 
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
+#else
+
+.macro CLEAR_CPU
+	ALTERNATIVE __stringify(push $__USER_DS ; verw (% _ASM_SP ) ; add $8, % _ASM_SP ),\
+		"", X86_FEATURE_NO_VERW
+.endm
+
+#endif
+
 #endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 08/28] MDSv4 19
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (6 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 07/28] MDSv4 0 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 09/28] MDSv4 16 Andi Kleen
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

The main kernel exits on 32bit kernels are already handled by
earlier patches.

But for NMIs we need to clear in the assembler code because
NMIs don't go through C code on exit, but they still
might need to clear due to mds=full

This could be handled with a static key like 64bit, but
for now just add an unconditional cpu clear on NMI exit.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_32.S | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..28b640f37f8d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -45,6 +45,7 @@
 #include <asm/smap.h>
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
+#include <asm/clearcpu.h>
 
 #include "calling.h"
 
@@ -1446,6 +1447,11 @@ ENTRY(nmi)
 	movl	%ebx, %esp
 
 .Lnmi_return:
+	/*
+	 * Only needed with mds=full
+	 * But for now do it unconditionally.
+	 */
+	CLEAR_CPU
 	CHECK_AND_APPLY_ESPFIX
 	RESTORE_ALL_NMI cr3_reg=%edi pop=4
 	jmp	.Lirq_return
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 09/28] MDSv4 16
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (7 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 08/28] MDSv4 19 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Export the MD_CLEAR CPUID set by new microcode to signal
that VERW implements the clear cpu side effect to KVM guests.

Also requires corresponding qemu patches

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index bbffa6c54697..d61272f50aed 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -409,7 +409,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
-		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP);
+		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
+		F(MD_CLEAR);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 10/28] MDSv4 24
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (8 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 09/28] MDSv4 16 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-15  1:05   ` [MODERATED] Encrypted Message Tim Chen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 11/28] MDSv4 21 Andi Kleen
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Including the theory, and some guide lines for subsystem/driver
maintainers.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/clearcpu.txt | 173 +++++++++++++++++++++++++++++++++++++
 1 file changed, 173 insertions(+)
 create mode 100644 Documentation/clearcpu.txt

diff --git a/Documentation/clearcpu.txt b/Documentation/clearcpu.txt
new file mode 100644
index 000000000000..b204b1e7051c
--- /dev/null
+++ b/Documentation/clearcpu.txt
@@ -0,0 +1,173 @@
+
+Security model for Microarchitectural Data Sampling
+===================================================
+
+Some CPUs can leave read or written data in internal buffers,
+which then later might be sampled through side effects.
+For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
+
+This can be avoided by explicitely clearing the CPU state.
+
+We trying to avoid leaking data between different processes,
+and also some sensitive data, like cryptographic data,
+or user data from other processes.
+
+We support three modes:
+
+(1) mitigation off (mds=off)
+(2) clear only when needed (default)
+(3) clear on every kernel exit, or guest entry (mds=full)
+
+(1) and (3) are trivial, the rest of the document discusses (2)
+
+Basic requirements and assumptions
+----------------------------------
+
+Kernel addresses and kernel temporary data are not sensitive.
+
+User data is sensitive, but only for other processes.
+
+Kernel data is sensitive when it is cryptographic keys.
+
+Guidance for driver/subsystem developers
+----------------------------------------
+
+When you touch user supplied data of *other* processes in system call
+context add lazy_clear_cpu().
+
+For the cases below we care only about data from other processes.
+Touching non cryptographic data from the current process is always allowed.
+
+Touching only pointers to user data is always allowed.
+
+When your interrupt does not touch user data directly consider marking
+it with IRQF_NO_USER.
+
+When your tasklet does not touch user data directly consider marking
+it with TASKLET_NO_USER using tasklet_init_flags/or
+DECLARE_TASKLET*_NOUSER.
+
+When your timer does not touch user data mark it with TIMER_NO_USER.
+If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.
+
+When your irq poll handler does not touch user data, mark it
+with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
+
+For networking code make sure to only touch user data through
+skb_push/put/copy [add more], unless it is data from the current
+process. If that is not ensured add lazy_clear_cpu or
+lazy_clear_cpu_interrupt. When the non skb data access is only in a
+hardware interrupt controlled by the driver, it can rely on not
+setting IRQF_NO_USER for that interrupt.
+
+Any cryptographic code touching key data should use memzero_explicit
+or kzfree.
+
+If your RCU callback touches user data add lazy_clear_cpu().
+
+These steps are currently only needed for code that runs on MDS affected
+CPUs, which is currently only x86. But might be worth being prepared
+if other architectures become affected too.
+
+Implementation details/assumptions
+----------------------------------
+
+If a system call touches data it is for its own process, so does not
+need to be cleared, because it has already access to it.
+
+When context switching we clear data, unless the context switch
+is inside a process, or from/to idle. We also clear after any
+context switches from kernel threads.
+
+Idle does not have sensitive data, except for in interrupts, which
+are handled separately.
+
+Cryptographic keys inside the kernel should be protected.
+We assume they use kzfree() or memzero_explicit() to clear
+state, so these functions trigger a cpu clear.
+
+Hard interrupts, tasklets, timers which can run asynchronous are
+assumed to touch random user data, unless they have been audited, and
+marked with NO_USER flags.
+
+Most interrupt handlers for modern devices should not touch
+user data because they rely on DMA and only manipulate
+pointers. This needs auditing to confirm though.
+
+For softirqs we assume that if they touch user data they use
+lazy_clear_cpu()/lazy_clear_interrupt() as needed.
+Networking is handled through skb_* below.
+Timer and Tasklets and IRQ poll are handled through opt-in.
+
+Scheduler softirq is assumed to not touch user data.
+
+Block softirq done callbacks are assumed to not touch user data.
+
+For networking code, any skb functions that are likely
+touching non header packet data schedule a clear cpu at next
+kernel exit. This includes skb_copy and related, skb_put/push,
+checksum functions.  We assume that any networking code touching
+packet data uses these functions.
+
+[In principle packet data should be encrypted anyways for the wire,
+but we try to avoid leaking it anyways]
+
+Some IO related functions like string PIO and memcpy_from/to_io, or
+the software pci dma bounce function, which touch data, schedule a
+buffer clear.
+
+We assume NMI/machine check code does not touch other
+processes' data.
+
+Any buffer clearing is done lazily on next kernel exit, so can be
+triggered in fast paths.
+
+Sandboxes
+---------
+
+We don't do anything special for seccomp processes
+
+If there is a sandbox inside the process the process should take care
+itself of clearing its own sensitive data before running sandbox
+code. This would include data touched by system calls.
+
+BPF
+---
+
+Assume BPF execution does not touch other user's data, so does
+not need to schedule a clear for itself.
+
+BPF could attack the rest of the kernel if it can successfully
+measure side channel side effects.
+
+When the BPF program was loaded unprivileged, always clear the CPU
+to prevent any exploits written in BPF using side channels to read
+data leaked from other kernel code
+
+We only do this when running in an interrupt, or if an clear cpu is
+already scheduled (which means for example there was a context
+switch, or crypto operation before)
+
+In process context we assume the code only accesses data of the
+current user and check that the BPF running was loaded by the
+same user so even if data leaked it would not cross privilege
+boundaries.
+
+Technically we would only need to do this if the BPF program
+contains conditional branches and loads dominated by them, but
+let's assume that near all do.
+
+This could be further optimized by allowing callers that do
+a lot of individual BPF runs and are sure they don't touch
+other user's data inbetween to do the clear only once
+at the beginning. We can add such optimizations later based on
+profile data.
+
+Virtualization
+--------------
+
+When entering a guest in KVM we clear to avoid any leakage to a guest.
+Normally this is done implicitely as part of the L1TF mitigation.
+It relies on this being enabled. It also uses the "fast exit"
+optimization that only clears if an interrupt or context switch
+happened.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 11/28] MDSv4 21
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (9 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 12/28] MDSv4 25 Andi Kleen
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Add preliminary administrator documentation

Add a Documentation file for administrators that describes MDS on a
high level.

So far not covering SMT.

Needs updates later for public URLs of supporting documentation.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/admin-guide/mds.rst | 108 ++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)
 create mode 100644 Documentation/admin-guide/mds.rst

diff --git a/Documentation/admin-guide/mds.rst b/Documentation/admin-guide/mds.rst
new file mode 100644
index 000000000000..1f3021d20953
--- /dev/null
+++ b/Documentation/admin-guide/mds.rst
@@ -0,0 +1,108 @@
+MDS - Microarchitectural Data Sampling)
+=======================================
+
+Microarchitectural Data Sampling is a side channel vulnerability that
+allows an attacker to sample data that has been earlier used during
+program execution. Internal buffers in the CPU may keep old data
+for some limited time, which can the later be determined by an attacker
+with side channel analysis. MDS can be used to occasionaly observe
+some values accessed earlier, but it cannot be used to observe values
+not recently touched by other code running on the same core.
+
+It is difficult to target particular data on a system using MDS,
+but attackers may be able to infer secrets by collecting
+and analyzing large amounts of data. MDS does not modify
+memory.
+
+MDS consists of multiple sub-vulnerabilities:
+Microarchitectural Store Buffer Data Sampling (MSBDS) (CVE-2018-12126)
+Microarchitectual Fill Buffer Data Sampling (MFBDS) (CVE-2018-12130)
+Microarchitectual Load Port Data (MLPDS) (CVE-2018-12127),
+with the first leaking store data, and the second loads and sometimes
+store data, and the third load data.
+
+The effects and mitigations are similar for all three, so the Linux
+kernel handles and reports them all as a single vulnerability called
+MDS. This also reduces the number of acronyms in use.
+
+Affected processors
+-------------------
+
+This vulnerability affects a wide range of Intel processors.
+Not all CPUs are affected by all of the sub vulnerabilities,
+however the kernel handles it always the same.
+
+The vulnerability is not present in
+
+    - Some Atoms (Bonnell, Saltwell, Goldmont, GoldmontPlus)
+
+The kernel will automatically detect future CPUs with hardware
+mitigations for these issues and disable any workarounds.
+
+The kernel reports if the current CPU is vulnerable and any
+mitigations used in
+
+/sys/devices/system/cpu/vulnerabilities/mds
+
+Kernel mitigation
+-----------------
+
+By default, the kernel automatically ensures no data leakage between
+different processes, or between kernel threads and interrupt handlers
+and user processes, or from any cryptographic code in the kernel.
+
+It does not isolate kernel code that only touches data of the
+current process.  If protecting such kernel code is desired,
+mds=full can be specified.
+
+The mitigation is automatically enabled, but can be further controlled
+with the command line options documented below.
+
+The mitigation can be done with microcode support, requiring
+updated microcode.
+
+The microcode should be loaded at early boot using the initrd. Hot
+updating microcode will not enable the mitigations.
+
+Virtual machine mitigation
+--------------------------
+
+The mitigation is enabled by default and controlled by the same options
+as L1TF cache clearing. See l1tf.rst for more details. In the default
+setting MDS for leaking data out of the guest into other processes
+will be mitigated.
+
+Kernel command line options
+---------------------------
+
+Normally the kernel selects reasonable defaults and no special configuration
+is needed. The default behavior can be overriden by the mds= kernel
+command line options.
+
+These options can be specified in the boot loader. Any changes require a reboot.
+
+When the system only runs trusted code, MDS mitigation can be disabled with
+mds=off as a performance optimization.
+
+   - mds=off      Disable workarounds if the CPU is not affected.
+
+By default the kernel only clears CPU data after execution
+that is known or likely to have touched user data of other processes,
+or cryptographic data. This relies on code audits done in the
+mainline Linux kernel. When running unaudited large out of tree code,
+or binary drivers, who might violate these constraints it is possible
+to use mds=full to always flush the CPU data on each kernel exit.
+
+   - mds=full     Always clear cpu state on exiting from kernel.
+
+TBD describe SMT
+
+References
+----------
+
+Fore more details on the kernel internal implementation of the MDS mitigations,
+please see Documentation/clearcpu.txt
+
+TBD Add URL for Intel white paper
+
+TBD add reference to microcodes
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 12/28] MDSv4 25
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (10 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 11/28] MDSv4 21 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 13/28] MDSv4 4 Andi Kleen
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Add basic infrastructure for code to request CPU buffer clearing
on the next kernel exit.

We have two functions lazy_clear_cpu to request clearing,
and lazy_clear_cpu_interrupt to request clearing if running
in an interrupt.

Non architecture specific code can include linux/clearcpu.h
and use lazy_clear_cpu / lazy_clear_interrupt. On x86
we provide low level implementations that set the TIF_CLEAR_CPU
bit.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/Kconfig                    |  3 +++
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/clearcpu.h |  5 +++++
 include/linux/clearcpu.h        | 36 +++++++++++++++++++++++++++++++++
 4 files changed, 45 insertions(+)
 create mode 100644 include/linux/clearcpu.h

diff --git a/arch/Kconfig b/arch/Kconfig
index 4cfb6de48f79..e6b7bf9174aa 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -808,6 +808,9 @@ config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config ARCH_HAS_CLEAR_CPU
+	def_bool n
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6185d4f33296..ccf05eff4151 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -84,6 +84,7 @@ config X86
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_CLEAR_CPU
 	select BUILDTIME_EXTABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_VALIDATE_LAST_CYCLE
diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 67c4e0d38802..35386628be6d 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -41,6 +41,11 @@ static inline void clear_cpu_idle(void)
 	}
 }
 
+static inline void lazy_clear_cpu(void)
+{
+	set_thread_flag(TIF_CLEAR_CPU);
+}
+
 DECLARE_STATIC_KEY_FALSE(force_cpu_clear);
 
 #else
diff --git a/include/linux/clearcpu.h b/include/linux/clearcpu.h
new file mode 100644
index 000000000000..63a6952b46fa
--- /dev/null
+++ b/include/linux/clearcpu.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CLEARCPU_H
+#define _LINUX_CLEARCPU_H 1
+
+#include <linux/preempt.h>
+
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearcpu.h>
+#else
+static inline void lazy_clear_cpu(void)
+{
+}
+#endif
+
+/*
+ * Use this function when potentially touching (reading or writing)
+ * user data in an interrupt. In this case schedule to clear the
+ * CPU buffers on kernel exit to avoid any potential side channels.
+ *
+ * If not in an interrupt we assume the touched data belongs to the
+ * current process and doesn't need to be cleared.
+ *
+ * This version is for code who might be in an interrupt.
+ * If you know for sure you're in interrupt context call
+ * lazy_clear_cpu directly.
+ *
+ * lazy_clear_cpu is reasonably cheap (just sets a bit) and
+ * can be used in fast paths.
+ */
+static inline void lazy_clear_cpu_interrupt(void)
+{
+	if (in_interrupt())
+		lazy_clear_cpu();
+}
+
+#endif
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 13/28] MDSv4 4
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (11 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 12/28] MDSv4 25 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 14/28] MDSv4 17 Andi Kleen
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

On context switch we need to schedule a cpu clear on the next
kernel exit when:

- We're switching between different processes
- We're switching from a kernel thread that is not idle.
For idle we assume only interrupts are sensitive, which
are already handled elsewhere. For kernel threads
like work queue we assume they might contain
sensitive (other user's or crypto) data.

The code hooks into the generic context switch, not
the mm context switch, because the mm context switch
doesn't handle the idle thread case.

This also transfers the clear cpu bit to the next task.

Tested-by: Neelima Krishnan <neelima.krishnan@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/process.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/process.h b/arch/x86/kernel/process.h
index 320ab978fb1f..52f97ccbf2dc 100644
--- a/arch/x86/kernel/process.h
+++ b/arch/x86/kernel/process.h
@@ -2,6 +2,7 @@
 //
 // Code shared between 32 and 64 bit
 
+#include <linux/clearcpu.h>
 #include <asm/spec-ctrl.h>
 
 void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p);
@@ -29,6 +30,32 @@ static inline void switch_to_extra(struct task_struct *prev,
 		}
 	}
 
+	/*
+	 * When we switch to a different process, or we switch
+	 * from a kernel thread that was not idle, clear the CPU
+	 * buffers on next kernel exit.
+	 *
+	 * We assume that idle does not touch user data, except
+	 * for interrupts, which schedule their own clears as needed.
+	 * But other kernel threads, like work queues, might
+	 * touch user data, so flush in this case.
+	 *
+	 * This has to be here because switch_mm doesn't get
+	 * called in the kernel thread case.
+	 */
+	if (static_cpu_has(X86_BUG_MDS)) {
+		if (prev->pid && (next->mm != prev->mm || prev->mm == NULL))
+			lazy_clear_cpu();
+		/*
+		 * Also transfer the clearcpu flag from the previous task.
+		 * Can be done non atomically because interrupts are off.
+		 */
+		task_thread_info(next)->status |=
+			task_thread_info(prev)->status & _TIF_CLEAR_CPU;
+		task_thread_info(prev)->status &= ~_TIF_CLEAR_CPU;
+	}
+
+
 	/*
 	 * __switch_to_xtra() handles debug registers, i/o bitmaps,
 	 * speculation mitigations etc.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 14/28] MDSv4 17
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (12 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 13/28] MDSv4 4 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 15/28] MDSv4 9 Andi Kleen
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  x86/speculation/mds: Add tracing for clear_cpu

Add trace points for clear_cpu and lazy_clear_cpu. This is useful
for debugging and performance testing.

The trace points have to be partially out of line to avoid
include loops, but the fast path jump labels are inlined.

The idle case cannot be traced because trace points
don't like idle context.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearcpu.h       | 36 +++++++++++++++++++++++++--
 arch/x86/include/asm/trace/clearcpu.h | 27 ++++++++++++++++++++
 arch/x86/kernel/cpu/bugs.c            | 17 +++++++++++++
 3 files changed, 78 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/trace/clearcpu.h

diff --git a/arch/x86/include/asm/clearcpu.h b/arch/x86/include/asm/clearcpu.h
index 35386628be6d..935b827a4175 100644
--- a/arch/x86/include/asm/clearcpu.h
+++ b/arch/x86/include/asm/clearcpu.h
@@ -9,12 +9,35 @@
 #include <asm/alternative.h>
 #include <linux/thread_info.h>
 
+/*
+ * We cannot directly include the trace point header here
+ * because it leads to include loops with other trace point
+ * files pulling this one in. Define the static
+ * key manually here, which handles noping the fast path,
+ * and the actual tracing is done out of line.
+ */
+#ifdef CONFIG_TRACEPOINTS
+#include <asm/atomic.h>
+#include <linux/tracepoint-defs.h>
+
+extern struct tracepoint __tracepoint_clear_cpu;
+extern struct tracepoint __tracepoint_lazy_clear_cpu;
+#define cc_tracepoint_active(t) static_key_false(&(t).key)
+
+extern void do_trace_clear_cpu(void);
+extern void do_trace_lazy_clear_cpu(void);
+#else
+#define cc_tracepoint_active(t) false
+static inline void do_trace_clear_cpu(void) {}
+static inline void do_trace_lazy_clear_cpu(void) {}
+#endif
+
 /*
  * Clear CPU buffers to avoid side channels.
  * We use microcode as a side effect of the obsolete VERW instruction
  */
 
-static inline void clear_cpu(void)
+static inline void __clear_cpu(void)
 {
 	unsigned kernel_ds = __KERNEL_DS;
 	/* Has to be memory form, don't modify to use an register */
@@ -22,6 +45,13 @@ static inline void clear_cpu(void)
 		[kernelds] "m" (kernel_ds));
 }
 
+static inline void clear_cpu(void)
+{
+	if (cc_tracepoint_active(__tracepoint_clear_cpu))
+		do_trace_clear_cpu();
+	__clear_cpu();
+}
+
 /*
  * Clear CPU buffers before going idle, so that no state is leaked to SMT
  * siblings taking over thread resources.
@@ -37,12 +67,14 @@ static inline void clear_cpu_idle(void)
 {
 	if (sched_smt_active()) {
 		clear_thread_flag(TIF_CLEAR_CPU);
-		clear_cpu();
+		__clear_cpu();
 	}
 }
 
 static inline void lazy_clear_cpu(void)
 {
+	if (cc_tracepoint_active(__tracepoint_lazy_clear_cpu))
+		do_trace_lazy_clear_cpu();
 	set_thread_flag(TIF_CLEAR_CPU);
 }
 
diff --git a/arch/x86/include/asm/trace/clearcpu.h b/arch/x86/include/asm/trace/clearcpu.h
new file mode 100644
index 000000000000..e742b5cd8ee9
--- /dev/null
+++ b/arch/x86/include/asm/trace/clearcpu.h
@@ -0,0 +1,27 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM clearcpu
+
+#if !defined(_TRACE_CLEARCPU_H) || defined(TRACE_HEADER_MULTI_READ)
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(clear_cpu,
+		    TP_PROTO(int dummy),
+		    TP_ARGS(dummy),
+		    TP_STRUCT__entry(__field(int, dummy)),
+		    TP_fast_assign(),
+		    TP_printk("%d", __entry->dummy));
+
+DEFINE_EVENT(clear_cpu, clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+DEFINE_EVENT(clear_cpu, lazy_clear_cpu, TP_PROTO(int dummy), TP_ARGS(dummy));
+
+#define _TRACE_CLEARCPU_H
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH asm/trace/
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE clearcpu
+#endif /* _TRACE_CLEARCPU_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 582b1cd019f7..e54df06dd462 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -1061,6 +1061,23 @@ early_param("l1tf", l1tf_cmdline);
 
 #undef pr_fmt
 
+#define CREATE_TRACE_POINTS
+#include <asm/trace/clearcpu.h>
+
+void do_trace_clear_cpu(void)
+{
+	trace_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(clear_cpu);
+
+void do_trace_lazy_clear_cpu(void)
+{
+	trace_lazy_clear_cpu(0);
+}
+EXPORT_SYMBOL(do_trace_lazy_clear_cpu);
+EXPORT_TRACEPOINT_SYMBOL(lazy_clear_cpu);
+
 DEFINE_STATIC_KEY_FALSE(force_cpu_clear);
 
 static void mds_select_mitigation(void)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 15/28] MDSv4 9
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (13 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 14/28] MDSv4 17 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 16/28] MDSv4 6 Andi Kleen
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Force clear cpu on kernel preemption

When the kernel is preempted we need to force a cpu clear,
because the preemption might happen before the code
has a chance to set TIF_CPU_CLEAR later.

We cannot rely on kernel code setting the flag before
touching sensitive data: the flag setting could
be implicit, like in memzero_explicit, which is always
called later.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/sched/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a674c7db2f29..b04918e9115c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -11,6 +11,8 @@
 
 #include <linux/kcov.h>
 
+#include <linux/clearcpu.h>
+
 #include <asm/switch_to.h>
 #include <asm/tlb.h>
 
@@ -3619,6 +3621,13 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	if (likely(!preemptible()))
 		return;
 
+	/*
+	 * For kernel preemption we need to force a cpu clear
+	 * because it could happen before the code has a chance
+	 * to set TIF_CLEAR_CPU.
+	 */
+	lazy_clear_cpu();
+
 	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 16/28] MDSv4 6
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (14 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 15/28] MDSv4 9 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 17/28] MDSv4 18 Andi Kleen
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Assume that any code using these functions is sensitive and shouldn't
leak any data.

This handles clearing for key data used in the kernel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 lib/string.c     | 6 ++++++
 mm/slab_common.c | 5 ++++-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/lib/string.c b/lib/string.c
index 38e4ca08e757..9ce59dd86541 100644
--- a/lib/string.c
+++ b/lib/string.c
@@ -28,6 +28,7 @@
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
+#include <linux/clearcpu.h>
 
 #include <asm/byteorder.h>
 #include <asm/word-at-a-time.h>
@@ -715,12 +716,17 @@ EXPORT_SYMBOL(memset);
  * necessary, memzero_explicit() should be used instead in
  * order to prevent the compiler from optimising away zeroing.
  *
+ * As a side effect this may also trigger extra cleaning
+ * of CPU state before the next kernel exit to avoid
+ * side channels.
+ *
  * memzero_explicit() doesn't need an arch-specific version as
  * it just invokes the one of memset() implicitly.
  */
 void memzero_explicit(void *s, size_t count)
 {
 	memset(s, 0, count);
+	lazy_clear_cpu();
 	barrier_data(s);
 }
 EXPORT_SYMBOL(memzero_explicit);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 81732d05e74a..7b5e2e1318a2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1576,6 +1576,9 @@ EXPORT_SYMBOL(krealloc);
  * Note: this function zeroes the whole allocated buffer which can be a good
  * deal bigger than the requested buffer size passed to kmalloc(). So be
  * careful when using this function in performance sensitive code.
+ *
+ * As a side effect this may also clear CPU state later before the
+ * next kernel exit to avoid side channels.
  */
 void kzfree(const void *p)
 {
@@ -1585,7 +1588,7 @@ void kzfree(const void *p)
 	if (unlikely(ZERO_OR_NULL_PTR(mem)))
 		return;
 	ks = ksize(mem);
-	memset(mem, 0, ks);
+	memzero_explicit(mem, ks);
 	kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 17/28] MDSv4 18
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (15 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 16/28] MDSv4 6 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 18/28] MDSv4 26 Andi Kleen
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Mark interrupts clear cpu, unless opted-out

Interrupts might touch user data from other processes
in any context.

By default we clear the CPU on the next kernel exit.

Add a new IRQ_F_NO_USER interrupt flag. When the flag
is not set on interrupt execution we clear the cpu state on
next kernel exit.

This allows interrupts to opt-out from the extra clearing
overhead, but is safe by default.

Over time as more interrupt code is audited it can set the opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 2 ++
 kernel/irq/handle.c       | 8 ++++++++
 kernel/irq/manage.c       | 1 +
 3 files changed, 11 insertions(+)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c672f34235e7..291b7fee3afe 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -61,6 +61,7 @@
  *                interrupt handler after suspending interrupts. For system
  *                wakeup devices users need to implement wakeup detection in
  *                their interrupt handlers.
+ * IRQF_NO_USER	- Interrupt does not touch user data
  */
 #define IRQF_SHARED		0x00000080
 #define IRQF_PROBE_SHARED	0x00000100
@@ -74,6 +75,7 @@
 #define IRQF_NO_THREAD		0x00010000
 #define IRQF_EARLY_RESUME	0x00020000
 #define IRQF_COND_SUSPEND	0x00040000
+#define IRQF_NO_USER		0x00080000
 
 #define IRQF_TIMER		(__IRQF_TIMER | IRQF_NO_SUSPEND | IRQF_NO_THREAD)
 
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 38554bc35375..e5910938ce2b 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -13,6 +13,7 @@
 #include <linux/sched.h>
 #include <linux/interrupt.h>
 #include <linux/kernel_stat.h>
+#include <linux/clearcpu.h>
 
 #include <trace/events/irq.h>
 
@@ -149,6 +150,13 @@ irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc, unsigned int *flags
 		res = action->handler(irq, action->dev_id);
 		trace_irq_handler_exit(irq, action, res);
 
+		/*
+		 * We aren't sure if the interrupt handler did or did not
+		 * touch user data. Schedule a cpu clear just in case.
+		 */
+		if (!(action->flags & IRQF_NO_USER))
+			lazy_clear_cpu();
+
 		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pF enabled interrupts\n",
 			      irq, action->handler))
 			local_irq_disable();
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index a4888ce4667a..3f0c99240638 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1793,6 +1793,7 @@ EXPORT_SYMBOL(free_irq);
  *
  *	IRQF_SHARED		Interrupt is shared
  *	IRQF_TRIGGER_*		Specify active edge(s) or level
+ *	IRQF_NOUSER		Does not touch user data.
  *
  */
 int request_threaded_irq(unsigned int irq, irq_handler_t handler,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 18/28] MDSv4 26
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (16 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 17/28] MDSv4 18 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 19/28] MDSv4 14 Andi Kleen
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Clear cpu on all timers, unless the timer
 opts-out

By default we assume timers might touch user data and schedule
a cpu clear on next kernel exit.

Support opt-outs where timer and hrtimer handlers can opt-in
they they don't touch any user data.

Note this takes one bit from the timer wheel index field away,
but it seems there are less wheels available anyways, so that
should be ok.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/hrtimer.h | 4 ++++
 include/linux/timer.h   | 9 ++++++---
 kernel/time/hrtimer.c   | 5 +++++
 kernel/time/timer.c     | 8 ++++++++
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 2e8957eac4d4..b32c76919f78 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -32,6 +32,7 @@ struct hrtimer_cpu_base;
  *				  when starting the timer)
  * HRTIMER_MODE_SOFT		- Timer callback function will be executed in
  *				  soft irq context
+ * HRTIMER_MODE_NO_USER		- Handler does not touch user data.
  */
 enum hrtimer_mode {
 	HRTIMER_MODE_ABS	= 0x00,
@@ -48,6 +49,7 @@ enum hrtimer_mode {
 	HRTIMER_MODE_ABS_PINNED_SOFT = HRTIMER_MODE_ABS_PINNED | HRTIMER_MODE_SOFT,
 	HRTIMER_MODE_REL_PINNED_SOFT = HRTIMER_MODE_REL_PINNED | HRTIMER_MODE_SOFT,
 
+	HRTIMER_MODE_NO_USER	= 0x08,
 };
 
 /*
@@ -101,6 +103,7 @@ enum hrtimer_restart {
  * @state:	state information (See bit values above)
  * @is_rel:	Set if the timer was armed relative
  * @is_soft:	Set if hrtimer will be expired in soft interrupt context.
+ * @no_user:	function does not touch user data.
  *
  * The hrtimer structure must be initialized by hrtimer_init()
  */
@@ -112,6 +115,7 @@ struct hrtimer {
 	u8				state;
 	u8				is_rel;
 	u8				is_soft;
+	u8				no_user;
 };
 
 /**
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 7b066fd38248..222e72432be3 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -56,10 +56,13 @@ struct timer_list {
 #define TIMER_DEFERRABLE	0x00080000
 #define TIMER_PINNED		0x00100000
 #define TIMER_IRQSAFE		0x00200000
-#define TIMER_ARRAYSHIFT	22
-#define TIMER_ARRAYMASK		0xFFC00000
+#define TIMER_NO_USER		0x00400000
+#define TIMER_ARRAYSHIFT	23
+#define TIMER_ARRAYMASK		0xFF800000
 
-#define TIMER_TRACE_FLAGMASK	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE)
+#define TIMER_TRACE_FLAGMASK	\
+	(TIMER_MIGRATING | TIMER_DEFERRABLE | TIMER_PINNED | TIMER_IRQSAFE | \
+	 TIMER_NO_USER)
 
 #define __TIMER_INITIALIZER(_function, _flags) {		\
 		.entry = { .next = TIMER_ENTRY_STATIC },	\
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index f5cfa1b73d6f..e2c8776ba2a4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -42,6 +42,7 @@
 #include <linux/timer.h>
 #include <linux/freezer.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 
@@ -1276,6 +1277,7 @@ static void __hrtimer_init(struct hrtimer *timer, clockid_t clock_id,
 		clock_id = CLOCK_MONOTONIC;
 
 	base += hrtimer_clockid_to_base(clock_id);
+	timer->no_user = !!(mode & HRTIMER_MODE_NO_USER);
 	timer->is_soft = softtimer;
 	timer->base = &cpu_base->clock_base[base];
 	timerqueue_init(&timer->node);
@@ -1390,6 +1392,9 @@ static void __run_hrtimer(struct hrtimer_cpu_base *cpu_base,
 	trace_hrtimer_expire_exit(timer);
 	raw_spin_lock_irq(&cpu_base->lock);
 
+	if (!timer->no_user)
+		lazy_clear_cpu();
+
 	/*
 	 * Note: We clear the running state after enqueue_hrtimer and
 	 * we do not reprogram the event hardware. Happens either in
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 444156debfa0..e6ab6986ffc8 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -43,6 +43,7 @@
 #include <linux/sched/debug.h>
 #include <linux/slab.h>
 #include <linux/compat.h>
+#include <linux/clearcpu.h>
 
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
@@ -1338,6 +1339,13 @@ static void call_timer_fn(struct timer_list *timer, void (*fn)(struct timer_list
 		 */
 		preempt_count_set(count);
 	}
+
+	/*
+	 * The timer might have touched user data. Schedule
+	 * a cpu clear on the next kernel exit.
+	 */
+	if (!(timer->flags & TIMER_NO_USER))
+		lazy_clear_cpu();
 }
 
 static void expire_timers(struct timer_base *base, struct hlist_head *head)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 19/28] MDSv4 14
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (17 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 18/28] MDSv4 26 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 20/28] MDSv4 23 Andi Kleen
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

By default we assume tasklets might touch user data and schedule
a cpu clear on next kernel exit.

Add new interfaces to allow audited tasklets to opt-out.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/interrupt.h | 16 +++++++++++++++-
 kernel/softirq.c          | 25 +++++++++++++++++++------
 2 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 291b7fee3afe..81b852fb5ecf 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -571,11 +571,22 @@ struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }
 #define DECLARE_TASKLET_DISABLED(name, func, data) \
 struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }
 
+#define DECLARE_TASKLET_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(0), func, data }
+
+#define DECLARE_TASKLET_DISABLED_NOUSER(name, func, data) \
+struct tasklet_struct name = { NULL, TASKLET_NO_USER, ATOMIC_INIT(1), func, data }
 
 enum
 {
 	TASKLET_STATE_SCHED,	/* Tasklet is scheduled for execution */
-	TASKLET_STATE_RUN	/* Tasklet is running (SMP only) */
+	TASKLET_STATE_RUN,	/* Tasklet is running (SMP only) */
+
+	/*
+	 * Set this flag when the tasklet is known to not touch user data,
+	 * so doesn't need extra CPU state clearing.
+	 */
+	TASKLET_NO_USER		= 1 << 5,
 };
 
 #ifdef CONFIG_SMP
@@ -639,6 +650,9 @@ extern void tasklet_kill(struct tasklet_struct *t);
 extern void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);
 extern void tasklet_init(struct tasklet_struct *t,
 			 void (*func)(unsigned long), unsigned long data);
+extern void tasklet_init_flags(struct tasklet_struct *t,
+			 void (*func)(unsigned long), unsigned long data,
+			 unsigned flags);
 
 struct tasklet_hrtimer {
 	struct hrtimer		timer;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d28813306b2c..fdd4e3be3db7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
 #include <linux/smpboot.h>
 #include <linux/tick.h>
 #include <linux/irq.h>
+#include <linux/clearcpu.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/irq.h>
@@ -522,6 +523,8 @@ static void tasklet_action_common(struct softirq_action *a,
 					BUG();
 				t->func(t->data);
 				tasklet_unlock(t);
+				if (!(t->state & TASKLET_NO_USER))
+					lazy_clear_cpu();
 				continue;
 			}
 			tasklet_unlock(t);
@@ -546,15 +549,23 @@ static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
 	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
 }
 
-void tasklet_init(struct tasklet_struct *t,
-		  void (*func)(unsigned long), unsigned long data)
+void tasklet_init_flags(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data,
+		  unsigned flags)
 {
 	t->next = NULL;
-	t->state = 0;
+	t->state = flags;
 	atomic_set(&t->count, 0);
 	t->func = func;
 	t->data = data;
 }
+EXPORT_SYMBOL(tasklet_init_flags);
+
+void tasklet_init(struct tasklet_struct *t,
+		  void (*func)(unsigned long), unsigned long data)
+{
+	tasklet_init_flags(t, func, data, 0);
+}
 EXPORT_SYMBOL(tasklet_init);
 
 void tasklet_kill(struct tasklet_struct *t)
@@ -609,7 +620,8 @@ static void __tasklet_hrtimer_trampoline(unsigned long data)
  * @ttimer:	 tasklet_hrtimer which is initialized
  * @function:	 hrtimer callback function which gets called from softirq context
  * @which_clock: clock id (CLOCK_MONOTONIC/CLOCK_REALTIME)
- * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL)
+ * @mode:	 hrtimer mode (HRTIMER_MODE_ABS/HRTIMER_MODE_REL),
+ *		 HRTIMER_MODE_NO_USER
  */
 void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 			  enum hrtimer_restart (*function)(struct hrtimer *),
@@ -617,8 +629,9 @@ void tasklet_hrtimer_init(struct tasklet_hrtimer *ttimer,
 {
 	hrtimer_init(&ttimer->timer, which_clock, mode);
 	ttimer->timer.function = __hrtimer_tasklet_trampoline;
-	tasklet_init(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
-		     (unsigned long)ttimer);
+	tasklet_init_flags(&ttimer->tasklet, __tasklet_hrtimer_trampoline,
+		     (unsigned long)ttimer,
+		     (mode & HRTIMER_MODE_NO_USER) ? TASKLET_NO_USER : 0);
 	ttimer->function = function;
 }
 EXPORT_SYMBOL_GPL(tasklet_hrtimer_init);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 20/28] MDSv4 23
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (18 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 19/28] MDSv4 14 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 21/28] MDSv4 15 Andi Kleen
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Clear CPU on irq poll, unless opted-out

By default we assume that irq poll handlers running in the irq poll
softirq might touch user data and we schedule a cpu clear on next
kernel exit.

Add interfaces for audited handlers to declare that they are safe.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/irq_poll.h |  2 ++
 lib/irq_poll.c           | 18 ++++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/irq_poll.h b/include/linux/irq_poll.h
index 16aaeccb65cb..5f13582f1b8e 100644
--- a/include/linux/irq_poll.h
+++ b/include/linux/irq_poll.h
@@ -15,6 +15,8 @@ struct irq_poll {
 enum {
 	IRQ_POLL_F_SCHED	= 0,
 	IRQ_POLL_F_DISABLE	= 1,
+
+	IRQ_POLL_F_NO_USER	= 1<<4,
 };
 
 extern void irq_poll_sched(struct irq_poll *);
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 86a709954f5a..cb19431f53ec 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -11,6 +11,7 @@
 #include <linux/cpu.h>
 #include <linux/irq_poll.h>
 #include <linux/delay.h>
+#include <linux/clearcpu.h>
 
 static unsigned int irq_poll_budget __read_mostly = 256;
 
@@ -111,6 +112,9 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 
 		budget -= work;
 
+		if (!(iop->state & IRQ_POLL_F_NO_USER))
+			lazy_clear_cpu();
+
 		local_irq_disable();
 
 		/*
@@ -168,21 +172,31 @@ void irq_poll_enable(struct irq_poll *iop)
 EXPORT_SYMBOL(irq_poll_enable);
 
 /**
- * irq_poll_init - Initialize this @iop
+ * irq_poll_init_flags - Initialize this @iop
  * @iop:      The parent iopoll structure
  * @weight:   The default weight (or command completion budget)
  * @poll_fn:  The handler to invoke
+ * @flags:    IRQ_POLL_F_NO_USER if callback does not touch user data.
  *
  * Description:
  *     Initialize and enable this irq_poll structure.
  **/
-void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+void irq_poll_init_flags(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn,
+			 int flags)
 {
 	memset(iop, 0, sizeof(*iop));
 	INIT_LIST_HEAD(&iop->list);
 	iop->weight = weight;
 	iop->poll = poll_fn;
+	iop->state = flags;
 }
+EXPORT_SYMBOL(irq_poll_init_flags);
+
+void irq_poll_init(struct irq_poll *iop, int weight, irq_poll_fn *poll_fn)
+{
+	return irq_poll_init_flags(iop, weight, poll_fn, 0);
+}
+
 EXPORT_SYMBOL(irq_poll_init);
 
 static int irq_poll_cpu_dead(unsigned int cpu)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 21/28] MDSv4 15
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (19 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 20/28] MDSv4 23 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 22/28] MDSv4 5 Andi Kleen
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Schedule a clear cpu on next kernel exit for string PIO
or memcpy_from/to_io calls, when they are called in
interrupts.

The PIO case is likely already handled by old drivers
not opting in their interrupt handlers to not clear,
but let's do it just to be sure.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/io.h | 3 +++
 include/asm-generic/io.h  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 686247db3106..19e2208eaa94 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -40,6 +40,7 @@
 
 #include <linux/string.h>
 #include <linux/compiler.h>
+#include <linux/clearcpu.h>
 #include <asm/page.h>
 #include <asm/early_ioremap.h>
 #include <asm/pgtable_types.h>
@@ -321,6 +322,7 @@ static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 			     : "+S"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }									\
 									\
 static inline void ins##bwl(int port, void *addr, unsigned long count)	\
@@ -337,6 +339,7 @@ static inline void ins##bwl(int port, void *addr, unsigned long count)	\
 			     : "+D"(addr), "+c"(count)			\
 			     : "d"(port) : "memory");			\
 	}								\
+	lazy_clear_cpu_interrupt();					\
 }
 
 BUILDIO(b, b, char)
diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index d356f802945a..cf58bceea042 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -14,6 +14,7 @@
 #include <asm/page.h> /* I/O is all done through memory accesses */
 #include <linux/string.h> /* for memset() and memcpy() */
 #include <linux/types.h>
+#include <linux/clearcpu.h>
 
 #ifdef CONFIG_GENERIC_IOMAP
 #include <asm-generic/iomap.h>
@@ -1115,6 +1116,7 @@ static inline void memcpy_fromio(void *buffer,
 				 size_t size)
 {
 	memcpy(buffer, __io_virt(addr), size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
@@ -1132,6 +1134,7 @@ static inline void memcpy_toio(volatile void __iomem *addr, const void *buffer,
 			       size_t size)
 {
 	memcpy(__io_virt(addr), buffer, size);
+	lazy_clear_cpu_interrupt();
 }
 #endif
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 22/28] MDSv4 5
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (20 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 21/28] MDSv4 15 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 23/28] MDSv4 13 Andi Kleen
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Schedule clear cpu in swiotlb

Schedule a cpu clear on next kernel exit for swiotlb running
in interrupt context, since it touches user data with the CPU.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/dma/swiotlb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d6361776dc5c..e11ff1e45a4c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -34,6 +34,7 @@
 #include <linux/scatterlist.h>
 #include <linux/mem_encrypt.h>
 #include <linux/set_memory.h>
+#include <linux/clearcpu.h>
 
 #include <asm/io.h>
 #include <asm/dma.h>
@@ -420,6 +421,7 @@ static void swiotlb_bounce(phys_addr_t orig_addr, phys_addr_t tlb_addr,
 	} else {
 		memcpy(phys_to_virt(orig_addr), vaddr, size);
 	}
+	lazy_clear_cpu_interrupt();
 }
 
 phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 23/28] MDSv4 13
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (21 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 22/28] MDSv4 5 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 24/28] MDSv4 28 Andi Kleen
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Instrument skb functions to clear cpu
 automatically

Instrument some strategic skbuff functions that either touch
packet data directly, or are likely followed by a user
data touch like a memcpy, to schedule a cpu clear on next
kernel exit. This is only done inside interrupts,
outside we assume it only touches the current processes' data.

In principle network data should be encrypted anyways,
but it's better to not leak it.

This provides protection for the network softirq.

Needs more auditing.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 26 ++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 93f56fddd92a..5e147afa07e4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -40,6 +40,7 @@
 #include <linux/in6.h>
 #include <linux/if_packet.h>
 #include <net/flow.h>
+#include <linux/clearcpu.h>
 
 /* The interface for checksum offload between the stack and networking drivers
  * is as follows...
@@ -2093,6 +2094,7 @@ static inline void *__skb_put(struct sk_buff *skb, unsigned int len)
 	SKB_LINEAR_ASSERT(skb);
 	skb->tail += len;
 	skb->len  += len;
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37317ffec146..eda9ef0ff63d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1189,6 +1189,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 	if (!num_frags)
 		goto release;
 
+	/* Likely to copy user data */
+	lazy_clear_cpu_interrupt();
+
 	new_frags = (__skb_pagelen(skb) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	for (i = 0; i < new_frags; i++) {
 		page = alloc_page(gfp_mask);
@@ -1353,6 +1356,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 	if (!n)
 		return NULL;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	/* Set the data pointer */
 	skb_reserve(n, headerlen);
 	/* Set the tail pointer and length */
@@ -1588,6 +1594,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	if (!n)
 		return NULL;
 
+	/* May copy user data */
+	lazy_clear_cpu_interrupt();
+
 	skb_reserve(n, newheadroom);
 
 	/* Set the tail pointer and length */
@@ -1676,6 +1685,8 @@ EXPORT_SYMBOL(__skb_pad);
 
 void *pskb_put(struct sk_buff *skb, struct sk_buff *tail, int len)
 {
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	if (tail != skb) {
 		skb->data_len += len;
 		skb->len += len;
@@ -1701,6 +1712,8 @@ void *skb_put(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->tail > skb->end))
 		skb_over_panic(skb, len, __builtin_return_address(0));
+	/* Likely to be followed by a user data copy */
+	lazy_clear_cpu_interrupt();
 	return tmp;
 }
 EXPORT_SYMBOL(skb_put);
@@ -1720,6 +1733,7 @@ void *skb_push(struct sk_buff *skb, unsigned int len)
 	skb->len  += len;
 	if (unlikely(skb->data < skb->head))
 		skb_under_panic(skb, len, __builtin_return_address(0));
+	/* No clear cpu, assume this is only header data */
 	return skb->data;
 }
 EXPORT_SYMBOL(skb_push);
@@ -2026,6 +2040,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2387,6 +2404,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len)
 	struct sk_buff *frag_iter;
 	int i, copy;
 
+	/* Copies user data */
+	lazy_clear_cpu_interrupt();
+
 	if (offset > (int)skb->len - len)
 		goto fault;
 
@@ -2467,6 +2487,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Checksum header. */
 	if (copy > 0) {
 		if (copy > len)
@@ -2559,6 +2582,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
 	struct sk_buff *frag_iter;
 	int pos = 0;
 
+	/* Reads packet data */
+	lazy_clear_cpu_interrupt();
+
 	/* Copy header. */
 	if (copy > 0) {
 		if (copy > len)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 24/28] MDSv4 28
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (22 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 23/28] MDSv4 13 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 25/28] MDSv4 1 Andi Kleen
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Mark the tcp tasklet as not needing an implicit cpu clear
flush. If any is needed it will be triggered by the skb_*
hooks.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 net/ipv4/tcp_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 730bc44dbad9..06bc635a54ca 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -903,9 +903,10 @@ void __init tcp_tasklet_init(void)
 		struct tsq_tasklet *tsq = &per_cpu(tsq_tasklet, i);
 
 		INIT_LIST_HEAD(&tsq->head);
-		tasklet_init(&tsq->tasklet,
+		tasklet_init_flags(&tsq->tasklet,
 			     tcp_tasklet_func,
-			     (unsigned long)tsq);
+			     (unsigned long)tsq,
+			     TASKLET_NO_USER);
 	}
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 25/28] MDSv4 1
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (23 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 24/28] MDSv4 28 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 26/28] MDSv4 27 Andi Kleen
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: mark kernel/* timers safe as not touching user
 data

Some preliminary auditing of kernel/* shows no timers touch
other processes' user data. Mark all the timers in kernel/*
as not needed an implicit cpu clear.

More auditing here would be useful.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 kernel/events/core.c       | 6 ++++--
 kernel/fork.c              | 3 ++-
 kernel/futex.c             | 6 +++---
 kernel/sched/core.c        | 5 +++--
 kernel/sched/deadline.c    | 6 ++++--
 kernel/sched/fair.c        | 6 ++++--
 kernel/sched/idle.c        | 3 ++-
 kernel/sched/rt.c          | 3 ++-
 kernel/time/alarmtimer.c   | 2 +-
 kernel/time/hrtimer.c      | 6 +++---
 kernel/time/posix-timers.c | 6 ++++--
 kernel/time/sched_clock.c  | 3 ++-
 kernel/time/tick-sched.c   | 6 ++++--
 kernel/watchdog.c          | 3 ++-
 14 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cd13a30f732..5d9a4ed0cf58 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1102,7 +1102,8 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
 	raw_spin_lock_init(&cpuctx->hrtimer_lock);
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
@@ -9202,7 +9203,8 @@ static void perf_swevent_init_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hwc->hrtimer.function = perf_swevent_hrtimer;
 
 	/*
diff --git a/kernel/fork.c b/kernel/fork.c
index a60459947f18..d1edd0bce062 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1541,7 +1541,8 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 
 #ifdef CONFIG_POSIX_TIMERS
 	INIT_LIST_HEAD(&sig->posix_timers);
-	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sig->real_timer.function = it_real_fn;
 #endif
 
diff --git a/kernel/futex.c b/kernel/futex.c
index be3bff2315ff..4ac7a412f04b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2691,7 +2691,7 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
 
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
@@ -2792,7 +2792,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
 	if (time) {
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, CLOCK_REALTIME,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires(&to->timer, *time);
 	}
@@ -3192,7 +3192,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
 		to = &timeout;
 		hrtimer_init_on_stack(&to->timer, (flags & FLAGS_CLOCKRT) ?
 				      CLOCK_REALTIME : CLOCK_MONOTONIC,
-				      HRTIMER_MODE_ABS);
+				      HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 		hrtimer_init_sleeper(to, current);
 		hrtimer_set_expires_range_ns(&to->timer, *abs_time,
 					     current->timer_slack_ns);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b04918e9115c..6ca60c91cf30 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -302,7 +302,7 @@ void hrtick_start(struct rq *rq, u64 delay)
 	 */
 	delay = max_t(u64, delay, 10000LL);
 	hrtimer_start(&rq->hrtick_timer, ns_to_ktime(delay),
-		      HRTIMER_MODE_REL_PINNED);
+		      HRTIMER_MODE_REL_PINNED|HRTIMER_MODE_NO_USER);
 }
 #endif /* CONFIG_SMP */
 
@@ -316,7 +316,8 @@ static void hrtick_rq_init(struct rq *rq)
 	rq->hrtick_csd.info = rq;
 #endif
 
-	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&rq->hrtick_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rq->hrtick_timer.function = hrtick;
 }
 #else	/* CONFIG_SCHED_HRTICK */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fb8b7b5d745d..dce637e0b3bd 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1054,7 +1054,8 @@ void init_dl_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->dl_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = dl_task_timer;
 }
 
@@ -1293,7 +1294,8 @@ void init_dl_inactive_task_timer(struct sched_dl_entity *dl_se)
 {
 	struct hrtimer *timer = &dl_se->inactive_timer;
 
-	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	timer->function = inactive_task_timer;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5a1bd4a1a46..b9d2a617b105 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4889,9 +4889,11 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	cfs_b->period = ns_to_ktime(default_cfs_period());
 
 	INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq);
-	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS_PINNED|HRTIMER_MODE_NO_USER);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
-	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	cfs_b->slack_timer.function = sched_cfs_slack_timer;
 	cfs_b->distribute_running = 0;
 }
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index f5516bae0c1b..6a4cc46d8c4b 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -330,7 +330,8 @@ void play_idle(unsigned long duration_ms)
 	cpuidle_use_deepest_state(true);
 
 	it.done = 0;
-	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init_on_stack(&it.timer, CLOCK_MONOTONIC,
+			      HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	it.timer.function = idle_inject_timer_fn;
 	hrtimer_start(&it.timer, ms_to_ktime(duration_ms), HRTIMER_MODE_REL_PINNED);
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e4f398ad9e73..24b90b260682 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -46,7 +46,8 @@ void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
 	raw_spin_lock_init(&rt_b->rt_runtime_lock);
 
 	hrtimer_init(&rt_b->rt_period_timer,
-			CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+			CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	rt_b->rt_period_timer.function = sched_rt_period_timer;
 }
 
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 2c97e8c2d29f..f2efd9b5d0b7 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -344,7 +344,7 @@ void alarm_init(struct alarm *alarm, enum alarmtimer_type type,
 		enum alarmtimer_restart (*function)(struct alarm *, ktime_t))
 {
 	hrtimer_init(&alarm->timer, alarm_bases[type].base_clockid,
-		     HRTIMER_MODE_ABS);
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	__alarm_init(alarm, type, function);
 }
 EXPORT_SYMBOL_GPL(alarm_init);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e2c8776ba2a4..58beefd3543a 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1713,7 +1713,7 @@ static long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
 	int ret;
 
 	hrtimer_init_on_stack(&t.timer, restart->nanosleep.clockid,
-				HRTIMER_MODE_ABS);
+				HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_tv64(&t.timer, restart->nanosleep.expires);
 
 	ret = do_nanosleep(&t, HRTIMER_MODE_ABS);
@@ -1733,7 +1733,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
 	if (dl_task(current) || rt_task(current))
 		slack = 0;
 
-	hrtimer_init_on_stack(&t.timer, clockid, mode);
+	hrtimer_init_on_stack(&t.timer, clockid, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, timespec64_to_ktime(*rqtp), slack);
 	ret = do_nanosleep(&t, mode);
 	if (ret != -ERESTART_RESTARTBLOCK)
@@ -1932,7 +1932,7 @@ schedule_hrtimeout_range_clock(ktime_t *expires, u64 delta,
 		return -EINTR;
 	}
 
-	hrtimer_init_on_stack(&t.timer, clock_id, mode);
+	hrtimer_init_on_stack(&t.timer, clock_id, mode|HRTIMER_MODE_NO_USER);
 	hrtimer_set_expires_range_ns(&t.timer, *expires, delta);
 
 	hrtimer_init_sleeper(&t, current);
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 0e84bb72a3da..0faf661cb4c8 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -464,7 +464,8 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 
 static int common_timer_create(struct k_itimer *new_timer)
 {
-	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock, 0);
+	hrtimer_init(&new_timer->it.real.timer, new_timer->it_clock,
+		HRTIMER_MODE_NO_USER);
 	return 0;
 }
 
@@ -789,7 +790,8 @@ static void common_hrtimer_arm(struct k_itimer *timr, ktime_t expires,
 	if (timr->it_clock == CLOCK_REALTIME)
 		timr->kclock = absolute ? &clock_realtime : &clock_monotonic;
 
-	hrtimer_init(&timr->it.real.timer, timr->it_clock, mode);
+	hrtimer_init(&timr->it.real.timer, timr->it_clock,
+		     mode|HRTIMER_MODE_NO_USER);
 	timr->it.real.timer.function = posix_timer_fn;
 
 	if (!absolute)
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 094b82ca95e5..e0a59ed9199f 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -249,7 +249,8 @@ void __init generic_sched_clock_init(void)
 	 * Start the timer to keep sched_clock() properly updated and
 	 * sets the initial epoch.
 	 */
-	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(&sched_clock_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	sched_clock_timer.function = sched_clock_poll;
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6fa52cd6df0b..b95f6f1e7bc3 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1205,7 +1205,8 @@ static void tick_nohz_switch_to_nohz(void)
 	 * Recycle the hrtimer in ts, so we can share the
 	 * hrtimer_forward with the highres code.
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	/* Get the next period */
 	next = tick_init_jiffy_update();
 
@@ -1302,7 +1303,8 @@ void tick_setup_sched_timer(void)
 	/*
 	 * Emulate tick processing via per-CPU hrtimers:
 	 */
-	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	hrtimer_init(&ts->sched_timer, CLOCK_MONOTONIC,
+		     HRTIMER_MODE_ABS|HRTIMER_MODE_NO_USER);
 	ts->sched_timer.function = tick_sched_timer;
 
 	/* Get the next period (per-CPU) */
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 977918d5d350..d3c9da0a4fce 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -483,7 +483,8 @@ static void watchdog_enable(unsigned int cpu)
 	 * Start the timer first to prevent the NMI watchdog triggering
 	 * before the timer has a chance to fire.
 	 */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC,
+			HRTIMER_MODE_REL|HRTIMER_MODE_NO_USER);
 	hrtimer->function = watchdog_timer_fn;
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 26/28] MDSv4 27
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (24 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 25/28] MDSv4 1 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 27/28] MDSv4 7 Andi Kleen
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

AHCI interrupt handlers never touch user data with the CPU.

Just to get the number of clears down on my test system.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/ata/ahci.c    |  2 +-
 drivers/ata/ahci.h    |  2 ++
 drivers/ata/libahci.c | 40 ++++++++++++++++++++++++----------------
 3 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 021ce46e2e57..1455ad89d2f9 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1865,7 +1865,7 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 
 	pci_set_master(pdev);
 
-	rc = ahci_host_activate(host, &ahci_sht);
+	rc = ahci_host_activate_irqflags(host, &ahci_sht, IRQF_NO_USER);
 	if (rc)
 		return rc;
 
diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index ef356e70e6de..42a3474f26b6 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -430,6 +430,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 int ahci_reset_em(struct ata_host *host);
 void ahci_print_info(struct ata_host *host, const char *scc_s);
 int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht);
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags);
 void ahci_error_handler(struct ata_port *ap);
 u32 ahci_handle_port_intr(struct ata_host *host, u32 irq_masked);
 
diff --git a/drivers/ata/libahci.c b/drivers/ata/libahci.c
index b5f57c69c487..b32664c7d8a1 100644
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -2548,7 +2548,8 @@ void ahci_set_em_messages(struct ahci_host_priv *hpriv,
 EXPORT_SYMBOL_GPL(ahci_set_em_messages);
 
 static int ahci_host_activate_multi_irqs(struct ata_host *host,
-					 struct scsi_host_template *sht)
+					 struct scsi_host_template *sht,
+					 int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int i, rc;
@@ -2571,7 +2572,7 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 		}
 
 		rc = devm_request_irq(host->dev, irq, ahci_multi_irqs_intr_hard,
-				0, pp->irq_desc, host->ports[i]);
+				irqflags, pp->irq_desc, host->ports[i]);
 
 		if (rc)
 			return rc;
@@ -2581,18 +2582,8 @@ static int ahci_host_activate_multi_irqs(struct ata_host *host,
 	return ata_host_register(host, sht);
 }
 
-/**
- *	ahci_host_activate - start AHCI host, request IRQs and register it
- *	@host: target ATA host
- *	@sht: scsi_host_template to use when registering the host
- *
- *	LOCKING:
- *	Inherited from calling layer (may sleep).
- *
- *	RETURNS:
- *	0 on success, -errno otherwise.
- */
-int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+int ahci_host_activate_irqflags(struct ata_host *host, struct scsi_host_template *sht,
+				int irqflags)
 {
 	struct ahci_host_priv *hpriv = host->private_data;
 	int irq = hpriv->irq;
@@ -2608,15 +2599,32 @@ int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
 			return -EIO;
 		}
 
-		rc = ahci_host_activate_multi_irqs(host, sht);
+		rc = ahci_host_activate_multi_irqs(host, sht, irqflags);
 	} else {
 		rc = ata_host_activate(host, irq, hpriv->irq_handler,
-				       IRQF_SHARED, sht);
+				       irqflags|IRQF_SHARED, sht);
 	}
 
 
 	return rc;
 }
+EXPORT_SYMBOL_GPL(ahci_host_activate_irqflags);
+
+/**
+ *	ahci_host_activate - start AHCI host, request IRQs and register it
+ *	@host: target ATA host
+ *	@sht: scsi_host_template to use when registering the host
+ *
+ *	LOCKING:
+ *	Inherited from calling layer (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, -errno otherwise.
+ */
+int ahci_host_activate(struct ata_host *host, struct scsi_host_template *sht)
+{
+	return ahci_host_activate_irqflags(host, sht, 0);
+}
 EXPORT_SYMBOL_GPL(ahci_host_activate);
 
 MODULE_AUTHOR("Jeff Garzik");
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 27/28] MDSv4 7
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (25 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 26/28] MDSv4 27 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 28/28] MDSv4 12 Andi Kleen
  2019-01-12  3:04 ` [MODERATED] Re: [PATCH v4 00/28] MDSv4 2 Andi Kleen
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

ACPI doesn't touch any user data, so doesn't need a cpu clear.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 drivers/acpi/osl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index f29e427d0d1d..f31064134b37 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,7 +572,8 @@ acpi_os_install_interrupt_handler(u32 gsi, acpi_osd_handler handler,
 
 	acpi_irq_handler = handler;
 	acpi_irq_context = context;
-	if (request_irq(irq, acpi_irq, IRQF_SHARED, "acpi", acpi_irq)) {
+	if (request_irq(irq, acpi_irq, IRQF_SHARED|IRQF_NO_USER,
+				"acpi", acpi_irq)) {
 		printk(KERN_ERR PREFIX "SCI (IRQ%d) allocation failed\n", irq);
 		acpi_irq_handler = NULL;
 		return AE_NOT_ACQUIRED;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] [PATCH v4 28/28] MDSv4 12
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (26 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 27/28] MDSv4 7 Andi Kleen
@ 2019-01-12  1:29 ` Andi Kleen
  2019-01-12  3:04 ` [MODERATED] Re: [PATCH v4 00/28] MDSv4 2 Andi Kleen
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  1:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

From: Andi Kleen <ak@linux.intel.com>
Subject:  mds: Mitigate BPF

BPF allows the user to run untrusted code in the kernel.

Normally MDS would allow some information leakage either
from other processes  or sensitive kernel code to the user
controlled BPF code.  We cannot rule out that BPF code contains
an MDS exploit and it is difficult to pattern match.

The patch aims to add limited number of clear cpus
before BPF executions to make EBPF executions safe.

Assume BPF execution does not touch other user's data, so does
not need to schedule a clear for itself.

For EBPF programs loaded privileged we never clear.

When the BPF program was loaded unprivileged clear the CPU
before the BPF execution, depending on the context it is running in:

We only do this when running in an interrupt, or if an clear cpu is
already scheduled (which means for example there was a context
switch, or crypto operation before)

In process context we check if the current process context
has the same userns+euid as the process who created the BPF.
This handles the common seccomp filter case without
any extra clears, but still adds clears when e.g. a socket
filter runs on a socket inherited to a process with different user id.

We also always clear when an earlier kernel subsystem scheduled
a clear, e.g. after a context switch or running crypto code.

Technically we would only need to do this if the BPF program
contains conditional branches and loads dominated by them, but
let's assume that near all do.

For example for running chromium with seccomp filters I see
only 15-18% of all sandbox system calls have a clear, most
are likely caused by context switches

Unprivileged EBPF usages in interrupts currently always clear.

This could be further optimized by allowing callers that do
a lot of individual BPF runs and are sure they don't touch
other user's data (that is not accessible to the EBPF anyways)
inbetween to do the clear only once at the beginning. We can add
such optimizations later based on profile data.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/clearbpf.h | 29 +++++++++++++++++++++++++++++
 include/linux/filter.h          | 21 +++++++++++++++++++--
 kernel/bpf/core.c               |  2 ++
 3 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/clearbpf.h

diff --git a/arch/x86/include/asm/clearbpf.h b/arch/x86/include/asm/clearbpf.h
new file mode 100644
index 000000000000..dc1756722b48
--- /dev/null
+++ b/arch/x86/include/asm/clearbpf.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_CLEARBPF_H
+#define _ASM_CLEARBPF_H 1
+
+#include <linux/clearcpu.h>
+#include <linux/cred.h>
+#include <asm/cpufeatures.h>
+
+/*
+ * When the BPF program was loaded unprivileged, clear the CPU
+ * to prevent any exploits written in BPF using side channels to read
+ * data leaked from other kernel code. In some cases, like
+ * process context with the same uid, we can avoid it.
+ *
+ * See Documentation/clearcpu.txt for more details.
+ */
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+	if (!static_cpu_has(X86_BUG_MDS))
+		return;
+	if (in_interrupt() ||
+		test_thread_flag(TIF_CLEAR_CPU) ||
+		!uid_eq(current_euid(), uid)) {
+		clear_cpu();
+		clear_thread_flag(TIF_CLEAR_CPU);
+	}
+}
+
+#endif
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ad106d845b22..b32547b4bd92 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,12 +20,21 @@
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
 #include <linux/if_vlan.h>
+#include <linux/clearcpu.h>
 
 #include <net/sch_generic.h>
 
 #include <uapi/linux/filter.h>
 #include <uapi/linux/bpf.h>
 
+#ifdef CONFIG_ARCH_HAS_CLEAR_CPU
+#include <asm/clearbpf.h>
+#else
+static inline void arch_bpf_prepare_nonpriv(kuid_t uid)
+{
+}
+#endif
+
 struct sk_buff;
 struct sock;
 struct seccomp_data;
@@ -490,7 +499,9 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				priv:1;		/* Was loaded privileged */
+	kuid_t			uid;		/* Original uid who created it */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
@@ -513,7 +524,13 @@ struct sk_filter {
 	struct bpf_prog	*prog;
 };
 
-#define BPF_PROG_RUN(filter, ctx)  (*(filter)->bpf_func)(ctx, (filter)->insnsi)
+static inline unsigned _bpf_prog_run(const struct bpf_prog *bp, const void *ctx)
+{
+	if (!bp->priv)
+		arch_bpf_prepare_nonpriv(bp->uid);
+	return bp->bpf_func(ctx, bp->insnsi);
+}
+#define BPF_PROG_RUN(filter, ctx) _bpf_prog_run(filter, ctx)
 
 #define BPF_SKB_CB_LEN QDISC_CB_PRIV_LEN
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index f908b9356025..67d845229d46 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -99,6 +99,8 @@ struct bpf_prog *bpf_prog_alloc(unsigned int size, gfp_t gfp_extra_flags)
 	fp->aux = aux;
 	fp->aux->prog = fp;
 	fp->jit_requested = ebpf_jit_enabled();
+	fp->priv = !!capable(CAP_SYS_ADMIN);
+	fp->uid = current_euid();
 
 	INIT_LIST_HEAD_RCU(&fp->aux->ksym_lnode);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 00/28] MDSv4 2
  2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
                   ` (27 preceding siblings ...)
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 28/28] MDSv4 12 Andi Kleen
@ 2019-01-12  3:04 ` Andi Kleen
  28 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-12  3:04 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]


Mailbox with the patches attached.

-Andi


[-- Attachment #2: mdsv4.mbox --]
[-- Type: application/mbox, Size: 104673 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 07/28] MDSv4 0
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 07/28] MDSv4 0 Andi Kleen
@ 2019-01-14  4:03   ` Josh Poimboeuf
  2019-01-14  4:38     ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: Josh Poimboeuf @ 2019-01-14  4:03 UTC (permalink / raw)
  To: speck

On Fri, Jan 11, 2019 at 05:29:20PM -0800, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86/speculation/mds: Support mds=full for NMIs
> 
> NMIs don't go through C code when exiting to user space

What about do_nmi()?

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 07/28] MDSv4 0
  2019-01-14  4:03   ` [MODERATED] " Josh Poimboeuf
@ 2019-01-14  4:38     ` Andi Kleen
  2019-01-14  4:55       ` Josh Poimboeuf
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-14  4:38 UTC (permalink / raw)
  To: speck

On Sun, Jan 13, 2019 at 10:03:02PM -0600, speck for Josh Poimboeuf wrote:
> On Fri, Jan 11, 2019 at 05:29:20PM -0800, speck for Andi Kleen wrote:
> > From: Andi Kleen <ak@linux.intel.com>
> > Subject:  x86/speculation/mds: Support mds=full for NMIs
> > 
> > NMIs don't go through C code when exiting to user space
> 
> What about do_nmi()?

NMIs = any NMI like exception, like machine check.

Yes they could be all handled in C, but it seems simpler
to just do it once in assembler.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 07/28] MDSv4 0
  2019-01-14  4:38     ` Andi Kleen
@ 2019-01-14  4:55       ` Josh Poimboeuf
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Poimboeuf @ 2019-01-14  4:55 UTC (permalink / raw)
  To: speck

On Sun, Jan 13, 2019 at 08:38:08PM -0800, speck for Andi Kleen wrote:
> On Sun, Jan 13, 2019 at 10:03:02PM -0600, speck for Josh Poimboeuf wrote:
> > On Fri, Jan 11, 2019 at 05:29:20PM -0800, speck for Andi Kleen wrote:
> > > From: Andi Kleen <ak@linux.intel.com>
> > > Subject:  x86/speculation/mds: Support mds=full for NMIs
> > > 
> > > NMIs don't go through C code when exiting to user space
> > 
> > What about do_nmi()?
> 
> NMIs = any NMI like exception, like machine check.
> 
> Yes they could be all handled in C, but it seems simpler
> to just do it once in assembler.

But this patch is NMI-only and has nothing to do with machine checks.
Machine checks and all the other non-NMI exceptions are handled by the
idtentry macro.

-- 
Josh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 03/28] MDSv4 20
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 03/28] MDSv4 20 Andi Kleen
@ 2019-01-14 18:50   ` Dave Hansen
  2019-01-14 19:29     ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-01-14 18:50 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 453 bytes --]

On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
> +static inline void clear_cpu(void)
> +{
> +	unsigned kernel_ds = __KERNEL_DS;
> +	/* Has to be memory form, don't modify to use an register */
> +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> +		[kernelds] "m" (kernel_ds));
> +}

I expected to see some boot_cpu_has_bug(X86_BUG_CPU_MDS) checks in here
somewhere.  Are those coming later on the "set" side or something?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 05/28] MDSv4 10
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
@ 2019-01-14 19:20   ` Dave Hansen
  2019-01-14 19:31     ` Andi Kleen
  2019-01-18  7:33     ` [MODERATED] Encrypted Message Jon Masters
  2019-01-14 23:39   ` Tim Chen
  1 sibling, 2 replies; 44+ messages in thread
From: Dave Hansen @ 2019-01-14 19:20 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 3487 bytes --]

On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
> When entering idle the internal state of the current CPU might
> become visible to the thread sibling because the CPU "frees" some
> internal resources.

Is there some documentation somewhere about what "idle" means here?  It
looks like MWAIT and HLT certainly count, but is there anything else?

I'm just trying to figure out how we make sure we catch all of the
call-sites for these.  This sprinkles quite a few of them around, and
I'm wondering how you found these, how we know if we missed any, and how
we keep folks from reintroducing new call-sites that would make us
vulnerable again.

I did a quick "objdump | grep mwait" and this patch appears to catch all
the functions that I encountered.

> +/*
> + * Clear CPU buffers before going idle, so that no state is leaked to SMT
> + * siblings taking over thread resources.
> + * Out of line to avoid include hell.
> + *
> + * Assumes that interrupts are disabled and only get reenabled
> + * before idle, otherwise the data from a racing interrupt might not
> + * get cleared. There are some callers who violate this,
> + * but they are only used in unattackable cases.> + */

Can we please document the unattackable cases, along with the reasons
they are unattackable?  This property also keeps us from being able to
annotate this site with lockdep checks for interrupts being off, which
is a bit unfortunate.

> +static inline void clear_cpu_idle(void)
> +{
> +	if (sched_smt_active()) {
> +		clear_thread_flag(TIF_CLEAR_CPU);
> +		clear_cpu();
> +	}
> +}

...
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index b2131c4ea124..0342daa122fe 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -33,6 +33,7 @@
>  #include <linux/cpuidle.h>
>  #include <linux/cpu.h>
>  #include <acpi/processor.h>
> +#include <asm/clearcpu.h>
>  
>  /*
>   * Include the apic definitions for x86 to have the APIC timer related defines
> @@ -120,6 +121,7 @@ static const struct dmi_system_id processor_power_dmi_table[] = {
>   */
>  static void __cpuidle acpi_safe_halt(void)
>  {
> +	clear_cpu_idle();
>  	if (!tif_need_resched()) {
>  		safe_halt();
>  		local_irq_disable();

Why is this one outside the if()?  Seems like it could be safely inside
next to safe_halt().

> @@ -681,6 +683,7 @@ static int acpi_idle_play_dead(struct cpuidle_device *dev, int index)
>  
>  	ACPI_FLUSH_CPU_CACHE();
>  
> +	clear_cpu_idle();
>  	while (1) {
>  
>  		if (cx->entry_method == ACPI_CSTATE_HALT)

At the risk of bike-shedding...  Why don't we just catch all these
*play_dead() sites inside play_dead() itself, or at arch_cpu_idle_dead()?

> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 8b5d85c91e9d..ddaa7603d53a 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -65,6 +65,7 @@
>  #include <asm/intel-family.h>
>  #include <asm/mwait.h>
>  #include <asm/msr.h>
> +#include <asm/clearcpu.h>
>  
>  #define INTEL_IDLE_VERSION "0.4.1"
>  
> @@ -933,6 +934,8 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
>  		}
>  	}
>  
> +	clear_cpu_idle();
> +
>  	mwait_idle_with_hints(eax, ecx);

And my like bikeshed: It seems like this would be a much smaller patch,
and be less likely to have future code add vulnerabilities if we just
patched mwait_idle_with_hints().


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 06/28] MDSv4 11
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 06/28] MDSv4 11 Andi Kleen
@ 2019-01-14 19:23   ` Dave Hansen
  2019-01-15 12:01     ` Jiri Kosina
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2019-01-14 19:23 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

> +	case X86_BUG_MDS:
> +		/* Assumes Hypervisor exposed HT state to us if in guest */
> +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> +			if (cpu_smt_control != CPU_SMT_ENABLED)
> +				return sprintf(buf, "Mitigation: microcode\n");
> +			return sprintf(buf, "Mitigation: microcode, HT vulnerable\n");
> +		}
> +		return sprintf(buf, "Vulnerable\n");

What are we trying to convey by saying "HT vulnerable"?  There are a ton
of patches in this set that do HT mitigations, so just saying
"vulnerable" seems a bit cynical.

Seems like I'm missing something.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 03/28] MDSv4 20
  2019-01-14 18:50   ` [MODERATED] " Dave Hansen
@ 2019-01-14 19:29     ` Andi Kleen
  2019-01-14 19:38       ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2019-01-14 19:29 UTC (permalink / raw)
  To: speck

On Mon, Jan 14, 2019 at 10:50:27AM -0800, speck for Dave Hansen wrote:
> On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
> > +static inline void clear_cpu(void)
> > +{
> > +	unsigned kernel_ds = __KERNEL_DS;
> > +	/* Has to be memory form, don't modify to use an register */
> > +	alternative_input("verw %[kernelds]", "", X86_FEATURE_NO_VERW,
> > +		[kernelds] "m" (kernel_ds));
> > +}
> 
> I expected to see some boot_cpu_has_bug(X86_BUG_CPU_MDS) checks in here
> somewhere.  Are those coming later on the "set" side or something?

Linus wanted VERW unconditional for VMWare

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 05/28] MDSv4 10
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-14 19:31     ` Andi Kleen
  2019-01-18  7:33     ` [MODERATED] Encrypted Message Jon Masters
  1 sibling, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2019-01-14 19:31 UTC (permalink / raw)
  To: speck

> > +	clear_cpu_idle();
> > +
> >  	mwait_idle_with_hints(eax, ecx);
> 
> And my like bikeshed: It seems like this would be a much smaller patch,
> and be less likely to have future code add vulnerabilities if we just
> patched mwait_idle_with_hints().

I had this originally, but it caused some issues.

For me it seems also unclean to pollute a low level "maps to an instruction"
inline with such functionality.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 03/28] MDSv4 20
  2019-01-14 19:29     ` Andi Kleen
@ 2019-01-14 19:38       ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2019-01-14 19:38 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On Tue, Jan 15, 2019, 07:29 speck for Andi Kleen <speck@linutronix.de wrote:

> On Mon, Jan 14, 2019 at 10:50:27AM -0800, speck for Dave Hansen wrote:
> >
> > I expected to see some boot_cpu_has_bug(X86_BUG_CPU_MDS) checks in here
> > somewhere.  Are those coming later on the "set" side or something?
>
> Linus wanted VERW unconditional for VMWare
>

Yes. And we can make it conditional later in a development kernel, so that
it's essentially only unconditional in a certain age of stable kernels.

Because in a year, the crazy VMware cpuid upgrade cycle will be a
non-issue. But apparently it's something like "up to 9 months after
release" until all of the cpuid but updates end up being visible.

     Linus

>

[-- Attachment #2: Type: text/html, Size: 1333 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
@ 2019-01-14 23:39   ` Tim Chen
  1 sibling, 0 replies; 44+ messages in thread
From: Tim Chen @ 2019-01-14 23:39 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 526 bytes --]


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50aa2aba69bd..b5a1bd4a1a46 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5980,6 +5980,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
> +EXPORT_SYMBOL(sched_smt_present);

This export is not needed since sched_smt_present is not used in the patch series.
Only sched_smt_active() is used.

Thanks.

Tim


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
@ 2019-01-15  1:05   ` Tim Chen
  0 siblings, 0 replies; 44+ messages in thread
From: Tim Chen @ 2019-01-15  1:05 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 130 bytes --]

From: Tim Chen <tim.c.chen@linux.intel.com>
To: speck for Andi Kleen <speck@linutronix.de>
Subject: Re: [PATCH v4 10/28] MDSv4 24

[-- Attachment #2: Type: text/plain, Size: 5059 bytes --]


On 1/11/19 5:29 PM, speck for Andi Kleen wrote:

> +Some CPUs can leave read or written data in internal buffers,
> +which then later might be sampled through side effects.
> +For more details see CVE-2018-12126 CVE-2018-12130 CVE-2018-12127
> +
> +This can be avoided by explicitely clearing the CPU state.

s/explicitely/explicitly

> +
> +We trying to avoid leaking data between different processes,

Suggest changing the above phrase to the below:

CPU state clearing prevents leaking data between different processes,

...

> +Basic requirements and assumptions
> +----------------------------------
> +
> +Kernel addresses and kernel temporary data are not sensitive.
> +
> +User data is sensitive, but only for other processes.
> +
> +Kernel data is sensitive when it is cryptographic keys.

s/when it is/when it involves/

> +
> +Guidance for driver/subsystem developers
> +----------------------------------------
> +
> +When you touch user supplied data of *other* processes in system call
> +context add lazy_clear_cpu().
> +
> +For the cases below we care only about data from other processes.
> +Touching non cryptographic data from the current process is always allowed.
> +
> +Touching only pointers to user data is always allowed.
> +
> +When your interrupt does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with IRQF_NO_USER.
> +
> +When your tasklet does not touch user data directly consider marking

Add a "," between "directly" and "consider"

> +it with TASKLET_NO_USER using tasklet_init_flags/or
> +DECLARE_TASKLET*_NOUSER.
> +
> +When your timer does not touch user data mark it with TIMER_NO_USER.

Add a "," between "data" and "mark"

> +If it is a hrtimer mark it with HRTIMER_MODE_NO_USER.

Add a "," between "hrtimer" and "mark"

> +
> +When your irq poll handler does not touch user data, mark it
> +with IRQ_POLL_F_NO_USER through irq_poll_init_flags.
> +
> +For networking code make sure to only touch user data through

Add a "," between "code" and "make"

> +skb_push/put/copy [add more], unless it is data from the current
> +process. If that is not ensured add lazy_clear_cpu or

Add a "," between "ensured" and "add"

> +lazy_clear_cpu_interrupt. When the non skb data access is only in a
> +hardware interrupt controlled by the driver, it can rely on not
> +setting IRQF_NO_USER for that interrupt.
> +
> +Any cryptographic code touching key data should use memzero_explicit
> +or kzfree.
> +
> +If your RCU callback touches user data add lazy_clear_cpu().
> +
> +These steps are currently only needed for code that runs on MDS affected
> +CPUs, which is currently only x86. But might be worth being prepared
> +if other architectures become affected too.
> +
> +Implementation details/assumptions
> +----------------------------------
> +
> +If a system call touches data it is for its own process, so does not

suggest rephrasing to 

If a system call touches data of its own process, cpu state does not

> +need to be cleared, because it has already access to it.
> +
> +When context switching we clear data, unless the context switch
> +is inside a process, or from/to idle. We also clear after any
> +context switches from kernel threads.
> +
> +Idle does not have sensitive data, except for in interrupts, which
> +are handled separately.
> +
> +Cryptographic keys inside the kernel should be protected.
> +We assume they use kzfree() or memzero_explicit() to clear
> +state, so these functions trigger a cpu clear.
> +
> +Hard interrupts, tasklets, timers which can run asynchronous are
> +assumed to touch random user data, unless they have been audited, and
> +marked with NO_USER flags.
> +
> +Most interrupt handlers for modern devices should not touch
> +user data because they rely on DMA and only manipulate
> +pointers. This needs auditing to confirm though.
> +
> +For softirqs we assume that if they touch user data they use

Add "," between "data" and "they"

...

> +Technically we would only need to do this if the BPF program
> +contains conditional branches and loads dominated by them, but
> +let's assume that near all do.
s/near/nealy/

> +
> +This could be further optimized by allowing callers that do
> +a lot of individual BPF runs and are sure they don't touch
> +other user's data inbetween to do the clear only once
> +at the beginning. 

Suggest breaking the above sentence.  It is quite difficult to read.

> We can add such optimizations later based on
> +profile data.
> +
> +Virtualization
> +--------------
> +
> +When entering a guest in KVM we clear to avoid any leakage to a guest.
... we clear CPU state to avoid ....

> +Normally this is done implicitely as part of the L1TF mitigation.

s/implicitely/implicitly/

> +It relies on this being enabled. It also uses the "fast exit"
> +optimization that only clears if an interrupt or context switch
> +happened.
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 06/28] MDSv4 11
  2019-01-14 19:23   ` [MODERATED] " Dave Hansen
@ 2019-01-15 12:01     ` Jiri Kosina
  0 siblings, 0 replies; 44+ messages in thread
From: Jiri Kosina @ 2019-01-15 12:01 UTC (permalink / raw)
  To: speck

On Mon, 14 Jan 2019, speck for Dave Hansen wrote:

> > +	case X86_BUG_MDS:
> > +		/* Assumes Hypervisor exposed HT state to us if in guest */
> > +		if (boot_cpu_has(X86_FEATURE_MD_CLEAR)) {
> > +			if (cpu_smt_control !=3D CPU_SMT_ENABLED)
> > +				return sprintf(buf, "Mitigation:
> > microcode\n");
> > +			return sprintf(buf, "Mitigation: microcode, HT
> > vulnerable\n");
> > +		}
> > +		return sprintf(buf, "Vulnerable\n");
> 
> What are we trying to convey by saying "HT vulnerable"?  There are a ton
> of patches in this set that do HT mitigations, so just saying
> "vulnerable" seems a bit cynical.

If I read the code correctly, the only case where SMT is taken care of 
wrt. MDS is when one of the siblings is in idle (mwait/hlt). Other than 
that, SMT is pretty much uncovered if both threads are actually executing 
code AFAICS.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Re: [PATCH v4 01/28] MDSv4 3
  2019-01-12  1:29 ` [MODERATED] [PATCH v4 01/28] MDSv4 3 Andi Kleen
@ 2019-01-15 14:11   ` Andrew Cooper
  0 siblings, 0 replies; 44+ messages in thread
From: Andrew Cooper @ 2019-01-15 14:11 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 931 bytes --]

On 12/01/2019 01:29, speck for Andi Kleen wrote:
> @@ -1019,6 +1027,12 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c)
>  	if (ia32_cap & ARCH_CAP_IBRS_ALL)
>  		setup_force_cpu_cap(X86_FEATURE_IBRS_ENHANCED);
>  
> +	if ((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> +	    !x86_match_cpu(cpu_no_mds)) &&
> +	    !(ia32_cap & ARCH_CAP_MDS_NO) &&
> +	    !(ia32_cap & ARCH_CAP_RDCL_NO))
> +		setup_force_cpu_bug(X86_BUG_MDS);

According to the latest doc I've got, RDCL_NO only indicates the absence
of MFBDS (FBBF), while MSBDS (PSF) and MLPDS (SVL)  are still present.

It is only MDS_NO which indicates the absence of all the issues.

Furthermore, looking at the giant affected matrix, I see no processors
which are affected by FBBF but not by PSF, so unless we've decided that
we don't care about PSF and SVL, workarounds still need to be used even
when RDCL_NO is asserted.

~Andrew


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [MODERATED] Encrypted Message
  2019-01-14 19:20   ` [MODERATED] " Dave Hansen
  2019-01-14 19:31     ` Andi Kleen
@ 2019-01-18  7:33     ` Jon Masters
  1 sibling, 0 replies; 44+ messages in thread
From: Jon Masters @ 2019-01-18  7:33 UTC (permalink / raw)
  To: speck

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/rfc822-headers; protected-headers="v1", Size: 122 bytes --]

From: Jon Masters <jcm@redhat.com>
To: speck for Dave Hansen <speck@linutronix.de>
Subject: Re: [PATCH v4 05/28] MDSv4 10

[-- Attachment #2: Type: text/plain, Size: 1328 bytes --]

On 1/14/19 2:20 PM, speck for Dave Hansen wrote:

> On 1/11/19 5:29 PM, speck for Andi Kleen wrote:
>> When entering idle the internal state of the current CPU might
>> become visible to the thread sibling because the CPU "frees" some
>> internal resources.
> 
> Is there some documentation somewhere about what "idle" means here?  It
> looks like MWAIT and HLT certainly count, but is there anything else?

We know power state transitions in addition can cause the peer to
dynamically sleep or wake up. MWAIT was the main example I got out of
Intel for how you'd explicitly cause a thread to be deallocated.

When Andi is talking about "frees" above he means (for example) the
dynamic allocation/deallocation of store buffer entries as threads come
and go - e.g. in Skylake there are 56 entries in a distributed store
buffer that splits into 2x28. I am not aware of fill buffer behavior
changing as threads come and go, and this isn't documented AFAICS.

I've been wondering whether we want a bit more detail in the docs. I
spent a /lot/ of time last week going through all of Intel's patents in
this area, which really help understand it. If folks feel we could do
with a bit more meaty summary, I can try to suggest something.

Jon.

-- 
Computer Architect | Sent with my Fedora powered laptop


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2019-01-18  7:33 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-12  1:29 [MODERATED] [PATCH v4 00/28] MDSv4 2 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 01/28] MDSv4 3 Andi Kleen
2019-01-15 14:11   ` [MODERATED] " Andrew Cooper
2019-01-12  1:29 ` [MODERATED] [PATCH v4 02/28] MDSv4 22 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 03/28] MDSv4 20 Andi Kleen
2019-01-14 18:50   ` [MODERATED] " Dave Hansen
2019-01-14 19:29     ` Andi Kleen
2019-01-14 19:38       ` Linus Torvalds
2019-01-12  1:29 ` [MODERATED] [PATCH v4 04/28] MDSv4 8 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 05/28] MDSv4 10 Andi Kleen
2019-01-14 19:20   ` [MODERATED] " Dave Hansen
2019-01-14 19:31     ` Andi Kleen
2019-01-18  7:33     ` [MODERATED] Encrypted Message Jon Masters
2019-01-14 23:39   ` Tim Chen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 06/28] MDSv4 11 Andi Kleen
2019-01-14 19:23   ` [MODERATED] " Dave Hansen
2019-01-15 12:01     ` Jiri Kosina
2019-01-12  1:29 ` [MODERATED] [PATCH v4 07/28] MDSv4 0 Andi Kleen
2019-01-14  4:03   ` [MODERATED] " Josh Poimboeuf
2019-01-14  4:38     ` Andi Kleen
2019-01-14  4:55       ` Josh Poimboeuf
2019-01-12  1:29 ` [MODERATED] [PATCH v4 08/28] MDSv4 19 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 09/28] MDSv4 16 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 10/28] MDSv4 24 Andi Kleen
2019-01-15  1:05   ` [MODERATED] Encrypted Message Tim Chen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 11/28] MDSv4 21 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 12/28] MDSv4 25 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 13/28] MDSv4 4 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 14/28] MDSv4 17 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 15/28] MDSv4 9 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 16/28] MDSv4 6 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 17/28] MDSv4 18 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 18/28] MDSv4 26 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 19/28] MDSv4 14 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 20/28] MDSv4 23 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 21/28] MDSv4 15 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 22/28] MDSv4 5 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 23/28] MDSv4 13 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 24/28] MDSv4 28 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 25/28] MDSv4 1 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 26/28] MDSv4 27 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 27/28] MDSv4 7 Andi Kleen
2019-01-12  1:29 ` [MODERATED] [PATCH v4 28/28] MDSv4 12 Andi Kleen
2019-01-12  3:04 ` [MODERATED] Re: [PATCH v4 00/28] MDSv4 2 Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.