[MODERATED] [PATCH 0/8] L1TFv3 4

All of lore.kernel.org
 help / color / mirror / Atom feed

* [MODERATED] [PATCH 0/8] L1TFv3 4
@ 2018-05-04  3:23 Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 1/8] L1TFv3 8 Andi Kleen
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

This is v3 of the core VM L1 OS mitigation.

- Addressed review comments
- Fixed a bug on 32bit PAE that prevented setting all needed
  bits to protect the full host memory.
- Added a new patch to forbid PROT_NONE high MMIO again. We found some systems
  where it is needed. This version tries to minimize any breakage by
  limiting the check to non root only, and only refusing the PROT_NONE
  when the underlying MMIO address is actually high.

Andi Kleen (7):
  x86, l1tf: Increase 32bit PAE __PHYSICAL_PAGE_MASK
  x86, l1tf: Protect PROT_NONE PTEs against speculation
  x86, l1tf: Make sure the first page is always reserved
  x86, l1tf: Add sysfs reporting for l1tf
  x86, l1tf: Report if too much memory for L1TF workaround
  x86, l1tf: Limit swap file size to MAX_PA/2
  mm, l1tf: Disallow non privileged high MMIO PROT_NONE mappings

Linus Torvalds (1):
  x86, l1tf: Protect swap entries against L1TF

 arch/x86/include/asm/cpufeatures.h    |  2 ++
 arch/x86/include/asm/page_32_types.h  |  9 +++++--
 arch/x86/include/asm/pgtable-2level.h | 12 +++++++++
 arch/x86/include/asm/pgtable-3level.h |  2 ++
 arch/x86/include/asm/pgtable-invert.h | 28 ++++++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 48 ++++++++++++++++++++++++----------
 arch/x86/include/asm/pgtable_64.h     | 38 +++++++++++++++++++--------
 arch/x86/kernel/cpu/bugs.c            | 11 ++++++++
 arch/x86/kernel/cpu/common.c          | 10 ++++++-
 arch/x86/kernel/setup.c               | 27 ++++++++++++++++++-
 arch/x86/mm/init.c                    | 17 ++++++++++++
 arch/x86/mm/mmap.c                    | 21 +++++++++++++++
 drivers/base/cpu.c                    |  8 ++++++
 include/asm-generic/pgtable.h         | 12 +++++++++
 include/linux/cpu.h                   |  2 ++
 include/linux/swapfile.h              |  2 ++
 mm/memory.c                           | 37 +++++++++++++++++++-------
 mm/mprotect.c                         | 49 +++++++++++++++++++++++++++++++++++
 mm/swapfile.c                         | 44 +++++++++++++++++++------------
 19 files changed, 325 insertions(+), 54 deletions(-)
 create mode 100644 arch/x86/include/asm/pgtable-invert.h

-- 
2.14.3

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 1/8] L1TFv3 8
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04 13:42   ` [MODERATED] " Michal Hocko
  2018-05-04  3:23 ` [MODERATED] [PATCH 2/8] L1TFv3 7 Andi Kleen
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

On 32bit PAE the max PTE mask is currently set to 44 bit because that is
the limit imposed by 32bit unsigned long PFNs in the VMs.

The L1TF PROT_NONE protection code uses the PTE masks to determine
what bits to invert to make sure the higher bits are set for unmapped
entries to prevent L1TF speculation attacks against EPT inside guests.

But our inverted mask has to match the host, and the host is likely
64bit and may use more than 43 bits of memory. We want to set
all possible bits to be safe here.

So increase the mask on 32bit PAE to 52 to match 64bit. The real
limit is still 44 bits but outside the inverted PTEs these
higher bits are set, so a bigger masks don't cause any problems.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/page_32_types.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index aa30c3241ea7..0d5c739eebd7 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -29,8 +29,13 @@
 #define N_EXCEPTION_STACKS 1

 #ifdef CONFIG_X86_PAE
-/* 44=32+12, the limit we can fit into an unsigned long pfn */
-#define __PHYSICAL_MASK_SHIFT	44
+/*
+ * This is beyond the 44 bit limit imposed by the 32bit long pfns,
+ * but we need the full mask to make sure inverted PROT_NONE
+ * entries have all the host bits set in a guest.
+ * The real limit is still 44 bits.
+ */
+#define __PHYSICAL_MASK_SHIFT	52
 #define __VIRTUAL_MASK_SHIFT	32

 #else  /* !CONFIG_X86_PAE */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 2/8] L1TFv3 7
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 1/8] L1TFv3 8 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-07 11:45   ` [MODERATED] " Vlastimil Babka
  2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

With L1 terminal fault the CPU speculates into unmapped PTEs, and
resulting side effects allow to read the memory the PTE is pointing
too, if its values are still in the L1 cache.

For swapped out pages Linux uses unmapped PTEs and stores a swap entry
into them.

We need to make sure the swap entry is not pointing to valid memory,
which requires setting higher bits (between bit 36 and bit 45) that
are inside the CPUs physical address space, but outside any real
memory.

To do this we invert the offset to make sure the higher bits are always
set, as long as the swap file is not too big.

Here's a patch that switches the order of "type" and
"offset" in the x86-64 encoding, in addition to doing the binary 'not' on
the offset.

That means that now the offset is bits 9-58 in the page table, and that
the offset is in the bits that hardware generally doesn't care about.

That, in turn, means that if you have a desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, you still have
to have 30 bits of offset actually *in use* until bit 39 ends up being
clear.

So that's 4 terabyte of swap space (because the offset is counted in
pages, so 30 bits of offset is 42 bits of actual coverage). With bigger
physical addressing, that obviously grows further, until you hit the limit
of the offset (at 50 bits of offset - 62 bits of actual swap file
coverage).

Note there is no workaround for 32bit !PAE, or on systems which
have more than MAX_PA/2 memory. The later case is very unlikely
to happen on real systems.

[updated description and minor tweaks by AK]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/include/asm/pgtable_64.h | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 877bc27718ae..593c3cf259dd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -286,20 +286,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
+ *
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
-#define SWP_TYPE_BITS 5
-/* Place the offset above the type: */
-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
+#define SWP_TYPE_BITS		5
+
+#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)
+
+/* We always extract/encode the offset by shifting it all the way up, and then down again */
+#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)

 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)

-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
-					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
-#define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (SWP_TYPE_FIRST_BIT)) \
-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
+/* Extract the high bits for type */
+#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
+
+/* Shift up (to get rid of type), then down to get value */
+#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+
+/*
+ * Shift the offset up "too far" by TYPE bits, then down again
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
+ */
+#define __swp_entry(type, offset) ((swp_entry_t) { \
+	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
+
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 3/8] L1TFv3 1
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 1/8] L1TFv3 8 Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 2/8] L1TFv3 7 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04  3:55   ` [MODERATED] " Linus Torvalds
                     ` (2 more replies)
  2018-05-04  3:23 ` [MODERATED] [PATCH 4/8] L1TFv3 6 Andi Kleen
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

We also need to protect PTEs that are set to PROT_NONE against
L1TF speculation attacks.

This is important inside guests, because L1TF speculation
bypasses physical page remapping. While the VM has its own
migitations preventing leaking data from other VMs into
the guest, this would still risk leaking the wrong page
inside the current guest.

This uses the same technique as Linus' swap entry patch:
while an entry is is in PROTNONE state we invert the
complete PFN part part of it. This ensures that the
the highest bit will point to non existing memory.

The invert is done by pte/pmd/pud_modify and pfn/pmd/pud_pte for
PROTNONE and pte/pmd/pud_pfn undo it.

We assume that noone tries to touch the PFN part of
a PTE without using these primitives.

This doesn't handle the case that MMIO is on the top
of the CPU physical memory. If such an MMIO region
was exposed by an unpriviledged driver for mmap
it would be possible to attack some real memory.
However this situation is all rather unlikely.

For 32bit non PAE we don't try inversion because
there are really not enough bits to protect anything.

Q: Why does the guest need to be protected when the
HyperVisor already has L1TF mitigations?
A: Here's an example:
You have physical pages 1 2. They get mapped into a guest as
GPA 1 -> PA 2
GPA 2 -> PA 1
through EPT.

The L1TF speculation ignores the EPT remapping.

Now the guest kernel maps GPA 1 to process A and GPA 2 to process B,
and they belong to different users and should be isolated.

A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping
and gets read access to the underlying physical page. Which
in this case points to PA 2, so it can read process B's data,
if it happened to be in L1.

So we broke isolation inside the guest.

There's nothing the hypervisor can do about this. This
mitigation has to be done in the guest.

v2: Use new helper to generate XOR mask to invert (Linus)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/include/asm/pgtable-2level.h | 12 ++++++++++
 arch/x86/include/asm/pgtable-3level.h |  2 ++
 arch/x86/include/asm/pgtable-invert.h | 28 ++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 44 ++++++++++++++++++++++++-----------
 arch/x86/include/asm/pgtable_64.h     |  2 ++
 5 files changed, 75 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/pgtable-invert.h

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..9e9d4ce4a2cc 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,4 +95,16 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
+/* No inverted PFNs on 2 level page tables */
+
+static inline u64 protnone_mask(u64 val)
+{
+	return 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+	return val;
+}
+
 #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index f24df59c40b2..76ab26a99e6e 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -295,4 +295,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 	return pte;
 }
 
+#include <asm/pgtable-invert.h>
+
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h
new file mode 100644
index 000000000000..457097f5f457
--- /dev/null
+++ b/arch/x86/include/asm/pgtable-invert.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_PGTABLE_INVERT_H
+#define _ASM_PGTABLE_INVERT_H 1
+
+#ifndef __ASSEMBLY__
+
+/* Get a mask to xor with the page table entry to get the correct pfn. */
+static inline u64 protnone_mask(u64 val)
+{
+	return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE ?
+		~0ull : 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+	/*
+	 * When a PTE transitions from NONE to !NONE or vice-versa
+	 * invert the PFN part to stop speculation.
+	 * pte_pfn undoes this when needed.
+	 */
+	if ((oldval & _PAGE_PROTNONE) != (val & _PAGE_PROTNONE))
+		val = (val & ~mask) | (~val & mask);
+	return val;
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5f49b4ff0c24..f811e3257e87 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)
 	return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
+/* Entries that were set to PROT_NONE are inverted */
+
+static inline u64 protnone_mask(u64 val);
+
 static inline unsigned long pte_pfn(pte_t pte)
 {
-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
+	unsigned long pfn = pte_val(pte);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pmd_pfn(pmd_t pmd)
 {
-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
+	unsigned long pfn = pmd_val(pmd);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pud_pfn(pud_t pud)
 {
-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
+	unsigned long pfn = pud_val(pud);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long p4d_pfn(p4d_t p4d)
@@ -545,25 +555,33 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PTE_PFN_MASK;
+	return __pte(pfn | check_pgprot(pgprot));
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PHYSICAL_PMD_PAGE_MASK;
+	return __pmd(pfn | check_pgprot(pgprot));
 }
 
 static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PHYSICAL_PUD_PAGE_MASK;
+	return __pud(pfn | check_pgprot(pgprot));
 }
 
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
-	pteval_t val = pte_val(pte);
+	pteval_t val = pte_val(pte), oldval = val;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -571,17 +589,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
-
+	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
 	return __pte(val);
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
-	pmdval_t val = pmd_val(pmd);
+	pmdval_t val = pmd_val(pmd), oldval = val;
 
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
-
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
 	return __pmd(val);
 }
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 593c3cf259dd..ea99272ab63e 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -357,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,
 	return true;
 }
 
+#include <asm/pgtable-invert.h>
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PGTABLE_64_H */
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 4/8] L1TFv3 6
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (2 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 5/8] L1TFv3 2 Andi Kleen
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

The L1TF workaround doesn't make any attempt to mitigate speculate
accesses to the first physical page for zeroed PTEs. Normally
it only contains some data from the early real mode BIOS.

I couldn't convince myself we always reserve the first page in
all configurations, so add an extra reservation call to
make sure it is really reserved. In most configurations (e.g.
with the standard reservations) it's likely a nop.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..fadbd41094d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -817,6 +817,9 @@ void __init setup_arch(char **cmdline_p)
 	memblock_reserve(__pa_symbol(_text),
 			 (unsigned long)__bss_stop - (unsigned long)_text);
 
+	/* Make sure page 0 is always reserved */
+	memblock_reserve(0, PAGE_SIZE);
+
 	early_reserve_initrd();
 
 	/*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 5/8] L1TFv3 2
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (3 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 4/8] L1TFv3 6 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 6/8] L1TFv3 0 Andi Kleen
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

L1TF core kernel workarounds are cheap and normally always enabled,
However we still want to report in sysfs if the system is vulnerable
or mitigated. Add the necessary checks.

- We use the same checks as Meltdown to determine if the system is
vulnerable. This excludes some Atom CPUs which don't have this
problem.
- We check for the (very unlikely) memory > MAX_PA/2 case
- We check for 32bit non PAE and warn

Note this patch will likely conflict with some other workaround patches
floating around, but should be straight forward to fix.

v2: Use positive instead of negative flag for WA. Fix override
reporting.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  2 ++
 arch/x86/kernel/cpu/bugs.c         | 11 +++++++++++
 arch/x86/kernel/cpu/common.c       | 10 +++++++++-
 drivers/base/cpu.c                 |  8 ++++++++
 include/linux/cpu.h                |  2 ++
 5 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d554c11e01ff..f1bfe8a37b84 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -214,6 +214,7 @@
 
 #define X86_FEATURE_USE_IBPB		( 7*32+21) /* "" Indirect Branch Prediction Barrier enabled */
 #define X86_FEATURE_USE_IBRS_FW		( 7*32+22) /* "" Use IBRS during runtime firmware calls */
+#define X86_FEATURE_L1TF_WA		( 7*32+23) /* "" L1TF workaround used */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
@@ -362,5 +363,6 @@
 #define X86_BUG_CPU_MELTDOWN		X86_BUG(14) /* CPU is affected by meltdown attack and needs kernel page table isolation */
 #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
+#define X86_BUG_L1TF			X86_BUG(17) /* CPU is affected by L1 Terminal Fault */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index bfca937bdcc3..e1f67b7c5217 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -340,4 +340,15 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, c
 		       boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "",
 		       spectre_v2_module_string());
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return sprintf(buf, "Not affected\n");
+
+	if (boot_cpu_has(X86_FEATURE_L1TF_WA))
+		return sprintf(buf, "Mitigated\n");
+
+	return sprintf(buf, "Mitigation Unavailable\n");
+}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8a5b185735e1..09c254979483 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -989,8 +989,16 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	setup_force_cpu_cap(X86_FEATURE_ALWAYS);
 
 	if (!x86_match_cpu(cpu_no_speculation)) {
-		if (cpu_vulnerable_to_meltdown(c))
+		if (cpu_vulnerable_to_meltdown(c)) {
 			setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
+			setup_force_cpu_bug(X86_BUG_L1TF);
+			/* Don't force so we can potentially clear it later */
+			set_cpu_cap(c, X86_FEATURE_L1TF_WA);
+#if CONFIG_PGTABLE_LEVELS == 2
+			pr_warn("Kernel not compiled for PAE. No workaround for L1TF\n");
+			setup_clear_cpu_cap(X86_FEATURE_L1TF_WA);
+#endif
+		}
 		setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 		setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
 	}
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 2da998baa75c..ed7b8591d461 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -534,14 +534,22 @@ ssize_t __weak cpu_show_spectre_v2(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_l1tf(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
+static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
 	&dev_attr_spectre_v1.attr,
 	&dev_attr_spectre_v2.attr,
+	&dev_attr_l1tf.attr,
 	NULL
 };
 
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 7b01bc11c692..75c430046ca0 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -53,6 +53,8 @@ extern ssize_t cpu_show_spectre_v1(struct device *dev,
 				   struct device_attribute *attr, char *buf);
 extern ssize_t cpu_show_spectre_v2(struct device *dev,
 				   struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_l1tf(struct device *dev,
+				   struct device_attribute *attr, char *buf);
 
 extern __printf(4, 5)
 struct device *cpu_device_create(struct device *parent, void *drvdata,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 6/8] L1TFv3 0
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (4 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 5/8] L1TFv3 2 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04  3:23 ` [MODERATED] [PATCH 7/8] L1TFv3 5 Andi Kleen
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

If the system has more than MAX_PA/2 physical memory the
invert page workarounds don't protect the system against
the L1TF attack anymore, because an inverted physical address
will point to valid memory.

We cannot do much here, after all users want to use the
memory, but at least print a warning and report the system as
vulnerable in sysfs

Note this is all extremely unlikely to happen on a real machine
because they typically have far more MAX_PA than DIMM slots

Some VMs also report fairly small PAs to guest, e.g. only 36bits.
In this case the threshold will be lower, but applies only
to the maximum guest size.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/setup.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fadbd41094d2..4d0b8ba10cf5 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -779,7 +779,27 @@ static void __init trim_low_memory_range(void)
 {
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
 }
-	
+
+static __init void check_maxpa_memory(void)
+{
+	u64 len;
+
+	if (!boot_cpu_has(X86_BUG_L1TF))
+		return;
+
+	len = (1ULL << (boot_cpu_data.x86_phys_bits - 1)) - 1;
+
+	/*
+	 * This is extremely unlikely to happen because systems near always have far
+	 * more MAX_PA than DIMM slots.
+	 */
+	if (e820__mapped_any(len, ULLONG_MAX - len,
+				     E820_TYPE_RAM)) {
+		pr_warn("System has more than MAX_PA/2 memory. Disabled L1TF workaround\n");
+		setup_clear_cpu_cap(X86_FEATURE_L1TF_WA);
+	}
+}
+
 /*
  * Dump out kernel offset information on panic.
  */
@@ -1016,6 +1036,8 @@ void __init setup_arch(char **cmdline_p)
 	insert_resource(&iomem_resource, &data_resource);
 	insert_resource(&iomem_resource, &bss_resource);
 
+	check_maxpa_memory();
+
 	e820_add_kernel_range();
 	trim_bios_range();
 #ifdef CONFIG_X86_32
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 7/8] L1TFv3 5
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (5 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 6/8] L1TFv3 0 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04 13:43   ` [MODERATED] " Michal Hocko
  2018-05-04  3:23 ` [MODERATED] [PATCH 8/8] L1TFv3 3 Andi Kleen
  2018-05-04  3:54 ` [MODERATED] Re: [PATCH 0/8] L1TFv3 4 Andi Kleen
  8 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

For the L1TF workaround we want to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never
point to valid memory.

Add a way for the architecture to override the swap file
size check in swapfile.c and add a x86 specific max swapfile check
function that enforces that limit.

The check is only enabled if the CPU is vulnerable to L1TF.

In VMs with 42bit MAX_PA the typical limit is 2TB now,
on a native system with 46bit PA it is 32TB. The limit
is only per individual swap file, so it's always possible
to exceed these limits with multiple swap files or
partitions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/mm/init.c       | 17 +++++++++++++++++
 include/linux/swapfile.h |  2 ++
 mm/swapfile.c            | 44 ++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fec82b577c18..9f571225f5db 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -4,6 +4,8 @@
 #include <linux/swap.h>
 #include <linux/memblock.h>
 #include <linux/bootmem.h>	/* for max_low_pfn */
+#include <linux/swapfile.h>
+#include <linux/swapops.h>
 
 #include <asm/set_memory.h>
 #include <asm/e820/api.h>
@@ -878,3 +880,18 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);
 	__pte2cachemode_tbl[entry] = cache;
 }
+
+unsigned long max_swapfile_size(void)
+{
+	unsigned long pages;
+
+	pages = generic_max_swapfile_size();
+
+	if (boot_cpu_has(X86_BUG_L1TF)) {
+		/* Limit the swap file size to MAX_PA/2 for the L1TF workaround */
+		pages = min_t(unsigned long,
+			      1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT),
+			      pages);
+	}
+	return pages;
+}
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 06bd7b096167..e06febf62978 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;
 extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 extern int try_to_unuse(unsigned int, bool, unsigned long);
+extern unsigned long generic_max_swapfile_size(void);
+extern unsigned long max_swapfile_size(void);
 
 #endif /* _LINUX_SWAPFILE_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cc2cf04d9018..413f48424194 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2909,6 +2909,33 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 	return 0;
 }
 
+
+/*
+ * Find out how many pages are allowed for a single swap
+ * device. There are two limiting factors: 1) the number
+ * of bits for the swap offset in the swp_entry_t type, and
+ * 2) the number of bits in the swap pte as defined by the
+ * different architectures. In order to find the
+ * largest possible bit mask, a swap entry with swap type 0
+ * and swap offset ~0UL is created, encoded to a swap pte,
+ * decoded to a swp_entry_t again, and finally the swap
+ * offset is extracted. This will mask all the bits from
+ * the initial ~0UL mask that can't be encoded in either
+ * the swp_entry_t or the architecture definition of a
+ * swap pte.
+ */
+unsigned long generic_max_swapfile_size(void)
+{
+	return swp_offset(pte_to_swp_entry(
+			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+}
+
+/* Can be overridden by an architecture for additional checks. */
+__weak unsigned long max_swapfile_size(void)
+{
+	return generic_max_swapfile_size();
+}
+
 static unsigned long read_swap_header(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					struct inode *inode)
@@ -2944,22 +2971,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	p->cluster_next = 1;
 	p->cluster_nr = 0;
 
-	/*
-	 * Find out how many pages are allowed for a single swap
-	 * device. There are two limiting factors: 1) the number
-	 * of bits for the swap offset in the swp_entry_t type, and
-	 * 2) the number of bits in the swap pte as defined by the
-	 * different architectures. In order to find the
-	 * largest possible bit mask, a swap entry with swap type 0
-	 * and swap offset ~0UL is created, encoded to a swap pte,
-	 * decoded to a swp_entry_t again, and finally the swap
-	 * offset is extracted. This will mask all the bits from
-	 * the initial ~0UL mask that can't be encoded in either
-	 * the swp_entry_t or the architecture definition of a
-	 * swap pte.
-	 */
-	maxpages = swp_offset(pte_to_swp_entry(
-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+	maxpages = max_swapfile_size();
 	last_page = swap_header->info.last_page;
 	if (!last_page) {
 		pr_warn("Empty swap-file\n");
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] [PATCH 8/8] L1TFv3 3
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (6 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 7/8] L1TFv3 5 Andi Kleen
@ 2018-05-04  3:23 ` Andi Kleen
  2018-05-04 14:19   ` [MODERATED] " Andi Kleen
  2018-05-04 22:15   ` Dave Hansen
  2018-05-04  3:54 ` [MODERATED] Re: [PATCH 0/8] L1TFv3 4 Andi Kleen
  8 siblings, 2 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:23 UTC (permalink / raw)
  To: speck

For L1TF PROT_NONE mappings are protected by inverting the PFN in the
page table entry. This sets the high bits in the CPU's address space,
thus making sure to point to not point an unmapped entry to valid
cached memory.

Some server system BIOS put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to an unprivileged
user they could attack low memory by setting such a mapping to
PROT_NONE. This could happen through a special device driver
which is not access protected. Normal /dev/mem is of course
access protect.

To avoid this we forbid PROT_NONE mappings or mprotect for high MMIO
mappings.

Valid page mappings are allowed because the system is then unsafe
anyways.

We don't expect users to commonly use PROT_NONE on MMIO. But
to minimize any impact here we only do this if the mapping actually
refers to a high MMIO address (defined as the MAX_PA-1 bit being set),
and also skip the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn
and in remap_pfn_range().

For mprotect it's a bit trickier. At the point we're looking at the
actual PTEs a lot of state has been changed and would be difficult
to undo on an error. Since this is a uncommon case we use a separate
early page talk walk pass for MMIO PROT_NONE mappings that
checks for this condition early. For non MMIO and non PROT_NONE
there are no changes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h |  4 ++++
 arch/x86/mm/mmap.c             | 21 ++++++++++++++++++
 include/asm-generic/pgtable.h  | 12 +++++++++++
 mm/memory.c                    | 37 ++++++++++++++++++++++---------
 mm/mprotect.c                  | 49 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 113 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f811e3257e87..338897c3b36f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1333,6 +1333,10 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
 	return __pte_access_permitted(pud_val(pud), write);
 }
 
+#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1
+extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);
+static inline bool arch_has_pfn_modify_check(void) { return true; }
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 48c591251600..d7e5083ec5dd 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
 
 	return phys_addr_valid(addr + count - 1);
 }
+
+/*
+ * Only allow root to set high MMIO mappings to PROT_NONE.
+ * This prevents an unpriv. user to set them to PROT_NONE and invert
+ * them, then pointing to valid memory for L1TF speculation.
+ */
+bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+	if (!boot_cpu_has(X86_BUG_L1TF))
+		return true;
+	if ((pgprot_val(prot) & (_PAGE_PRESENT|_PAGE_PROTNONE)) !=
+	    _PAGE_PROTNONE)
+		return true;
+	/* If it's real memory always allow */
+	if (pfn_valid(pfn))
+		return true;
+	if ((pfn & (1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT))) &&
+	    !capable(CAP_SYS_ADMIN))
+		return false;
+	return true;
+}
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f59639afaa39..0ecc1197084b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1097,4 +1097,16 @@ static inline void init_espfix_bsp(void) { }
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED
+static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+	return true;
+}
+
+static inline bool arch_has_pfn_modify_check(void)
+{
+	return false;
+}
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464e0fd2..fe497cecd2ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1891,6 +1891,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
 
+	if (!pfn_modify_allowed(pfn, pgprot))
+		return -EACCES;
+
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
@@ -1926,6 +1929,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 
 	track_pfn_insert(vma, &pgprot, pfn);
 
+	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
+		return -EACCES;
+
 	/*
 	 * If we don't have pte special, then we have to use the pfn_valid()
 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1973,6 +1979,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	int err = 0;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1980,12 +1987,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	arch_enter_lazy_mmu_mode();
 	do {
 		BUG_ON(!pte_none(*pte));
+		if (!pfn_modify_allowed(pfn, prot)) {
+			err = -EACCES;
+			break;
+		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
-	return 0;
+	return err;
 }
 
 static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
@@ -1994,6 +2005,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 {
 	pmd_t *pmd;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pmd = pmd_alloc(mm, pud, addr);
@@ -2002,9 +2014,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
-		if (remap_pte_range(mm, pmd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pte_range(mm, pmd, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -2015,6 +2028,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 {
 	pud_t *pud;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pud = pud_alloc(mm, p4d, addr);
@@ -2022,9 +2036,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (remap_pmd_range(mm, pud, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pmd_range(mm, pud, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
@@ -2035,6 +2050,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 {
 	p4d_t *p4d;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	p4d = p4d_alloc(mm, pgd, addr);
@@ -2042,9 +2058,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (remap_pud_range(mm, p4d, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pud_range(mm, p4d, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 625608bc8962..6d331620b9e5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -306,6 +306,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	return pages;
 }
 
+static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
+			       unsigned long next, struct mm_walk *walk)
+{
+	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+		0 : -EACCES;
+}
+
+static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
+				   unsigned long addr, unsigned long next,
+				   struct mm_walk *walk)
+{
+	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+		0 : -EACCES;
+}
+
+static int prot_none_test(unsigned long addr, unsigned long next,
+			  struct mm_walk *walk)
+{
+	return 0;
+}
+
+static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
+			   unsigned long end, unsigned long newflags)
+{
+	pgprot_t new_pgprot = vm_get_page_prot(newflags);
+	struct mm_walk prot_none_walk = {
+		.pte_entry = prot_none_pte_entry,
+		.hugetlb_entry = prot_none_hugetlb_entry,
+		.test_walk = prot_none_test,
+		.mm = current->mm,
+		.private = &new_pgprot,
+	};
+
+	return walk_page_range(start, end, &prot_none_walk);
+}
+
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	unsigned long start, unsigned long end, unsigned long newflags)
@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 		return 0;
 	}
 
+	/*
+	 * Do PROT_NONE PFN permission checks here when we can still
+	 * bail out without undoing a lot of state. This is a rather
+	 * uncommon case, so doesn't need to be very optimized.
+	 */
+	if (arch_has_pfn_modify_check() &&
+	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
+	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {
+		error = prot_none_walk(vma, start, end, newflags);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * If we make a private mapping writable we increase our commit;
 	 * but (without finer accounting) cannot reduce our commit if we
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 0/8] L1TFv3 4
  2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
                   ` (7 preceding siblings ...)
  2018-05-04  3:23 ` [MODERATED] [PATCH 8/8] L1TFv3 3 Andi Kleen
@ 2018-05-04  3:54 ` Andi Kleen
  8 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04  3:54 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 88 bytes --]


Also here's an "git format-patch --stdout" mbox attached for easier applying.

-Andi



[-- Attachment #2: m --]
[-- Type: text/plain, Size: 38120 bytes --]

From 0c2eb2235d5476b216693f1e9ec8394d58af20b3 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 3 May 2018 08:35:42 -0700
Subject: [PATCH 1/8] x86, l1tf: Increase 32bit PAE __PHYSICAL_PAGE_MASK
Status: RO
Content-Length: 1575
Lines: 43

On 32bit PAE the max PTE mask is currently set to 44 bit because that is
the limit imposed by 32bit unsigned long PFNs in the VMs.

The L1TF PROT_NONE protection code uses the PTE masks to determine
what bits to invert to make sure the higher bits are set for unmapped
entries to prevent L1TF speculation attacks against EPT inside guests.

But our inverted mask has to match the host, and the host is likely
64bit and may use more than 43 bits of memory. We want to set
all possible bits to be safe here.

So increase the mask on 32bit PAE to 52 to match 64bit. The real
limit is still 44 bits but outside the inverted PTEs these
higher bits are set, so a bigger masks don't cause any problems.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/page_32_types.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index aa30c3241ea7..0d5c739eebd7 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -29,8 +29,13 @@
 #define N_EXCEPTION_STACKS 1
 
 #ifdef CONFIG_X86_PAE
-/* 44=32+12, the limit we can fit into an unsigned long pfn */
-#define __PHYSICAL_MASK_SHIFT	44
+/*
+ * This is beyond the 44 bit limit imposed by the 32bit long pfns,
+ * but we need the full mask to make sure inverted PROT_NONE
+ * entries have all the host bits set in a guest.
+ * The real limit is still 44 bits.
+ */
+#define __PHYSICAL_MASK_SHIFT	52
 #define __VIRTUAL_MASK_SHIFT	32
 
 #else  /* !CONFIG_X86_PAE */
-- 
2.14.3


From 1bef0e393f925379b76cb689bfb3fdbfc052e716 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 27 Apr 2018 09:06:34 -0700
Subject: [PATCH 2/8] x86, l1tf: Protect swap entries against L1TF
Status: RO
Content-Length: 4505
Lines: 108

With L1 terminal fault the CPU speculates into unmapped PTEs, and
resulting side effects allow to read the memory the PTE is pointing
too, if its values are still in the L1 cache.

For swapped out pages Linux uses unmapped PTEs and stores a swap entry
into them.

We need to make sure the swap entry is not pointing to valid memory,
which requires setting higher bits (between bit 36 and bit 45) that
are inside the CPUs physical address space, but outside any real
memory.

To do this we invert the offset to make sure the higher bits are always
set, as long as the swap file is not too big.

Here's a patch that switches the order of "type" and
"offset" in the x86-64 encoding, in addition to doing the binary 'not' on
the offset.

That means that now the offset is bits 9-58 in the page table, and that
the offset is in the bits that hardware generally doesn't care about.

That, in turn, means that if you have a desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, you still have
to have 30 bits of offset actually *in use* until bit 39 ends up being
clear.

So that's 4 terabyte of swap space (because the offset is counted in
pages, so 30 bits of offset is 42 bits of actual coverage). With bigger
physical addressing, that obviously grows further, until you hit the limit
of the offset (at 50 bits of offset - 62 bits of actual swap file
coverage).

Note there is no workaround for 32bit !PAE, or on systems which
have more than MAX_PA/2 memory. The later case is very unlikely
to happen on real systems.

[updated description and minor tweaks by AK]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/include/asm/pgtable_64.h | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 877bc27718ae..593c3cf259dd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -286,20 +286,34 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
+ *
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
-#define SWP_TYPE_BITS 5
-/* Place the offset above the type: */
-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
+#define SWP_TYPE_BITS		5
+
+#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)
+
+/* We always extract/encode the offset by shifting it all the way up, and then down again */
+#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
-					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
-#define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (SWP_TYPE_FIRST_BIT)) \
-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
+/* Extract the high bits for type */
+#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
+
+/* Shift up (to get rid of type), then down to get value */
+#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+
+/*
+ * Shift the offset up "too far" by TYPE bits, then down again
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
+ */
+#define __swp_entry(type, offset) ((swp_entry_t) { \
+	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
+
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
-- 
2.14.3


From 15668abaeb581d8d0ff089eb7643999752b10118 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 27 Apr 2018 09:47:37 -0700
Subject: [PATCH 3/8] x86, l1tf: Protect PROT_NONE PTEs against speculation
Status: RO
Content-Length: 7878
Lines: 244

We also need to protect PTEs that are set to PROT_NONE against
L1TF speculation attacks.

This is important inside guests, because L1TF speculation
bypasses physical page remapping. While the VM has its own
migitations preventing leaking data from other VMs into
the guest, this would still risk leaking the wrong page
inside the current guest.

This uses the same technique as Linus' swap entry patch:
while an entry is is in PROTNONE state we invert the
complete PFN part part of it. This ensures that the
the highest bit will point to non existing memory.

The invert is done by pte/pmd/pud_modify and pfn/pmd/pud_pte for
PROTNONE and pte/pmd/pud_pfn undo it.

We assume that noone tries to touch the PFN part of
a PTE without using these primitives.

This doesn't handle the case that MMIO is on the top
of the CPU physical memory. If such an MMIO region
was exposed by an unpriviledged driver for mmap
it would be possible to attack some real memory.
However this situation is all rather unlikely.

For 32bit non PAE we don't try inversion because
there are really not enough bits to protect anything.

Q: Why does the guest need to be protected when the
HyperVisor already has L1TF mitigations?
A: Here's an example:
You have physical pages 1 2. They get mapped into a guest as
GPA 1 -> PA 2
GPA 2 -> PA 1
through EPT.

The L1TF speculation ignores the EPT remapping.

Now the guest kernel maps GPA 1 to process A and GPA 2 to process B,
and they belong to different users and should be isolated.

A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping
and gets read access to the underlying physical page. Which
in this case points to PA 2, so it can read process B's data,
if it happened to be in L1.

So we broke isolation inside the guest.

There's nothing the hypervisor can do about this. This
mitigation has to be done in the guest.

v2: Use new helper to generate XOR mask to invert (Linus)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/include/asm/pgtable-2level.h | 12 ++++++++++
 arch/x86/include/asm/pgtable-3level.h |  2 ++
 arch/x86/include/asm/pgtable-invert.h | 28 ++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h        | 44 ++++++++++++++++++++++++-----------
 arch/x86/include/asm/pgtable_64.h     |  2 ++
 5 files changed, 75 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/pgtable-invert.h

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 685ffe8a0eaf..9e9d4ce4a2cc 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,4 +95,16 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { (pte).pte_low })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
 
+/* No inverted PFNs on 2 level page tables */
+
+static inline u64 protnone_mask(u64 val)
+{
+	return 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+	return val;
+}
+
 #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index f24df59c40b2..76ab26a99e6e 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -295,4 +295,6 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 	return pte;
 }
 
+#include <asm/pgtable-invert.h>
+
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable-invert.h b/arch/x86/include/asm/pgtable-invert.h
new file mode 100644
index 000000000000..457097f5f457
--- /dev/null
+++ b/arch/x86/include/asm/pgtable-invert.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_PGTABLE_INVERT_H
+#define _ASM_PGTABLE_INVERT_H 1
+
+#ifndef __ASSEMBLY__
+
+/* Get a mask to xor with the page table entry to get the correct pfn. */
+static inline u64 protnone_mask(u64 val)
+{
+	return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE ?
+		~0ull : 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+	/*
+	 * When a PTE transitions from NONE to !NONE or vice-versa
+	 * invert the PFN part to stop speculation.
+	 * pte_pfn undoes this when needed.
+	 */
+	if ((oldval & _PAGE_PROTNONE) != (val & _PAGE_PROTNONE))
+		val = (val & ~mask) | (~val & mask);
+	return val;
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5f49b4ff0c24..f811e3257e87 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)
 	return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
+/* Entries that were set to PROT_NONE are inverted */
+
+static inline u64 protnone_mask(u64 val);
+
 static inline unsigned long pte_pfn(pte_t pte)
 {
-	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
+	unsigned long pfn = pte_val(pte);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pmd_pfn(pmd_t pmd)
 {
-	return (pmd_val(pmd) & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
+	unsigned long pfn = pmd_val(pmd);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & pmd_pfn_mask(pmd)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long pud_pfn(pud_t pud)
 {
-	return (pud_val(pud) & pud_pfn_mask(pud)) >> PAGE_SHIFT;
+	unsigned long pfn = pud_val(pud);
+	pfn ^= protnone_mask(pfn);
+	return (pfn & pud_pfn_mask(pud)) >> PAGE_SHIFT;
 }
 
 static inline unsigned long p4d_pfn(p4d_t p4d)
@@ -545,25 +555,33 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot)
 
 static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pte(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PTE_PFN_MASK;
+	return __pte(pfn | check_pgprot(pgprot));
 }
 
 static inline pmd_t pfn_pmd(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pmd(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PHYSICAL_PMD_PAGE_MASK;
+	return __pmd(pfn | check_pgprot(pgprot));
 }
 
 static inline pud_t pfn_pud(unsigned long page_nr, pgprot_t pgprot)
 {
-	return __pud(((phys_addr_t)page_nr << PAGE_SHIFT) |
-		     check_pgprot(pgprot));
+	phys_addr_t pfn = page_nr << PAGE_SHIFT;
+	pfn ^= protnone_mask(pgprot_val(pgprot));
+	pfn &= PHYSICAL_PUD_PAGE_MASK;
+	return __pud(pfn | check_pgprot(pgprot));
 }
 
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
+
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 {
-	pteval_t val = pte_val(pte);
+	pteval_t val = pte_val(pte), oldval = val;
 
 	/*
 	 * Chop off the NX bit (if present), and add the NX portion of
@@ -571,17 +589,17 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 */
 	val &= _PAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
-
+	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
 	return __pte(val);
 }
 
 static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 {
-	pmdval_t val = pmd_val(pmd);
+	pmdval_t val = pmd_val(pmd), oldval = val;
 
 	val &= _HPAGE_CHG_MASK;
 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
-
+	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
 	return __pmd(val);
 }
 
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 593c3cf259dd..ea99272ab63e 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -357,5 +357,7 @@ static inline bool gup_fast_permitted(unsigned long start, int nr_pages,
 	return true;
 }
 
+#include <asm/pgtable-invert.h>
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_PGTABLE_64_H */
-- 
2.14.3


From 704d99d6d9c1bab5f742f6db07f8b027fe4b5854 Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Mon, 23 Apr 2018 15:57:54 -0700
Subject: [PATCH 4/8] x86, l1tf: Make sure the first page is always reserved
Status: RO
Content-Length: 985
Lines: 31

The L1TF workaround doesn't make any attempt to mitigate speculate
accesses to the first physical page for zeroed PTEs. Normally
it only contains some data from the early real mode BIOS.

I couldn't convince myself we always reserve the first page in
all configurations, so add an extra reservation call to
make sure it is really reserved. In most configurations (e.g.
with the standard reservations) it's likely a nop.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/setup.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6285697b6e56..fadbd41094d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -817,6 +817,9 @@ void __init setup_arch(char **cmdline_p)
 	memblock_reserve(__pa_symbol(_text),
 			 (unsigned long)__bss_stop - (unsigned long)_text);
 
+	/* Make sure page 0 is always reserved */
+	memblock_reserve(0, PAGE_SIZE);
+
 	early_reserve_initrd();
 
 	/*
-- 
2.14.3


From 5fbd0bd04a0cc3a74afe74a88ea6178032e577df Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 27 Apr 2018 14:44:53 -0700
Subject: [PATCH 5/8] x86, l1tf: Add sysfs reporting for l1tf
Status: RO
Content-Length: 5072
Lines: 128

L1TF core kernel workarounds are cheap and normally always enabled,
However we still want to report in sysfs if the system is vulnerable
or mitigated. Add the necessary checks.

- We use the same checks as Meltdown to determine if the system is
vulnerable. This excludes some Atom CPUs which don't have this
problem.
- We check for the (very unlikely) memory > MAX_PA/2 case
- We check for 32bit non PAE and warn

Note this patch will likely conflict with some other workaround patches
floating around, but should be straight forward to fix.

v2: Use positive instead of negative flag for WA. Fix override
reporting.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  2 ++
 arch/x86/kernel/cpu/bugs.c         | 11 +++++++++++
 arch/x86/kernel/cpu/common.c       | 10 +++++++++-
 drivers/base/cpu.c                 |  8 ++++++++
 include/linux/cpu.h                |  2 ++
 5 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d554c11e01ff..f1bfe8a37b84 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -214,6 +214,7 @@
 
 #define X86_FEATURE_USE_IBPB		( 7*32+21) /* "" Indirect Branch Prediction Barrier enabled */
 #define X86_FEATURE_USE_IBRS_FW		( 7*32+22) /* "" Use IBRS during runtime firmware calls */
+#define X86_FEATURE_L1TF_WA		( 7*32+23) /* "" L1TF workaround used */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
@@ -362,5 +363,6 @@
 #define X86_BUG_CPU_MELTDOWN		X86_BUG(14) /* CPU is affected by meltdown attack and needs kernel page table isolation */
 #define X86_BUG_SPECTRE_V1		X86_BUG(15) /* CPU is affected by Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2		X86_BUG(16) /* CPU is affected by Spectre variant 2 attack with indirect branches */
+#define X86_BUG_L1TF			X86_BUG(17) /* CPU is affected by L1 Terminal Fault */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index bfca937bdcc3..e1f67b7c5217 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -340,4 +340,15 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct device_attribute *attr, c
 		       boot_cpu_has(X86_FEATURE_USE_IBRS_FW) ? ", IBRS_FW" : "",
 		       spectre_v2_module_string());
 }
+
+ssize_t cpu_show_l1tf(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	if (!boot_cpu_has_bug(X86_BUG_L1TF))
+		return sprintf(buf, "Not affected\n");
+
+	if (boot_cpu_has(X86_FEATURE_L1TF_WA))
+		return sprintf(buf, "Mitigated\n");
+
+	return sprintf(buf, "Mitigation Unavailable\n");
+}
 #endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 8a5b185735e1..09c254979483 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -989,8 +989,16 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c)
 	setup_force_cpu_cap(X86_FEATURE_ALWAYS);
 
 	if (!x86_match_cpu(cpu_no_speculation)) {
-		if (cpu_vulnerable_to_meltdown(c))
+		if (cpu_vulnerable_to_meltdown(c)) {
 			setup_force_cpu_bug(X86_BUG_CPU_MELTDOWN);
+			setup_force_cpu_bug(X86_BUG_L1TF);
+			/* Don't force so we can potentially clear it later */
+			set_cpu_cap(c, X86_FEATURE_L1TF_WA);
+#if CONFIG_PGTABLE_LEVELS == 2
+			pr_warn("Kernel not compiled for PAE. No workaround for L1TF\n");
+			setup_clear_cpu_cap(X86_FEATURE_L1TF_WA);
+#endif
+		}
 		setup_force_cpu_bug(X86_BUG_SPECTRE_V1);
 		setup_force_cpu_bug(X86_BUG_SPECTRE_V2);
 	}
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 2da998baa75c..ed7b8591d461 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -534,14 +534,22 @@ ssize_t __weak cpu_show_spectre_v2(struct device *dev,
 	return sprintf(buf, "Not affected\n");
 }
 
+ssize_t __weak cpu_show_l1tf(struct device *dev,
+				   struct device_attribute *attr, char *buf)
+{
+	return sprintf(buf, "Not affected\n");
+}
+
 static DEVICE_ATTR(meltdown, 0444, cpu_show_meltdown, NULL);
 static DEVICE_ATTR(spectre_v1, 0444, cpu_show_spectre_v1, NULL);
 static DEVICE_ATTR(spectre_v2, 0444, cpu_show_spectre_v2, NULL);
+static DEVICE_ATTR(l1tf, 0444, cpu_show_l1tf, NULL);
 
 static struct attribute *cpu_root_vulnerabilities_attrs[] = {
 	&dev_attr_meltdown.attr,
 	&dev_attr_spectre_v1.attr,
 	&dev_attr_spectre_v2.attr,
+	&dev_attr_l1tf.attr,
 	NULL
 };
 
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 7b01bc11c692..75c430046ca0 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -53,6 +53,8 @@ extern ssize_t cpu_show_spectre_v1(struct device *dev,
 				   struct device_attribute *attr, char *buf);
 extern ssize_t cpu_show_spectre_v2(struct device *dev,
 				   struct device_attribute *attr, char *buf);
+extern ssize_t cpu_show_l1tf(struct device *dev,
+				   struct device_attribute *attr, char *buf);
 
 extern __printf(4, 5)
 struct device *cpu_device_create(struct device *parent, void *drvdata,
-- 
2.14.3


From 6abad913b28fa726cc0e4d2702fda32b0a488afa Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 9 Feb 2018 10:36:15 -0800
Subject: [PATCH 6/8] x86, l1tf: Report if too much memory for L1TF workaround
Status: RO
Content-Length: 1908
Lines: 66

If the system has more than MAX_PA/2 physical memory the
invert page workarounds don't protect the system against
the L1TF attack anymore, because an inverted physical address
will point to valid memory.

We cannot do much here, after all users want to use the
memory, but at least print a warning and report the system as
vulnerable in sysfs

Note this is all extremely unlikely to happen on a real machine
because they typically have far more MAX_PA than DIMM slots

Some VMs also report fairly small PAs to guest, e.g. only 36bits.
In this case the threshold will be lower, but applies only
to the maximum guest size.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/setup.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fadbd41094d2..4d0b8ba10cf5 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -779,7 +779,27 @@ static void __init trim_low_memory_range(void)
 {
 	memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
 }
-	
+
+static __init void check_maxpa_memory(void)
+{
+	u64 len;
+
+	if (!boot_cpu_has(X86_BUG_L1TF))
+		return;
+
+	len = (1ULL << (boot_cpu_data.x86_phys_bits - 1)) - 1;
+
+	/*
+	 * This is extremely unlikely to happen because systems near always have far
+	 * more MAX_PA than DIMM slots.
+	 */
+	if (e820__mapped_any(len, ULLONG_MAX - len,
+				     E820_TYPE_RAM)) {
+		pr_warn("System has more than MAX_PA/2 memory. Disabled L1TF workaround\n");
+		setup_clear_cpu_cap(X86_FEATURE_L1TF_WA);
+	}
+}
+
 /*
  * Dump out kernel offset information on panic.
  */
@@ -1016,6 +1036,8 @@ void __init setup_arch(char **cmdline_p)
 	insert_resource(&iomem_resource, &data_resource);
 	insert_resource(&iomem_resource, &bss_resource);
 
+	check_maxpa_memory();
+
 	e820_add_kernel_range();
 	trim_bios_range();
 #ifdef CONFIG_X86_32
-- 
2.14.3


From 9422c9bcd3e664ce0f3a5eeadd45fa9127e0b5ba Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Fri, 27 Apr 2018 15:29:17 -0700
Subject: [PATCH 7/8] x86, l1tf: Limit swap file size to MAX_PA/2
Status: RO
Content-Length: 4671
Lines: 132

For the L1TF workaround we want to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never
point to valid memory.

Add a way for the architecture to override the swap file
size check in swapfile.c and add a x86 specific max swapfile check
function that enforces that limit.

The check is only enabled if the CPU is vulnerable to L1TF.

In VMs with 42bit MAX_PA the typical limit is 2TB now,
on a native system with 46bit PA it is 32TB. The limit
is only per individual swap file, so it's always possible
to exceed these limits with multiple swap files or
partitions.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/mm/init.c       | 17 +++++++++++++++++
 include/linux/swapfile.h |  2 ++
 mm/swapfile.c            | 44 ++++++++++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fec82b577c18..9f571225f5db 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -4,6 +4,8 @@
 #include <linux/swap.h>
 #include <linux/memblock.h>
 #include <linux/bootmem.h>	/* for max_low_pfn */
+#include <linux/swapfile.h>
+#include <linux/swapops.h>
 
 #include <asm/set_memory.h>
 #include <asm/e820/api.h>
@@ -878,3 +880,18 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
 	__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);
 	__pte2cachemode_tbl[entry] = cache;
 }
+
+unsigned long max_swapfile_size(void)
+{
+	unsigned long pages;
+
+	pages = generic_max_swapfile_size();
+
+	if (boot_cpu_has(X86_BUG_L1TF)) {
+		/* Limit the swap file size to MAX_PA/2 for the L1TF workaround */
+		pages = min_t(unsigned long,
+			      1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT),
+			      pages);
+	}
+	return pages;
+}
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 06bd7b096167..e06febf62978 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;
 extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 extern int try_to_unuse(unsigned int, bool, unsigned long);
+extern unsigned long generic_max_swapfile_size(void);
+extern unsigned long max_swapfile_size(void);
 
 #endif /* _LINUX_SWAPFILE_H */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cc2cf04d9018..413f48424194 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2909,6 +2909,33 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 	return 0;
 }
 
+
+/*
+ * Find out how many pages are allowed for a single swap
+ * device. There are two limiting factors: 1) the number
+ * of bits for the swap offset in the swp_entry_t type, and
+ * 2) the number of bits in the swap pte as defined by the
+ * different architectures. In order to find the
+ * largest possible bit mask, a swap entry with swap type 0
+ * and swap offset ~0UL is created, encoded to a swap pte,
+ * decoded to a swp_entry_t again, and finally the swap
+ * offset is extracted. This will mask all the bits from
+ * the initial ~0UL mask that can't be encoded in either
+ * the swp_entry_t or the architecture definition of a
+ * swap pte.
+ */
+unsigned long generic_max_swapfile_size(void)
+{
+	return swp_offset(pte_to_swp_entry(
+			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+}
+
+/* Can be overridden by an architecture for additional checks. */
+__weak unsigned long max_swapfile_size(void)
+{
+	return generic_max_swapfile_size();
+}
+
 static unsigned long read_swap_header(struct swap_info_struct *p,
 					union swap_header *swap_header,
 					struct inode *inode)
@@ -2944,22 +2971,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 	p->cluster_next = 1;
 	p->cluster_nr = 0;
 
-	/*
-	 * Find out how many pages are allowed for a single swap
-	 * device. There are two limiting factors: 1) the number
-	 * of bits for the swap offset in the swp_entry_t type, and
-	 * 2) the number of bits in the swap pte as defined by the
-	 * different architectures. In order to find the
-	 * largest possible bit mask, a swap entry with swap type 0
-	 * and swap offset ~0UL is created, encoded to a swap pte,
-	 * decoded to a swp_entry_t again, and finally the swap
-	 * offset is extracted. This will mask all the bits from
-	 * the initial ~0UL mask that can't be encoded in either
-	 * the swp_entry_t or the architecture definition of a
-	 * swap pte.
-	 */
-	maxpages = swp_offset(pte_to_swp_entry(
-			swp_entry_to_pte(swp_entry(0, ~0UL)))) + 1;
+	maxpages = max_swapfile_size();
 	last_page = swap_header->info.last_page;
 	if (!last_page) {
 		pr_warn("Empty swap-file\n");
-- 
2.14.3


From 62b4a73ea2fd6453e0aa3d7493ce8a46b5bb13ca Mon Sep 17 00:00:00 2001
From: Andi Kleen <ak@linux.intel.com>
Date: Thu, 3 May 2018 16:39:51 -0700
Subject: [PATCH 8/8] mm, l1tf: Disallow non privileged high MMIO PROT_NONE
 mappings
Status: RO
Content-Length: 9404
Lines: 292

For L1TF PROT_NONE mappings are protected by inverting the PFN in the
page table entry. This sets the high bits in the CPU's address space,
thus making sure to point to not point an unmapped entry to valid
cached memory.

Some server system BIOS put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to an unprivileged
user they could attack low memory by setting such a mapping to
PROT_NONE. This could happen through a special device driver
which is not access protected. Normal /dev/mem is of course
access protect.

To avoid this we forbid PROT_NONE mappings or mprotect for high MMIO
mappings.

Valid page mappings are allowed because the system is then unsafe
anyways.

We don't expect users to commonly use PROT_NONE on MMIO. But
to minimize any impact here we only do this if the mapping actually
refers to a high MMIO address (defined as the MAX_PA-1 bit being set),
and also skip the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn
and in remap_pfn_range().

For mprotect it's a bit trickier. At the point we're looking at the
actual PTEs a lot of state has been changed and would be difficult
to undo on an error. Since this is a uncommon case we use a separate
early page talk walk pass for MMIO PROT_NONE mappings that
checks for this condition early. For non MMIO and non PROT_NONE
there are no changes.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h |  4 ++++
 arch/x86/mm/mmap.c             | 21 ++++++++++++++++++
 include/asm-generic/pgtable.h  | 12 +++++++++++
 mm/memory.c                    | 37 ++++++++++++++++++++++---------
 mm/mprotect.c                  | 49 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 113 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f811e3257e87..338897c3b36f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1333,6 +1333,10 @@ static inline bool pud_access_permitted(pud_t pud, bool write)
 	return __pte_access_permitted(pud_val(pud), write);
 }
 
+#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1
+extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);
+static inline bool arch_has_pfn_modify_check(void) { return true; }
+
 #include <asm-generic/pgtable.h>
 #endif	/* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
index 48c591251600..d7e5083ec5dd 100644
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
 
 	return phys_addr_valid(addr + count - 1);
 }
+
+/*
+ * Only allow root to set high MMIO mappings to PROT_NONE.
+ * This prevents an unpriv. user to set them to PROT_NONE and invert
+ * them, then pointing to valid memory for L1TF speculation.
+ */
+bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+	if (!boot_cpu_has(X86_BUG_L1TF))
+		return true;
+	if ((pgprot_val(prot) & (_PAGE_PRESENT|_PAGE_PROTNONE)) !=
+	    _PAGE_PROTNONE)
+		return true;
+	/* If it's real memory always allow */
+	if (pfn_valid(pfn))
+		return true;
+	if ((pfn & (1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT))) &&
+	    !capable(CAP_SYS_ADMIN))
+		return false;
+	return true;
+}
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index f59639afaa39..0ecc1197084b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1097,4 +1097,16 @@ static inline void init_espfix_bsp(void) { }
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED
+static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+	return true;
+}
+
+static inline bool arch_has_pfn_modify_check(void)
+{
+	return false;
+}
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
diff --git a/mm/memory.c b/mm/memory.c
index 01f5464e0fd2..fe497cecd2ab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1891,6 +1891,9 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	if (addr < vma->vm_start || addr >= vma->vm_end)
 		return -EFAULT;
 
+	if (!pfn_modify_allowed(pfn, pgprot))
+		return -EACCES;
+
 	track_pfn_insert(vma, &pgprot, __pfn_to_pfn_t(pfn, PFN_DEV));
 
 	ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
@@ -1926,6 +1929,9 @@ static int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 
 	track_pfn_insert(vma, &pgprot, pfn);
 
+	if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
+		return -EACCES;
+
 	/*
 	 * If we don't have pte special, then we have to use the pfn_valid()
 	 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1973,6 +1979,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	int err = 0;
 
 	pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
@@ -1980,12 +1987,16 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	arch_enter_lazy_mmu_mode();
 	do {
 		BUG_ON(!pte_none(*pte));
+		if (!pfn_modify_allowed(pfn, prot)) {
+			err = -EACCES;
+			break;
+		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
-	return 0;
+	return err;
 }
 
 static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
@@ -1994,6 +2005,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 {
 	pmd_t *pmd;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pmd = pmd_alloc(mm, pud, addr);
@@ -2002,9 +2014,10 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
-		if (remap_pte_range(mm, pmd, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pte_range(mm, pmd, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -2015,6 +2028,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 {
 	pud_t *pud;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	pud = pud_alloc(mm, p4d, addr);
@@ -2022,9 +2036,10 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (remap_pmd_range(mm, pud, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pmd_range(mm, pud, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
@@ -2035,6 +2050,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 {
 	p4d_t *p4d;
 	unsigned long next;
+	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
 	p4d = p4d_alloc(mm, pgd, addr);
@@ -2042,9 +2058,10 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (remap_pud_range(mm, p4d, addr, next,
-				pfn + (addr >> PAGE_SHIFT), prot))
-			return -ENOMEM;
+		err = remap_pud_range(mm, p4d, addr, next,
+				pfn + (addr >> PAGE_SHIFT), prot);
+		if (err)
+			return err;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 625608bc8962..6d331620b9e5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -306,6 +306,42 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
 	return pages;
 }
 
+static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
+			       unsigned long next, struct mm_walk *walk)
+{
+	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+		0 : -EACCES;
+}
+
+static int prot_none_hugetlb_entry(pte_t *pte, unsigned long hmask,
+				   unsigned long addr, unsigned long next,
+				   struct mm_walk *walk)
+{
+	return pfn_modify_allowed(pte_pfn(*pte), *(pgprot_t *)(walk->private)) ?
+		0 : -EACCES;
+}
+
+static int prot_none_test(unsigned long addr, unsigned long next,
+			  struct mm_walk *walk)
+{
+	return 0;
+}
+
+static int prot_none_walk(struct vm_area_struct *vma, unsigned long start,
+			   unsigned long end, unsigned long newflags)
+{
+	pgprot_t new_pgprot = vm_get_page_prot(newflags);
+	struct mm_walk prot_none_walk = {
+		.pte_entry = prot_none_pte_entry,
+		.hugetlb_entry = prot_none_hugetlb_entry,
+		.test_walk = prot_none_test,
+		.mm = current->mm,
+		.private = &new_pgprot,
+	};
+
+	return walk_page_range(start, end, &prot_none_walk);
+}
+
 int
 mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	unsigned long start, unsigned long end, unsigned long newflags)
@@ -323,6 +359,19 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 		return 0;
 	}
 
+	/*
+	 * Do PROT_NONE PFN permission checks here when we can still
+	 * bail out without undoing a lot of state. This is a rather
+	 * uncommon case, so doesn't need to be very optimized.
+	 */
+	if (arch_has_pfn_modify_check() &&
+	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
+	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {
+		error = prot_none_walk(vma, start, end, newflags);
+		if (error)
+			return error;
+	}
+
 	/*
 	 * If we make a private mapping writable we increase our commit;
 	 * but (without finer accounting) cannot reduce our commit if we
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 3/8] L1TFv3 1
  2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
@ 2018-05-04  3:55   ` Linus Torvalds
  2018-05-04 13:42   ` Michal Hocko
  2018-05-07 12:38   ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2018-05-04  3:55 UTC (permalink / raw)
  To: speck



On Thu, 3 May 2018, speck for Andi Kleen wrote:
> 
> We also need to protect PTEs that are set to PROT_NONE against 
> L1TF speculation attacks.

Thanks, this looks better to me. Maybe it's because I got to pee in the 
snow and make my mark, though, so somebody else should maybe also look at 
this.

             Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 1/8] L1TFv3 8
  2018-05-04  3:23 ` [MODERATED] [PATCH 1/8] L1TFv3 8 Andi Kleen
@ 2018-05-04 13:42   ` Michal Hocko
  2018-05-04 14:07     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 13:42 UTC (permalink / raw)
  To: speck

On Thu 03-05-18 20:23:22, speck for Andi Kleen wrote:
[...]
>  #ifdef CONFIG_X86_PAE
> -/* 44=32+12, the limit we can fit into an unsigned long pfn */
> -#define __PHYSICAL_MASK_SHIFT	44
> +/*
> + * This is beyond the 44 bit limit imposed by the 32bit long pfns,
> + * but we need the full mask to make sure inverted PROT_NONE
> + * entries have all the host bits set in a guest.
> + * The real limit is still 44 bits.

I cannot find where is that limit enforced.

> + */
> +#define __PHYSICAL_MASK_SHIFT	52
>  #define __VIRTUAL_MASK_SHIFT	32
>  
>  #else  /* !CONFIG_X86_PAE */

Other than that this makes sense.
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 3/8] L1TFv3 1
  2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
  2018-05-04  3:55   ` [MODERATED] " Linus Torvalds
@ 2018-05-04 13:42   ` Michal Hocko
  2018-05-07 12:38   ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 13:42 UTC (permalink / raw)
  To: speck

On Thu 03-05-18 20:23:24, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86, l1tf: Protect PROT_NONE PTEs against speculation
> 
> We also need to protect PTEs that are set to PROT_NONE against
> L1TF speculation attacks.
> 
> This is important inside guests, because L1TF speculation
> bypasses physical page remapping. While the VM has its own
> migitations preventing leaking data from other VMs into
> the guest, this would still risk leaking the wrong page
> inside the current guest.
> 
> This uses the same technique as Linus' swap entry patch:
> while an entry is is in PROTNONE state we invert the
> complete PFN part part of it. This ensures that the
> the highest bit will point to non existing memory.
> 
> The invert is done by pte/pmd/pud_modify and pfn/pmd/pud_pte for
> PROTNONE and pte/pmd/pud_pfn undo it.
> 
> We assume that noone tries to touch the PFN part of
> a PTE without using these primitives.
> 
> This doesn't handle the case that MMIO is on the top
> of the CPU physical memory. If such an MMIO region
> was exposed by an unpriviledged driver for mmap
> it would be possible to attack some real memory.
> However this situation is all rather unlikely.
> 
> For 32bit non PAE we don't try inversion because
> there are really not enough bits to protect anything.
> 
> Q: Why does the guest need to be protected when the
> HyperVisor already has L1TF mitigations?
> A: Here's an example:
> You have physical pages 1 2. They get mapped into a guest as
> GPA 1 -> PA 2
> GPA 2 -> PA 1
> through EPT.
> 
> The L1TF speculation ignores the EPT remapping.
> 
> Now the guest kernel maps GPA 1 to process A and GPA 2 to process B,
> and they belong to different users and should be isolated.
> 
> A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping
> and gets read access to the underlying physical page. Which
> in this case points to PA 2, so it can read process B's data,
> if it happened to be in L1.
> 
> So we broke isolation inside the guest.
> 
> There's nothing the hypervisor can do about this. This
> mitigation has to be done in the guest.
> 
> v2: Use new helper to generate XOR mask to invert (Linus)
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks for extending the changelog. The implementation is different but
my ack still applies. I like how this ended up being smaller than
the original patch.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 7/8] L1TFv3 5
  2018-05-04  3:23 ` [MODERATED] [PATCH 7/8] L1TFv3 5 Andi Kleen
@ 2018-05-04 13:43   ` Michal Hocko
  2018-05-04 14:11     ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 13:43 UTC (permalink / raw)
  To: speck

On Thu 03-05-18 20:23:28, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86, l1tf: Limit swap file size to MAX_PA/2
> 
> For the L1TF workaround we want to limit the swap file size to below
> MAX_PA/2, so that the higher bits of the swap offset inverted never
> point to valid memory.
> 
> Add a way for the architecture to override the swap file
> size check in swapfile.c and add a x86 specific max swapfile check
> function that enforces that limit.
> 
> The check is only enabled if the CPU is vulnerable to L1TF.
> 
> In VMs with 42bit MAX_PA the typical limit is 2TB now,
> on a native system with 46bit PA it is 32TB. The limit
> is only per individual swap file, so it's always possible
> to exceed these limits with multiple swap files or
> partitions.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>
still applies. Just in case the previous post got lost in the noise I
repeat that I would prefer if we issued pr_want about truncated swap
file. It is quite a stretch to expect such a large swap storage but
let's keep users aware of the fact just in case.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 1/8] L1TFv3 8
  2018-05-04 13:42   ` [MODERATED] " Michal Hocko
@ 2018-05-04 14:07     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-04 14:07 UTC (permalink / raw)
  To: speck

On Fri, May 04, 2018 at 03:42:17PM +0200, speck for Michal Hocko wrote:
> On Thu 03-05-18 20:23:22, speck for Andi Kleen wrote:
> [...]
> >  #ifdef CONFIG_X86_PAE
> > -/* 44=32+12, the limit we can fit into an unsigned long pfn */
> > -#define __PHYSICAL_MASK_SHIFT	44
> > +/*
> > + * This is beyond the 44 bit limit imposed by the 32bit long pfns,
> > + * but we need the full mask to make sure inverted PROT_NONE
> > + * entries have all the host bits set in a guest.
> > + * The real limit is still 44 bits.
> 
> I cannot find where is that limit enforced.

I don't think there is an explicit check anywhere, but it would just silently
wrap everywhere PFNs are used.

But likely it would fail to allocate such a large mem_map in lowmem, so 
I doubt it could ever happen.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 7/8] L1TFv3 5
  2018-05-04 13:43   ` [MODERATED] " Michal Hocko
@ 2018-05-04 14:11     ` Andi Kleen
  2018-05-04 14:21       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04 14:11 UTC (permalink / raw)
  To: speck

> still applies. Just in case the previous post got lost in the noise I
> repeat that I would prefer if we issued pr_want about truncated swap
> file. It is quite a stretch to expect such a large swap storage but
> let's keep users aware of the fact just in case.

The warning is already there in swapfile.c

        last_page = swap_header->info.last_page;
        if (last_page > maxpages) {
                pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
                        maxpages << (PAGE_SHIFT - 10),
                        last_page << (PAGE_SHIFT - 10));
        }

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04  3:23 ` [MODERATED] [PATCH 8/8] L1TFv3 3 Andi Kleen
@ 2018-05-04 14:19   ` Andi Kleen
  2018-05-04 14:34     ` Michal Hocko
  2018-05-04 22:15   ` Dave Hansen
  1 sibling, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04 14:19 UTC (permalink / raw)
  To: speck


BTW this one is a somewhat tricky and new patch. Would be good if VM people
could take a good look at the assumptions and the logic. Thanks.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 7/8] L1TFv3 5
  2018-05-04 14:11     ` Andi Kleen
@ 2018-05-04 14:21       ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 14:21 UTC (permalink / raw)
  To: speck

On Fri 04-05-18 07:11:20, speck for Andi Kleen wrote:
> > still applies. Just in case the previous post got lost in the noise I
> > repeat that I would prefer if we issued pr_want about truncated swap
> > file. It is quite a stretch to expect such a large swap storage but
> > let's keep users aware of the fact just in case.
> 
> The warning is already there in swapfile.c
> 
>         last_page = swap_header->info.last_page;
>         if (last_page > maxpages) {
>                 pr_warn("Truncating oversized swap area, only using %luk out of %luk\n",
>                         maxpages << (PAGE_SHIFT - 10),
>                         last_page << (PAGE_SHIFT - 10));
>         }

right you are, I've missed that one.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04 14:19   ` [MODERATED] " Andi Kleen
@ 2018-05-04 14:34     ` Michal Hocko
  2018-05-04 15:53       ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 14:34 UTC (permalink / raw)
  To: speck

On Fri 04-05-18 07:19:51, speck for Andi Kleen wrote:
> 
> BTW this one is a somewhat tricky and new patch. Would be good if VM people
> could take a good look at the assumptions and the logic. Thanks.

I have to confess that it is not entirely clear to me how any bios can
put MMIO that high and things still keep working but in principle, the
patch looks reasonable to me. I yet have to double check you have
covered all the pfn remaping APIs to catch all potential offenders.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04 14:34     ` Michal Hocko
@ 2018-05-04 15:53       ` Andi Kleen
  2018-05-04 16:26         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-04 15:53 UTC (permalink / raw)
  To: speck

On Fri, May 04, 2018 at 04:34:24PM +0200, speck for Michal Hocko wrote:
> On Fri 04-05-18 07:19:51, speck for Andi Kleen wrote:
> > 
> > BTW this one is a somewhat tricky and new patch. Would be good if VM people
> > could take a good look at the assumptions and the logic. Thanks.
> 
> I have to confess that it is not entirely clear to me how any bios can
> put MMIO that high and things still keep working but in principle, the

How? It can of course put any mapping there that doesn't have another
address restriction. Devices with large mappings usually don't have
restrictions smaller than the MAX_PA size of the CPU (46 bits), so it's
an arbitrary choice.

Mappings with restrictions will of course be in the 4GB hole.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04 15:53       ` Andi Kleen
@ 2018-05-04 16:26         ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2018-05-04 16:26 UTC (permalink / raw)
  To: speck

On Fri 04-05-18 08:53:54, speck for Andi Kleen wrote:
> On Fri, May 04, 2018 at 04:34:24PM +0200, speck for Michal Hocko wrote:
> > On Fri 04-05-18 07:19:51, speck for Andi Kleen wrote:
> > > 
> > > BTW this one is a somewhat tricky and new patch. Would be good if VM people
> > > could take a good look at the assumptions and the logic. Thanks.
> > 
> > I have to confess that it is not entirely clear to me how any bios can
> > put MMIO that high and things still keep working but in principle, the
> 
> How? It can of course put any mapping there that doesn't have another
> address restriction. Devices with large mappings usually don't have
> restrictions smaller than the MAX_PA size of the CPU (46 bits), so it's
> an arbitrary choice.

Well, I would expect that most pfns would get masked by
__PHYSICAL_MASK_SHIFT. But then I am not really familiar with MMIO very
much to be honest.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04  3:23 ` [MODERATED] [PATCH 8/8] L1TFv3 3 Andi Kleen
  2018-05-04 14:19   ` [MODERATED] " Andi Kleen
@ 2018-05-04 22:15   ` Dave Hansen
  2018-05-05  3:55     ` Andi Kleen
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Hansen @ 2018-05-04 22:15 UTC (permalink / raw)
  To: speck

> +bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
> +{
> +	if (!boot_cpu_has(X86_BUG_L1TF))
> +		return true;
> +	if ((pgprot_val(prot) & (_PAGE_PRESENT|_PAGE_PROTNONE)) !=
> +	    _PAGE_PROTNONE)
> +		return true;

Just bikeshedding: it would be nice to have all the

	(_PAGE_PROTNONE | _PAGE_PRESENT)) == _PAGE_PROTNONE;

checks consolidated into a single helper, but this adds like the 6th
site that does this, so it's not just this code.

> +	/* If it's real memory always allow */
> +	if (pfn_valid(pfn))
> +		return true;
> +	if ((pfn & (1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT))) &&
> +	    !capable(CAP_SYS_ADMIN))
> +		return false;

The CAP_SYS_ADMIN kinda surprised me here.  Was there a reason we don't
just do this up front in the function?  Seems like we'd always allow pfn
modification if CAP_SYS_ADMIN.

Also, that PFN calculation is dying for a helper.

> +	/*
> +	 * Do PROT_NONE PFN permission checks here when we can still
> +	 * bail out without undoing a lot of state. This is a rather
> +	 * uncommon case, so doesn't need to be very optimized.
> +	 */
> +	if (arch_has_pfn_modify_check() &&
> +	    (vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
> +	    (newflags & (VM_READ|VM_WRITE|VM_EXEC)) == 0) {
> +		error = prot_none_walk(vma, start, end, newflags);
> +		if (error)
> +			return error;
> +	}

Otherwise it looks OK to me.  The need for arch_has_pfn_modify_check()
wasn't obvious until the end, but it all came together.  This seems like
a pretty reasonable compromise for something we think is really rare.

Although, I do wonder: should we put a WARN_ONCE() or something else in
here to help us see if *anybody* does this?  Or even a VM counter?  Or
do we just wait for the bug reports from somebody doing frequent 1-page
PROT_NONEs in a 100GB VMA?  :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 8/8] L1TFv3 3
  2018-05-04 22:15   ` Dave Hansen
@ 2018-05-05  3:55     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2018-05-05  3:55 UTC (permalink / raw)
  To: speck

On Fri, May 04, 2018 at 03:15:34PM -0700, speck for Dave Hansen wrote:
> > +	/* If it's real memory always allow */
> > +	if (pfn_valid(pfn))
> > +		return true;
> > +	if ((pfn & (1ULL << (boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT))) &&
> > +	    !capable(CAP_SYS_ADMIN))
> > +		return false;
> 
> The CAP_SYS_ADMIN kinda surprised me here.  Was there a reason we don't
> just do this up front in the function?  Seems like we'd always allow pfn
> modification if CAP_SYS_ADMIN.

In most cases it's useless because the earlier checks reject it.
I tried to order it by likelyhood that it rejects.

> Otherwise it looks OK to me.  The need for arch_has_pfn_modify_check()
> wasn't obvious until the end, but it all came together.  This seems like
> a pretty reasonable compromise for something we think is really rare.
> 
> Although, I do wonder: should we put a WARN_ONCE() or something else in
> here to help us see if *anybody* does this?  Or even a VM counter?  Or

If it's really a problem it would show up in perf.

> do we just wait for the bug reports from somebody doing frequent 1-page
> PROT_NONEs in a 100GB VMA?  :)

I doubt it is actually noticeable slower than something open coded. 
Even though there are some callbacks modern CPUs are really good at
running them very very fast fast

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 2/8] L1TFv3 7
  2018-05-04  3:23 ` [MODERATED] [PATCH 2/8] L1TFv3 7 Andi Kleen
@ 2018-05-07 11:45   ` Vlastimil Babka
  0 siblings, 0 replies; 29+ messages in thread
From: Vlastimil Babka @ 2018-05-07 11:45 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

On 05/04/2018 05:23 AM, speck for Andi Kleen wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Subject:  x86, l1tf: Protect swap entries against L1TF
> 
> With L1 terminal fault the CPU speculates into unmapped PTEs, and
> resulting side effects allow to read the memory the PTE is pointing
> too, if its values are still in the L1 cache.
> 
> For swapped out pages Linux uses unmapped PTEs and stores a swap entry
> into them.
> 
> We need to make sure the swap entry is not pointing to valid memory,
> which requires setting higher bits (between bit 36 and bit 45) that
> are inside the CPUs physical address space, but outside any real
> memory.
> 
> To do this we invert the offset to make sure the higher bits are always
> set, as long as the swap file is not too big.
> 
> Here's a patch that switches the order of "type" and
> "offset" in the x86-64 encoding, in addition to doing the binary 'not' on
> the offset.
> 
> That means that now the offset is bits 9-58 in the page table, and that
> the offset is in the bits that hardware generally doesn't care about.

      ^ type

> That, in turn, means that if you have a desktop chip with only 40 bits of
> physical addressing, now that the offset starts at bit 9, you still have
> to have 30 bits of offset actually *in use* until bit 39 ends up being
> clear.
> 
> So that's 4 terabyte of swap space (because the offset is counted in
> pages, so 30 bits of offset is 42 bits of actual coverage). With bigger
> physical addressing, that obviously grows further, until you hit the limit
> of the offset (at 50 bits of offset - 62 bits of actual swap file
> coverage).
> 
> Note there is no workaround for 32bit !PAE, or on systems which
> have more than MAX_PA/2 memory. The later case is very unlikely
> to happen on real systems.
> 
> [updated description and minor tweaks by AK]
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Tested-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 3/8] L1TFv3 1
  2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
  2018-05-04  3:55   ` [MODERATED] " Linus Torvalds
  2018-05-04 13:42   ` Michal Hocko
@ 2018-05-07 12:38   ` Vlastimil Babka
  2018-05-07 13:41     ` Andi Kleen
  2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2018-05-07 12:38 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1333 bytes --]

On 05/04/2018 05:23 AM, speck for Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> Subject:  x86, l1tf: Protect PROT_NONE PTEs against speculation
> 
> We also need to protect PTEs that are set to PROT_NONE against
> L1TF speculation attacks.
> 
> This is important inside guests, because L1TF speculation
> bypasses physical page remapping. While the VM has its own
> migitations preventing leaking data from other VMs into
> the guest, this would still risk leaking the wrong page
> inside the current guest.
> 
> This uses the same technique as Linus' swap entry patch:
> while an entry is is in PROTNONE state we invert the
> complete PFN part part of it. This ensures that the
> the highest bit will point to non existing memory.
> 
> The invert is done by pte/pmd/pud_modify and pfn/pmd/pud_pte for

There's no pud_modify() AFAICS.

...

> v2: Use new helper to generate XOR mask to invert (Linus)
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

I was a bit worried if this would still work with NUMA balancing
disabled, but seems we set the _PAGE_PROTNONE bit even when in that
case. Only pte_protnone() and pmd_protnone() return always 0 in that
case, but the code here doesn't use those.

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 3/8] L1TFv3 1
  2018-05-07 12:38   ` Vlastimil Babka
@ 2018-05-07 13:41     ` Andi Kleen
  2018-05-07 18:01       ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-07 13:41 UTC (permalink / raw)
  To: speck

> I was a bit worried if this would still work with NUMA balancing
> disabled, but seems we set the _PAGE_PROTNONE bit even when in that
> case. Only pte_protnone() and pmd_protnone() return always 0 in that
> case, but the code here doesn't use those.

Yes pte/pmd_protnone semantics are a total mess. Numa balancing
really should have used a differently named function for this
instead of this terrible ifdef hack.

That's why I'm open coding everything, to avoid this bogosity.

Should be really cleaned up at some point, but I think that's out
of scope for this patch kit.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/8] L1TFv3 1
  2018-05-07 13:41     ` Andi Kleen
@ 2018-05-07 18:01       ` Thomas Gleixner
  2018-05-07 18:21         ` [MODERATED] " Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Gleixner @ 2018-05-07 18:01 UTC (permalink / raw)
  To: speck

On Mon, 7 May 2018, speck for Andi Kleen wrote:
> > I was a bit worried if this would still work with NUMA balancing
> > disabled, but seems we set the _PAGE_PROTNONE bit even when in that
> > case. Only pte_protnone() and pmd_protnone() return always 0 in that
> > case, but the code here doesn't use those.
> 
> Yes pte/pmd_protnone semantics are a total mess. Numa balancing
> really should have used a differently named function for this
> instead of this terrible ifdef hack.
> 
> That's why I'm open coding everything, to avoid this bogosity.
> 
> Should be really cleaned up at some point, but I think that's out
> of scope for this patch kit.

IOW, you know it's a mess and because you can't be bothered to clean it up,
you add open coded crap to it, so others have more work to clean it up later.

That's exactly the attitude which leads to unmaintainable and undebuggable
code over time.

So no, please clean it up first and then add these new bits to it in a sane
way.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [MODERATED] Re: [PATCH 3/8] L1TFv3 1
  2018-05-07 18:01       ` Thomas Gleixner
@ 2018-05-07 18:21         ` Andi Kleen
  2018-05-07 20:03           ` Thomas Gleixner
  0 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2018-05-07 18:21 UTC (permalink / raw)
  To: speck

> IOW, you know it's a mess and because you can't be bothered to clean it up,

> you add open coded crap to it, so others have more work to clean it up later.

It's not about being bothered, it's about keeping a clean and simple 
and self contained and straight forward patchkit that only does one thing
and can be easily understood without diving into all kinds of unrelated stuff.

If I add a numa balancing dependency the cognitive load for every reviewer
and later applier will go up by at least an order of magnitude,
with very little benefit.

I know doing patchkits like this is out of fashion, but I'm old school
in that regard.

It's only two users that could be replaced with pte_protnone() anyways,
you're being overly dramatic.

The third user is custom and couldn't be expressed with pte_protnone().

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/8] L1TFv3 1
  2018-05-07 18:21         ` [MODERATED] " Andi Kleen
@ 2018-05-07 20:03           ` Thomas Gleixner
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Gleixner @ 2018-05-07 20:03 UTC (permalink / raw)
  To: speck

On Mon, 7 May 2018, speck for Andi Kleen wrote:

> > IOW, you know it's a mess and because you can't be bothered to clean it up,
> 
> > you add open coded crap to it, so others have more work to clean it up later.
> 
> It's not about being bothered, it's about keeping a clean and simple 
> and self contained and straight forward patchkit that only does one thing
> and can be easily understood without diving into all kinds of unrelated stuff.

Sorry, that's not unrelated. If stuff needs a cleanup so new things can be
added in a better way then they are related.

> If I add a numa balancing dependency the cognitive load for every reviewer
> and later applier will go up by at least an order of magnitude,
> with very little benefit.

Who is dramatic here? As long as the patches are self contained, clean and
simple it's not burden at all. Definitely not by an order of magnitude.

You really don't have to tell me how to do code refactoring and I've never
heard a complaint from any reviewer or maintainer about an extra step or
two when the end result is cleaner.

You neither have to tell me about maintainer burden. I prefer a well done
patch series which has the extra one or two steps anytime over stuff which
is duct taped on top of the existing mess.

> I know doing patchkits like this is out of fashion, but I'm old school
> in that regard.

It's about time that you arrive in the 21th century and figure out that
your beloved old school has been closed long ago due to lack of pupils.

> It's only two users that could be replaced with pte_protnone() anyways,
> you're being overly dramatic.

I'm not dramatic, but I care about readable and maintainable code.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-05-07 20:03 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-04  3:23 [MODERATED] [PATCH 0/8] L1TFv3 4 Andi Kleen
2018-05-04  3:23 ` [MODERATED] [PATCH 1/8] L1TFv3 8 Andi Kleen
2018-05-04 13:42   ` [MODERATED] " Michal Hocko
2018-05-04 14:07     ` Andi Kleen
2018-05-04  3:23 ` [MODERATED] [PATCH 2/8] L1TFv3 7 Andi Kleen
2018-05-07 11:45   ` [MODERATED] " Vlastimil Babka
2018-05-04  3:23 ` [MODERATED] [PATCH 3/8] L1TFv3 1 Andi Kleen
2018-05-04  3:55   ` [MODERATED] " Linus Torvalds
2018-05-04 13:42   ` Michal Hocko
2018-05-07 12:38   ` Vlastimil Babka
2018-05-07 13:41     ` Andi Kleen
2018-05-07 18:01       ` Thomas Gleixner
2018-05-07 18:21         ` [MODERATED] " Andi Kleen
2018-05-07 20:03           ` Thomas Gleixner
2018-05-04  3:23 ` [MODERATED] [PATCH 4/8] L1TFv3 6 Andi Kleen
2018-05-04  3:23 ` [MODERATED] [PATCH 5/8] L1TFv3 2 Andi Kleen
2018-05-04  3:23 ` [MODERATED] [PATCH 6/8] L1TFv3 0 Andi Kleen
2018-05-04  3:23 ` [MODERATED] [PATCH 7/8] L1TFv3 5 Andi Kleen
2018-05-04 13:43   ` [MODERATED] " Michal Hocko
2018-05-04 14:11     ` Andi Kleen
2018-05-04 14:21       ` Michal Hocko
2018-05-04  3:23 ` [MODERATED] [PATCH 8/8] L1TFv3 3 Andi Kleen
2018-05-04 14:19   ` [MODERATED] " Andi Kleen
2018-05-04 14:34     ` Michal Hocko
2018-05-04 15:53       ` Andi Kleen
2018-05-04 16:26         ` Michal Hocko
2018-05-04 22:15   ` Dave Hansen
2018-05-05  3:55     ` Andi Kleen
2018-05-04  3:54 ` [MODERATED] Re: [PATCH 0/8] L1TFv3 4 Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.