All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH 1/6] Patch 1
@ 2018-04-25  3:29 Andi Kleen
  2018-04-25 15:51 ` [MODERATED] " Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25  3:29 UTC (permalink / raw)
  To: speck; +Cc: Andi Kleen

Intel CPUs can speculatively reference the address in a page table
entry with the present bit clear. This can allow to access data
in the L1 using a gadget similar to Spectre v2.

Linux has three cases where PTEs can be unmapped:
- Empty page table entries (referencing the 4K page at phys address 0)
This page doesn't contain interesting data, so is not mitigated. This
also applies to some page table entries that are temporarily
cleared to prevent races.
- Page is currently swapped out or being migrated/poisoned.
- Virtual address range is set to PROT_NONE using mprotect

This patch addresses the second case. The page is swapped out and
the PTE has been replaced with a swap entry.  It could also
contain a migration or poison entry, which have the same format.

The swap file offset could point to real memory, which might be by chance
in L1 and would be open to this side channel.

Fill the bits from MAX_PA-1 to the maximum with ones. This forces
the CPU to reference an unmapped but supported memory area, which stops
any speculation. In principle we only need the MAX_PA-1 bit, but
filling all bits makes the trick swapon() uses to determine
the maximum swap file size from the __swp_entry work.
It's also slightly safer if a VM reports an incorrect MAX_PA to
a guest.

This limits the maximum size of swap files to 3.5TB.

The workaround is only possible on 64bit and 32bit with PAE. On non
PAE it would require limiting memory to less than 2GB, which
is likely not practical. So systems without PAE are still vulnerable.

There are no user options to enable/disable because the workaround
has no noticeable performance impact. However we automatically
disable it if the system has more PA bits than 46 or reports
RDCL_NO.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/pgtable-3level.h |  5 ++--
 arch/x86/include/asm/pgtable.h        |  3 +++
 arch/x86/include/asm/pgtable_64.h     | 21 ++++++++++++----
 arch/x86/kernel/cpu/bugs.c            | 47 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/init.c                    |  3 +++
 5 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index f24df59c40b2..d247cdca105d 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -243,8 +243,9 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 /* Encode and de-code a swap entry */
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
 #define __swp_type(x)			(((x).val) & 0x1f)
-#define __swp_offset(x)			((x).val >> 5)
-#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})
+#define __swp_offset(x)			(((x).val & ~__swp_stop_mask) >> 5)
+#define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5 |\
+					 __swp_stop_mask})
 #define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
 #define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })
 
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5f49b4ff0c24..9f1280bb7e20 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -185,6 +185,8 @@ static inline int pte_special(pte_t pte)
 	return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
+extern u64 __swp_stop_mask;
+
 static inline unsigned long pte_pfn(pte_t pte)
 {
 	return (pte_val(pte) & PTE_PFN_MASK) >> PAGE_SHIFT;
@@ -635,6 +637,7 @@ static inline int is_new_memtype_allowed(u64 paddr, unsigned long size,
 
 pmd_t *populate_extra_pmd(unsigned long vaddr);
 pte_t *populate_extra_pte(unsigned long vaddr);
+
 #endif	/* __ASSEMBLY__ */
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 877bc27718ae..76ed3ef49f53 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -271,9 +271,17 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 /*
  * Encode and de-code a swap entry
  *
- * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
- * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * |     ...					   | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
+ * |     ...					   |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
+ * |MAXPA+2->63|SAVED12|1| OFFSET (14->MAXPA-2) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ *
+ * MAXPA is the highest PA bit reported by CPUID
+ * We set a 1 stop bit at the highest MAXPA bit to prevent speculation.
+ * Also PS(bit 7) must be always 0.
+ *
+ * SAVED12 is a copy of the original value of the MAXPA-1 stop bit
+ * and a marker bit that the saved copy contains valid data.
+ * The bits above are filled with ones.
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -296,10 +304,13 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 
 #define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
 					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
+
+#define __swp_offset(x)			(((x).val & ~__swp_stop_mask) \
+					  >> SWP_OFFSET_FIRST_BIT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
+					 | ((offset) << SWP_OFFSET_FIRST_BIT) \
+					 | __swp_stop_mask})
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0a1f319c69fe..6aaee4ce8842 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -29,6 +29,7 @@
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
 static void __init spec_ctrl_save_msr(void);
+static void __init l1tf_mitigation(struct cpuinfo_x86 *c);
 
 void __init check_bugs(void)
 {
@@ -85,6 +86,8 @@ void __init check_bugs(void)
 	if (!direct_gbpages)
 		set_memory_4k((unsigned long)__va(0), 1);
 #endif
+
+	l1tf_mitigation(&boot_cpu_data);
 }
 
 /* The kernel command line selection */
@@ -569,3 +572,47 @@ ssize_t cpu_show_spec_store_bypass(struct device *dev, struct device_attribute *
 	return cpu_show_common(dev, attr, buf, X86_BUG_SPEC_STORE_BYPASS);
 }
 #endif
+
+/*
+ * Note there is not a lot of motivation to disable the L1TF
+ * workaround, as it is very cheap. But there are a few
+ * corner cases where it can be disabled, so disable
+ * it also when not needed.
+ */
+static bool cpu_needs_l1tf(struct cpuinfo_x86 *c)
+{
+	u64 ia32_cap = 0;
+
+	if (cpu_has(c, X86_FEATURE_ARCH_CAPABILITIES))
+		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, ia32_cap);
+
+	if (ia32_cap & ARCH_CAP_RDCL_NO)
+		return false;
+
+	if (c->x86_phys_bits > 46)
+		return false;
+
+	/* Add a check for MKTME here */
+
+	return true;
+}
+
+/*
+ * Workaround for L1 terminal fault speculation
+ * (CVE-2018-3620)
+ *
+ * For unmapped PTEs set all bits from MAX_PA-1 to top to stop
+ * speculation
+ *
+ * We only really need the MAX_PA-1 bit to address the L1
+ * terminal fault, but if we set all above too the swap file
+ * size check in swapon() limits the swap size correctly.
+ *
+ * Note this overwrites NX, which may need to be restored
+ * later.
+ */
+static __init void l1tf_mitigation(struct cpuinfo_x86 *c)
+{
+	if (cpu_needs_l1tf(c))
+		__swp_stop_mask = (-1ULL) << (c->x86_phys_bits - 1);
+}
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fec82b577c18..e4a10bbdc53a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -71,6 +71,9 @@ static unsigned long __initdata pgt_buf_start;
 static unsigned long __initdata pgt_buf_end;
 static unsigned long __initdata pgt_buf_top;
 
+u64 __swp_stop_mask __read_mostly;
+EXPORT_SYMBOL(__swp_stop_mask);
+
 static unsigned long min_pfn_mapped;
 
 static bool __initdata can_use_brk_pgt = true;
-- 
2.15.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25  3:29 [MODERATED] [PATCH 1/6] Patch 1 Andi Kleen
@ 2018-04-25 15:51 ` Linus Torvalds
  2018-04-25 16:06   ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 15:51 UTC (permalink / raw)
  To: speck



On Tue, 24 Apr 2018, speck for Andi Kleen wrote:
> 
> This patch addresses the second case. The page is swapped out and
> the PTE has been replaced with a swap entry.  It could also
> contain a migration or poison entry, which have the same format.

NAK NAK NAK.

Why is this doing the idiotic __swp_stop_mask, when I told multiple people 
not to do that, and when I've already seen the patch (from Michal Hocko?) 
that did the much simpler approach of unconditionally just inverting all 
the 'offset' bits.

So all you do is add a single bit-flip in the offset encoding/decoding:

  #define __swp_offset(x) (~(x).val >> SWP_OFFSET_FIRST_BIT)
                          ^^^

and

  #define __swp_entry(type, offset) ((swp_entry_t) { \
    ..
                 | (~(offset) << SWP_OFFSET_FIRST_BIT) })
                   ^^^


or something very close to that. None of this garbage "different values 
for different uarchitectures and cases"

I have already seen the correct patch on this list, why is this stupid 
garbage patch still floating around? Especially since I already mentioned 
the unconditional approach at the meeting at Intel originally.

There is absolutely zero reason to do anything more complicated and 
fragile afaik.

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 15:51 ` [MODERATED] " Linus Torvalds
@ 2018-04-25 16:06   ` Andi Kleen
  2018-04-25 17:25     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 16:06 UTC (permalink / raw)
  To: speck

On Wed, Apr 25, 2018 at 08:51:09AM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Tue, 24 Apr 2018, speck for Andi Kleen wrote:
> > 
> > This patch addresses the second case. The page is swapped out and
> > the PTE has been replaced with a swap entry.  It could also
> > contain a migration or poison entry, which have the same format.
> 
> NAK NAK NAK.
> 
> Why is this doing the idiotic __swp_stop_mask, when I told multiple people 
> not to do that, and when I've already seen the patch (from Michal Hocko?) 
> that did the much simpler approach of unconditionally just inverting all 
> the 'offset' bits.

We looked at it, but your invert approach is broken if there is any MMIO
space between MAX_PA/2 ... MAX_PA that is ever mapped to ring 3. 

And we cannot rule that out.

In this case the inverted bit would start pointing to valid memory, 
so everything would become attackable.

So yes the more complicated patches are needed.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 16:06   ` Andi Kleen
@ 2018-04-25 17:25     ` Linus Torvalds
  2018-04-25 17:36       ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 17:25 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> We looked at it, but your invert approach is broken if there is any MMIO
> space between MAX_PA/2 ... MAX_PA that is ever mapped to ring 3. 
> 
> And we cannot rule that out.

What?

That's complete garbage. MMIO space is irrelevant, since it's not even in 
the cache. 

And even if some crazy platform does make it cacheable (I assume people 
are thinking nvdimms or something), your patch is no better, since it has 
the exact same issue. It sets that EXACT SAME __swp_stop_mask bit, at 
MAX_PA/2.

Christ. Stop this idiocy.

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 17:25     ` Linus Torvalds
@ 2018-04-25 17:36       ` Andi Kleen
  2018-04-25 18:00         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 17:36 UTC (permalink / raw)
  To: speck

On Wed, Apr 25, 2018 at 10:25:59AM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > 
> > We looked at it, but your invert approach is broken if there is any MMIO
> > space between MAX_PA/2 ... MAX_PA that is ever mapped to ring 3. 
> > 
> > And we cannot rule that out.
> 
> What?
> 
> That's complete garbage. MMIO space is irrelevant, since it's not even in 
> the cache. 

If the MAX_PA-1 bit is inverted the address is not pointing to MMIO space anymore,
but likely to some real cached memory.

Let's say you have MMIO at (1ULL<<45) + 10MB. You invert the bits and the PA
points to phys 10MB

BTW there are ways to make it work, but it would likely require
forbidding PROT_NONE on MMIO space. If you're ok with a potentially
existing application breaking changes like this it's possible.
But it has a risk of breaking applications.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 17:36       ` Andi Kleen
@ 2018-04-25 18:00         ` Linus Torvalds
  2018-04-25 18:11           ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 18:00 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> If the MAX_PA-1 bit is inverted the address is not pointing to MMIO space anymore,
> but likely to some real cached memory.

What the hell are you blathering about?

This is all only about the SWAP ENTRY. 

> Let's say you have MMIO at (1ULL<<45) + 10MB. You invert the bits and the PA
> points to phys 10MB

No.

The swap entry has absolutely NOTHING to do with any MMIO physical 
address. We do not touch those AT ALL.

The only thing it affects is the "offset" of a swap entry. And honestly, 
if you have offsets with the high bits set, you're already broken, since 
it's not even guaramnteed to fit in the architecture-specific swap entry.

So the swap offset will have the high bits clear, and we'll invert them 
when creating the arch-specific entry, and thus the PTE will have the high 
bits set.

And maybe arechitecturally x86 doesn't _require_ that to be MMIO, but the 
PC platform sure as hell does in practice. So it won't be cached now. 
Unless you have some really really odd nvdimm setup or similar, at which 
point YOUR UNNECESSARILY COMPLEX PATCH HAS THE SAME EXACT ISSUE.

              Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 18:00         ` Linus Torvalds
@ 2018-04-25 18:11           ` Andi Kleen
  2018-04-25 18:26             ` Thomas Gleixner
  2018-04-25 18:30             ` [MODERATED] " Linus Torvalds
  0 siblings, 2 replies; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 18:11 UTC (permalink / raw)
  To: speck

On Wed, Apr 25, 2018 at 11:00:57AM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > 
> > If the MAX_PA-1 bit is inverted the address is not pointing to MMIO space anymore,
> > but likely to some real cached memory.
> 
> What the hell are you blathering about?
> 
> This is all only about the SWAP ENTRY. 

Ok. My patchkit handles both mprotect and swap entries. You're right 
we don't need it for swap entries and could use the inversion
there.

I was talking about mprotect though.

We need to handle mprotect (and potentially also non lazy fault mmap PROT_NONE)
because if we're in a guest setting something to PROT_NONE allows to bypass EPT temporarily, so you could suddenly see the values of some other guest pages.

What's your opinion on that case?

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/6] Patch 1
  2018-04-25 18:11           ` Andi Kleen
@ 2018-04-25 18:26             ` Thomas Gleixner
  2018-04-25 18:30             ` [MODERATED] " Linus Torvalds
  1 sibling, 0 replies; 21+ messages in thread
From: Thomas Gleixner @ 2018-04-25 18:26 UTC (permalink / raw)
  To: speck

On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> On Wed, Apr 25, 2018 at 11:00:57AM -0700, speck for Linus Torvalds wrote:
> > On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > > 
> > > If the MAX_PA-1 bit is inverted the address is not pointing to MMIO space anymore,
> > > but likely to some real cached memory.
> > 
> > What the hell are you blathering about?
> > 
> > This is all only about the SWAP ENTRY. 
> 
> Ok. My patchkit handles both mprotect and swap entries. You're right 
> we don't need it for swap entries and could use the inversion
> there.
> 
> I was talking about mprotect though.
> 
> We need to handle mprotect (and potentially also non lazy fault mmap
> PROT_NONE) because if we're in a guest setting something to PROT_NONE
> allows to bypass EPT temporarily, so you could suddenly see the values of
> some other guest pages.
>
> What's your opinion on that case?

The guest case is helpless anyway because rogue guests can set their PTEs
to whatever they want. You need to fix that at the host level, i.e. no HT
and L1D flush on vmenter.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 18:11           ` Andi Kleen
  2018-04-25 18:26             ` Thomas Gleixner
@ 2018-04-25 18:30             ` Linus Torvalds
  2018-04-25 18:51               ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 18:30 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> Ok. My patchkit handles both mprotect and swap entries. You're right 
> we don't need it for swap entries and could use the inversion
> there.
> 
> I was talking about mprotect though.

So I really want that to be handled separately. 

I also am not convinced that it *can* be handled. If a user has 
permissions to mprotect high MMIO, then we're simply fundamentally out of 
bits that the hardware cares about!

So please handle the swap case separately, because the swap case is easy 
and obvious, and has no downsides.

The mprotect case is *completely* different.

In particular, for mprotect, we have several different situations:

 - on native hardware we simply don't care. Whoever has the mmap already 
   could just access it without that PROT_NONE.

 - in a virtual environment, the host simply needs to flush its caches 
   before entering vmx mode, and now there is nothing sensitive for the 
   guest to access. All it can read is its own caches.

So the mprotect case is simply fundamentally not very interesting for 99% 
of all people. Somebody like Amazon doesn't care about leaking data 
_inside_ the VM - plus it's hard to do anyway because the guest needs to 
also try to figure out the virtual mapping and has only physically mapped 
cached data to go by.

Now, can we make it *harder* for a guest to get to that cached data? Yes 
we can. But honestly, if you need root access (inside the guest) to then 
use PROT_NONE on some MMIO mapping to access some particular physical 
addresses that *may* be cached (in the guest), then what the hell is the 
leak? You already control the guest at that point.

So the PROT_NONE case just doesn't seem to be a security issue. It's a 
security issue *RIGHT NOW* because of the lack of cache flush at VM entry, 
but that's a separate issue. Once the cache flush is there, the only thing 
that leaks is guest data anyway, and it only leaks to the already trusted 
user in the guest.

Now, that said, we can have other things we can do to just make it harder 
to mis-use this. In particular, we could fairly easily say:

 - PROT_NONE on a shared mapping means that we just flush and unmap all 
   the pages *entirely* (we leave the vma alone, and just rely on faulting 
   them back in after you set it back to something that isn't PROT_NONE)

Does that perhaps need some extra work? Yeah, maybe. But it would be a 
clean solution. And notice that it's not really a big security issue due 
to the above, but maybe it would make it more palatable to then say "ok, 
we won't flush the L1 cache on VM entry, because we trust the guest OS to 
have special PROT_NONE logic".

That might be a big deal to Amazon, for example. That's assuming Amazon 
controls the guest OS in the first place? I don't know their setup.

And at no point do we need that special bit in the PROT_NONE mapping.

See?

(And yes, I might be missing some detail, but I _really_ hate that 
"special bit" patch. It fundamentally should not be needed).

                  Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 18:30             ` [MODERATED] " Linus Torvalds
@ 2018-04-25 18:51               ` Andi Kleen
  2018-04-25 20:15                 ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 18:51 UTC (permalink / raw)
  To: speck

>  - in a virtual environment, the host simply needs to flush its caches 
>    before entering vmx mode, and now there is nothing sensitive for the 
>    guest to access. All it can read is its own caches.

This doesn't help unfortunately because it could also leak data inside
the guest. If skipping EPT causes the PA to point to some other page
inside the same guest you can leak that data. And that other page might
be owned by the kernel or by some other process.

If a guest uses most memory in the system that's quite likely.

There was also another case, brought up by Google ChromeOS, where they
control/trust the guest kernel and want to rely on it from accessing
data outside the current guest, so not needing flush mitigations in the VMM. 
That one is a bit more dubious, but I guess it's also not completely broken.

> Now, that said, we can have other things we can do to just make it harder 
> to mis-use this. In particular, we could fairly easily say:
> 
>  - PROT_NONE on a shared mapping means that we just flush and unmap all 
>    the pages *entirely* (we leave the vma alone, and just rely on faulting 
>    them back in after you set it back to something that isn't PROT_NONE)

It doesn't need to be a shared mapping, can happen with any mapping.

> 
> (And yes, I might be missing some detail, but I _really_ hate that 
> "special bit" patch. It fundamentally should not be needed).

I don't think anyone likes it, we just couldn't find better solutions.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 18:51               ` Andi Kleen
@ 2018-04-25 20:15                 ` Linus Torvalds
  2018-04-25 21:19                   ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 20:15 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> This doesn't help unfortunately because it could also leak data inside
> the guest. If skipping EPT causes the PA to point to some other page
> inside the same guest you can leak that data. And that other page might
> be owned by the kernel or by some other process.

Do you even read what I write?

THAT'S EXACTLY WHAT I TALKED ABOUT IN THE REST OF THE EMAIL.

> There was also another case, brought up by Google ChromeOS, where they
> control/trust the guest kernel and want to rely on it from accessing
> data outside the current guest, so not needing flush mitigations in the VMM. 
> That one is a bit more dubious, but I guess it's also not completely broken.

And I mentioned this exact case too. But pointed out that there is NO WAY 
IN HELL that your patch will fix it either but that there are other 
possible alternatives to mitigate things if you trust the guest OS.

Of course, if you trust the guest, then why the hell are you even doing 
virtualization in the first place? 

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 20:15                 ` Linus Torvalds
@ 2018-04-25 21:19                   ` Andi Kleen
  2018-04-25 22:35                     ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 21:19 UTC (permalink / raw)
  To: speck

On Wed, Apr 25, 2018 at 01:15:35PM -0700, speck for Linus Torvalds wrote:
> On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > 
> > This doesn't help unfortunately because it could also leak data inside
> > the guest. If skipping EPT causes the PA to point to some other page
> > inside the same guest you can leak that data. And that other page might
> > be owned by the kernel or by some other process.
> 
> Do you even read what I write?
> 
> THAT'S EXACTLY WHAT I TALKED ABOUT IN THE REST OF THE EMAIL.

Ok so I reread what you wrote. I think you're saying it's not a problem
because it's too hard to know what the other page is?

I can think of various ways around this:

Assume the attacker process owns most of the memory and
the attacked process is very small. It fills its own memory
with a known pattern. Then it checks against that pattern.
If it's not the pattern it's someone elses. 

Or it does the mprotect attack on a lot of different pages
that it cycles through, and tries on each of them, looking
for some known pattern?

Or the attacker uses memory pressure to force another 
process to cycle through a lot of memory, and always
retries inbetween until it sees some known pattern.

Considering all these cases, do you still say that mprotect does
not need to be mitigated?

It would seem very risky to me.

BTW people like Amazon care actually a lot about
security inside their guests.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 21:19                   ` Andi Kleen
@ 2018-04-25 22:35                     ` Linus Torvalds
  2018-04-25 23:12                       ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 22:35 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> Ok so I reread what you wrote. I think you're saying it's not a problem
> because it's too hard to know what the other page is?

I do think that's one big issue.

But no, the real issue is that 

 (a) somebody who runs virtual guests can't generally trust them *anyway*, 
so anything we do is pointless for that case. The host is protected, and 
the guest protections are separate.

 (b) if we want to protect guests against leaks, we need other models than 
the one you had anyway.

> Considering all these cases, do you still say that mprotect does
> not need to be mitigated?

I'm saying that your patch is *not* the way to mitigate it anyway. 

The mprotect mitigation has absolutely *nothing* to do with the swap entry 
mitigation, and we can - and should - do it not just sepatrately but 
entirely differently.

So first off, split that up, and do the TRIVIAL TWO-LINER that was already 
posted (by Hocko?) for swap entry mitigation. Nothing else. 

Then, start looking at PROT_NONE. There are two cases:

 - actual real file mappings

   We can use the same trick for PROT_NONE as we did for swap entries: 
   just invert all the high bits. None of this "how many bits do I have".

 - the /dev/mem kind of mappings (not necessarily through /dev/mem, they 
   may well be remap_pfn_range() other ways

   These are always shared. It's not entirely clear that anybody even uses 
   PROT_NONE on them. We may be able to just disallow PROT_NONE entirely, 
   or we can just zap the mapping instead of trying to "save" it, since 
   there really isn't anythign to save.

See? 

I absolutely detest the "let's pick one bit and treat it specially". It's 
fragile garbage, and it generates bad code to boot. We likely have much 
better options available to us.

               Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 22:35                     ` Linus Torvalds
@ 2018-04-25 23:12                       ` Andi Kleen
  2018-04-25 23:21                         ` Linus Torvalds
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 23:12 UTC (permalink / raw)
  To: speck

> So first off, split that up, and do the TRIVIAL TWO-LINER that was already 
> posted (by Hocko?) for swap entry mitigation. Nothing else. 

I haven't see Hocko's patch, but I assume it just inverts the complete 
swap entry except P while it is stored.

Two more corner cases:

- How about the case when you have more than MAX_PA/2 physical
memory? For this case we would need a flag to disable the
invert (and clear the reporting), otherwise it will be attackable.

Frankly this is fairly unlikely because usually systems have
far more MAX_PA than DIMM slots, except possibly for SGI UV.
It would be still better to handle it for that case.

- And how about someone adding a swap file that is larger than
MAX_PA/2? This would be also attackable with the invert approach.
I can readd a check to swapon for that.

My patch kit also made sure that swapon() with a too large file
gets rejected, otherwise it would be attackable if inverted. 
This also needs to be readded. This will require checking phys_bits again. 
Is that ok for you?

> 
> Then, start looking at PROT_NONE. There are two cases:
> 
>  - actual real file mappings
> 
>    We can use the same trick for PROT_NONE as we did for swap entries: 
>    just invert all the high bits. None of this "how many bits do I have".

Ok.

> 
>  - the /dev/mem kind of mappings (not necessarily through /dev/mem, they 
>    may well be remap_pfn_range() other ways
> 
>    These are always shared. It's not entirely clear that anybody even uses 
>    PROT_NONE on them. We may be able to just disallow PROT_NONE entirely, 
>    or we can just zap the mapping instead of trying to "save" it, since 
>    there really isn't anythign to save.

Ok. So disallow PROT_NONE until someone complains.

I assume we want to disallow PROT_NONE for any VM_PFNMAP vma. 

I don't think it makes sense for any other shared mapping?

BTW there's also the case that if you have a device driver that
does not use lazy faulting, but remap_pfn_range in the ->mmap function
and the mmap is called with PROT_NONE, and it remaps memory, not MMIO.
I still need to audit the tree to see if that can happen. It may
be safer to forbid that case somewhere in the VM also.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 23:12                       ` Andi Kleen
@ 2018-04-25 23:21                         ` Linus Torvalds
  2018-04-25 23:39                           ` Andi Kleen
  2018-04-26 13:59                           ` Michal Hocko
  0 siblings, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2018-04-25 23:21 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> I haven't see Hocko's patch, but I assume it just inverts the complete 
> swap entry except P while it is stored.

No, it only inverted offset. That's the easiest thing.

> Two more corner cases:
> 
> - How about the case when you have more than MAX_PA/2 physical
> memory? For this case we would need a flag to disable the
> invert (and clear the reporting), otherwise it will be attackable.

No. In some random theoretical situation that doesn't matter, maybe. 

But it's completely not attackable, because you need to swap out literally 
terabytes of memory, and you have no control over what the allocations 
will even be.

So forget about it. We're not going to add any complexity over an attack 
that is not realistic.

> >    These are always shared. It's not entirely clear that anybody even uses 
> >    PROT_NONE on them. We may be able to just disallow PROT_NONE entirely, 
> >    or we can just zap the mapping instead of trying to "save" it, since 
> >    there really isn't anythign to save.
> 
> Ok. So disallow PROT_NONE until someone complains.

It's an option. I actually would prefer the "just zap the mapping and 
populate it again" model, but the "just disallow" might be simpler and 
might be acceptable. But it _does_ have the potential of people finding it 
a regression. Maybe people play games with PROT_NONE. User space often 
does really really odd things.

                Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 23:21                         ` Linus Torvalds
@ 2018-04-25 23:39                           ` Andi Kleen
  2018-04-26  3:22                             ` Linus Torvalds
  2018-04-26 13:59                           ` Michal Hocko
  1 sibling, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2018-04-25 23:39 UTC (permalink / raw)
  To: speck

On Wed, Apr 25, 2018 at 04:21:24PM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > 
> > I haven't see Hocko's patch, but I assume it just inverts the complete 
> > swap entry except P while it is stored.
> 
> No, it only inverted offset. That's the easiest thing.
> 
> > Two more corner cases:
> > 
> > - How about the case when you have more than MAX_PA/2 physical
> > memory? For this case we would need a flag to disable the
> > invert (and clear the reporting), otherwise it will be attackable.
> 
> No. In some random theoretical situation that doesn't matter, maybe. 
> 
> But it's completely not attackable, because you need to swap out literally 
> terabytes of memory, and you have no control over what the allocations 
> will even be.

For the >MAX_PA/2 case it would be any swapout, right?
Because any offset inversion is still pointing to valid memory then.

We could actually not have any check for this in the VM code
(because it's vulnerable anyways), but only invert the BUG bit so that
sysfs reports vulnerability. 

That would be localized purely in the initialization code.
Would that be ok?
> 
> So forget about it. We're not going to add any complexity over an attack 
> that is not realistic.

That's for the >3.5TB swapfile case? So you don't want the
check in swapon, correct?
> 
> > >    These are always shared. It's not entirely clear that anybody even uses 
> > >    PROT_NONE on them. We may be able to just disallow PROT_NONE entirely, 
> > >    or we can just zap the mapping instead of trying to "save" it, since 
> > >    there really isn't anythign to save.
> > 
> > Ok. So disallow PROT_NONE until someone complains.
> 
> It's an option. I actually would prefer the "just zap the mapping and 
> populate it again" model, but the "just disallow" might be simpler and 

Ok. I will see if that's easily implementable. Otherwise forbid PROT_NONE.

> might be acceptable. But it _does_ have the potential of people finding it 
> a regression. Maybe people play games with PROT_NONE. User space often 
> does really really odd things.

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 23:39                           ` Andi Kleen
@ 2018-04-26  3:22                             ` Linus Torvalds
  2018-04-26  3:39                               ` Jon Masters
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-26  3:22 UTC (permalink / raw)
  To: speck



On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> 
> For the >MAX_PA/2 case it would be any swapout, right?

It would have to be a pretty damn big offset, at the very least. You hit 
it when you hit swap offset MAX_PA >> SWP_OFFSET_FIRST_BIT.

I think SWP_OFFSET_FIRST_BIT is 14. It could be smaller, but sadly we 
can't use the Dirty/Accessed bit in not-present entries because of other 
errata, so we avoid some of the low bits that would otherwise be useful.

So if we have Xeon with 46 bits of physical addressing, we're talking 
about hitting it when you hit an offset on the order of (1 << 30). Even 
with desktop chips (what, 40 bits physical?), we you have to have offsets 
on the order of (1 << 25).

So (12 bits of pages) your swap file has to be 1 << 37 bits in size, so 
you have to swap out 128 GB of data before you can even hit that last bit.

Maybe I am off by one or two orders-of-2 or so, but it should be in the 
ballpark. You literally have to page out tens or hundreds of gigabytes of 
memory to hit an interesting swap entry, and then as an attacker you won't 
even have control over how it's done, so you are going to have a really 
work at it to _use_ those things. 

And that's on a _desktop_ chip. On a Xeon, the physical address space is 
what - 44 bits? 48 bits? ARK is being singularly unhelpful here. Then 
we're talking terabytes of swap space. 

Note: not terabytes allocated. Terabytes _used_.

But yes, we could perhaps limit swap space size or something. Or maybe 
just warn people.

And if we really care (I obviously don't think we should), then we could 
just move the *type* to the high bits of the page table entry we'd start 
the offset bits much earlier (at bit 9). So now you'd get 5 extra bits 
before you even hit that MAX_PA case, so on a desktop chip you'd already 
hit the "you have to have a terabyte of swap in use to even get there".

That sounds really trivial to do, in fact. So another 2 lines of code or 
so (and the code generation shouldn't really change - it's just switching 
how you shift the "offset" vs the "type" bits around).

I don't know. Maybe I'm missing something. But it *already* sounds 
impossible to use in practice,  and I pretty much guarantee that if you 
need terabytes of swap space in use on a desktop (and another 6-8 bits of 
physical addressing on the Xeons?) there is absolutely no way people will 
have swap offsets big enough to ever hit the MAX_PA/2 bit.

             Linus

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-26  3:22                             ` Linus Torvalds
@ 2018-04-26  3:39                               ` Jon Masters
  0 siblings, 0 replies; 21+ messages in thread
From: Jon Masters @ 2018-04-26  3:39 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

On 04/25/2018 11:22 PM, speck for Linus Torvalds wrote:

> I don't know. Maybe I'm missing something. But it *already* sounds 
> impossible to use in practice,  and I pretty much guarantee that if you 
> need terabytes of swap space in use on a desktop (and another 6-8 bits of 
> physical addressing on the Xeons?) there is absolutely no way people will 
> have swap offsets big enough to ever hit the MAX_PA/2 bit.

A bunch of us (including Amazon) discussed this limitation (MAX_PA/2)
and it seemed somewhat academic that you'd be able to pull off an attack
at that point reliably. There doesn't seem to be any downside to adding
a vulnerable warning in sysfs in that case, which was also suggested.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-25 23:21                         ` Linus Torvalds
  2018-04-25 23:39                           ` Andi Kleen
@ 2018-04-26 13:59                           ` Michal Hocko
  2018-04-26 17:14                             ` Linus Torvalds
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Hocko @ 2018-04-26 13:59 UTC (permalink / raw)
  To: speck

[Sorry for being late in this discussion. I was conferencing and still
on the way back at the airport.]

On Wed 25-04-18 16:21:24, speck for Linus Torvalds wrote:
> 
> 
> On Wed, 25 Apr 2018, speck for Andi Kleen wrote:
> > 
> > I haven't see Hocko's patch, but I assume it just inverts the complete 
> > swap entry except P while it is stored.
> 
> No, it only inverted offset. That's the easiest thing.

Here is the patch for your reference. I've posted it to this mailing
list earlier. I will try to wrap my head around the mprotect part and
help reviewing whatever you come up with.
---
From 7b03455455e1152988b2a295a917c0641f531fb0 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Tue, 10 Apr 2018 14:10:42 +0200
Subject: [PATCH] mm, swap, x86: make sure high bits of the swap offset are set

Intel platforms have a bug where L1 cache contents can speculatively
be used to load content of !present entries.  This allows certain side
channel attacks. We do have several different classes of !present
pages. Unmapped memory clears the whole ptes so they are non-issue.
mprotect, numa hints are referring to an existing pfns which cannot be
tweaked by an attacker to a different privilege domains.  So we are left
with swap entries which encode the swap offset and that might conflict
with an existing pfn. Obfuscate those entries by inverting bits in the
swap offset which will set all the high bits and that _should_ stop the
speculation as it should refer to the maximum addressable memory on all
Intel platforms.

Well this doesn't solve the problem on very large offsets (1<<30 on
uarchs with 44b addressing) but this should be out of any practical
attack space.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/include/asm/pgtable_64.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 1149d2112b2e..213c15b2e168 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -299,10 +299,10 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
 
 #define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
 					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
+#define __swp_offset(x)			(~(x).val >> SWP_OFFSET_FIRST_BIT)
 #define __swp_entry(type, offset)	((swp_entry_t) { \
 					 ((type) << (SWP_TYPE_FIRST_BIT)) \
-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
+					 | (~(offset) << SWP_OFFSET_FIRST_BIT) })
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })
-- 
2.16.3

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-26 13:59                           ` Michal Hocko
@ 2018-04-26 17:14                             ` Linus Torvalds
  2018-04-27  0:05                               ` Andi Kleen
  0 siblings, 1 reply; 21+ messages in thread
From: Linus Torvalds @ 2018-04-26 17:14 UTC (permalink / raw)
  To: speck



On Thu, 26 Apr 2018, speck for Michal Hocko wrote:
> 
> Here is the patch for your reference. 

So here's a _slightly_ larger patch that switches the order of "type" and 
"offset" in the x86-64 encoding, in addition to doing the binary 'not' on 
the offset.

That means that now the offset is bits 9-58 in the page table, and that 
the offset is in the bits that hardware generally doesn't care about.

That, in turn, means that if you have a desktop chip with only 40 bits of 
physical addressing, now that the offset starts at bit 9, you still have 
to have 30 bits of offset actually *in use* until bit 39 ends up being 
clear.

So that's 4 terabyte of swap space (because the offset is counted in 
pages, so 30 bits of offset is 42 bits of actual coverage). With bigger 
physical addressing, that obviously grows further, until you hit the limit 
of the offset (at 50 bits of offset - 62 bits of actual swap file 
coverage).

NOTE NOTE NOTE! This all built for me, but maybe I got the shifting wrong 
and maybe it's completely broken dur to some silly mistake on my part. 
Think of it as an RFC patch that needs testing and looking at. The changes 
to the comment are probably the most important part.

Anybody willing to test and/or double-check my math/logic?

                    Linus
---

 arch/x86/include/asm/pgtable_64.h | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 877bc27718ae..3e4584ba5231 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) |  OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -286,20 +286,30 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
  *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
+ *
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
-#define SWP_TYPE_BITS 5
-/* Place the offset above the type: */
-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
+#define SWP_TYPE_BITS		5
+
+#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)
+
+/* We always extract/encode the offset by shifting it all the way up, and then down again */
+#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)			(((x).val >> (SWP_TYPE_FIRST_BIT)) \
-					 & ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)			((x).val >> SWP_OFFSET_FIRST_BIT)
-#define __swp_entry(type, offset)	((swp_entry_t) { \
-					 ((type) << (SWP_TYPE_FIRST_BIT)) \
-					 | ((offset) << SWP_OFFSET_FIRST_BIT) })
+/* Extract the high bits for type */
+#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
+
+/* Shift up (to get rid of type), then down to get value */
+#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+
+/* Shift the offset up "too far" by TYPE bits, then down again */
+#define __swp_entry(type, offset) ((swp_entry_t) { \
+	(~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+	| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
+
 #define __pte_to_swp_entry(pte)		((swp_entry_t) { pte_val((pte)) })
 #define __pmd_to_swp_entry(pmd)		((swp_entry_t) { pmd_val((pmd)) })
 #define __swp_entry_to_pte(x)		((pte_t) { .pte = (x).val })

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [MODERATED] Re: [PATCH 1/6] Patch 1
  2018-04-26 17:14                             ` Linus Torvalds
@ 2018-04-27  0:05                               ` Andi Kleen
  0 siblings, 0 replies; 21+ messages in thread
From: Andi Kleen @ 2018-04-27  0:05 UTC (permalink / raw)
  To: speck

> That, in turn, means that if you have a desktop chip with only 40 bits of 
> physical addressing, now that the offset starts at bit 9, you still have 
> to have 30 bits of offset actually *in use* until bit 39 ends up being 
> clear.

Here are the cases for modern Intel CPUs:

    Nehalem Client                                  39      >=0.25TB
    Nehalem Server                                  44      >=8TB
    SandyBridge/IvyBridge/Haswell/Broadwell/Skylake 46      >=32TB

In general anything newer has 46 bits

> Anybody willing to test and/or double-check my math/logic?

I did some tests and it works and seems to do the right thing. I'll
use this patch in the next version of the patchkit. Thanks.

-Andi

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2018-04-27  0:05 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-25  3:29 [MODERATED] [PATCH 1/6] Patch 1 Andi Kleen
2018-04-25 15:51 ` [MODERATED] " Linus Torvalds
2018-04-25 16:06   ` Andi Kleen
2018-04-25 17:25     ` Linus Torvalds
2018-04-25 17:36       ` Andi Kleen
2018-04-25 18:00         ` Linus Torvalds
2018-04-25 18:11           ` Andi Kleen
2018-04-25 18:26             ` Thomas Gleixner
2018-04-25 18:30             ` [MODERATED] " Linus Torvalds
2018-04-25 18:51               ` Andi Kleen
2018-04-25 20:15                 ` Linus Torvalds
2018-04-25 21:19                   ` Andi Kleen
2018-04-25 22:35                     ` Linus Torvalds
2018-04-25 23:12                       ` Andi Kleen
2018-04-25 23:21                         ` Linus Torvalds
2018-04-25 23:39                           ` Andi Kleen
2018-04-26  3:22                             ` Linus Torvalds
2018-04-26  3:39                               ` Jon Masters
2018-04-26 13:59                           ` Michal Hocko
2018-04-26 17:14                             ` Linus Torvalds
2018-04-27  0:05                               ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.