From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vbabka@suse.cz>
Received: from mail.linutronix.de (146.0.238.70:993) by
  crypto-ml.lab.linutronix.de with IMAP4-SSL for <speck@linutronix.de>; 26 Jun
  2018 12:01:28 -0000
Received: from mx2.suse.de ([195.135.220.15])	by Galois.linutronix.de with
 esmtps (TLS1.0:DHE_RSA_CAMELLIA_256_CBC_SHA1:256)	(Exim 4.80)	(envelope-from
 <vbabka@suse.cz>)	id 1fXmf0-0008Dy-JQ	for speck@linutronix.de; Tue, 26 Jun
 2018 14:01:27 +0200
Received: from relay1.suse.de (charybdis-ext-too.suse.de [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id 89D68AF1F
	for <speck@linutronix.de>; Tue, 26 Jun 2018 12:01:18 +0000 (UTC)
Subject: [MODERATED] Re: [PATCH 8/8] L1TFv8 6
References: <cover.1528929489.git.ak@linux.intel.com>
 <20180614150632.E064C61183@crypto-ml.lab.linutronix.de>
 <4ad5c4d2-7721-729e-3af6-6c8ed84dda9f@suse.cz>
 <260fce1e-c5fe-cace-56a8-a83c2a41f115@suse.cz>
 <b825c100-d238-c8c9-b28a-3de423a8e7af@suse.cz>
 <20180622165652.GX30690@tassilo.jf.intel.com>
 <ed73823d-2666-b6c8-7a54-97c4cc8bc0ed@suse.cz>
 <20180625203154.GB19456@tassilo.jf.intel.com>
From: Vlastimil Babka <vbabka@suse.cz>
Message-ID: <59926fa8-c26b-b5e0-6817-55d6921ed2fd@suse.cz>
Date: Tue, 26 Jun 2018 14:01:18 +0200
MIME-Version: 1.0
In-Reply-To: <20180625203154.GB19456@tassilo.jf.intel.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

On 06/25/2018 10:31 PM, speck for Andi Kleen wrote:
> On Mon, Jun 25, 2018 at 09:04:34AM +0200, speck for Vlastimil Babka wrote:
>>> Seems ugly and complicated. Perhaps it's better to just sacrifice the three bits.
>>> Doubt anyone will really need it anyways, especially not on 32bit systems.
>>
>> What three bits? You seem to be confusing this with my previous fix for
>> 64bit max swap size, but this is something quite different.
> 
> You're right.
> 
>>
>> Before this patch, PAE code did not flip the offset bits, and was using
>> the high pte word. That means bits 32-36 for type, 37-63 for offset.
>> Lower word was zeroed, thus systems with 4GB or less memory should be
>> safe, for 4GB to 128GB the swap type controls the "vulnerable" memory
>> locations, above that also the offset. Is it correct that 32bit PAE HW
>> phys limit is 64GB, but in the virtualized 32bit-pae guest on 64bit HW
>> (that you were concerned about) that limit doesn't apply?
> 
> AFAIK it never applies on modern systems.
> 
>>
>> Now if we put the swap entry to the lower word starting with bit 9 (like
>> 64bit), with 5 bits type we have 18 bits left for swap offset. That's
>> just one 1GB. In the high word we have bits to avoid for L1TF (40 to 51
>> at least?) so that's even worse. IMHO we have to use the whole 64bit
>> entry then, which is what the patch does.
> 
> Ok.

Thanks. Here's an updated patch with changelog, and it has been also tested.

----8<----
>From 94b19f2277984594eda826a315cb49d6be5375b5 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 22 Jun 2018 17:39:33 +0200
Subject: [PATCH] x86/speculation/l1tf: protect PAE swap entries against L1TF

The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.

By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/x86/include/asm/pgtable-3level.h | 35 +++++++++++++++++++++++++--
 arch/x86/mm/init.c                    |  2 +-
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 76ab26a99e6e..a1d9ab21f8ea 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -241,12 +241,43 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
 #endif
 
 /* Encode and de-code a swap entry */
+#define SWP_TYPE_BITS		5
+
+#define SWP_OFFSET_FIRST_BIT	(_PAGE_BIT_PROTNONE + 1)
+
+/* We always extract/encode the offset by shifting it all the way up, and then down again */
+#define SWP_OFFSET_SHIFT	(SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
+
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
 #define __swp_type(x)			(((x).val) & 0x1f)
 #define __swp_offset(x)			((x).val >> 5)
 #define __swp_entry(type, offset)	((swp_entry_t){(type) | (offset) << 5})
-#define __pte_to_swp_entry(pte)		((swp_entry_t){ (pte).pte_high })
-#define __swp_entry_to_pte(x)		((pte_t){ { .pte_high = (x).val } })
+
+/*
+ * Normally, __swp_entry() converts from arch-independent swp_entry_t to
+ * arch-dependent swp_entry_t, and __swp_entry_to_pte() just stores the result
+ * to pte. But here we have 32bit swp_entry_t and 64bit pte, and need to use the
+ * whole 64 bits. Thus, we shift the "real" arch-dependent conversion to
+ * __swp_entry_to_pte() through the following helper macro based on 64bit
+ * __swp_entry().
+ */
+#define __swp_pteval_entry(type, offset) ((pteval_t) { \
+	(~(pteval_t)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+	| ((pteval_t)(type) << (64-SWP_TYPE_BITS)) })
+
+#define __swp_entry_to_pte(x)	((pte_t){ .pte = \
+		__swp_pteval_entry(__swp_type(x), __swp_offset(x)) })
+/*
+ * Analogically, __pte_to_swp_entry() doesn't just extract the arch-dependent
+ * swp_entry_t, but also has to convert it from 64bit to the 32bit
+ * intermediate representation, using the following macros based on 64bit
+ * __swp_type() and __swp_offset().
+ */
+#define __pteval_swp_type(x) ((unsigned long)((x).pte >> (64 - SWP_TYPE_BITS)))
+#define __pteval_swp_offset(x) ((unsigned long)(~((x).pte) << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT))
+
+#define __pte_to_swp_entry(pte)	(__swp_entry(__pteval_swp_type(pte), \
+					     __pteval_swp_offset(pte)))
 
 #define gup_get_pte gup_get_pte
 /*
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c0870df32b2d..862191ed3d6e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -896,7 +896,7 @@ unsigned long max_swapfile_size(void)
 		 * We encode swap offsets also with 3 bits below those for pfn
 		 * which makes the usable limit higher.
 		 */
-#ifdef CONFIG_X86_64
+#if CONFIG_PGTABLE_LEVELS > 2
 		l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;
 #endif
 		pages = min_t(unsigned long, l1tf_limit, pages);
-- 
2.17.1