All of lore.kernel.org
 help / color / mirror / Atom feed
* [MODERATED] [PATCH] SPTE masking
@ 2018-08-08 23:21 Jim Mattson
  2018-08-09  2:57 ` [MODERATED] " Andi Kleen
  2018-08-09  9:25 ` Paolo Bonzini
  0 siblings, 2 replies; 22+ messages in thread
From: Jim Mattson @ 2018-08-08 23:21 UTC (permalink / raw)
  To: speck

[PATCH] kvm: x86: Set highest physical address bit in non-present/reserved SPTEs

Always set the upper-most supported physical address bit to 1 for SPTEs
that are marked as non-present or reserved, to make them unusable for
L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
(We do not need to mark PTEs that are completely 0 as physical page 0
is already reserved.)

This allows mitigation of L1TF without disabling hyper-threading by using
shadow paging mode instead of EPT.

Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
---
 arch/x86/kvm/mmu.c | 25 ++++++++++++++++++++-----
 arch/x86/kvm/x86.c |  8 ++++++--
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a44e568363a46..ef967785e056a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -221,6 +221,9 @@ static const u64 shadow_acc_track_saved_bits_mask = PT64_EPT_READABLE_MASK |
 						    PT64_EPT_EXECUTABLE_MASK;
 static const u64 shadow_acc_track_saved_bits_shift = PT64_SECOND_AVAIL_BITS_SHIFT;
 
+/* This mask must be set on all non-zero Non-Present or Reserved SPTEs */
+static u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
+
 static void mmu_spte_set(u64 *sptep, u64 spte);
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask, u64 mmio_value)
@@ -308,9 +311,12 @@ static void mark_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, u64 gfn,
 {
 	unsigned int gen = kvm_current_mmio_generation(vcpu);
 	u64 mask = generation_mmio_spte_mask(gen);
+	u64 gpa = gfn << PAGE_SHIFT;
 
 	access &= ACC_WRITE_MASK | ACC_USER_MASK;
-	mask |= shadow_mmio_value | access | gfn << PAGE_SHIFT;
+	mask |= shadow_mmio_value | shadow_nonpresent_or_rsvd_mask;
+	mask |= access | gpa;
+	mask |= (gpa & shadow_nonpresent_or_rsvd_mask) << 1;
 
 	trace_mark_mmio_spte(sptep, gfn, access, gen);
 	mmu_spte_set(sptep, mask);
@@ -323,8 +329,13 @@ static bool is_mmio_spte(u64 spte)
 
 static gfn_t get_mmio_spte_gfn(u64 spte)
 {
-	u64 mask = generation_mmio_spte_mask(MMIO_GEN_MASK) | shadow_mmio_mask;
-	return (spte & ~mask) >> PAGE_SHIFT;
+	u64 mask = generation_mmio_spte_mask(MMIO_GEN_MASK) | shadow_mmio_mask |
+		   shadow_nonpresent_or_rsvd_mask;
+	u64 gpa = spte & ~mask;
+
+	gpa |= (spte >> 1) & shadow_nonpresent_or_rsvd_mask;
+
+	return gpa >> PAGE_SHIFT;
 }
 
 static unsigned get_mmio_spte_access(u64 spte)
@@ -381,7 +392,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mask_ptes);
 
-static void kvm_mmu_clear_all_pte_masks(void)
+static void kvm_mmu_reset_all_pte_masks(void)
 {
 	shadow_user_mask = 0;
 	shadow_accessed_mask = 0;
@@ -391,6 +402,10 @@ static void kvm_mmu_clear_all_pte_masks(void)
 	shadow_mmio_mask = 0;
 	shadow_present_mask = 0;
 	shadow_acc_track_mask = 0;
+
+	if (boot_cpu_data.x86_phys_bits < 51)
+		shadow_nonpresent_or_rsvd_mask =
+			1ull << (boot_cpu_data.x86_phys_bits - 1);
 }
 
 static int is_cpuid_PSE36(void)
@@ -5500,7 +5515,7 @@ int kvm_mmu_module_init(void)
 {
 	int ret = -ENOMEM;
 
-	kvm_mmu_clear_all_pte_masks();
+	kvm_mmu_reset_all_pte_masks();
 
 	pte_list_desc_cache = kmem_cache_create("pte_list_desc",
 					    sizeof(struct pte_list_desc),
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a5caa5e5480ca..60e102adf80be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6503,8 +6503,12 @@ static void kvm_set_mmio_spte_mask(void)
 	 * Set the reserved bits and the present bit of an paging-structure
 	 * entry to generate page fault with PFER.RSV = 1.
 	 */
-	 /* Mask the reserved physical address bits. */
-	mask = rsvd_bits(maxphyaddr, 51);
+
+	/*
+	 * Mask the uppermost physical address bit, which would be reserved as
+	 * long as the supported physical address width is less than 52.
+	 */
+	mask = 1ull << 51;
 
 	/* Set the present bit. */
 	mask |= 1ull;
-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-08 23:21 [MODERATED] [PATCH] SPTE masking Jim Mattson
@ 2018-08-09  2:57 ` Andi Kleen
  2018-08-09  9:24   ` Paolo Bonzini
  2018-08-09  9:25 ` Paolo Bonzini
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-08-09  2:57 UTC (permalink / raw)
  To: speck

On Wed, Aug 08, 2018 at 04:21:14PM -0700, speck for Jim Mattson wrote:
> [PATCH] kvm: x86: Set highest physical address bit in non-present/reserved SPTEs
> 
> Always set the upper-most supported physical address bit to 1 for SPTEs
> that are marked as non-present or reserved, to make them unusable for
> L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.

L1TF only works for cached memory. 

Are you concerned about cacheable MMIO?

I didn't think it could happen.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09  2:57 ` [MODERATED] " Andi Kleen
@ 2018-08-09  9:24   ` Paolo Bonzini
  2018-08-09 17:43     ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09  9:24 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 678 bytes --]

On 09/08/2018 04:57, speck for Andi Kleen wrote:
>> [PATCH] kvm: x86: Set highest physical address bit in non-present/reserved SPTEs
>>
>> Always set the upper-most supported physical address bit to 1 for SPTEs
>> that are marked as non-present or reserved, to make them unusable for
>> L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
> L1TF only works for cached memory. 
> 
> Are you concerned about cacheable MMIO?

No, he's concerned that KVM stores information in SPTEs that point to
guest MMIO (i.e. emulated devices), and that information is
guest-controlled.  But that would only apply to processors with
MAXPHYADDR=52.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-08 23:21 [MODERATED] [PATCH] SPTE masking Jim Mattson
  2018-08-09  2:57 ` [MODERATED] " Andi Kleen
@ 2018-08-09  9:25 ` Paolo Bonzini
  2018-08-09  9:33   ` Andrew Cooper
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09  9:25 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On 09/08/2018 01:21, speck for Jim Mattson wrote:
> [PATCH] kvm: x86: Set highest physical address bit in non-present/reserved SPTEs
> 
> Always set the upper-most supported physical address bit to 1 for SPTEs
> that are marked as non-present or reserved, to make them unusable for
> L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
> (We do not need to mark PTEs that are completely 0 as physical page 0
> is already reserved.)
> 
> This allows mitigation of L1TF without disabling hyper-threading by using
> shadow paging mode instead of EPT.

I don't understand why the big patch is needed.  MMIO SPTEs already have a mask
applied that includes the top bit on all processors that have MAXPHYADDR<52
I would hope that all processors with MAXPHYADDR=52 will have the bug fixed
(and AFAIK none are being sold right now), but in any case something like

        if (maxphyaddr == 52) {
                kvm_mmu_set_mmio_spte_mask((1ull << 51) | 1, 1ull << 51);
		return;
        }

in kvm_set_mmio_spte_mask should do, or alternatively the nicer patch after
my signature (untested and unthought).

Paolo


diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6529,29 +6529,25 @@ static unsigned long kvm_get_guest_ip(void)
 
 static void kvm_set_mmio_spte_mask(void)
 {
-	u64 mask;
+	u64 mask, value;
 	int maxphyaddr = boot_cpu_data.x86_phys_bits;
 
 	/*
 	 * Set the reserved bits and the present bit of an paging-structure
 	 * entry to generate page fault with PFER.RSV = 1.
 	 */
-	 /* Mask the reserved physical address bits. */
-	mask = rsvd_bits(maxphyaddr, 51);
+	mask = value = PT_PRESENT_MASK | (1ull << 51);
 
-	/* Set the present bit. */
-	mask |= 1ull;
-
-#ifdef CONFIG_X86_64
-	/*
-	 * If reserved bit is not supported, clear the present bit to disable
-	 * mmio page fault.
-	 */
-	if (maxphyaddr == 52)
-		mask &= ~1ull;
-#endif
+	if (maxphyaddr == 52) {
+		/*
+		 * If reserved bit is not supported, clear the present bit to disable
+		 * mmio page fault.  Leave the topmost bit set to separate MMIO sptes
+		 * from other nonpresent sptes, and to protect against the L1TF bug.
+		 */
+		value &= ~PT_PRESENT_MASK;
+	}
 
-	kvm_mmu_set_mmio_spte_mask(mask, mask);
+	kvm_mmu_set_mmio_spte_mask(mask, value);
 }
 
 #ifdef CONFIG_X86_64


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09  9:25 ` Paolo Bonzini
@ 2018-08-09  9:33   ` Andrew Cooper
  2018-08-09 10:01     ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-08-09  9:33 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1740 bytes --]

On 09/08/18 10:25, speck for Paolo Bonzini wrote:
> On 09/08/2018 01:21, speck for Jim Mattson wrote:
>> [PATCH] kvm: x86: Set highest physical address bit in non-present/reserved SPTEs
>>
>> Always set the upper-most supported physical address bit to 1 for SPTEs
>> that are marked as non-present or reserved, to make them unusable for
>> L1TF attacks from the guest. Currently, this just applies to MMIO SPTEs.
>> (We do not need to mark PTEs that are completely 0 as physical page 0
>> is already reserved.)
>>
>> This allows mitigation of L1TF without disabling hyper-threading by using
>> shadow paging mode instead of EPT.
> I don't understand why the big patch is needed.  MMIO SPTEs already have a mask
> applied that includes the top bit on all processors that have MAXPHYADDR<52
> I would hope that all processors with MAXPHYADDR=52 will have the bug fixed
> (and AFAIK none are being sold right now), but in any case something like
>
>         if (maxphyaddr == 52) {
>                 kvm_mmu_set_mmio_spte_mask((1ull << 51) | 1, 1ull << 51);
> 		return;
>         }
>
> in kvm_set_mmio_spte_mask should do, or alternatively the nicer patch after
> my signature (untested and unthought).

Setting bit 51 doesn't mitigate L1TF on any current processor.

You need to set an address bit inside L1D-maxphysaddr, and isn't
cacheable on the current system.

Attached is my patch for doing this generally in Xen, along with some
safety heuristics for nesting.  In Xen, we need to audit each PTE a PV
guest tries to write, and the bottom line safety check for that is:

static inline bool is_l1tf_safe_maddr(intpte_t pte)
{
    paddr_t maddr = pte & l1tf_addr_mask;

    return maddr == 0 || maddr >= l1tf_safe_maddr;
}

~Andrew

[-- Attachment #2: 0001-x86-spec-ctrl-Calculate-safe-PTE-addresses-for-L1TF-.patch --]
[-- Type: text/x-patch, Size: 11798 bytes --]

From 2e4a4617d484fd3b44abdc77013d559a02c1ed10 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Wed, 25 Jul 2018 12:10:19 +0000
Subject: [PATCH] x86/spec-ctrl: Calculate safe PTE addresses for L1TF
 mitigations

Safe PTE addresses for L1TF mitigations are ones which are within the L1D
address width (may be wider than reported in CPUID), and above the highest
cacheable RAM/NVDIMM/BAR/etc.

All logic here is best-effort heuristics, which should in practice be fine for
most hardware.  Future work will see about disentangling the SRAT handling
further, as well as having L0 pass this information down to lower levels when
virtualised.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2:
 * New
v3:
 * Adjust heuristics to be mostly safe for the high end server case, and
   heterogeneous migration cases.  Extend the comments a lot to explain what
   is going on.
v4:
 * Fold EFI and SRAT adjustments.
 * Expand safety comment.
---
 xen/arch/x86/setup.c            |  12 ++++
 xen/arch/x86/spec_ctrl.c        | 154 ++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/srat.c             |   8 ++-
 xen/common/efi/boot.c           |  12 ++++
 xen/include/asm-x86/spec_ctrl.h |   7 ++
 5 files changed, 191 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 8301de8..7c86b9a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -912,6 +912,18 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     /* Sanitise the raw E820 map to produce a final clean version. */
     max_page = raw_max_page = init_e820(memmap_type, &e820_raw);
 
+    if ( !efi_enabled(EFI_BOOT) )
+    {
+        /*
+         * Supplement the heuristics in l1tf_calculations() by assuming that
+         * anything referenced in the E820 may be cacheable.
+         */
+        l1tf_safe_maddr =
+            max(l1tf_safe_maddr,
+                ROUNDUP(e820_raw.map[e820_raw.nr_map - 1].addr +
+                        e820_raw.map[e820_raw.nr_map - 1].size, PAGE_SIZE));
+    }
+
     /* Create a temporary copy of the E820 map. */
     memcpy(&boot_e820, &e820, sizeof(e820));
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 32a4ea6..abe3785 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -50,6 +50,10 @@ bool __initdata bsp_delay_spec_ctrl;
 uint8_t __read_mostly default_xen_spec_ctrl;
 uint8_t __read_mostly default_spec_ctrl_flags;
 
+paddr_t __read_mostly l1tf_addr_mask, __read_mostly l1tf_safe_maddr;
+static bool __initdata cpu_has_bug_l1tf;
+static unsigned int __initdata l1d_maxphysaddr;
+
 static int __init parse_bti(const char *s)
 {
     const char *ss;
@@ -420,6 +424,154 @@ static bool __init should_use_eager_fpu(void)
     }
 }
 
+/* Calculate whether this CPU is vulnerable to L1TF. */
+static __init void l1tf_calculations(uint64_t caps)
+{
+    bool hit_default = false;
+
+    l1d_maxphysaddr = paddr_bits;
+
+    /* L1TF is only known to affect Intel Family 6 processors at this time. */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 )
+    {
+        switch ( boot_cpu_data.x86_model )
+        {
+            /*
+             * Core processors since at least Penryn are vulnerable.
+             */
+        case 0x17: /* Penryn */
+        case 0x1d: /* Dunnington */
+            cpu_has_bug_l1tf = true;
+            break;
+
+        case 0x1f: /* Auburndale / Havendale */
+        case 0x1e: /* Nehalem */
+        case 0x1a: /* Nehalem EP */
+        case 0x2e: /* Nehalem EX */
+        case 0x25: /* Westmere */
+        case 0x2c: /* Westmere EP */
+        case 0x2f: /* Westmere EX */
+            cpu_has_bug_l1tf = true;
+            l1d_maxphysaddr = 44;
+            break;
+
+        case 0x2a: /* SandyBridge */
+        case 0x2d: /* SandyBridge EP/EX */
+        case 0x3a: /* IvyBridge */
+        case 0x3e: /* IvyBridge EP/EX */
+        case 0x3c: /* Haswell */
+        case 0x3f: /* Haswell EX/EP */
+        case 0x45: /* Haswell D */
+        case 0x46: /* Haswell H */
+        case 0x3d: /* Broadwell */
+        case 0x47: /* Broadwell H */
+        case 0x4f: /* Broadwell EP/EX */
+        case 0x56: /* Broadwell D */
+        case 0x4e: /* Skylake M */
+        case 0x55: /* Skylake X */
+        case 0x5e: /* Skylake D */
+        case 0x66: /* Cannonlake */
+        case 0x67: /* Cannonlake? */
+        case 0x8e: /* Kabylake M */
+        case 0x9e: /* Kabylake D */
+            cpu_has_bug_l1tf = true;
+            l1d_maxphysaddr = 46;
+            break;
+
+            /*
+             * Atom processors are not vulnerable.
+             */
+        case 0x1c: /* Pineview */
+        case 0x26: /* Lincroft */
+        case 0x27: /* Penwell */
+        case 0x35: /* Cloverview */
+        case 0x36: /* Cedarview */
+        case 0x37: /* Baytrail / Valleyview (Silvermont) */
+        case 0x4d: /* Avaton / Rangely (Silvermont) */
+        case 0x4c: /* Cherrytrail / Brasswell */
+        case 0x4a: /* Merrifield */
+        case 0x5a: /* Moorefield */
+        case 0x5c: /* Goldmont */
+        case 0x5f: /* Denverton */
+        case 0x7a: /* Gemini Lake */
+            break;
+
+            /*
+             * Knights processors are not vulnerable.
+             */
+        case 0x57: /* Knights Landing */
+        case 0x85: /* Knights Mill */
+            break;
+
+        default:
+            /* Defer printk() until we've accounted for RDCL_NO. */
+            hit_default = true;
+            cpu_has_bug_l1tf = true;
+            break;
+        }
+    }
+
+    /* Any processor advertising RDCL_NO should be not vulnerable to L1TF. */
+    if ( caps & ARCH_CAPABILITIES_RDCL_NO )
+        cpu_has_bug_l1tf = false;
+
+    if ( cpu_has_bug_l1tf && hit_default )
+        printk("Unrecognised CPU model %#x - assuming vulnerable to L1TF\n",
+               boot_cpu_data.x86_model);
+
+    /*
+     * L1TF safe address heuristics.  These apply to the real hardware we are
+     * running on, and are best-effort-only if Xen is virtualised.
+     *
+     * The address mask which the L1D cache uses, which might be wider than
+     * the CPUID-reported maxphysaddr.
+     */
+    l1tf_addr_mask = ((1ul << l1d_maxphysaddr) - 1) & PAGE_MASK;
+
+    /*
+     * To be safe, l1tf_safe_maddr must be above the highest cacheable entity
+     * in system physical address space.  However, to preserve space for
+     * paged-out metadata, it should be as low as possible above the highest
+     * cacheable address, so as to require fewer high-order bits being set.
+     *
+     * These heuristics are based on some guesswork to improve the likelihood
+     * of safety in the common case, accounting for the fact that Linux's
+     * behaviour of mitigating L1TF by inverting all address bits in a
+     * non-present PTE.
+     *
+     * - If L1D is wider than CPUID (Nehalem and later mobile/desktop/low end
+     *   server), setting any address bit beyond CPUID maxphysaddr guarantees
+     *   to make the PTE safe.  This case doesn't require all the high-order
+     *   bits being set, and doesn't require any other source of information
+     *   for safety.
+     *
+     * - If L1D is the same as CPUID (Pre-Nehalem, or high end server), we
+     *   must sacrifice high order bits from the real address space for
+     *   safety.  Therefore, make a blind guess that there is nothing
+     *   cacheable in the top quarter of physical address space.
+     *
+     *   It is exceedingly unlikely for machines to be populated with this
+     *   much RAM (likely 512G on pre-Nehalem, 16T on Nehalem/Westmere, 64T on
+     *   Sandybridge and later) due to the sheer volume of DIMMs this would
+     *   actually take.
+     *
+     *   However, it is possible to find machines this large, so the "top
+     *   quarter" guess is supplemented to push the limit higher if references
+     *   to cacheable mappings (E820/SRAT/EFI/etc) are found above the top
+     *   quarter boundary.
+     *
+     *   Finally, this top quarter guess gives us a good chance of being safe
+     *   when running virtualised (and the CPUID maxphysaddr hasn't been
+     *   levelled for heterogeneous migration safety), where the safety
+     *   consideration is still in terms of host details, but all E820/etc
+     *   information is in terms of guest physical layout.
+     */
+    l1tf_safe_maddr = max(l1tf_safe_maddr, ((l1d_maxphysaddr > paddr_bits)
+                                            ? (1ul << paddr_bits)
+                                            : (3ul << (paddr_bits - 2))));
+}
+
 #define OPT_XPTI_DEFAULT  0xff
 uint8_t __read_mostly opt_xpti = OPT_XPTI_DEFAULT;
 
@@ -626,6 +778,8 @@ void __init init_speculation_mitigations(void)
     else
         setup_clear_cpu_cap(X86_FEATURE_NO_XPTI);
 
+    l1tf_calculations(caps);
+
     print_details(thunk, caps);
 
     /*
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 166eb44..2d70b45 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -20,6 +20,7 @@
 #include <xen/pfn.h>
 #include <asm/e820.h>
 #include <asm/page.h>
+#include <asm/spec_ctrl.h>
 
 static struct acpi_table_slit *__read_mostly acpi_slit;
 
@@ -284,6 +285,11 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 	if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
 		return;
 
+	start = ma->base_address;
+	end = start + ma->length;
+	/* Supplement the heuristics in l1tf_calculations(). */
+	l1tf_safe_maddr = max(l1tf_safe_maddr, ROUNDUP(end, PAGE_SIZE));
+
 	if (num_node_memblks >= NR_NODE_MEMBLKS)
 	{
 		dprintk(XENLOG_WARNING,
@@ -292,8 +298,6 @@ acpi_numa_memory_affinity_init(const struct acpi_srat_mem_affinity *ma)
 		return;
 	}
 
-	start = ma->base_address;
-	end = start + ma->length;
 	pxm = ma->proximity_domain;
 	if (srat_rev < 2)
 		pxm &= 0xff;
diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c
index 1464520..2f49731 100644
--- a/xen/common/efi/boot.c
+++ b/xen/common/efi/boot.c
@@ -1382,6 +1382,8 @@ efi_start(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE *SystemTable)
 
 #ifndef CONFIG_ARM /* TODO - runtime service support */
 
+#include <asm/spec_ctrl.h>
+
 static bool __initdata efi_map_uc;
 
 static int __init parse_efi_param(const char *s)
@@ -1497,6 +1499,16 @@ void __init efi_init_memory(void)
                desc->PhysicalStart, desc->PhysicalStart + len - 1,
                desc->Type, desc->Attribute);
 
+        if ( (desc->Attribute & (EFI_MEMORY_WB | EFI_MEMORY_WT)) ||
+             (efi_bs_revision >= EFI_REVISION(2, 5) &&
+              (desc->Attribute & EFI_MEMORY_WP)) )
+        {
+            /* Supplement the heuristics in l1tf_calculations(). */
+            l1tf_safe_maddr =
+                max(l1tf_safe_maddr,
+                    ROUNDUP(desc->PhysicalStart + len, PAGE_SIZE));
+        }
+
         if ( !efi_enabled(EFI_RS) ||
              (!(desc->Attribute & EFI_MEMORY_RUNTIME) &&
               (!map_bs ||
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index 5b40afb..872b494 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -38,6 +38,13 @@ extern uint8_t opt_xpti;
 #define OPT_XPTI_DOM0  0x01
 #define OPT_XPTI_DOMU  0x02
 
+/*
+ * The L1D address mask, which might be wider than reported in CPUID, and the
+ * system physical address above which there are believed to be no cacheable
+ * memory regions, thus unable to leak data via the L1TF vulnerability.
+ */
+extern paddr_t l1tf_addr_mask, l1tf_safe_maddr;
+
 static inline void init_shadow_spec_ctrl_state(void)
 {
     struct cpu_info *info = get_cpu_info();
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09  9:33   ` Andrew Cooper
@ 2018-08-09 10:01     ` Paolo Bonzini
  2018-08-09 10:47       ` Andrew Cooper
  2018-08-09 20:14       ` Jim Mattson
  0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09 10:01 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1179 bytes --]

On 09/08/2018 11:33, speck for Andrew Cooper wrote:
>>
>> in kvm_set_mmio_spte_mask should do, or alternatively the nicer patch after
>> my signature (untested and unthought).
> Setting bit 51 doesn't mitigate L1TF on any current processor.
> 
> You need to set an address bit inside L1D-maxphysaddr, and isn't
> cacheable on the current system.

KVM currently sets all reserved bits, so it's safe as long as
l1d-maxphyaddr > cpuid-maxphyaddr.

What remains is pre-Nehalem or high-end servers where L1D-maxphyaddr =
cpuid-maxphyaddr.  So we could split the guest gfn in two so that

bit 47-51            = bits MPA-5..MPA-1 of guest physical address
bit maxphyaddr-5..46 = all ones, or look it up using e820
bit 12..maxphyaddr-6 = bits 12..MPA-6 of guest physical address.

assuming all processors with maxphyaddr > 46 are safe.

> and the CPUID maxphysaddr hasn't been
> levelled for heterogeneous migration safety

I don't know about Xen PV, but when using EPT you cannot do that, the
maxphyaddr is not virtualizable (obviously not to guest-maxphyaddr >
host-maxphyaddr, but guest-maxphyaddr < host-maxphyaddr cannot be
emulated either).

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 10:01     ` Paolo Bonzini
@ 2018-08-09 10:47       ` Andrew Cooper
  2018-08-09 11:13         ` Paolo Bonzini
  2018-08-09 20:14       ` Jim Mattson
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-08-09 10:47 UTC (permalink / raw)
  To: speck

On 09/08/18 11:01, speck for Paolo Bonzini wrote:
> On 09/08/2018 11:33, speck for Andrew Cooper wrote:
>>> in kvm_set_mmio_spte_mask should do, or alternatively the nicer patch after
>>> my signature (untested and unthought).
>> Setting bit 51 doesn't mitigate L1TF on any current processor.
>>
>> You need to set an address bit inside L1D-maxphysaddr, and isn't
>> cacheable on the current system.
> KVM currently sets all reserved bits, so it's safe as long as
> l1d-maxphyaddr > cpuid-maxphyaddr.
>
> What remains is pre-Nehalem or high-end servers where L1D-maxphyaddr =
> cpuid-maxphyaddr.  So we could split the guest gfn in two so that
>
> bit 47-51            = bits MPA-5..MPA-1 of guest physical address
> bit maxphyaddr-5..46 = all ones, or look it up using e820
> bit 12..maxphyaddr-6 = bits 12..MPA-6 of guest physical address.
>
> assuming all processors with maxphyaddr > 46 are safe.

The equivalent path in Xen sets all the upper 32 bits, which makes this
trick safe on any system which doesn't have cacheable mappings within
the top 4G of system.  I've got some fixes pending for posting once I'm
done with the more urgent L1TF bits.

>
>> and the CPUID maxphysaddr hasn't been
>> levelled for heterogeneous migration safety
> I don't know about Xen PV, but when using EPT you cannot do that, the
> maxphyaddr is not virtualizable (obviously not to guest-maxphyaddr >
> host-maxphyaddr, but guest-maxphyaddr < host-maxphyaddr cannot be
> emulated either).

There is nothing wrong with telling a guest it has maxphysaddr smaller
than the real maxphysaddr.  Just like CPUID feature levelling, it says
"don't go playing there".

No software is permitted to rely on the behaviour of reserved bits.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 10:47       ` Andrew Cooper
@ 2018-08-09 11:13         ` Paolo Bonzini
  2018-08-09 11:46           ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09 11:13 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1486 bytes --]

On 09/08/2018 12:47, speck for Andrew Cooper wrote:
>>> and the CPUID maxphysaddr hasn't been
>>> levelled for heterogeneous migration safety
>> I don't know about Xen PV, but when using EPT you cannot do that, the
>> maxphyaddr is not virtualizable (obviously not to guest-maxphyaddr >
>> host-maxphyaddr, but guest-maxphyaddr < host-maxphyaddr cannot be
>> emulated either).
> There is nothing wrong with telling a guest it has maxphysaddr smaller
> than the real maxphysaddr.  Just like CPUID feature levelling, it says
> "don't go playing there".
> 
> No software is permitted to rely on the behaviour of reserved bits.

That's just not true for page tables.  Bits maxphyaddr:51 are documented
to generate a page fault with the reserved bits set.  In the future the
behavior may change (unlikely) but it would be keyed against e.g. a new
CR4 bit.

In fact, Intel has been stashing new functionality in previously ignored
bits, of course keying the interpretation on bits from CR4 (e.g.
protection keys) or VMCS execution controls (e.g. EPT mode-based
execution control aka XS/XU).

I tried emulating guestphysaddr < hostphysaddr in KVM, but generating
the reserved bits page fault from EPT violations doesn't work.  If the
host processor thinks the bits are not reserved and generates e.g. a
present-but-not-writable fault, no EPT violation happens and the guest
will get an unexpected page fault error code.  This can cause
malfunctioning.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 11:13         ` Paolo Bonzini
@ 2018-08-09 11:46           ` Andrew Cooper
  2018-08-09 11:54             ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-08-09 11:46 UTC (permalink / raw)
  To: speck

On 09/08/18 12:13, speck for Paolo Bonzini wrote:
> On 09/08/2018 12:47, speck for Andrew Cooper wrote:
>>>> and the CPUID maxphysaddr hasn't been
>>>> levelled for heterogeneous migration safety
>>> I don't know about Xen PV, but when using EPT you cannot do that, the
>>> maxphyaddr is not virtualizable (obviously not to guest-maxphyaddr >
>>> host-maxphyaddr, but guest-maxphyaddr < host-maxphyaddr cannot be
>>> emulated either).
>> There is nothing wrong with telling a guest it has maxphysaddr smaller
>> than the real maxphysaddr.  Just like CPUID feature levelling, it says
>> "don't go playing there".
>>
>> No software is permitted to rely on the behaviour of reserved bits.
> That's just not true for page tables.  Bits maxphyaddr:51 are documented
> to generate a page fault with the reserved bits set.  In the future the
> behavior may change (unlikely) but it would be keyed against e.g. a new
> CR4 bit.
>
> In fact, Intel has been stashing new functionality in previously ignored
> bits, of course keying the interpretation on bits from CR4 (e.g.
> protection keys) or VMCS execution controls (e.g. EPT mode-based
> execution control aka XS/XU).
>
> I tried emulating guestphysaddr < hostphysaddr in KVM, but generating
> the reserved bits page fault from EPT violations doesn't work.  If the
> host processor thinks the bits are not reserved and generates e.g. a
> present-but-not-writable fault, no EPT violation happens

Sounds like you've got a KVM bug here.  Either an EPT misconfig or
violation is guaranteed to occur here, because you won't have allowed an
EPT mapping at a guest physical address above guest maxphysaddr, would you?

Even with a strict interpretation 51:maxphysaddr, I don't see any
technical limitations with emulating it correctly.  Making this work in
Xen is on my todo list (once I've figured out exactly how Intel and
AMD's implementation of SMAP differ, so I can fix the software pagewalk
to match the hardware it is running on).

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 11:46           ` Andrew Cooper
@ 2018-08-09 11:54             ` Paolo Bonzini
  2018-08-09 14:01               ` Andrew Cooper
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09 11:54 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1668 bytes --]

On 09/08/2018 13:46, speck for Andrew Cooper wrote:
>>
>> I tried emulating guestphysaddr < hostphysaddr in KVM, but generating
>> the reserved bits page fault from EPT violations doesn't work.  If the
>> host processor thinks the bits are not reserved and generates e.g. a
>> present-but-not-writable fault, no EPT violation happens
> Sounds like you've got a KVM bug here.  Either an EPT misconfig or
> violation is guaranteed to occur here, because you won't have allowed an
> EPT mapping at a guest physical address above guest maxphysaddr, would you?

The problem is that no access occurs at all in the case that becomes buggy.

For example, say host mpa = 46, guest mpa = 40, and you write to a page
table with PTE.P and PTE.40 set, but PTE.W cleared.  You'll get a page
fault with error code present/writable, because the host does not think
PTE.40 must be zero.  On a real processor you'd have gotten a reserved
bit page fault.  Likewise for a CPL=3 access where PTE.U=0, etc.

I know because I tried just yesterday :) and as soon as I saw the
problem I remembered having tried at least once more in the past.  AFAIU
Intel is also convinced that it should be possible, so maybe I'm missing
something.  But I don't see a workaround.

> Even with a strict interpretation 51:maxphysaddr, I don't see any
> technical limitations with emulating it correctly.  Making this work in
> Xen is on my todo list (once I've figured out exactly how Intel and
> AMD's implementation of SMAP differ, so I can fix the software pagewalk
> to match the hardware it is running on).

Interesting, do you have pointers to the differences?

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 11:54             ` Paolo Bonzini
@ 2018-08-09 14:01               ` Andrew Cooper
  2018-08-09 15:00                 ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-08-09 14:01 UTC (permalink / raw)
  To: speck

On 09/08/18 12:54, speck for Paolo Bonzini wrote:
> On 09/08/2018 13:46, speck for Andrew Cooper wrote:
>>> I tried emulating guestphysaddr < hostphysaddr in KVM, but generating
>>> the reserved bits page fault from EPT violations doesn't work.  If the
>>> host processor thinks the bits are not reserved and generates e.g. a
>>> present-but-not-writable fault, no EPT violation happens
>> Sounds like you've got a KVM bug here.  Either an EPT misconfig or
>> violation is guaranteed to occur here, because you won't have allowed an
>> EPT mapping at a guest physical address above guest maxphysaddr, would you?
> The problem is that no access occurs at all in the case that becomes buggy.
>
> For example, say host mpa = 46, guest mpa = 40, and you write to a page
> table with PTE.P and PTE.40 set, but PTE.W cleared.  You'll get a page
> fault with error code present/writable, because the host does not think
> PTE.40 must be zero.  On a real processor you'd have gotten a reserved
> bit page fault.  Likewise for a CPL=3 access where PTE.U=0, etc.
>
> I know because I tried just yesterday :) and as soon as I saw the
> problem I remembered having tried at least once more in the past.  AFAIU
> Intel is also convinced that it should be possible, so maybe I'm missing
> something.  But I don't see a workaround.

Oh lovely :(

Yes - architecturally, there is no memory access when a #PF occurs, so
an EPT-based vmexit won't occur.

It would be interesting to hear an architects take on reserved pagefault
bits, but I expect "don't do that then" might be the answer.

>
>> Even with a strict interpretation 51:maxphysaddr, I don't see any
>> technical limitations with emulating it correctly.  Making this work in
>> Xen is on my todo list (once I've figured out exactly how Intel and
>> AMD's implementation of SMAP differ, so I can fix the software pagewalk
>> to match the hardware it is running on).
> Interesting, do you have pointers to the differences?

There is one case to do with implicit accesses which is definitely
different.  There is no Implicit access flag in the error code (i.e.
insufficient architectural state), which leads to the following corner case:

https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/arch/x86/mm/shadow/multi.c;h=021ae252e47f2cf43ee1aaea9508105ee855d771;hb=4b60c40659b34b6577a6bc91eb4115458a0e425f#l2977

From what I understand, Intels behaviour used to be sensible (from a
hypervisor point of view) in an earlier version of the SMAP spec, and
AMD implemented that version.  Then, a later version of the SMAP spec
inverted the supervisor implicit access to a user mapping case.

AMD raised erratum 1053 "When SMAP is Enabled and EFLAGS.AC is Set, the
Processor Will Fail to Page Fault on an Implicit Supervisor Access to a
User Page", but are considering whether to not even fix it in future
silicon as I pointed out that this is better behaviour from software's
point of view.

Beyond that, there further differences in the determination of access
rights, which I have yet to track down.  This work is on pause while I'm
L1TFing.

~Andrew

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 14:01               ` Andrew Cooper
@ 2018-08-09 15:00                 ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-09 15:00 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 894 bytes --]

On 09/08/2018 16:01, speck for Andrew Cooper wrote:
>>
>> I know because I tried just yesterday :) and as soon as I saw the
>> problem I remembered having tried at least once more in the past.  AFAIU
>> Intel is also convinced that it should be possible, so maybe I'm missing
>> something.  But I don't see a workaround.
> Oh lovely :(
> 
> Yes - architecturally, there is no memory access when a #PF occurs, so
> an EPT-based vmexit won't occur.

Well, I suppose I could just trap all #PF, or even use the PFEC mask to
restrict those to "present" page faults, when guestphysaddr <
hostphysaddr.  But it's extra complexity for bad reasons, and also it
would give lots of cache misses because of walking the page tables.

Paolo

> It would be interesting to hear an architects take on reserved pagefault
> bits, but I expect "don't do that then" might be the answer.
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09  9:24   ` Paolo Bonzini
@ 2018-08-09 17:43     ` Andi Kleen
  2018-08-10  7:55       ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-08-09 17:43 UTC (permalink / raw)
  To: speck

> But that would only apply to processors with
> MAXPHYADDR=52.

We don't expect those to be vulnerable to L1TF.

So patches are not needed I think.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 10:01     ` Paolo Bonzini
  2018-08-09 10:47       ` Andrew Cooper
@ 2018-08-09 20:14       ` Jim Mattson
  1 sibling, 0 replies; 22+ messages in thread
From: Jim Mattson @ 2018-08-09 20:14 UTC (permalink / raw)
  To: speck

On 09/08/18 11:01, speck for Paolo Bonzini wrote:
> On 09/08/2018 11:33, speck for Andrew Cooper wrote:
>> Setting bit 51 doesn't mitigate L1TF on any current processor.
>>
>> You need to set an address bit inside L1D-maxphysaddr, and isn't
>> cacheable on the current system.
> 
> KVM currently sets all reserved bits, so it's safe as long as
> l1d-maxphyaddr > cpuid-maxphyaddr.
> 
> What remains is pre-Nehalem or high-end servers where L1D-maxphyaddr =
> cpuid-maxphyaddr.  So we could split the guest gfn in two so that
>
> bit 47-51            = bits MPA-5..MPA-1 of guest physical address
> bit maxphyaddr-5..46 = all ones, or look it up using e820
> bit 12..maxphyaddr-6 = bits 12..MPA-6 of guest physical address.
>
> assuming all processors with maxphyaddr > 46 are safe.

This is sort of what this patch was trying to do, except that it was
only setting a single bit (maxphyaddr-1) to 1 instead of 5 bits (and
storing the displaced gfn bit in bit maxphyaddr rather than in bit
51). Setting more bits does mean that it would work on machines with a
bit more physical RAM than half of the max supported.  If that is
important, then we can modify the patch to always set the 5 highest
supported address bits to 1.

>
>> and the CPUID maxphysaddr hasn't been
>> levelled for heterogeneous migration safety
>
> I don't know about Xen PV, but when using EPT you cannot do that, the
> maxphyaddr is not virtualizable (obviously not to guest-maxphyaddr >
> host-maxphyaddr, but guest-maxphyaddr < host-maxphyaddr cannot be
> emulated either).
>
> Paolo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-09 17:43     ` Andi Kleen
@ 2018-08-10  7:55       ` Paolo Bonzini
  2018-08-10 15:59         ` Jim Mattson
  2018-08-10 17:23         ` Andi Kleen
  0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-10  7:55 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 314 bytes --]

On 09/08/2018 19:43, speck for Andi Kleen wrote:
>> But that would only apply to processors with
>> MAXPHYADDR=52.
> We don't expect those to be vulnerable to L1TF.
> 
> So patches are not needed I think.

Looks like they are still needed on processors with cache-maxphyaddr =
l1tf-maxphyaddr.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10  7:55       ` Paolo Bonzini
@ 2018-08-10 15:59         ` Jim Mattson
  2018-08-10 17:23         ` Andi Kleen
  1 sibling, 0 replies; 22+ messages in thread
From: Jim Mattson @ 2018-08-10 15:59 UTC (permalink / raw)
  To: speck

2018-08-10 0:55 GMT-07:00 speck for Paolo Bonzini <speck@linutronix.de>:
> Looks like they are still needed on processors with cache-maxphyaddr = l1tf-maxphyaddr.

Right. If none of the reserved bits actually feed the L1D$
lookup, then something akin to Junaid's patch is
necessary. Junaid's patch only works, however, if there isn't any
cacheable memory with the highest supported physical address bit
set.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10  7:55       ` Paolo Bonzini
  2018-08-10 15:59         ` Jim Mattson
@ 2018-08-10 17:23         ` Andi Kleen
  2018-08-10 17:32           ` Linus Torvalds
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-08-10 17:23 UTC (permalink / raw)
  To: speck

On Fri, Aug 10, 2018 at 09:55:49AM +0200, speck for Paolo Bonzini wrote:
> On 09/08/2018 19:43, speck for Andi Kleen wrote:
> >> But that would only apply to processors with
> >> MAXPHYADDR=52.
> > We don't expect those to be vulnerable to L1TF.
> > 
> > So patches are not needed I think.
> 
> Looks like they are still needed on processors with cache-maxphyaddr =
> l1tf-maxphyaddr.

Not sure I follow your definitions. What is l1tf-maxphyaddr?

Normally we have the maxphyaddr reported by CPUID, and then we
have a larger maxphyaddr which could exist because VMs are lying
(upto 46bits)

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10 17:23         ` Andi Kleen
@ 2018-08-10 17:32           ` Linus Torvalds
  2018-08-10 17:45             ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2018-08-10 17:32 UTC (permalink / raw)
  To: speck



On Fri, 10 Aug 2018, speck for Andi Kleen wrote:
> 
> Normally we have the maxphyaddr reported by CPUID, and then we
> have a larger maxphyaddr which could exist because VMs are lying
> (upto 46bits)

.. or because the CPU itself is lying. Didn't we have some CPU versions 
that reported a different maxphyaddr than the CPU internally actually had? 
I'm pretty sure I saw a patch that listed specific CPU versions with the 
"correction" to the reported size..

           Linus

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10 17:32           ` Linus Torvalds
@ 2018-08-10 17:45             ` Andi Kleen
  2018-08-10 18:37               ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-08-10 17:45 UTC (permalink / raw)
  To: speck

On Fri, Aug 10, 2018 at 10:32:07AM -0700, speck for Linus Torvalds wrote:
> 
> 
> On Fri, 10 Aug 2018, speck for Andi Kleen wrote:
> > 
> > Normally we have the maxphyaddr reported by CPUID, and then we
> > have a larger maxphyaddr which could exist because VMs are lying
> > (upto 46bits)
> 
> .. or because the CPU itself is lying. Didn't we have some CPU versions 
> that reported a different maxphyaddr than the CPU internally actually had? 
> I'm pretty sure I saw a patch that listed specific CPU versions with the 
> "correction" to the reported size..

Yes some client parts report a lower maxphyaddr than what the cache
internally uses because their external bus supports less.
But in all cases it's <= 46bits.

Still not clear how that applies to Paolo's terminology.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10 17:45             ` Andi Kleen
@ 2018-08-10 18:37               ` Paolo Bonzini
  2018-08-10 19:17                 ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-10 18:37 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

On 10/08/2018 19:45, speck for Andi Kleen wrote:
> On Fri, Aug 10, 2018 at 10:32:07AM -0700, speck for Linus Torvalds wrote:
>> On Fri, 10 Aug 2018, speck for Andi Kleen wrote:
>>> Not sure I follow your definitions. What is l1tf-maxphyaddr?
>>>
>>> Normally we have the maxphyaddr reported by CPUID, and then we
>>> have a larger maxphyaddr which could exist because VMs are lying
>>> (upto 46bits)
>>
>> .. or because the CPU itself is lying. Didn't we have some CPU versions 
>> that reported a different maxphyaddr than the CPU internally actually had? 
>> I'm pretty sure I saw a patch that listed specific CPU versions with the 
>> "correction" to the reported size..
> 
> Yes some client parts report a lower maxphyaddr than what the cache
> internally uses because their external bus supports less.
> But in all cases it's <= 46bits.
> 
> Still not clear how that applies to Paolo's terminology.

l1tf-maxphyaddr is the real maxphyaddr, cpuid-maxphyaddr is the lower
fake one reported by the CPU.

If they are equal, you need to ensure that your nonpresent PTEs use
unused/uncacheable portions of the address space.  If they are not
equal, you can use some of those reserved bits that exist in the cache
but not in the PTEs.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10 18:37               ` Paolo Bonzini
@ 2018-08-10 19:17                 ` Andi Kleen
  2018-08-12 10:57                   ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2018-08-10 19:17 UTC (permalink / raw)
  To: speck

> l1tf-maxphyaddr is the real maxphyaddr, cpuid-maxphyaddr is the lower
> fake one reported by the CPU.

It's not fake, it's what the bus supports.

> 
> If they are equal, you need to ensure that your nonpresent PTEs use
> unused/uncacheable portions of the address space.  If they are not
> equal, you can use some of those reserved bits that exist in the cache
> but not in the PTEs.

Right, but we always use upto 46 because we have to anyways because
of the lying VMs.  And with 46 we're always safe on all current parts,
because it's guaranteed to be >= l1tf-maxphyaddr.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [MODERATED] Re: [PATCH] SPTE masking
  2018-08-10 19:17                 ` Andi Kleen
@ 2018-08-12 10:57                   ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2018-08-12 10:57 UTC (permalink / raw)
  To: speck

[-- Attachment #1: Type: text/plain, Size: 926 bytes --]

On 10/08/2018 21:17, speck for Andi Kleen wrote:
>> l1tf-maxphyaddr is the real maxphyaddr, cpuid-maxphyaddr is the lower
>> fake one reported by the CPU.
> 
> It's not fake, it's what the bus supports.
> 
>>
>> If they are equal, you need to ensure that your nonpresent PTEs use
>> unused/uncacheable portions of the address space.  If they are not
>> equal, you can use some of those reserved bits that exist in the cache
>> but not in the PTEs.
> 
> Right, but we always use upto 46 because we have to anyways because
> of the lying VMs.  And with 46 we're always safe on all current parts,
> because it's guaranteed to be >= l1tf-maxphyaddr.

For KVM the problem is the opposite, if l1tf-maxphyaddr ==
cpuid-maxphyaddr you can end up with a cacheable frame number in invalid
PTEs.

Perhaps the simplest fix is to also do inversion.  I'll send a patch
next week after everything is public.

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-08-12 10:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-08 23:21 [MODERATED] [PATCH] SPTE masking Jim Mattson
2018-08-09  2:57 ` [MODERATED] " Andi Kleen
2018-08-09  9:24   ` Paolo Bonzini
2018-08-09 17:43     ` Andi Kleen
2018-08-10  7:55       ` Paolo Bonzini
2018-08-10 15:59         ` Jim Mattson
2018-08-10 17:23         ` Andi Kleen
2018-08-10 17:32           ` Linus Torvalds
2018-08-10 17:45             ` Andi Kleen
2018-08-10 18:37               ` Paolo Bonzini
2018-08-10 19:17                 ` Andi Kleen
2018-08-12 10:57                   ` Paolo Bonzini
2018-08-09  9:25 ` Paolo Bonzini
2018-08-09  9:33   ` Andrew Cooper
2018-08-09 10:01     ` Paolo Bonzini
2018-08-09 10:47       ` Andrew Cooper
2018-08-09 11:13         ` Paolo Bonzini
2018-08-09 11:46           ` Andrew Cooper
2018-08-09 11:54             ` Paolo Bonzini
2018-08-09 14:01               ` Andrew Cooper
2018-08-09 15:00                 ` Paolo Bonzini
2018-08-09 20:14       ` Jim Mattson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.