[PATCH v2 0/2] x86: initial simplistic Meltdown mitigation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] x86: initial simplistic Meltdown mitigation
@ 2018-01-15 11:01 Jan Beulich
  2018-01-15 11:06 ` [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-15 11:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

1: x86: Meltdown band-aid against malicious 64-bit PV guests
2: x86: allow Meltdown band-aid to be disabled

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Addressing review comments for patch 1 (see there) and new
    patch 2.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-15 11:01 [PATCH v2 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
@ 2018-01-15 11:06 ` Jan Beulich
  2018-01-15 18:23   ` Andrew Cooper
  2018-01-15 11:07 ` [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
  2018-01-16 15:16 ` [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-15 11:06 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.

The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.

There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Use sensible numbers for the register classification constants also
    for the low 8 registers. Insert 'r' into their names to make their
    purpose more clear. Use "sub" instead of "add" when adding an
    immediate of 128. Defer sync IPI in do_mmu_update() until the end of
    the main loop. Use flush IPI there instead event check one. Make
    setup functions return a proper error code. Use better suited local
    variables in clone_mapping(). Add comment to new struct cpu_info
    fields.
---
This also wants Andrew's "[PATCH RFC 11/44] x86/pt-shadow: Always set
_PAGE_ACCESSED on L4e updates".
---
Backporting notes:
- This needs f9eb74789a ("x86/entry: Remove support for partial
  cpu_user_regs frames") as a prereq, due to the uses of %r14 and %r15.
  But that's intended to be backported anyway (for Spectre/SP2).
- The use of "root" instead of "l4" here is mainly to not make 5-level
  page table additions any harder. In backports "l4" should probably be
  preferred.
---
Follow-up notes:
- use alternatives patching for fully suppressing performance effects
  when disabled
- check whether running globally with CR4.PGE clear helps overall
  performance

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1511,6 +1511,9 @@ void paravirt_ctxt_switch_to(struct vcpu
 {
     unsigned long cr4;
 
+    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
+        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
         write_cr4(cr4);
@@ -1682,6 +1685,8 @@ void context_switch(struct vcpu *prev, s
 
     ASSERT(local_irq_is_enabled());
 
+    get_cpu_info()->xen_cr3 = 0;
+
     cpumask_copy(&dirty_mask, next->vcpu_dirty_cpumask);
     /* Allow at most one CPU at a time to be dirty. */
     ASSERT(cpumask_weight(&dirty_mask) <= 1);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3520,6 +3520,7 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
+    bool sync_guest = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -3683,6 +3684,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        sync_guest = true;
                     break;
 
                 case PGT_writable_page:
@@ -3787,6 +3790,24 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
+    if ( sync_guest )
+    {
+        /*
+         * Force other vCPU-s of the affected guest to pick up L4 entry
+         * changes (if any). Issue a flush IPI with empty operation mask to
+         * facilitate this (including ourselves waiting for the IPI to
+         * actually have arrived). Utilize the fact that FLUSH_VA_VALID is
+         * meaningless without FLUSH_CACHE, but will allow to pass the no-op
+         * check in flush_area_mask().
+         */
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+
+        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
+        if ( !cpumask_empty(mask) )
+            flush_area_mask(mask, ZERO_BLOCK_PTR, FLUSH_VA_VALID);
+    }
+
     perfc_add(num_page_updates, i);
 
  out:
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -327,6 +327,9 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->xen_cr3 = 0;
+    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
+
     load_system_tables();
 
     /* Full exception support from here on in. */
@@ -633,6 +636,187 @@ void cpu_exit_clear(unsigned int cpu)
     set_cpu_state(CPU_STATE_DEAD);
 }
 
+static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
+{
+    unsigned long linear = (unsigned long)ptr, pfn;
+    unsigned int flags;
+    l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
+                         l3_table_offset(linear);
+    l2_pgentry_t *pl2e;
+    l1_pgentry_t *pl1e;
+
+    if ( linear < DIRECTMAP_VIRT_START )
+        return 0;
+
+    flags = l3e_get_flags(*pl3e);
+    ASSERT(flags & _PAGE_PRESENT);
+    if ( flags & _PAGE_PSE )
+    {
+        pfn = (l3e_get_pfn(*pl3e) & ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
+              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));
+        flags &= ~_PAGE_PSE;
+    }
+    else
+    {
+        pl2e = l3e_to_l2e(*pl3e) + l2_table_offset(linear);
+        flags = l2e_get_flags(*pl2e);
+        ASSERT(flags & _PAGE_PRESENT);
+        if ( flags & _PAGE_PSE )
+        {
+            pfn = (l2e_get_pfn(*pl2e) & ~((1UL << PAGETABLE_ORDER) - 1)) |
+                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
+            flags &= ~_PAGE_PSE;
+        }
+        else
+        {
+            pl1e = l2e_to_l1e(*pl2e) + l1_table_offset(linear);
+            flags = l1e_get_flags(*pl1e);
+            if ( !(flags & _PAGE_PRESENT) )
+                return 0;
+            pfn = l1e_get_pfn(*pl1e);
+        }
+    }
+
+    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
+    {
+        pl3e = alloc_xen_pagetable();
+        if ( !pl3e )
+            return -ENOMEM;
+        clear_page(pl3e);
+        l4e_write(&rpt[root_table_offset(linear)],
+                  l4e_from_paddr(__pa(pl3e), __PAGE_HYPERVISOR));
+    }
+    else
+        pl3e = l4e_to_l3e(rpt[root_table_offset(linear)]);
+
+    pl3e += l3_table_offset(linear);
+
+    if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+    {
+        pl2e = alloc_xen_pagetable();
+        if ( !pl2e )
+            return -ENOMEM;
+        clear_page(pl2e);
+        l3e_write(pl3e, l3e_from_paddr(__pa(pl2e), __PAGE_HYPERVISOR));
+    }
+    else
+    {
+        ASSERT(!(l3e_get_flags(*pl3e) & _PAGE_PSE));
+        pl2e = l3e_to_l2e(*pl3e);
+    }
+
+    pl2e += l2_table_offset(linear);
+
+    if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
+    {
+        pl1e = alloc_xen_pagetable();
+        if ( !pl1e )
+            return -ENOMEM;
+        clear_page(pl1e);
+        l2e_write(pl2e, l2e_from_paddr(__pa(pl1e), __PAGE_HYPERVISOR));
+    }
+    else
+    {
+        ASSERT(!(l2e_get_flags(*pl2e) & _PAGE_PSE));
+        pl1e = l2e_to_l1e(*pl2e);
+    }
+
+    pl1e += l1_table_offset(linear);
+
+    if ( l1e_get_flags(*pl1e) & _PAGE_PRESENT )
+    {
+        ASSERT(l1e_get_pfn(*pl1e) == pfn);
+        ASSERT(l1e_get_flags(*pl1e) == flags);
+    }
+    else
+        l1e_write(pl1e, l1e_from_pfn(pfn, flags));
+
+    return 0;
+}
+
+DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
+
+static int setup_cpu_root_pgt(unsigned int cpu)
+{
+    root_pgentry_t *rpt = alloc_xen_pagetable();
+    unsigned int off;
+    int rc;
+
+    if ( !rpt )
+        return -ENOMEM;
+
+    clear_page(rpt);
+    per_cpu(root_pgt, cpu) = rpt;
+
+    rpt[root_table_offset(RO_MPT_VIRT_START)] =
+        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
+    /* SH_LINEAR_PT inserted together with guest mappings. */
+    /* PERDOMAIN inserted during context switch. */
+    rpt[root_table_offset(XEN_VIRT_START)] =
+        idle_pg_table[root_table_offset(XEN_VIRT_START)];
+
+    /* Install direct map page table entries for stack, IDT, and TSS. */
+    for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
+        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
+
+    if ( !rc )
+        rc = clone_mapping(idt_tables[cpu], rpt);
+    if ( !rc )
+        rc = clone_mapping(&per_cpu(init_tss, cpu), rpt);
+
+    return rc;
+}
+
+static void cleanup_cpu_root_pgt(unsigned int cpu)
+{
+    root_pgentry_t *rpt = per_cpu(root_pgt, cpu);
+    unsigned int r;
+
+    if ( !rpt )
+        return;
+
+    per_cpu(root_pgt, cpu) = NULL;
+
+    for ( r = root_table_offset(DIRECTMAP_VIRT_START);
+          r < root_table_offset(HYPERVISOR_VIRT_END); ++r )
+    {
+        l3_pgentry_t *l3t;
+        unsigned int i3;
+
+        if ( !(root_get_flags(rpt[r]) & _PAGE_PRESENT) )
+            continue;
+
+        l3t = l4e_to_l3e(rpt[r]);
+
+        for ( i3 = 0; i3 < L3_PAGETABLE_ENTRIES; ++i3 )
+        {
+            l2_pgentry_t *l2t;
+            unsigned int i2;
+
+            if ( !(l3e_get_flags(l3t[i3]) & _PAGE_PRESENT) )
+                continue;
+
+            ASSERT(!(l3e_get_flags(l3t[i3]) & _PAGE_PSE));
+            l2t = l3e_to_l2e(l3t[i3]);
+
+            for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; ++i2 )
+            {
+                if ( !(l2e_get_flags(l2t[i2]) & _PAGE_PRESENT) )
+                    continue;
+
+                ASSERT(!(l2e_get_flags(l2t[i2]) & _PAGE_PSE));
+                free_xen_pagetable(l2e_to_l1e(l2t[i2]));
+            }
+
+            free_xen_pagetable(l2t);
+        }
+
+        free_xen_pagetable(l3t);
+    }
+
+    free_xen_pagetable(rpt);
+}
+
 static void cpu_smpboot_free(unsigned int cpu)
 {
     unsigned int order, socket = cpu_to_socket(cpu);
@@ -671,6 +855,8 @@ static void cpu_smpboot_free(unsigned in
             free_domheap_page(mfn_to_page(mfn));
     }
 
+    cleanup_cpu_root_pgt(cpu);
+
     order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
     free_xenheap_pages(per_cpu(gdt_table, cpu), order);
 
@@ -727,6 +913,11 @@ static int cpu_smpboot_alloc(unsigned in
     set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
     set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
 
+    rc = setup_cpu_root_pgt(cpu);
+    if ( rc )
+        goto out;
+    rc = -ENOMEM;
+
     for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
           i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
         if ( cpu_online(i) && cpu_to_node(i) == node )
@@ -786,6 +977,8 @@ static struct notifier_block cpu_smpboot
 
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
+    int rc;
+
     register_cpu_notifier(&cpu_smpboot_nfb);
 
     mtrr_aps_sync_begin();
@@ -799,6 +992,11 @@ void __init smp_prepare_cpus(unsigned in
 
     stack_base[0] = stack_start;
 
+    rc = setup_cpu_root_pgt(0);
+    if ( rc )
+        panic("Error %d setting up PV root page table\n", rc);
+    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+
     set_nr_sockets();
 
     socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
@@ -867,6 +1065,8 @@ void __init smp_prepare_boot_cpu(void)
 #if NR_CPUS > 2 * BITS_PER_LONG
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
+
+    get_cpu_info()->xen_cr3 = 0;
 }
 
 static void
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -137,6 +137,8 @@ void __dummy__(void)
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
+    OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
+    OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -199,6 +199,17 @@ ENTRY(cstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+UNLIKELY_START(nz, cstar_cr3)
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+UNLIKELY_END(cstar_cr3)
+
         GET_CURRENT(bx)
         movq  VCPU_domain(%rbx),%rcx
         cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
 /* %rbx: struct vcpu, interrupts disabled */
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
+
+        /* Copy guest mappings and switch to per-CPU root page table. */
+        mov   %cr3, %r9
+        GET_STACK_END(dx)
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        movabs $PADDR_MASK & PAGE_MASK, %rsi
+        movabs $DIRECTMAP_VIRT_START, %rcx
+        mov   %rdi, %rax
+        and   %rsi, %rdi
+        and   %r9, %rsi
+        add   %rcx, %rdi
+        add   %rcx, %rsi
+        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
+        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
+        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
+        rep movsq
+        mov   $ROOT_PAGETABLE_ENTRIES - \
+               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
+        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
+                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
+        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
+                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
+        rep movsq
+        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        write_cr3 rax, rdi, rsi
+
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -71,6 +97,18 @@ iret_exit_to_guest:
         ALIGN
 /* No special register assumptions. */
 restore_all_xen:
+        /*
+         * Check whether we need to switch to the per-CPU page tables, in
+         * case we return to late PV exit code (from an NMI or #MC).
+         */
+        GET_STACK_END(ax)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
+        test  %rdx, %rdx
+UNLIKELY_START(g, exit_cr3)
+        write_cr3 rax, rdi, rsi
+UNLIKELY_END(exit_cr3)
+
         RESTORE_ALL adj=8
         iretq
 
@@ -100,7 +138,18 @@ ENTRY(lstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-        GET_CURRENT(bx)
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Llstar_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Llstar_cr3_okay:
+
+        __GET_CURRENT(bx)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
 
@@ -192,7 +241,18 @@ GLOBAL(sysenter_eflags_saved)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-        GET_CURRENT(bx)
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Lsyse_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Lsyse_cr3_okay:
+
+        __GET_CURRENT(bx)
         cmpb  $0,VCPU_sysenter_disables_events(%rbx)
         movq  VCPU_sysenter_addr(%rbx),%rax
         setne %cl
@@ -228,13 +288,23 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+UNLIKELY_START(nz, int80_cr3)
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+UNLIKELY_END(int80_cr3)
+
         cmpb  $0,untrusted_msi(%rip)
 UNLIKELY_START(ne, msi_check)
         movl  $0x80,%edi
         call  check_for_unexpected_msi
 UNLIKELY_END(msi_check)
 
-        GET_CURRENT(bx)
+        __GET_CURRENT(bx)
 
         /* Check that the callback is non-null. */
         leaq  VCPU_int80_bounce(%rbx),%rdx
@@ -391,9 +461,27 @@ ENTRY(dom_crash_sync_extable)
 
 ENTRY(common_interrupt)
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .Lintr_cr3_okay
+        jns   .Lintr_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.Lintr_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        xor   %ecx, %ecx
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        testb $3, UREGS_cs(%rsp)
+        cmovnz %rcx, %r15
+.Lintr_cr3_okay:
+
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp ret_from_intr
 
 /* No special register assumptions. */
@@ -411,6 +499,23 @@ ENTRY(page_fault)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .Lxcpt_cr3_okay
+        jns   .Lxcpt_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.Lxcpt_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        xor   %ecx, %ecx
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        testb $3, UREGS_cs(%rsp)
+        cmovnz %rcx, %r15
+.Lxcpt_cr3_okay:
+
 handle_exception_saved:
         GET_CURRENT(bx)
         testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
@@ -475,6 +580,7 @@ handle_exception_saved:
         leaq  exception_table(%rip),%rdx
         PERFC_INCR(exceptions, %rax, %rbx)
         callq *(%rdx,%rax,8)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
@@ -507,6 +613,7 @@ exception_with_ints_disabled:
         rep;  movsq                     # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
 
 /* No special register assumptions. */
@@ -585,6 +692,17 @@ ENTRY(double_fault)
         movl  $TRAP_double_fault,4(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
+        test  %rbx, %rbx
+        jz    .Ldblf_cr3_okay
+        jns   .Ldblf_cr3_load
+        neg   %rbx
+.Ldblf_cr3_load:
+        write_cr3 rbx, rdi, rsi
+.Ldblf_cr3_okay:
+
         movq  %rsp,%rdi
         call  do_double_fault
         BUG   /* do_double_fault() shouldn't return. */
@@ -603,10 +721,28 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
 handle_ist_exception:
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .List_cr3_okay
+        jns   .List_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.List_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+.List_cr3_okay:
+
         CR4_PV32_RESTORE
         testb $3,UREGS_cs(%rsp)
         jz    1f
-        /* Interrupted guest context. Copy the context to stack bottom. */
+        /*
+         * Interrupted guest context. Clear the restore value for xen_cr3
+         * and copy the context to stack bottom.
+         */
+        xor   %r15, %r15
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -616,6 +752,7 @@ handle_ist_exception:
         movzbl UREGS_entry_vector(%rsp),%eax
         leaq  exception_table(%rip),%rdx
         callq *(%rdx,%rax,8)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
 
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -93,9 +93,30 @@ void ret_from_intr(void);
         UNLIKELY_DONE(mp, tag);   \
         __UNLIKELY_END(tag)
 
+        .equ .Lrax, 0
+        .equ .Lrcx, 1
+        .equ .Lrdx, 2
+        .equ .Lrbx, 3
+        .equ .Lrsp, 4
+        .equ .Lrbp, 5
+        .equ .Lrsi, 6
+        .equ .Lrdi, 7
+        .equ .Lr8,  8
+        .equ .Lr9,  9
+        .equ .Lr10, 10
+        .equ .Lr11, 11
+        .equ .Lr12, 12
+        .equ .Lr13, 13
+        .equ .Lr14, 14
+        .equ .Lr15, 15
+
 #define STACK_CPUINFO_FIELD(field) (1 - CPUINFO_sizeof + CPUINFO_##field)
 #define GET_STACK_END(reg)                        \
+        .if .Lr##reg > 8;                         \
+        movq $STACK_SIZE-1, %r##reg;              \
+        .else;                                    \
         movl $STACK_SIZE-1, %e##reg;              \
+        .endif;                                   \
         orq  %rsp, %r##reg
 
 #define GET_CPUINFO_FIELD(field, reg)             \
@@ -177,6 +198,15 @@ void ret_from_intr(void);
 #define ASM_STAC ASM_AC(STAC)
 #define ASM_CLAC ASM_AC(CLAC)
 
+.macro write_cr3 val:req, tmp1:req, tmp2:req
+        mov   %cr4, %\tmp1
+        mov   %\tmp1, %\tmp2
+        and   $~X86_CR4_PGE, %\tmp1
+        mov   %\tmp1, %cr4
+        mov   %\val, %cr3
+        mov   %\tmp2, %cr4
+.endm
+
 #define CR4_PV32_RESTORE                                           \
         667: ASM_NOP5;                                             \
         .pushsection .altinstr_replacement, "ax";                  \
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -41,6 +41,18 @@ struct cpu_info {
     struct vcpu *current_vcpu;
     unsigned long per_cpu_offset;
     unsigned long cr4;
+    /*
+     * Of the two following fields the latter is being set to the CR3 value
+     * to be used on the given pCPU for loading whenever 64-bit PV guest
+     * context is being entered. The value never changes once set.
+     * The former is the value to restore when re-entering Xen, if any. IOW
+     * its value being zero means there's nothing to restore. However, its
+     * value can also be negative, indicating to the exit-to-Xen code that
+     * restoring is not necessary, but allowing any nested entry code paths
+     * to still know the value to put back into CR3.
+     */
+    unsigned long xen_cr3;
+    unsigned long pv_cr3;
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -462,6 +462,7 @@ extern idt_entry_t idt_table[];
 extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(struct tss_struct, init_tss);
+DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
 extern void init_int80_direct_trap(struct vcpu *v);
 
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -24,8 +24,8 @@
 /* These are architectural limits. Current CPUs support only 40-bit phys. */
 #define PADDR_BITS              52
 #define VADDR_BITS              48
-#define PADDR_MASK              ((1UL << PADDR_BITS)-1)
-#define VADDR_MASK              ((1UL << VADDR_BITS)-1)
+#define PADDR_MASK              ((_AC(1,UL) << PADDR_BITS)-1)
+#define VADDR_MASK              ((_AC(1,UL) << VADDR_BITS)-1)
 
 #define VADDR_TOP_BIT           (1UL << (VADDR_BITS - 1))
 #define CANONICAL_MASK          (~0UL & ~VADDR_MASK)
@@ -107,6 +107,7 @@ typedef l4_pgentry_t root_pgentry_t;
       : (((_s) < ROOT_PAGETABLE_FIRST_XEN_SLOT) ||  \
          ((_s) > ROOT_PAGETABLE_LAST_XEN_SLOT)))
 
+#define root_table_offset         l4_table_offset
 #define root_get_pfn              l4e_get_pfn
 #define root_get_flags            l4e_get_flags
 #define root_get_intpte           l4e_get_intpte



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-15 11:01 [PATCH v2 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
  2018-01-15 11:06 ` [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
@ 2018-01-15 11:07 ` Jan Beulich
  2018-01-15 18:26   ` Andrew Cooper
  2018-01-16 12:12   ` George Dunlap
  2018-01-16 15:16 ` [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
  2 siblings, 2 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-15 11:07 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1849,6 +1849,15 @@ In the case that x2apic is in use, this
 clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
+### xpti
+> `= <boolean>`
+
+> Default: `false` on AMD hardware
+> Default: `true` everywhere else
+
+Override default selection of whether to isolate 64-bit PV guest page
+tables.
+
 ### xsave
 > `= <boolean>`
 
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1511,8 +1511,10 @@ void paravirt_ctxt_switch_to(struct vcpu
 {
     unsigned long cr4;
 
-    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+    if ( this_cpu(root_pgt) )
+        this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
+            l4e_from_page(v->domain->arch.perdomain_l3_pg,
+                          __PAGE_HYPERVISOR_RW);
 
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3685,7 +3685,7 @@ long do_mmu_update(
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc )
-                        sync_guest = true;
+                        sync_guest = this_cpu(root_pgt);
                     break;
 
                 case PGT_writable_page:
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -328,7 +328,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
+    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
 
     load_system_tables();
 
@@ -734,14 +734,20 @@ static int clone_mapping(const void *ptr
     return 0;
 }
 
+static __read_mostly int8_t opt_xpti = -1;
+boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
-    root_pgentry_t *rpt = alloc_xen_pagetable();
+    root_pgentry_t *rpt;
     unsigned int off;
     int rc;
 
+    if ( !opt_xpti )
+        return 0;
+
+    rpt = alloc_xen_pagetable();
     if ( !rpt )
         return -ENOMEM;
 
@@ -992,10 +998,13 @@ void __init smp_prepare_cpus(unsigned in
 
     stack_base[0] = stack_start;
 
+    if ( opt_xpti < 0 )
+        opt_xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
     rc = setup_cpu_root_pgt(0);
     if ( rc )
         panic("Error %d setting up PV root page table\n", rc);
-    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+    if ( per_cpu(root_pgt, 0) )
+        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
 
     set_nr_sockets();
 
@@ -1067,6 +1076,7 @@ void __init smp_prepare_boot_cpu(void)
 #endif
 
     get_cpu_info()->xen_cr3 = 0;
+    get_cpu_info()->pv_cr3 = 0;
 }
 
 static void
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -46,6 +46,7 @@ restore_all_guest:
         movabs $DIRECTMAP_VIRT_START, %rcx
         mov   %rdi, %rax
         and   %rsi, %rdi
+        jz    .Lrag_keep_cr3
         and   %r9, %rsi
         add   %rcx, %rdi
         add   %rcx, %rsi
@@ -62,6 +63,7 @@ restore_all_guest:
         rep movsq
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         write_cr3 rax, rdi, rsi
+.Lrag_keep_cr3:
 
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-15 11:06 ` [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
@ 2018-01-15 18:23   ` Andrew Cooper
  2018-01-16  7:46     ` Jan Beulich
  2018-01-16  9:33     ` Jan Beulich
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Cooper @ 2018-01-15 18:23 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 15/01/18 11:06, Jan Beulich wrote:
> This is a very simplistic change limiting the amount of memory a running
> 64-bit PV guest has mapped (and hence available for attacking): Only the
> mappings of stack, IDT, and TSS are being cloned from the direct map
> into per-CPU page tables. Guest controlled parts of the page tables are
> being copied into those per-CPU page tables upon entry into the guest.
> Cross-vCPU synchronization of top level page table entry changes is
> being effected by forcing other active vCPU-s of the guest into the
> hypervisor.
>
> The change to context_switch() isn't strictly necessary, but there's no
> reason to keep switching page tables once a PV guest is being scheduled
> out.
>
> There is certainly much room for improvement, especially of performance,
> here - first and foremost suppressing all the negative effects on AMD
> systems. But in the interest of backportability (including to really old
> hypervisors, which may not even have alternative patching) any such is
> being left out here.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Use sensible numbers for the register classification constants also
>     for the low 8 registers. Insert 'r' into their names to make their
>     purpose more clear. Use "sub" instead of "add" when adding an
>     immediate of 128. Defer sync IPI in do_mmu_update() until the end of
>     the main loop. Use flush IPI there instead event check one. Make
>     setup functions return a proper error code. Use better suited local
>     variables in clone_mapping(). Add comment to new struct cpu_info
>     fields.
> ---
> This also wants Andrew's "[PATCH RFC 11/44] x86/pt-shadow: Always set
> _PAGE_ACCESSED on L4e updates".

I've cleaned this patch up and committed it in preparation.

http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=bd61fe94bee0556bc2f64999a4a8315b93f90f21

> ---
> Backporting notes:
> - This needs f9eb74789a ("x86/entry: Remove support for partial
>   cpu_user_regs frames") as a prereq, due to the uses of %r14 and %r15.
>   But that's intended to be backported anyway (for Spectre/SP2).
> - The use of "root" instead of "l4" here is mainly to not make 5-level
>   page table additions any harder. In backports "l4" should probably be
>   preferred.
> ---
> Follow-up notes:
> - use alternatives patching for fully suppressing performance effects
>   when disabled
> - check whether running globally with CR4.PGE clear helps overall
>   performance
>
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1511,6 +1511,9 @@ void paravirt_ctxt_switch_to(struct vcpu
>  {
>      unsigned long cr4;
>  
> +    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
> +        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
> +
>      cr4 = pv_guest_cr4_to_real_cr4(v);
>      if ( unlikely(cr4 != read_cr4()) )
>          write_cr4(cr4);
> @@ -1682,6 +1685,8 @@ void context_switch(struct vcpu *prev, s
>  
>      ASSERT(local_irq_is_enabled());
>  
> +    get_cpu_info()->xen_cr3 = 0;
> +
>      cpumask_copy(&dirty_mask, next->vcpu_dirty_cpumask);
>      /* Allow at most one CPU at a time to be dirty. */
>      ASSERT(cpumask_weight(&dirty_mask) <= 1);
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3520,6 +3520,7 @@ long do_mmu_update(
>      struct vcpu *curr = current, *v = curr;
>      struct domain *d = v->domain, *pt_owner = d, *pg_owner;
>      mfn_t map_mfn = INVALID_MFN;
> +    bool sync_guest = false;
>      uint32_t xsm_needed = 0;
>      uint32_t xsm_checked = 0;
>      int rc = put_old_guest_table(curr);
> @@ -3683,6 +3684,8 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> +                    if ( !rc )
> +                        sync_guest = true;
>                      break;
>  
>                  case PGT_writable_page:
> @@ -3787,6 +3790,24 @@ long do_mmu_update(
>      if ( va )
>          unmap_domain_page(va);
>  
> +    if ( sync_guest )
> +    {
> +        /*
> +         * Force other vCPU-s of the affected guest to pick up L4 entry
> +         * changes (if any). Issue a flush IPI with empty operation mask to
> +         * facilitate this (including ourselves waiting for the IPI to
> +         * actually have arrived). Utilize the fact that FLUSH_VA_VALID is
> +         * meaningless without FLUSH_CACHE, but will allow to pass the no-op
> +         * check in flush_area_mask().
> +         */
> +        unsigned int cpu = smp_processor_id();
> +        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
> +
> +        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
> +        if ( !cpumask_empty(mask) )
> +            flush_area_mask(mask, ZERO_BLOCK_PTR, FLUSH_VA_VALID);
> +    }
> +
>      perfc_add(num_page_updates, i);
>  
>   out:
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -327,6 +327,9 @@ void start_secondary(void *unused)
>       */
>      spin_debug_disable();
>  
> +    get_cpu_info()->xen_cr3 = 0;
> +    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
> +
>      load_system_tables();
>  
>      /* Full exception support from here on in. */
> @@ -633,6 +636,187 @@ void cpu_exit_clear(unsigned int cpu)
>      set_cpu_state(CPU_STATE_DEAD);
>  }
>  
> +static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
> +{
> +    unsigned long linear = (unsigned long)ptr, pfn;
> +    unsigned int flags;
> +    l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
> +                         l3_table_offset(linear);
> +    l2_pgentry_t *pl2e;
> +    l1_pgentry_t *pl1e;
> +
> +    if ( linear < DIRECTMAP_VIRT_START )
> +        return 0;
> +
> +    flags = l3e_get_flags(*pl3e);
> +    ASSERT(flags & _PAGE_PRESENT);
> +    if ( flags & _PAGE_PSE )
> +    {
> +        pfn = (l3e_get_pfn(*pl3e) & ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
> +              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));
> +        flags &= ~_PAGE_PSE;
> +    }
> +    else
> +    {
> +        pl2e = l3e_to_l2e(*pl3e) + l2_table_offset(linear);
> +        flags = l2e_get_flags(*pl2e);
> +        ASSERT(flags & _PAGE_PRESENT);
> +        if ( flags & _PAGE_PSE )
> +        {
> +            pfn = (l2e_get_pfn(*pl2e) & ~((1UL << PAGETABLE_ORDER) - 1)) |
> +                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
> +            flags &= ~_PAGE_PSE;
> +        }
> +        else
> +        {
> +            pl1e = l2e_to_l1e(*pl2e) + l1_table_offset(linear);
> +            flags = l1e_get_flags(*pl1e);
> +            if ( !(flags & _PAGE_PRESENT) )
> +                return 0;
> +            pfn = l1e_get_pfn(*pl1e);
> +        }
> +    }
> +
> +    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
> +    {
> +        pl3e = alloc_xen_pagetable();
> +        if ( !pl3e )
> +            return -ENOMEM;
> +        clear_page(pl3e);
> +        l4e_write(&rpt[root_table_offset(linear)],
> +                  l4e_from_paddr(__pa(pl3e), __PAGE_HYPERVISOR));
> +    }
> +    else
> +        pl3e = l4e_to_l3e(rpt[root_table_offset(linear)]);
> +
> +    pl3e += l3_table_offset(linear);
> +
> +    if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
> +    {
> +        pl2e = alloc_xen_pagetable();
> +        if ( !pl2e )
> +            return -ENOMEM;
> +        clear_page(pl2e);
> +        l3e_write(pl3e, l3e_from_paddr(__pa(pl2e), __PAGE_HYPERVISOR));
> +    }
> +    else
> +    {
> +        ASSERT(!(l3e_get_flags(*pl3e) & _PAGE_PSE));
> +        pl2e = l3e_to_l2e(*pl3e);
> +    }
> +
> +    pl2e += l2_table_offset(linear);
> +
> +    if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
> +    {
> +        pl1e = alloc_xen_pagetable();
> +        if ( !pl1e )
> +            return -ENOMEM;
> +        clear_page(pl1e);
> +        l2e_write(pl2e, l2e_from_paddr(__pa(pl1e), __PAGE_HYPERVISOR));
> +    }
> +    else
> +    {
> +        ASSERT(!(l2e_get_flags(*pl2e) & _PAGE_PSE));
> +        pl1e = l2e_to_l1e(*pl2e);
> +    }
> +
> +    pl1e += l1_table_offset(linear);
> +
> +    if ( l1e_get_flags(*pl1e) & _PAGE_PRESENT )
> +    {
> +        ASSERT(l1e_get_pfn(*pl1e) == pfn);
> +        ASSERT(l1e_get_flags(*pl1e) == flags);
> +    }
> +    else
> +        l1e_write(pl1e, l1e_from_pfn(pfn, flags));
> +
> +    return 0;
> +}
> +
> +DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
> +
> +static int setup_cpu_root_pgt(unsigned int cpu)
> +{
> +    root_pgentry_t *rpt = alloc_xen_pagetable();
> +    unsigned int off;
> +    int rc;
> +
> +    if ( !rpt )
> +        return -ENOMEM;
> +
> +    clear_page(rpt);
> +    per_cpu(root_pgt, cpu) = rpt;
> +
> +    rpt[root_table_offset(RO_MPT_VIRT_START)] =
> +        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
> +    /* SH_LINEAR_PT inserted together with guest mappings. */
> +    /* PERDOMAIN inserted during context switch. */
> +    rpt[root_table_offset(XEN_VIRT_START)] =
> +        idle_pg_table[root_table_offset(XEN_VIRT_START)];
> +
> +    /* Install direct map page table entries for stack, IDT, and TSS. */
> +    for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
> +        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
> +
> +    if ( !rc )
> +        rc = clone_mapping(idt_tables[cpu], rpt);
> +    if ( !rc )
> +        rc = clone_mapping(&per_cpu(init_tss, cpu), rpt);
> +
> +    return rc;
> +}
> +
> +static void cleanup_cpu_root_pgt(unsigned int cpu)
> +{
> +    root_pgentry_t *rpt = per_cpu(root_pgt, cpu);
> +    unsigned int r;
> +
> +    if ( !rpt )
> +        return;
> +
> +    per_cpu(root_pgt, cpu) = NULL;
> +
> +    for ( r = root_table_offset(DIRECTMAP_VIRT_START);
> +          r < root_table_offset(HYPERVISOR_VIRT_END); ++r )
> +    {
> +        l3_pgentry_t *l3t;
> +        unsigned int i3;
> +
> +        if ( !(root_get_flags(rpt[r]) & _PAGE_PRESENT) )
> +            continue;
> +
> +        l3t = l4e_to_l3e(rpt[r]);
> +
> +        for ( i3 = 0; i3 < L3_PAGETABLE_ENTRIES; ++i3 )
> +        {
> +            l2_pgentry_t *l2t;
> +            unsigned int i2;
> +
> +            if ( !(l3e_get_flags(l3t[i3]) & _PAGE_PRESENT) )
> +                continue;
> +
> +            ASSERT(!(l3e_get_flags(l3t[i3]) & _PAGE_PSE));
> +            l2t = l3e_to_l2e(l3t[i3]);
> +
> +            for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; ++i2 )
> +            {
> +                if ( !(l2e_get_flags(l2t[i2]) & _PAGE_PRESENT) )
> +                    continue;
> +
> +                ASSERT(!(l2e_get_flags(l2t[i2]) & _PAGE_PSE));
> +                free_xen_pagetable(l2e_to_l1e(l2t[i2]));
> +            }
> +
> +            free_xen_pagetable(l2t);
> +        }
> +
> +        free_xen_pagetable(l3t);
> +    }
> +
> +    free_xen_pagetable(rpt);
> +}
> +
>  static void cpu_smpboot_free(unsigned int cpu)
>  {
>      unsigned int order, socket = cpu_to_socket(cpu);
> @@ -671,6 +855,8 @@ static void cpu_smpboot_free(unsigned in
>              free_domheap_page(mfn_to_page(mfn));
>      }
>  
> +    cleanup_cpu_root_pgt(cpu);
> +
>      order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
>      free_xenheap_pages(per_cpu(gdt_table, cpu), order);
>  
> @@ -727,6 +913,11 @@ static int cpu_smpboot_alloc(unsigned in
>      set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
>      set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
>  
> +    rc = setup_cpu_root_pgt(cpu);
> +    if ( rc )
> +        goto out;
> +    rc = -ENOMEM;
> +
>      for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
>            i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
>          if ( cpu_online(i) && cpu_to_node(i) == node )
> @@ -786,6 +977,8 @@ static struct notifier_block cpu_smpboot
>  
>  void __init smp_prepare_cpus(unsigned int max_cpus)
>  {
> +    int rc;
> +
>      register_cpu_notifier(&cpu_smpboot_nfb);
>  
>      mtrr_aps_sync_begin();
> @@ -799,6 +992,11 @@ void __init smp_prepare_cpus(unsigned in
>  
>      stack_base[0] = stack_start;
>  
> +    rc = setup_cpu_root_pgt(0);
> +    if ( rc )
> +        panic("Error %d setting up PV root page table\n", rc);
> +    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
> +
>      set_nr_sockets();
>  
>      socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
> @@ -867,6 +1065,8 @@ void __init smp_prepare_boot_cpu(void)
>  #if NR_CPUS > 2 * BITS_PER_LONG
>      per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
>  #endif
> +
> +    get_cpu_info()->xen_cr3 = 0;
>  }
>  
>  static void
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -137,6 +137,8 @@ void __dummy__(void)
>      OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
>      OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
>      OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
> +    OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
> +    OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
>      DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
>      BLANK();
>  
> --- a/xen/arch/x86/x86_64/compat/entry.S
> +++ b/xen/arch/x86/x86_64/compat/entry.S
> @@ -199,6 +199,17 @@ ENTRY(cstar_enter)
>          pushq $0
>          movl  $TRAP_syscall, 4(%rsp)
>          SAVE_ALL
> +
> +        GET_STACK_END(bx)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
> +        neg   %rcx
> +UNLIKELY_START(nz, cstar_cr3)
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +        neg   %rcx
> +        write_cr3 rcx, rdi, rsi
> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +UNLIKELY_END(cstar_cr3)

These UNLIKELY()s aren't correct.  It will depend on hardware and
command line setting as to whether we expect to update cr3.

Furthermore, they will complicate splitting the early entry code away
from the main .text section for a full isolation implementation.

For now, I'd drop them and have a simple jz .Lskip.


Also, can we collect these together into macros, rather than
opencoding?  We seem to have 3 distinct variations.

> +
>          GET_CURRENT(bx)
>          movq  VCPU_domain(%rbx),%rcx
>          cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
>  /* %rbx: struct vcpu, interrupts disabled */
>  restore_all_guest:
>          ASSERT_INTERRUPTS_DISABLED
> +
> +        /* Copy guest mappings and switch to per-CPU root page table. */
> +        mov   %cr3, %r9
> +        GET_STACK_END(dx)
> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
> +        movabs $PADDR_MASK & PAGE_MASK, %rsi
> +        movabs $DIRECTMAP_VIRT_START, %rcx
> +        mov   %rdi, %rax
> +        and   %rsi, %rdi
> +        and   %r9, %rsi
> +        add   %rcx, %rdi
> +        add   %rcx, %rsi
> +        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
> +        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
> +        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
> +        rep movsq
> +        mov   $ROOT_PAGETABLE_ENTRIES - \
> +               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
> +        rep movsq
> +        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
> +        write_cr3 rax, rdi, rsi
> +

Can we possibly move this up into C?  For this simplistic algorithm it
is ok in ASM, but if we want to do any optimisations to avoid the 4k
memcpy (generation count hidden somewhere in page_info?), ASM is going
quickly become unwieldy.

Another optimisation I found made a massive difference for the KAISER
series was to have an MRU cache of 4 pagetables, so in-guest syscalls
don't result in any copying as we pass in and out of Xen.

>          RESTORE_ALL
>          testw $TRAP_syscall,4(%rsp)
>          jz    iret_exit_to_guest
> @@ -71,6 +97,18 @@ iret_exit_to_guest:
>          ALIGN
>  /* No special register assumptions. */
>  restore_all_xen:
> +        /*
> +         * Check whether we need to switch to the per-CPU page tables, in
> +         * case we return to late PV exit code (from an NMI or #MC).
> +         */
> +        GET_STACK_END(ax)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
> +        test  %rdx, %rdx
> +UNLIKELY_START(g, exit_cr3)

cmp or ne ?

> +        write_cr3 rax, rdi, rsi
> +UNLIKELY_END(exit_cr3)
> +
>          RESTORE_ALL adj=8
>          iretq
>  
> @@ -100,7 +138,18 @@ ENTRY(lstar_enter)
>          pushq $0
>          movl  $TRAP_syscall, 4(%rsp)
>          SAVE_ALL
> -        GET_CURRENT(bx)
> +
> +        GET_STACK_END(bx)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
> +        neg   %rcx
> +        jz    .Llstar_cr3_okay
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +        neg   %rcx
> +        write_cr3 rcx, rdi, rsi
> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +.Llstar_cr3_okay:
> +
> +        __GET_CURRENT(bx)
>          testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
>          jz    switch_to_kernel
>  
> @@ -192,7 +241,18 @@ GLOBAL(sysenter_eflags_saved)
>          pushq $0
>          movl  $TRAP_syscall, 4(%rsp)
>          SAVE_ALL
> -        GET_CURRENT(bx)
> +
> +        GET_STACK_END(bx)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
> +        neg   %rcx
> +        jz    .Lsyse_cr3_okay
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +        neg   %rcx
> +        write_cr3 rcx, rdi, rsi
> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +.Lsyse_cr3_okay:
> +
> +        __GET_CURRENT(bx)
>          cmpb  $0,VCPU_sysenter_disables_events(%rbx)
>          movq  VCPU_sysenter_addr(%rbx),%rax
>          setne %cl
> @@ -228,13 +288,23 @@ ENTRY(int80_direct_trap)
>          movl  $0x80, 4(%rsp)
>          SAVE_ALL
>  
> +        GET_STACK_END(bx)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
> +        neg   %rcx
> +UNLIKELY_START(nz, int80_cr3)
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +        neg   %rcx
> +        write_cr3 rcx, rdi, rsi
> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
> +UNLIKELY_END(int80_cr3)
> +
>          cmpb  $0,untrusted_msi(%rip)
>  UNLIKELY_START(ne, msi_check)
>          movl  $0x80,%edi
>          call  check_for_unexpected_msi
>  UNLIKELY_END(msi_check)
>  
> -        GET_CURRENT(bx)
> +        __GET_CURRENT(bx)
>  
>          /* Check that the callback is non-null. */
>          leaq  VCPU_int80_bounce(%rbx),%rdx
> @@ -391,9 +461,27 @@ ENTRY(dom_crash_sync_extable)
>  
>  ENTRY(common_interrupt)
>          SAVE_ALL CLAC
> +
> +        GET_STACK_END(14)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
> +        mov   %rcx, %r15
> +        neg   %rcx
> +        jz    .Lintr_cr3_okay
> +        jns   .Lintr_cr3_load
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +        neg   %rcx
> +.Lintr_cr3_load:
> +        write_cr3 rcx, rdi, rsi
> +        xor   %ecx, %ecx
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +        testb $3, UREGS_cs(%rsp)
> +        cmovnz %rcx, %r15
> +.Lintr_cr3_okay:
> +
>          CR4_PV32_RESTORE
>          movq %rsp,%rdi
>          callq do_IRQ
> +        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>          jmp ret_from_intr
>  
>  /* No special register assumptions. */
> @@ -411,6 +499,23 @@ ENTRY(page_fault)
>  /* No special register assumptions. */
>  GLOBAL(handle_exception)
>          SAVE_ALL CLAC
> +
> +        GET_STACK_END(14)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
> +        mov   %rcx, %r15
> +        neg   %rcx
> +        jz    .Lxcpt_cr3_okay
> +        jns   .Lxcpt_cr3_load
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +        neg   %rcx
> +.Lxcpt_cr3_load:
> +        write_cr3 rcx, rdi, rsi
> +        xor   %ecx, %ecx
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +        testb $3, UREGS_cs(%rsp)
> +        cmovnz %rcx, %r15
> +.Lxcpt_cr3_okay:
> +
>  handle_exception_saved:
>          GET_CURRENT(bx)
>          testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
> @@ -475,6 +580,7 @@ handle_exception_saved:
>          leaq  exception_table(%rip),%rdx
>          PERFC_INCR(exceptions, %rax, %rbx)
>          callq *(%rdx,%rax,8)
> +        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>          testb $3,UREGS_cs(%rsp)
>          jz    restore_all_xen
>          leaq  VCPU_trap_bounce(%rbx),%rdx
> @@ -507,6 +613,7 @@ exception_with_ints_disabled:
>          rep;  movsq                     # make room for ec/ev
>  1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
>          movq  %rax,UREGS_kernel_sizeof(%rsp)
> +        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>          jmp   restore_all_xen           # return to fixup code
>  
>  /* No special register assumptions. */
> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>          movl  $TRAP_double_fault,4(%rsp)
>          /* Set AC to reduce chance of further SMAP faults */
>          SAVE_ALL STAC
> +
> +        GET_STACK_END(bx)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
> +        test  %rbx, %rbx
> +        jz    .Ldblf_cr3_okay
> +        jns   .Ldblf_cr3_load
> +        neg   %rbx
> +.Ldblf_cr3_load:
> +        write_cr3 rbx, rdi, rsi
> +.Ldblf_cr3_okay:
> +

It is moderately common for there to be cascade faults in #DF.  This
would be better if it were the general IST switch.

~Andrew

>          movq  %rsp,%rdi
>          call  do_double_fault
>          BUG   /* do_double_fault() shouldn't return. */
> @@ -603,10 +721,28 @@ ENTRY(nmi)
>          movl  $TRAP_nmi,4(%rsp)
>  handle_ist_exception:
>          SAVE_ALL CLAC
> +
> +        GET_STACK_END(14)
> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
> +        mov   %rcx, %r15
> +        neg   %rcx
> +        jz    .List_cr3_okay
> +        jns   .List_cr3_load
> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +        neg   %rcx
> +.List_cr3_load:
> +        write_cr3 rcx, rdi, rsi
> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
> +.List_cr3_okay:
> +
>          CR4_PV32_RESTORE
>          testb $3,UREGS_cs(%rsp)
>          jz    1f
> -        /* Interrupted guest context. Copy the context to stack bottom. */
> +        /*
> +         * Interrupted guest context. Clear the restore value for xen_cr3
> +         * and copy the context to stack bottom.
> +         */
> +        xor   %r15, %r15
>          GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
>          movq  %rsp,%rsi
>          movl  $UREGS_kernel_sizeof/8,%ecx
> @@ -616,6 +752,7 @@ handle_ist_exception:
>          movzbl UREGS_entry_vector(%rsp),%eax
>          leaq  exception_table(%rip),%rdx
>          callq *(%rdx,%rax,8)
> +        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
>          cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
>          jne   ret_from_intr
>  
> --- a/xen/include/asm-x86/asm_defns.h
> +++ b/xen/include/asm-x86/asm_defns.h
> @@ -93,9 +93,30 @@ void ret_from_intr(void);
>          UNLIKELY_DONE(mp, tag);   \
>          __UNLIKELY_END(tag)
>  
> +        .equ .Lrax, 0
> +        .equ .Lrcx, 1
> +        .equ .Lrdx, 2
> +        .equ .Lrbx, 3
> +        .equ .Lrsp, 4
> +        .equ .Lrbp, 5
> +        .equ .Lrsi, 6
> +        .equ .Lrdi, 7
> +        .equ .Lr8,  8
> +        .equ .Lr9,  9
> +        .equ .Lr10, 10
> +        .equ .Lr11, 11
> +        .equ .Lr12, 12
> +        .equ .Lr13, 13
> +        .equ .Lr14, 14
> +        .equ .Lr15, 15
> +
>  #define STACK_CPUINFO_FIELD(field) (1 - CPUINFO_sizeof + CPUINFO_##field)
>  #define GET_STACK_END(reg)                        \
> +        .if .Lr##reg > 8;                         \
> +        movq $STACK_SIZE-1, %r##reg;              \
> +        .else;                                    \
>          movl $STACK_SIZE-1, %e##reg;              \
> +        .endif;                                   \
>          orq  %rsp, %r##reg
>  
>  #define GET_CPUINFO_FIELD(field, reg)             \
> @@ -177,6 +198,15 @@ void ret_from_intr(void);
>  #define ASM_STAC ASM_AC(STAC)
>  #define ASM_CLAC ASM_AC(CLAC)
>  
> +.macro write_cr3 val:req, tmp1:req, tmp2:req
> +        mov   %cr4, %\tmp1
> +        mov   %\tmp1, %\tmp2
> +        and   $~X86_CR4_PGE, %\tmp1
> +        mov   %\tmp1, %cr4
> +        mov   %\val, %cr3
> +        mov   %\tmp2, %cr4
> +.endm
> +
>  #define CR4_PV32_RESTORE                                           \
>          667: ASM_NOP5;                                             \
>          .pushsection .altinstr_replacement, "ax";                  \
> --- a/xen/include/asm-x86/current.h
> +++ b/xen/include/asm-x86/current.h
> @@ -41,6 +41,18 @@ struct cpu_info {
>      struct vcpu *current_vcpu;
>      unsigned long per_cpu_offset;
>      unsigned long cr4;
> +    /*
> +     * Of the two following fields the latter is being set to the CR3 value
> +     * to be used on the given pCPU for loading whenever 64-bit PV guest
> +     * context is being entered. The value never changes once set.
> +     * The former is the value to restore when re-entering Xen, if any. IOW
> +     * its value being zero means there's nothing to restore. However, its
> +     * value can also be negative, indicating to the exit-to-Xen code that
> +     * restoring is not necessary, but allowing any nested entry code paths
> +     * to still know the value to put back into CR3.
> +     */
> +    unsigned long xen_cr3;
> +    unsigned long pv_cr3;
>      /* get_stack_bottom() must be 16-byte aligned */
>  };
>  
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -462,6 +462,7 @@ extern idt_entry_t idt_table[];
>  extern idt_entry_t *idt_tables[];
>  
>  DECLARE_PER_CPU(struct tss_struct, init_tss);
> +DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
>  
>  extern void init_int80_direct_trap(struct vcpu *v);
>  
> --- a/xen/include/asm-x86/x86_64/page.h
> +++ b/xen/include/asm-x86/x86_64/page.h
> @@ -24,8 +24,8 @@
>  /* These are architectural limits. Current CPUs support only 40-bit phys. */
>  #define PADDR_BITS              52
>  #define VADDR_BITS              48
> -#define PADDR_MASK              ((1UL << PADDR_BITS)-1)
> -#define VADDR_MASK              ((1UL << VADDR_BITS)-1)
> +#define PADDR_MASK              ((_AC(1,UL) << PADDR_BITS)-1)
> +#define VADDR_MASK              ((_AC(1,UL) << VADDR_BITS)-1)
>  
>  #define VADDR_TOP_BIT           (1UL << (VADDR_BITS - 1))
>  #define CANONICAL_MASK          (~0UL & ~VADDR_MASK)
> @@ -107,6 +107,7 @@ typedef l4_pgentry_t root_pgentry_t;
>        : (((_s) < ROOT_PAGETABLE_FIRST_XEN_SLOT) ||  \
>           ((_s) > ROOT_PAGETABLE_LAST_XEN_SLOT)))
>  
> +#define root_table_offset         l4_table_offset
>  #define root_get_pfn              l4e_get_pfn
>  #define root_get_flags            l4e_get_flags
>  #define root_get_intpte           l4e_get_intpte
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-15 11:07 ` [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
@ 2018-01-15 18:26   ` Andrew Cooper
  2018-01-16  8:12     ` Jan Beulich
  2018-01-16 12:12   ` George Dunlap
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2018-01-15 18:26 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 15/01/18 11:07, Jan Beulich wrote:
> First of all we don't need it on AMD systems. Additionally allow its use
> to be controlled by command line option. For best backportability, this
> intentionally doesn't use alternative instruction patching to achieve
> the intended effect - while we likely want it, this will be later
> follow-up.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: New.
>
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -1849,6 +1849,15 @@ In the case that x2apic is in use, this
>  clustered mode.  The default, given no hint from the **FADT**, is cluster
>  mode.
>  
> +### xpti
> +> `= <boolean>`
> +
> +> Default: `false` on AMD hardware
> +> Default: `true` everywhere else

This is fine for now, but any further isolation is going to have to be
unconditional, or possibly compile-time choice, but maintaining that
going forwards is going to be truly horrible.

> +
> +Override default selection of whether to isolate 64-bit PV guest page
> +tables.

This wants a

** WARNING: Not yet a complete isolation implementation, but better than
nothing. **

or similar, just to avoid giving the people the impression that it is
complete.  Perhaps also a similar warning in the patch 1 commit message?

> +
>  ### xsave
>  > `= <boolean>`
>  
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1511,8 +1511,10 @@ void paravirt_ctxt_switch_to(struct vcpu
>  {
>      unsigned long cr4;
>  
> -    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
> -        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
> +    if ( this_cpu(root_pgt) )
> +        this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
> +            l4e_from_page(v->domain->arch.perdomain_l3_pg,
> +                          __PAGE_HYPERVISOR_RW);

This would be cleaner and have better generated code by pulling
this_cpu(root_pgt) into a local variable.

>  
>      cr4 = pv_guest_cr4_to_real_cr4(v);
>      if ( unlikely(cr4 != read_cr4()) )
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3685,7 +3685,7 @@ long do_mmu_update(
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>                      if ( !rc )
> -                        sync_guest = true;
> +                        sync_guest = this_cpu(root_pgt);

This is quite deceptive, as it is actually sync_guest =
this_cpu(root_pgt) != NULL;

It would probably be cleaner to export opt_xpti and use that, rather
than imply things based on root_pgt.  It would certainly be more
efficient code.

>                      break;
>  
>                  case PGT_writable_page:
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -328,7 +328,7 @@ void start_secondary(void *unused)
>      spin_debug_disable();
>  
>      get_cpu_info()->xen_cr3 = 0;
> -    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
> +    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
>  
>      load_system_tables();
>  
> @@ -734,14 +734,20 @@ static int clone_mapping(const void *ptr
>      return 0;
>  }
>  
> +static __read_mostly int8_t opt_xpti = -1;
> +boolean_param("xpti", opt_xpti);
>  DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
>  
>  static int setup_cpu_root_pgt(unsigned int cpu)
>  {
> -    root_pgentry_t *rpt = alloc_xen_pagetable();
> +    root_pgentry_t *rpt;
>      unsigned int off;
>      int rc;
>  
> +    if ( !opt_xpti )
> +        return 0;
> +
> +    rpt = alloc_xen_pagetable();
>      if ( !rpt )
>          return -ENOMEM;
>  
> @@ -992,10 +998,13 @@ void __init smp_prepare_cpus(unsigned in
>  
>      stack_base[0] = stack_start;
>  
> +    if ( opt_xpti < 0 )
> +        opt_xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;

Newline?

>      rc = setup_cpu_root_pgt(0);
>      if ( rc )
>          panic("Error %d setting up PV root page table\n", rc);
> -    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
> +    if ( per_cpu(root_pgt, 0) )
> +        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
>  
>      set_nr_sockets();
>  
> @@ -1067,6 +1076,7 @@ void __init smp_prepare_boot_cpu(void)
>  #endif
>  
>      get_cpu_info()->xen_cr3 = 0;
> +    get_cpu_info()->pv_cr3 = 0;
>  }
>  
>  static void
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -46,6 +46,7 @@ restore_all_guest:
>          movabs $DIRECTMAP_VIRT_START, %rcx
>          mov   %rdi, %rax
>          and   %rsi, %rdi
> +        jz    .Lrag_keep_cr3

What is rag?  What about .L_skip_cr3_reload ?

~Andrew

>          and   %r9, %rsi
>          add   %rcx, %rdi
>          add   %rcx, %rsi
> @@ -62,6 +63,7 @@ restore_all_guest:
>          rep movsq
>          mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
>          write_cr3 rax, rdi, rsi
> +.Lrag_keep_cr3:
>  
>          RESTORE_ALL
>          testw $TRAP_syscall,4(%rsp)
>
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-15 18:23   ` Andrew Cooper
@ 2018-01-16  7:46     ` Jan Beulich
  2018-01-16 11:51       ` Andrew Cooper
  2018-01-16  9:33     ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16  7:46 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
> On 15/01/18 11:06, Jan Beulich wrote:
>> --- a/xen/arch/x86/x86_64/compat/entry.S
>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>> @@ -199,6 +199,17 @@ ENTRY(cstar_enter)
>>          pushq $0
>>          movl  $TRAP_syscall, 4(%rsp)
>>          SAVE_ALL
>> +
>> +        GET_STACK_END(bx)
>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
>> +        neg   %rcx
>> +UNLIKELY_START(nz, cstar_cr3)
>> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>> +        neg   %rcx
>> +        write_cr3 rcx, rdi, rsi
>> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>> +UNLIKELY_END(cstar_cr3)
> 
> These UNLIKELY()s aren't correct.  It will depend on hardware and
> command line setting as to whether we expect to update cr3.

Why are they not correct? What 64-bit kernels do you know that
use the CSTAR entry method in PV mode? Afaik this was in
existence for a range of kernel versions only in our forward port.
The INT80 path is perhaps indeed more questionable in this regard.

> Furthermore, they will complicate splitting the early entry code away
> from the main .text section for a full isolation implementation.
> 
> For now, I'd drop them and have a simple jz .Lskip.

I will replace them (on the entry paths; I think the one on the exit-
to-Xen path is valid) to please you and in the interest of forward
progress; maybe this will even slightly help backporting to really old
Xen versions, where the UNLIKELY* constructs don't exist yet.

> Also, can we collect these together into macros, rather than
> opencoding?  We seem to have 3 distinct variations.

I had considered that (following the model you use in the SP2
series), but decided against it not the least because of the
dependent but placement-wise separated code additions to
restore original values. Plus again this might be a hindrance of
backporting to really old Xen (which then typically will also be
built on really old tool chains) - as you certainly recall old gas
had quite a few issues with macro handling.

>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
>>  /* %rbx: struct vcpu, interrupts disabled */
>>  restore_all_guest:
>>          ASSERT_INTERRUPTS_DISABLED
>> +
>> +        /* Copy guest mappings and switch to per-CPU root page table. */
>> +        mov   %cr3, %r9
>> +        GET_STACK_END(dx)
>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
>> +        movabs $PADDR_MASK & PAGE_MASK, %rsi
>> +        movabs $DIRECTMAP_VIRT_START, %rcx
>> +        mov   %rdi, %rax
>> +        and   %rsi, %rdi
>> +        and   %r9, %rsi
>> +        add   %rcx, %rdi
>> +        add   %rcx, %rsi
>> +        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
>> +        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
>> +        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
>> +        rep movsq
>> +        mov   $ROOT_PAGETABLE_ENTRIES - \
>> +               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
>> +        rep movsq
>> +        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
>> +        write_cr3 rax, rdi, rsi
>> +
> 
> Can we possibly move this up into C?  For this simplistic algorithm it
> is ok in ASM, but if we want to do any optimisations to avoid the 4k
> memcpy (generation count hidden somewhere in page_info?), ASM is going
> quickly become unwieldy.

I'd prefer to move it into C when it really becomes necessary. Also
you don't properly qualify "this" - for example, I'd rather not move
the write_cr3 invocation into C, yet the placement of your comment
suggests to do so.

> Another optimisation I found made a massive difference for the KAISER
> series was to have an MRU cache of 4 pagetables, so in-guest syscalls
> don't result in any copying as we pass in and out of Xen.

As said elsewhere - optimization can come later. Plus - is avoiding the
copying at _any_ time actually correct, considering possible racing
L4 entry updates on another vCPU?

>> @@ -71,6 +97,18 @@ iret_exit_to_guest:
>>          ALIGN
>>  /* No special register assumptions. */
>>  restore_all_xen:
>> +        /*
>> +         * Check whether we need to switch to the per-CPU page tables, in
>> +         * case we return to late PV exit code (from an NMI or #MC).
>> +         */
>> +        GET_STACK_END(ax)
>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
>> +        test  %rdx, %rdx
>> +UNLIKELY_START(g, exit_cr3)
> 
> cmp or ne ?

"ne" (or really "nz" when used with "test") is outright wrong - we
want to skip the restore when the value is zero _or_ negative.
What's wrong with "jg" and "test" in combination? There simply is
no "jnsz" (other than e.g. "jnbe"). "cmp" against zero could be
used here, but why would I use the larger instruction when "test"
does?

>> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>>          movl  $TRAP_double_fault,4(%rsp)
>>          /* Set AC to reduce chance of further SMAP faults */
>>          SAVE_ALL STAC
>> +
>> +        GET_STACK_END(bx)
>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
>> +        test  %rbx, %rbx
>> +        jz    .Ldblf_cr3_okay
>> +        jns   .Ldblf_cr3_load
>> +        neg   %rbx
>> +.Ldblf_cr3_load:
>> +        write_cr3 rbx, rdi, rsi
>> +.Ldblf_cr3_okay:
>> +
> 
> It is moderately common for there to be cascade faults in #DF.  This
> would be better if it were the general IST switch.

Based on the issues I had with #DF occurring while debugging this,
I've decided to keep the code here as simple as possible without
being incorrect: There's no point looking at the incoming CR3.
There's no point in trying to avoid nested faults (including
subsequent #DF) restoring CR3. There's also no point in retaining
the value for later restoring here, as we never return. In fact, as
mentioned elsewhere, we should imo indeed consider unilaterally
switching to idle_pg_table[] here.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-15 18:26   ` Andrew Cooper
@ 2018-01-16  8:12     ` Jan Beulich
  2018-01-16 13:20       ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16  8:12 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 15.01.18 at 19:26, <andrew.cooper3@citrix.com> wrote:
> On 15/01/18 11:07, Jan Beulich wrote:
>> --- a/docs/misc/xen-command-line.markdown
>> +++ b/docs/misc/xen-command-line.markdown
>> @@ -1849,6 +1849,15 @@ In the case that x2apic is in use, this
>>  clustered mode.  The default, given no hint from the **FADT**, is cluster
>>  mode.
>>  
>> +### xpti
>> +> `= <boolean>`
>> +
>> +> Default: `false` on AMD hardware
>> +> Default: `true` everywhere else
> 
> This is fine for now, but any further isolation is going to have to be
> unconditional, or possibly compile-time choice, but maintaining that
> going forwards is going to be truly horrible.
> 
>> +
>> +Override default selection of whether to isolate 64-bit PV guest page
>> +tables.
> 
> This wants a
> 
> ** WARNING: Not yet a complete isolation implementation, but better than
> nothing. **
> 
> or similar, just to avoid giving the people the impression that it is
> complete.  Perhaps also a similar warning in the patch 1 commit message?

I've added the warning text here as suggested, and a respective
remark to patch 1's description, albeit under the current guidelines
of when we would consider an information leak a security issue, I
think isolation is sufficiently complete: No parts of the direct map
not pertaining to the current guest are being exposed, which
implies that no parts of the Xen heap not pertaining to the current
guest are. Leaking control page contents of other guests as well
as leaking the bits and pieces of the Xen image is not an issue by
itself.

IOW I'm not convinced such a warning is - strictly from a "would
this need an XSA on its own" pov - entirely appropriate.

Furthermore the command line option really exists to _disable_
the isolation (it also being capable of enabling is more a side
effect, but perhaps a useful one in the niche of running Xen
itself virtualized with being given the impression of running on
AMD hardware, but actually running on Intel), at which point the
warning is completely irrelevant.

One thing we may want/need to consider (but which is orthogonal
to the changes here) is to overwrite the part of the hypervisor
stack which isn't already being overwritten during context switch.

>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -1511,8 +1511,10 @@ void paravirt_ctxt_switch_to(struct vcpu
>>  {
>>      unsigned long cr4;
>>  
>> -    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
>> -        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
>> +    if ( this_cpu(root_pgt) )
>> +        this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
>> +            l4e_from_page(v->domain->arch.perdomain_l3_pg,
>> +                          __PAGE_HYPERVISOR_RW);
> 
> This would be cleaner and have better generated code by pulling
> this_cpu(root_pgt) into a local variable.

Done.

>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -3685,7 +3685,7 @@ long do_mmu_update(
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>                      if ( !rc )
>> -                        sync_guest = true;
>> +                        sync_guest = this_cpu(root_pgt);
> 
> This is quite deceptive, as it is actually sync_guest =
> this_cpu(root_pgt) != NULL;

We omit such "!= NULL" / "!= 0" elsewhere too.

> It would probably be cleaner to export opt_xpti and use that, rather
> than imply things based on root_pgt.  It would certainly be more
> efficient code.

Again I had considered this and decided against: Other reasons to
have NULL in here may arise, so it felt better to key the decision
off of the actual fact that it depends on, rather than that one's
pre-condition.

>> @@ -992,10 +998,13 @@ void __init smp_prepare_cpus(unsigned in
>>  
>>      stack_base[0] = stack_start;
>>  
>> +    if ( opt_xpti < 0 )
>> +        opt_xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
> 
> Newline?

I've added one, but imo it hides that the assignment is a pre-condition
for ...

>>      rc = setup_cpu_root_pgt(0);

... this call (i.e. makes less obvious that re-ordering here is not an
option).

>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -46,6 +46,7 @@ restore_all_guest:
>>          movabs $DIRECTMAP_VIRT_START, %rcx
>>          mov   %rdi, %rax
>>          and   %rsi, %rdi
>> +        jz    .Lrag_keep_cr3
> 
> What is rag?  What about .L_skip_cr3_reload ?

Look at the function name: It's the acronym of "restore_all_guest".
I specifically wanted to avoid any unspecific names like the one
you suggest, and short of a better idea I've used this one (to
parallel infixes like "cstar" used elsewhere).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-15 18:23   ` Andrew Cooper
  2018-01-16  7:46     ` Jan Beulich
@ 2018-01-16  9:33     ` Jan Beulich
  2018-01-16 11:56       ` Andrew Cooper
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16  9:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
> On 15/01/18 11:06, Jan Beulich wrote:
>> This also wants Andrew's "[PATCH RFC 11/44] x86/pt-shadow: Always set
>> _PAGE_ACCESSED on L4e updates".
> 
> I've cleaned this patch up and committed it in preparation.
> 
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=bd61fe94bee0556bc2f6 
> 4999a4a8315b93f90f21

Only now that I'm doing backports thereof I notice an oddity with
32-bit guest handling: Why would you set the accessed bit in that
case? The L4 is an internal thing there, and hence by us knowing
that we don't care, this is unnecessary (but of course also not
wrong). I'll do the 4.9 and older backports according to that
observation (making for slightly less of a code change).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16  7:46     ` Jan Beulich
@ 2018-01-16 11:51       ` Andrew Cooper
  2018-01-16 12:33         ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 11:51 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/01/18 07:46, Jan Beulich wrote:
>>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
>> On 15/01/18 11:06, Jan Beulich wrote:
>>> --- a/xen/arch/x86/x86_64/compat/entry.S
>>> +++ b/xen/arch/x86/x86_64/compat/entry.S
>>> @@ -199,6 +199,17 @@ ENTRY(cstar_enter)
>>>          pushq $0
>>>          movl  $TRAP_syscall, 4(%rsp)
>>>          SAVE_ALL
>>> +
>>> +        GET_STACK_END(bx)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
>>> +        neg   %rcx
>>> +UNLIKELY_START(nz, cstar_cr3)
>>> +        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>>> +        neg   %rcx
>>> +        write_cr3 rcx, rdi, rsi
>>> +        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
>>> +UNLIKELY_END(cstar_cr3)
>> These UNLIKELY()s aren't correct.  It will depend on hardware and
>> command line setting as to whether we expect to update cr3.
> Why are they not correct? What 64-bit kernels do you know that
> use the CSTAR entry method in PV mode?

None, but I also don't see you excluding 32bit PV guests, which means we
do take the unlikely path all the time in general.

>  Afaik this was in
> existence for a range of kernel versions only in our forward port.
> The INT80 path is perhaps indeed more questionable in this regard.
>
>> Furthermore, they will complicate splitting the early entry code away
>> from the main .text section for a full isolation implementation.
>>
>> For now, I'd drop them and have a simple jz .Lskip.
> I will replace them (on the entry paths; I think the one on the exit-
> to-Xen path is valid) to please you and in the interest of forward
> progress; maybe this will even slightly help backporting to really old
> Xen versions, where the UNLIKELY* constructs don't exist yet.
>
>> Also, can we collect these together into macros, rather than
>> opencoding?  We seem to have 3 distinct variations.
> I had considered that (following the model you use in the SP2
> series), but decided against it not the least because of the
> dependent but placement-wise separated code additions to
> restore original values. Plus again this might be a hindrance of
> backporting to really old Xen (which then typically will also be
> built on really old tool chains) - as you certainly recall old gas
> had quite a few issues with macro handling.

There is nothing special in these macros though?  I found the SP2-style
far easier to backport because they are a single slot-in line.

Anyway, I'm not overly fussed, but I have a

>
>>> --- a/xen/arch/x86/x86_64/entry.S
>>> +++ b/xen/arch/x86/x86_64/entry.S
>>> @@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
>>>  /* %rbx: struct vcpu, interrupts disabled */
>>>  restore_all_guest:
>>>          ASSERT_INTERRUPTS_DISABLED
>>> +
>>> +        /* Copy guest mappings and switch to per-CPU root page table. */
>>> +        mov   %cr3, %r9
>>> +        GET_STACK_END(dx)
>>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
>>> +        movabs $PADDR_MASK & PAGE_MASK, %rsi
>>> +        movabs $DIRECTMAP_VIRT_START, %rcx
>>> +        mov   %rdi, %rax
>>> +        and   %rsi, %rdi
>>> +        and   %r9, %rsi
>>> +        add   %rcx, %rdi
>>> +        add   %rcx, %rsi
>>> +        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
>>> +        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
>>> +        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
>>> +        rep movsq
>>> +        mov   $ROOT_PAGETABLE_ENTRIES - \
>>> +               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
>>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
>>> +        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
>>> +                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
>>> +        rep movsq
>>> +        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
>>> +        write_cr3 rax, rdi, rsi
>>> +
>> Can we possibly move this up into C?  For this simplistic algorithm it
>> is ok in ASM, but if we want to do any optimisations to avoid the 4k
>> memcpy (generation count hidden somewhere in page_info?), ASM is going
>> quickly become unwieldy.
> I'd prefer to move it into C when it really becomes necessary. Also
> you don't properly qualify "this" - for example, I'd rather not move
> the write_cr3 invocation into C, yet the placement of your comment
> suggests to do so.

I meant the whole hunk, not write_cr3 itself.

>
>> Another optimisation I found made a massive difference for the KAISER
>> series was to have an MRU cache of 4 pagetables, so in-guest syscalls
>> don't result in any copying as we pass in and out of Xen.
> As said elsewhere - optimization can come later. Plus - is avoiding the
> copying at _any_ time actually correct, considering possible racing
> L4 entry updates on another vCPU?

later> indeed.

The common behaviour is that L4's don't really change after fork(), so
having a generation count on them will allow skipping of the memcpy() in
most cases.

You can't skip the check, but the amount of logic used to avoid a 4k
memcpy can be quite substantial and still be a performance win.

>
>>> @@ -71,6 +97,18 @@ iret_exit_to_guest:
>>>          ALIGN
>>>  /* No special register assumptions. */
>>>  restore_all_xen:
>>> +        /*
>>> +         * Check whether we need to switch to the per-CPU page tables, in
>>> +         * case we return to late PV exit code (from an NMI or #MC).
>>> +         */
>>> +        GET_STACK_END(ax)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
>>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
>>> +        test  %rdx, %rdx
>>> +UNLIKELY_START(g, exit_cr3)
>> cmp or ne ?
> "ne" (or really "nz" when used with "test") is outright wrong - we
> want to skip the restore when the value is zero _or_ negative.
> What's wrong with "jg" and "test" in combination? There simply is
> no "jnsz" (other than e.g. "jnbe"). "cmp" against zero could be
> used here, but why would I use the larger instruction when "test"
> does?

greater or less than is not commonly related to the test instruction,
which is why this looks wrong to a reviewer.

A comment of /* Ideally jnsz, but jg will have to do */ would go a very
long way.

I've double checked the logic and I agree with your conclusions, but the
only reason this works is because test unconditionally zeroes the
overflow flag.

>
>>> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>>>          movl  $TRAP_double_fault,4(%rsp)
>>>          /* Set AC to reduce chance of further SMAP faults */
>>>          SAVE_ALL STAC
>>> +
>>> +        GET_STACK_END(bx)
>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
>>> +        test  %rbx, %rbx
>>> +        jz    .Ldblf_cr3_okay
>>> +        jns   .Ldblf_cr3_load
>>> +        neg   %rbx
>>> +.Ldblf_cr3_load:
>>> +        write_cr3 rbx, rdi, rsi
>>> +.Ldblf_cr3_okay:
>>> +
>> It is moderately common for there to be cascade faults in #DF.  This
>> would be better if it were the general IST switch.
> Based on the issues I had with #DF occurring while debugging this,
> I've decided to keep the code here as simple as possible without
> being incorrect: There's no point looking at the incoming CR3.
> There's no point in trying to avoid nested faults (including
> subsequent #DF) restoring CR3. There's also no point in retaining
> the value for later restoring here, as we never return. In fact, as
> mentioned elsewhere, we should imo indeed consider unilaterally
> switching to idle_pg_table[] here.

Ok.

~Andrew

>
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16  9:33     ` Jan Beulich
@ 2018-01-16 11:56       ` Andrew Cooper
  2018-01-16 12:25         ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 11:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/01/18 09:33, Jan Beulich wrote:
>>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
>> On 15/01/18 11:06, Jan Beulich wrote:
>>> This also wants Andrew's "[PATCH RFC 11/44] x86/pt-shadow: Always set
>>> _PAGE_ACCESSED on L4e updates".
>> I've cleaned this patch up and committed it in preparation.
>>
>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=bd61fe94bee0556bc2f6 
>> 4999a4a8315b93f90f21
> Only now that I'm doing backports thereof I notice an oddity with
> 32-bit guest handling: Why would you set the accessed bit in that
> case? The L4 is an internal thing there, and hence by us knowing
> that we don't care, this is unnecessary (but of course also not
> wrong). I'll do the 4.9 and older backports according to that
> observation (making for slightly less of a code change).

There are no paths leading to adjust_guest_l4e() for 32bit PV guests. 
The only L4 handling which exists is constructing the monitor table,
which writes the PTEs directly.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-15 11:07 ` [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
  2018-01-15 18:26   ` Andrew Cooper
@ 2018-01-16 12:12   ` George Dunlap
  2018-01-16 12:21     ` Juergen Gross
  2018-01-16 12:35     ` Jan Beulich
  1 sibling, 2 replies; 29+ messages in thread
From: George Dunlap @ 2018-01-16 12:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper

On Mon, Jan 15, 2018 at 11:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
> First of all we don't need it on AMD systems. Additionally allow its use
> to be controlled by command line option. For best backportability, this
> intentionally doesn't use alternative instruction patching to achieve
> the intended effect - while we likely want it, this will be later
> follow-up.

Is it worth making it optional to apply to dom0?  In most cases, if an
attacker can manage to get userspace on dom0, they should be able to
take over the whole system anyway; turning it off on dom0 to get
better performance seems like a policy decision that administrators
might reasonably make.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 12:12   ` George Dunlap
@ 2018-01-16 12:21     ` Juergen Gross
  2018-01-16 12:39       ` George Dunlap
  2018-01-16 12:35     ` Jan Beulich
  1 sibling, 1 reply; 29+ messages in thread
From: Juergen Gross @ 2018-01-16 12:21 UTC (permalink / raw)
  To: George Dunlap, Jan Beulich; +Cc: xen-devel, Andrew Cooper

On 16/01/18 13:12, George Dunlap wrote:
> On Mon, Jan 15, 2018 at 11:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> First of all we don't need it on AMD systems. Additionally allow its use
>> to be controlled by command line option. For best backportability, this
>> intentionally doesn't use alternative instruction patching to achieve
>> the intended effect - while we likely want it, this will be later
>> follow-up.
> 
> Is it worth making it optional to apply to dom0?  In most cases, if an
> attacker can manage to get userspace on dom0, they should be able to
> take over the whole system anyway; turning it off on dom0 to get
> better performance seems like a policy decision that administrators
> might reasonably make.

You are implying here that Linux is insecure: why does userspace access
allow you to take over the machine? I can see that being true for root
access, but not for any other unprivileged user account.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 11:56       ` Andrew Cooper
@ 2018-01-16 12:25         ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 12:25 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 16.01.18 at 12:56, <andrew.cooper3@citrix.com> wrote:
> On 16/01/18 09:33, Jan Beulich wrote:
>>>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
>>> On 15/01/18 11:06, Jan Beulich wrote:
>>>> This also wants Andrew's "[PATCH RFC 11/44] x86/pt-shadow: Always set
>>>> _PAGE_ACCESSED on L4e updates".
>>> I've cleaned this patch up and committed it in preparation.
>>>
>>> http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=bd61fe94bee0556bc2f6 
> 
>>> 4999a4a8315b93f90f21
>> Only now that I'm doing backports thereof I notice an oddity with
>> 32-bit guest handling: Why would you set the accessed bit in that
>> case? The L4 is an internal thing there, and hence by us knowing
>> that we don't care, this is unnecessary (but of course also not
>> wrong). I'll do the 4.9 and older backports according to that
>> observation (making for slightly less of a code change).
> 
> There are no paths leading to adjust_guest_l4e() for 32bit PV guests. 
> The only L4 handling which exists is constructing the monitor table,
> which writes the PTEs directly.

Yeah, I did realize this after writing. The check could probably
go away altogether.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 11:51       ` Andrew Cooper
@ 2018-01-16 12:33         ` Jan Beulich
  2018-01-16 13:26           ` Andrew Cooper
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 12:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 16.01.18 at 12:51, <andrew.cooper3@citrix.com> wrote:
> On 16/01/18 07:46, Jan Beulich wrote:
>>>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
>>> Also, can we collect these together into macros, rather than
>>> opencoding?  We seem to have 3 distinct variations.
>> I had considered that (following the model you use in the SP2
>> series), but decided against it not the least because of the
>> dependent but placement-wise separated code additions to
>> restore original values. Plus again this might be a hindrance of
>> backporting to really old Xen (which then typically will also be
>> built on really old tool chains) - as you certainly recall old gas
>> had quite a few issues with macro handling.
> 
> There is nothing special in these macros though?  I found the SP2-style
> far easier to backport because they are a single slot-in line.

I've just found the patch here needing a change in register use
in 4.7 - such would be a little harder with pre-cooked macros,
especially when they don't allow to specify all registers to be
used (including ones for temporaries).

> Anyway, I'm not overly fussed, but I have a

Unfinished sentence?

>>>> @@ -71,6 +97,18 @@ iret_exit_to_guest:
>>>>          ALIGN
>>>>  /* No special register assumptions. */
>>>>  restore_all_xen:
>>>> +        /*
>>>> +         * Check whether we need to switch to the per-CPU page tables, in
>>>> +         * case we return to late PV exit code (from an NMI or #MC).
>>>> +         */
>>>> +        GET_STACK_END(ax)
>>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
>>>> +        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
>>>> +        test  %rdx, %rdx
>>>> +UNLIKELY_START(g, exit_cr3)
>>> cmp or ne ?
>> "ne" (or really "nz" when used with "test") is outright wrong - we
>> want to skip the restore when the value is zero _or_ negative.
>> What's wrong with "jg" and "test" in combination? There simply is
>> no "jnsz" (other than e.g. "jnbe"). "cmp" against zero could be
>> used here, but why would I use the larger instruction when "test"
>> does?
> 
> greater or less than is not commonly related to the test instruction,
> which is why this looks wrong to a reviewer.
> 
> A comment of /* Ideally jnsz, but jg will have to do */ would go a very
> long way.

Added.

> I've double checked the logic and I agree with your conclusions, but the
> only reason this works is because test unconditionally zeroes the
> overflow flag.

Because, well, there is no overflow possible for AND (and hence
TEST), OR, and XOR.

>>>> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>>>>          movl  $TRAP_double_fault,4(%rsp)
>>>>          /* Set AC to reduce chance of further SMAP faults */
>>>>          SAVE_ALL STAC
>>>> +
>>>> +        GET_STACK_END(bx)
>>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
>>>> +        test  %rbx, %rbx
>>>> +        jz    .Ldblf_cr3_okay
>>>> +        jns   .Ldblf_cr3_load
>>>> +        neg   %rbx
>>>> +.Ldblf_cr3_load:
>>>> +        write_cr3 rbx, rdi, rsi
>>>> +.Ldblf_cr3_okay:
>>>> +
>>> It is moderately common for there to be cascade faults in #DF.  This
>>> would be better if it were the general IST switch.
>> Based on the issues I had with #DF occurring while debugging this,
>> I've decided to keep the code here as simple as possible without
>> being incorrect: There's no point looking at the incoming CR3.
>> There's no point in trying to avoid nested faults (including
>> subsequent #DF) restoring CR3. There's also no point in retaining
>> the value for later restoring here, as we never return. In fact, as
>> mentioned elsewhere, we should imo indeed consider unilaterally
>> switching to idle_pg_table[] here.
> 
> Ok.

"Ok" to which parts of this - keeping the code simple, switching to
idle_pg_table[] (perhaps in a follow-up patch), or both?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 12:12   ` George Dunlap
  2018-01-16 12:21     ` Juergen Gross
@ 2018-01-16 12:35     ` Jan Beulich
  2018-01-16 12:40       ` George Dunlap
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 12:35 UTC (permalink / raw)
  To: George Dunlap; +Cc: Andrew Cooper, xen-devel

>>> On 16.01.18 at 13:12, <dunlapg@umich.edu> wrote:
> On Mon, Jan 15, 2018 at 11:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
>> First of all we don't need it on AMD systems. Additionally allow its use
>> to be controlled by command line option. For best backportability, this
>> intentionally doesn't use alternative instruction patching to achieve
>> the intended effect - while we likely want it, this will be later
>> follow-up.
> 
> Is it worth making it optional to apply to dom0?  In most cases, if an
> attacker can manage to get userspace on dom0, they should be able to
> take over the whole system anyway; turning it off on dom0 to get
> better performance seems like a policy decision that administrators
> might reasonably make.

Irrespective of Jürgen's reply (which I agree with) this would be an
option, but I'd prefer to fold this into the stage 2 activities (if we
really want it in the first place).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 12:21     ` Juergen Gross
@ 2018-01-16 12:39       ` George Dunlap
  0 siblings, 0 replies; 29+ messages in thread
From: George Dunlap @ 2018-01-16 12:39 UTC (permalink / raw)
  To: Juergen Gross; +Cc: xen-devel, Jan Beulich, Andrew Cooper

On Tue, Jan 16, 2018 at 12:21 PM, Juergen Gross <jgross@suse.com> wrote:
> On 16/01/18 13:12, George Dunlap wrote:
>> On Mon, Jan 15, 2018 at 11:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>> First of all we don't need it on AMD systems. Additionally allow its use
>>> to be controlled by command line option. For best backportability, this
>>> intentionally doesn't use alternative instruction patching to achieve
>>> the intended effect - while we likely want it, this will be later
>>> follow-up.
>>
>> Is it worth making it optional to apply to dom0?  In most cases, if an
>> attacker can manage to get userspace on dom0, they should be able to
>> take over the whole system anyway; turning it off on dom0 to get
>> better performance seems like a policy decision that administrators
>> might reasonably make.
>
> You are implying here that Linux is insecure: why does userspace access
> allow you to take over the machine? I can see that being true for root
> access, but not for any other unprivileged user account.

Well first of all, go look at my talk about local root exploits in
Linux -- usually there are a few found every month.

But let's ignore that out for a minute.  Consider a "typical"
recommended dom0 setup:
- Dom0 on a separate network
- Nothing running on dom0 except Xen-related services, and toolstack

How would an attacker get userspace access on such a host at all?
- Attack a devicemodel running in dom0
- Attack a backend running in the kernel
- Attack xenstore

We don't yet have a deprivileged QEMu, so at the moment an attack on
any of these will already give you full control of the system.

Obviously this wouldn't be appropriate to all systems; but it could be
appropriate to a fair number of them.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 12:35     ` Jan Beulich
@ 2018-01-16 12:40       ` George Dunlap
  0 siblings, 0 replies; 29+ messages in thread
From: George Dunlap @ 2018-01-16 12:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

On Tue, Jan 16, 2018 at 12:35 PM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 16.01.18 at 13:12, <dunlapg@umich.edu> wrote:
>> On Mon, Jan 15, 2018 at 11:07 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>> First of all we don't need it on AMD systems. Additionally allow its use
>>> to be controlled by command line option. For best backportability, this
>>> intentionally doesn't use alternative instruction patching to achieve
>>> the intended effect - while we likely want it, this will be later
>>> follow-up.
>>
>> Is it worth making it optional to apply to dom0?  In most cases, if an
>> attacker can manage to get userspace on dom0, they should be able to
>> take over the whole system anyway; turning it off on dom0 to get
>> better performance seems like a policy decision that administrators
>> might reasonably make.
>
> Irrespective of Jürgen's reply (which I agree with) this would be an
> option, but I'd prefer to fold this into the stage 2 activities (if we
> really want it in the first place).

That sounds reasonable.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16  8:12     ` Jan Beulich
@ 2018-01-16 13:20       ` Andrew Cooper
  2018-01-16 13:51         ` Jan Beulich
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 13:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/01/18 08:12, Jan Beulich wrote:
>>>> On 15.01.18 at 19:26, <andrew.cooper3@citrix.com> wrote:
>> On 15/01/18 11:07, Jan Beulich wrote:
>>> --- a/docs/misc/xen-command-line.markdown
>>> +++ b/docs/misc/xen-command-line.markdown
>>> @@ -1849,6 +1849,15 @@ In the case that x2apic is in use, this
>>>  clustered mode.  The default, given no hint from the **FADT**, is cluster
>>>  mode.
>>>  
>>> +### xpti
>>> +> `= <boolean>`
>>> +
>>> +> Default: `false` on AMD hardware
>>> +> Default: `true` everywhere else
>> This is fine for now, but any further isolation is going to have to be
>> unconditional, or possibly compile-time choice, but maintaining that
>> going forwards is going to be truly horrible.
>>
>>> +
>>> +Override default selection of whether to isolate 64-bit PV guest page
>>> +tables.
>> This wants a
>>
>> ** WARNING: Not yet a complete isolation implementation, but better than
>> nothing. **
>>
>> or similar, just to avoid giving the people the impression that it is
>> complete.  Perhaps also a similar warning in the patch 1 commit message?
> I've added the warning text here as suggested, and a respective
> remark to patch 1's description, albeit under the current guidelines
> of when we would consider an information leak a security issue, I
> think isolation is sufficiently complete: No parts of the direct map
> not pertaining to the current guest are being exposed, which
> implies that no parts of the Xen heap not pertaining to the current
> guest are. Leaking control page contents of other guests as well
> as leaking the bits and pieces of the Xen image is not an issue by
> itself.
>
> IOW I'm not convinced such a warning is - strictly from a "would
> this need an XSA on its own" pov - entirely appropriate.

The isolation is definitely not complete.  Amongst other things, remote
stacks are in view of an attacker, which is why my KAISER-prereq series
pushes for the fully isolated per-pcpu range.

We do not, under any circumstances, want to give the impression that
this patch makes them completely immune to leaks, but improvements on
this status quo shouldn't need an XSA.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 12:33         ` Jan Beulich
@ 2018-01-16 13:26           ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 13:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On 16/01/18 12:33, Jan Beulich wrote:
>>
>>>>>> On 15.01.18 at 19:23, <andrew.cooper3@citrix.com> wrote:
>>>>  can we collect these together into macros, rather than
>>>> opencoding?  We seem to have 3 distinct variations.
>>> I had considered that (following the model you use in the SP2
>>> series), but decided against it not the least because of the
>>> dependent but placement-wise separated code additions to
>>> restore original values. Plus again this might be a hindrance of
>>> backporting to really old Xen (which then typically will also be
>>> built on really old tool chains) - as you certainly recall old gas
>>> had quite a few issues with macro handling.
>> There is nothing special in these macros though?  I found the SP2-style
>> far easier to backport because they are a single slot-in line.
> I've just found the patch here needing a change in register use
> in 4.7 - such would be a little harder with pre-cooked macros,
> especially when they don't allow to specify all registers to be
> used (including ones for temporaries).
>
>> Anyway, I'm not overly fussed, but I have a
> Unfinished sentence?

Oops.  "I have a feeling that the duplication will lead to subtle bugs
when we need to change things, and inevitably miss one."

>
>>>>> @@ -585,6 +692,17 @@ ENTRY(double_fault)
>>>>>          movl  $TRAP_double_fault,4(%rsp)
>>>>>          /* Set AC to reduce chance of further SMAP faults */
>>>>>          SAVE_ALL STAC
>>>>> +
>>>>> +        GET_STACK_END(bx)
>>>>> +        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
>>>>> +        test  %rbx, %rbx
>>>>> +        jz    .Ldblf_cr3_okay
>>>>> +        jns   .Ldblf_cr3_load
>>>>> +        neg   %rbx
>>>>> +.Ldblf_cr3_load:
>>>>> +        write_cr3 rbx, rdi, rsi
>>>>> +.Ldblf_cr3_okay:
>>>>> +
>>>> It is moderately common for there to be cascade faults in #DF.  This
>>>> would be better if it were the general IST switch.
>>> Based on the issues I had with #DF occurring while debugging this,
>>> I've decided to keep the code here as simple as possible without
>>> being incorrect: There's no point looking at the incoming CR3.
>>> There's no point in trying to avoid nested faults (including
>>> subsequent #DF) restoring CR3. There's also no point in retaining
>>> the value for later restoring here, as we never return. In fact, as
>>> mentioned elsewhere, we should imo indeed consider unilaterally
>>> switching to idle_pg_table[] here.
>> Ok.
> "Ok" to which parts of this - keeping the code simple, switching to
> idle_pg_table[] (perhaps in a follow-up patch), or both?

Keeping the code as it is now, and adjusting in a follow-up.  Thinking
about things, idle_pg_table[], while nice, is incompatible with a more
complete isolation strategy.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 13:20       ` Andrew Cooper
@ 2018-01-16 13:51         ` Jan Beulich
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 13:51 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

>>> On 16.01.18 at 14:20, <andrew.cooper3@citrix.com> wrote:
> The isolation is definitely not complete.  Amongst other things, remote
> stacks are in view of an attacker, which is why my KAISER-prereq series
> pushes for the fully isolated per-pcpu range.

How are remote stacks visible? The local page tables contain only
mappings of the local CPU's stack.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation
  2018-01-15 11:01 [PATCH v2 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
  2018-01-15 11:06 ` [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
  2018-01-15 11:07 ` [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
@ 2018-01-16 15:16 ` Jan Beulich
  2018-01-16 15:21   ` [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
  2018-01-16 15:22   ` [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
  2 siblings, 2 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 15:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

1: x86: Meltdown band-aid against malicious 64-bit PV guests
2: x86: allow Meltdown band-aid to be disabled

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Addressing review comments (see individual patches).
v2: Addressing review comments for patch 1 (see there) and new
    patch 2.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 15:16 ` [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
@ 2018-01-16 15:21   ` Jan Beulich
  2018-01-16 16:05     ` Andrew Cooper
  2018-01-16 17:28     ` Andy Smith
  2018-01-16 15:22   ` [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
  1 sibling, 2 replies; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 15:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.

The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.

This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.

There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Add remark in description as to not doing full isolation here. Drop
    use of UNLIKELY_*() from CSTAR and INT80 entry paths.
v2: Use sensible numbers for the register classification constants also
    for the low 8 registers. Insert 'r' into their names to make their
    purpose more clear. Use "sub" instead of "add" when adding an
    immediate of 128. Defer sync IPI in do_mmu_update() until the end of
    the main loop. Use flush IPI there instead event check one. Make
    setup functions return a proper error code. Use better suited local
    variables in clone_mapping(). Add comment to new struct cpu_info
    fields.
---
Backporting notes:
- This needs f9eb74789a ("x86/entry: Remove support for partial
  cpu_user_regs frames") as a prereq, due to the uses of %r14 and %r15.
  But that's intended to be backported anyway (for Spectre/SP2).
- This further needs bd61fe94be ("x86/mm: Always set _PAGE_ACCESSED on
  L4e updates") as a prereq.
- The use of "root" instead of "l4" here is mainly to not make 5-level
  page table additions any harder. In backports "l4" should probably be
  preferred.
---
Follow-up notes:
- use alternatives patching for fully suppressing performance effects
  when disabled
- check whether running globally with CR4.PGE clear helps overall
  performance

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1511,6 +1511,9 @@ void paravirt_ctxt_switch_to(struct vcpu
 {
     unsigned long cr4;
 
+    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
+        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
         write_cr4(cr4);
@@ -1682,6 +1685,8 @@ void context_switch(struct vcpu *prev, s
 
     ASSERT(local_irq_is_enabled());
 
+    get_cpu_info()->xen_cr3 = 0;
+
     cpumask_copy(&dirty_mask, next->vcpu_dirty_cpumask);
     /* Allow at most one CPU at a time to be dirty. */
     ASSERT(cpumask_weight(&dirty_mask) <= 1);
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3520,6 +3520,7 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
+    bool sync_guest = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -3683,6 +3684,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        sync_guest = true;
                     break;
 
                 case PGT_writable_page:
@@ -3787,6 +3790,24 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
+    if ( sync_guest )
+    {
+        /*
+         * Force other vCPU-s of the affected guest to pick up L4 entry
+         * changes (if any). Issue a flush IPI with empty operation mask to
+         * facilitate this (including ourselves waiting for the IPI to
+         * actually have arrived). Utilize the fact that FLUSH_VA_VALID is
+         * meaningless without FLUSH_CACHE, but will allow to pass the no-op
+         * check in flush_area_mask().
+         */
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+
+        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
+        if ( !cpumask_empty(mask) )
+            flush_area_mask(mask, ZERO_BLOCK_PTR, FLUSH_VA_VALID);
+    }
+
     perfc_add(num_page_updates, i);
 
  out:
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -327,6 +327,9 @@ void start_secondary(void *unused)
      */
     spin_debug_disable();
 
+    get_cpu_info()->xen_cr3 = 0;
+    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
+
     load_system_tables();
 
     /* Full exception support from here on in. */
@@ -633,6 +636,187 @@ void cpu_exit_clear(unsigned int cpu)
     set_cpu_state(CPU_STATE_DEAD);
 }
 
+static int clone_mapping(const void *ptr, root_pgentry_t *rpt)
+{
+    unsigned long linear = (unsigned long)ptr, pfn;
+    unsigned int flags;
+    l3_pgentry_t *pl3e = l4e_to_l3e(idle_pg_table[root_table_offset(linear)]) +
+                         l3_table_offset(linear);
+    l2_pgentry_t *pl2e;
+    l1_pgentry_t *pl1e;
+
+    if ( linear < DIRECTMAP_VIRT_START )
+        return 0;
+
+    flags = l3e_get_flags(*pl3e);
+    ASSERT(flags & _PAGE_PRESENT);
+    if ( flags & _PAGE_PSE )
+    {
+        pfn = (l3e_get_pfn(*pl3e) & ~((1UL << (2 * PAGETABLE_ORDER)) - 1)) |
+              (PFN_DOWN(linear) & ((1UL << (2 * PAGETABLE_ORDER)) - 1));
+        flags &= ~_PAGE_PSE;
+    }
+    else
+    {
+        pl2e = l3e_to_l2e(*pl3e) + l2_table_offset(linear);
+        flags = l2e_get_flags(*pl2e);
+        ASSERT(flags & _PAGE_PRESENT);
+        if ( flags & _PAGE_PSE )
+        {
+            pfn = (l2e_get_pfn(*pl2e) & ~((1UL << PAGETABLE_ORDER) - 1)) |
+                  (PFN_DOWN(linear) & ((1UL << PAGETABLE_ORDER) - 1));
+            flags &= ~_PAGE_PSE;
+        }
+        else
+        {
+            pl1e = l2e_to_l1e(*pl2e) + l1_table_offset(linear);
+            flags = l1e_get_flags(*pl1e);
+            if ( !(flags & _PAGE_PRESENT) )
+                return 0;
+            pfn = l1e_get_pfn(*pl1e);
+        }
+    }
+
+    if ( !(root_get_flags(rpt[root_table_offset(linear)]) & _PAGE_PRESENT) )
+    {
+        pl3e = alloc_xen_pagetable();
+        if ( !pl3e )
+            return -ENOMEM;
+        clear_page(pl3e);
+        l4e_write(&rpt[root_table_offset(linear)],
+                  l4e_from_paddr(__pa(pl3e), __PAGE_HYPERVISOR));
+    }
+    else
+        pl3e = l4e_to_l3e(rpt[root_table_offset(linear)]);
+
+    pl3e += l3_table_offset(linear);
+
+    if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+    {
+        pl2e = alloc_xen_pagetable();
+        if ( !pl2e )
+            return -ENOMEM;
+        clear_page(pl2e);
+        l3e_write(pl3e, l3e_from_paddr(__pa(pl2e), __PAGE_HYPERVISOR));
+    }
+    else
+    {
+        ASSERT(!(l3e_get_flags(*pl3e) & _PAGE_PSE));
+        pl2e = l3e_to_l2e(*pl3e);
+    }
+
+    pl2e += l2_table_offset(linear);
+
+    if ( !(l2e_get_flags(*pl2e) & _PAGE_PRESENT) )
+    {
+        pl1e = alloc_xen_pagetable();
+        if ( !pl1e )
+            return -ENOMEM;
+        clear_page(pl1e);
+        l2e_write(pl2e, l2e_from_paddr(__pa(pl1e), __PAGE_HYPERVISOR));
+    }
+    else
+    {
+        ASSERT(!(l2e_get_flags(*pl2e) & _PAGE_PSE));
+        pl1e = l2e_to_l1e(*pl2e);
+    }
+
+    pl1e += l1_table_offset(linear);
+
+    if ( l1e_get_flags(*pl1e) & _PAGE_PRESENT )
+    {
+        ASSERT(l1e_get_pfn(*pl1e) == pfn);
+        ASSERT(l1e_get_flags(*pl1e) == flags);
+    }
+    else
+        l1e_write(pl1e, l1e_from_pfn(pfn, flags));
+
+    return 0;
+}
+
+DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
+
+static int setup_cpu_root_pgt(unsigned int cpu)
+{
+    root_pgentry_t *rpt = alloc_xen_pagetable();
+    unsigned int off;
+    int rc;
+
+    if ( !rpt )
+        return -ENOMEM;
+
+    clear_page(rpt);
+    per_cpu(root_pgt, cpu) = rpt;
+
+    rpt[root_table_offset(RO_MPT_VIRT_START)] =
+        idle_pg_table[root_table_offset(RO_MPT_VIRT_START)];
+    /* SH_LINEAR_PT inserted together with guest mappings. */
+    /* PERDOMAIN inserted during context switch. */
+    rpt[root_table_offset(XEN_VIRT_START)] =
+        idle_pg_table[root_table_offset(XEN_VIRT_START)];
+
+    /* Install direct map page table entries for stack, IDT, and TSS. */
+    for ( off = rc = 0; !rc && off < STACK_SIZE; off += PAGE_SIZE )
+        rc = clone_mapping(__va(__pa(stack_base[cpu])) + off, rpt);
+
+    if ( !rc )
+        rc = clone_mapping(idt_tables[cpu], rpt);
+    if ( !rc )
+        rc = clone_mapping(&per_cpu(init_tss, cpu), rpt);
+
+    return rc;
+}
+
+static void cleanup_cpu_root_pgt(unsigned int cpu)
+{
+    root_pgentry_t *rpt = per_cpu(root_pgt, cpu);
+    unsigned int r;
+
+    if ( !rpt )
+        return;
+
+    per_cpu(root_pgt, cpu) = NULL;
+
+    for ( r = root_table_offset(DIRECTMAP_VIRT_START);
+          r < root_table_offset(HYPERVISOR_VIRT_END); ++r )
+    {
+        l3_pgentry_t *l3t;
+        unsigned int i3;
+
+        if ( !(root_get_flags(rpt[r]) & _PAGE_PRESENT) )
+            continue;
+
+        l3t = l4e_to_l3e(rpt[r]);
+
+        for ( i3 = 0; i3 < L3_PAGETABLE_ENTRIES; ++i3 )
+        {
+            l2_pgentry_t *l2t;
+            unsigned int i2;
+
+            if ( !(l3e_get_flags(l3t[i3]) & _PAGE_PRESENT) )
+                continue;
+
+            ASSERT(!(l3e_get_flags(l3t[i3]) & _PAGE_PSE));
+            l2t = l3e_to_l2e(l3t[i3]);
+
+            for ( i2 = 0; i2 < L2_PAGETABLE_ENTRIES; ++i2 )
+            {
+                if ( !(l2e_get_flags(l2t[i2]) & _PAGE_PRESENT) )
+                    continue;
+
+                ASSERT(!(l2e_get_flags(l2t[i2]) & _PAGE_PSE));
+                free_xen_pagetable(l2e_to_l1e(l2t[i2]));
+            }
+
+            free_xen_pagetable(l2t);
+        }
+
+        free_xen_pagetable(l3t);
+    }
+
+    free_xen_pagetable(rpt);
+}
+
 static void cpu_smpboot_free(unsigned int cpu)
 {
     unsigned int order, socket = cpu_to_socket(cpu);
@@ -671,6 +855,8 @@ static void cpu_smpboot_free(unsigned in
             free_domheap_page(mfn_to_page(mfn));
     }
 
+    cleanup_cpu_root_pgt(cpu);
+
     order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
     free_xenheap_pages(per_cpu(gdt_table, cpu), order);
 
@@ -727,6 +913,11 @@ static int cpu_smpboot_alloc(unsigned in
     set_ist(&idt_tables[cpu][TRAP_nmi],           IST_NONE);
     set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
 
+    rc = setup_cpu_root_pgt(cpu);
+    if ( rc )
+        goto out;
+    rc = -ENOMEM;
+
     for ( stub_page = 0, i = cpu & ~(STUBS_PER_PAGE - 1);
           i < nr_cpu_ids && i <= (cpu | (STUBS_PER_PAGE - 1)); ++i )
         if ( cpu_online(i) && cpu_to_node(i) == node )
@@ -786,6 +977,8 @@ static struct notifier_block cpu_smpboot
 
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
+    int rc;
+
     register_cpu_notifier(&cpu_smpboot_nfb);
 
     mtrr_aps_sync_begin();
@@ -799,6 +992,11 @@ void __init smp_prepare_cpus(unsigned in
 
     stack_base[0] = stack_start;
 
+    rc = setup_cpu_root_pgt(0);
+    if ( rc )
+        panic("Error %d setting up PV root page table\n", rc);
+    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+
     set_nr_sockets();
 
     socket_cpumask = xzalloc_array(cpumask_t *, nr_sockets);
@@ -867,6 +1065,8 @@ void __init smp_prepare_boot_cpu(void)
 #if NR_CPUS > 2 * BITS_PER_LONG
     per_cpu(scratch_cpumask, cpu) = &scratch_cpu0mask;
 #endif
+
+    get_cpu_info()->xen_cr3 = 0;
 }
 
 static void
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -137,6 +137,8 @@ void __dummy__(void)
     OFFSET(CPUINFO_processor_id, struct cpu_info, processor_id);
     OFFSET(CPUINFO_current_vcpu, struct cpu_info, current_vcpu);
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
+    OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
+    OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -199,6 +199,17 @@ ENTRY(cstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Lcstar_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Lcstar_cr3_okay:
+
         GET_CURRENT(bx)
         movq  VCPU_domain(%rbx),%rcx
         cmpb  $0,DOMAIN_is_32bit_pv(%rcx)
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -37,6 +37,32 @@ ENTRY(switch_to_kernel)
 /* %rbx: struct vcpu, interrupts disabled */
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
+
+        /* Copy guest mappings and switch to per-CPU root page table. */
+        mov   %cr3, %r9
+        GET_STACK_END(dx)
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rdx), %rdi
+        movabs $PADDR_MASK & PAGE_MASK, %rsi
+        movabs $DIRECTMAP_VIRT_START, %rcx
+        mov   %rdi, %rax
+        and   %rsi, %rdi
+        and   %r9, %rsi
+        add   %rcx, %rdi
+        add   %rcx, %rsi
+        mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
+        mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
+        mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)
+        rep movsq
+        mov   $ROOT_PAGETABLE_ENTRIES - \
+               ROOT_PAGETABLE_LAST_XEN_SLOT - 1, %ecx
+        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
+                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rsi
+        sub   $(ROOT_PAGETABLE_FIRST_XEN_SLOT - \
+                ROOT_PAGETABLE_LAST_XEN_SLOT - 1) * 8, %rdi
+        rep movsq
+        mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
+        write_cr3 rax, rdi, rsi
+
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -71,6 +97,22 @@ iret_exit_to_guest:
         ALIGN
 /* No special register assumptions. */
 restore_all_xen:
+        /*
+         * Check whether we need to switch to the per-CPU page tables, in
+         * case we return to late PV exit code (from an NMI or #MC).
+         */
+        GET_STACK_END(ax)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
+        test  %rdx, %rdx
+        /*
+         * Ideally the condition would be "nsz", but such doesn't exist,
+         * so "g" will have to do.
+         */
+UNLIKELY_START(g, exit_cr3)
+        write_cr3 rax, rdi, rsi
+UNLIKELY_END(exit_cr3)
+
         RESTORE_ALL adj=8
         iretq
 
@@ -100,7 +142,18 @@ ENTRY(lstar_enter)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-        GET_CURRENT(bx)
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Llstar_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Llstar_cr3_okay:
+
+        __GET_CURRENT(bx)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
 
@@ -192,7 +245,18 @@ GLOBAL(sysenter_eflags_saved)
         pushq $0
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
-        GET_CURRENT(bx)
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Lsyse_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Lsyse_cr3_okay:
+
+        __GET_CURRENT(bx)
         cmpb  $0,VCPU_sysenter_disables_events(%rbx)
         movq  VCPU_sysenter_addr(%rbx),%rax
         setne %cl
@@ -228,13 +292,23 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
+        neg   %rcx
+        jz    .Lint80_cr3_okay
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+        neg   %rcx
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%rbx)
+.Lint80_cr3_okay:
+
         cmpb  $0,untrusted_msi(%rip)
 UNLIKELY_START(ne, msi_check)
         movl  $0x80,%edi
         call  check_for_unexpected_msi
 UNLIKELY_END(msi_check)
 
-        GET_CURRENT(bx)
+        __GET_CURRENT(bx)
 
         /* Check that the callback is non-null. */
         leaq  VCPU_int80_bounce(%rbx),%rdx
@@ -391,9 +465,27 @@ ENTRY(dom_crash_sync_extable)
 
 ENTRY(common_interrupt)
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .Lintr_cr3_okay
+        jns   .Lintr_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.Lintr_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        xor   %ecx, %ecx
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        testb $3, UREGS_cs(%rsp)
+        cmovnz %rcx, %r15
+.Lintr_cr3_okay:
+
         CR4_PV32_RESTORE
         movq %rsp,%rdi
         callq do_IRQ
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp ret_from_intr
 
 /* No special register assumptions. */
@@ -411,6 +503,23 @@ ENTRY(page_fault)
 /* No special register assumptions. */
 GLOBAL(handle_exception)
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .Lxcpt_cr3_okay
+        jns   .Lxcpt_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.Lxcpt_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        xor   %ecx, %ecx
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        testb $3, UREGS_cs(%rsp)
+        cmovnz %rcx, %r15
+.Lxcpt_cr3_okay:
+
 handle_exception_saved:
         GET_CURRENT(bx)
         testb $X86_EFLAGS_IF>>8,UREGS_eflags+1(%rsp)
@@ -475,6 +584,7 @@ handle_exception_saved:
         leaq  exception_table(%rip),%rdx
         PERFC_INCR(exceptions, %rax, %rbx)
         callq *(%rdx,%rax,8)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
@@ -507,6 +617,7 @@ exception_with_ints_disabled:
         rep;  movsq                     # make room for ec/ev
 1:      movq  UREGS_error_code(%rsp),%rax # ec/ev
         movq  %rax,UREGS_kernel_sizeof(%rsp)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         jmp   restore_all_xen           # return to fixup code
 
 /* No special register assumptions. */
@@ -585,6 +696,17 @@ ENTRY(double_fault)
         movl  $TRAP_double_fault,4(%rsp)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
+
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
+        test  %rbx, %rbx
+        jz    .Ldblf_cr3_okay
+        jns   .Ldblf_cr3_load
+        neg   %rbx
+.Ldblf_cr3_load:
+        write_cr3 rbx, rdi, rsi
+.Ldblf_cr3_okay:
+
         movq  %rsp,%rdi
         call  do_double_fault
         BUG   /* do_double_fault() shouldn't return. */
@@ -603,10 +725,28 @@ ENTRY(nmi)
         movl  $TRAP_nmi,4(%rsp)
 handle_ist_exception:
         SAVE_ALL CLAC
+
+        GET_STACK_END(14)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
+        mov   %rcx, %r15
+        neg   %rcx
+        jz    .List_cr3_okay
+        jns   .List_cr3_load
+        mov   %rcx, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+        neg   %rcx
+.List_cr3_load:
+        write_cr3 rcx, rdi, rsi
+        movq  $0, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
+.List_cr3_okay:
+
         CR4_PV32_RESTORE
         testb $3,UREGS_cs(%rsp)
         jz    1f
-        /* Interrupted guest context. Copy the context to stack bottom. */
+        /*
+         * Interrupted guest context. Clear the restore value for xen_cr3
+         * and copy the context to stack bottom.
+         */
+        xor   %r15, %r15
         GET_CPUINFO_FIELD(guest_cpu_user_regs,di)
         movq  %rsp,%rsi
         movl  $UREGS_kernel_sizeof/8,%ecx
@@ -616,6 +756,7 @@ handle_ist_exception:
         movzbl UREGS_entry_vector(%rsp),%eax
         leaq  exception_table(%rip),%rdx
         callq *(%rdx,%rax,8)
+        mov   %r15, STACK_CPUINFO_FIELD(xen_cr3)(%r14)
         cmpb  $TRAP_nmi,UREGS_entry_vector(%rsp)
         jne   ret_from_intr
 
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -93,9 +93,30 @@ void ret_from_intr(void);
         UNLIKELY_DONE(mp, tag);   \
         __UNLIKELY_END(tag)
 
+        .equ .Lrax, 0
+        .equ .Lrcx, 1
+        .equ .Lrdx, 2
+        .equ .Lrbx, 3
+        .equ .Lrsp, 4
+        .equ .Lrbp, 5
+        .equ .Lrsi, 6
+        .equ .Lrdi, 7
+        .equ .Lr8,  8
+        .equ .Lr9,  9
+        .equ .Lr10, 10
+        .equ .Lr11, 11
+        .equ .Lr12, 12
+        .equ .Lr13, 13
+        .equ .Lr14, 14
+        .equ .Lr15, 15
+
 #define STACK_CPUINFO_FIELD(field) (1 - CPUINFO_sizeof + CPUINFO_##field)
 #define GET_STACK_END(reg)                        \
+        .if .Lr##reg > 8;                         \
+        movq $STACK_SIZE-1, %r##reg;              \
+        .else;                                    \
         movl $STACK_SIZE-1, %e##reg;              \
+        .endif;                                   \
         orq  %rsp, %r##reg
 
 #define GET_CPUINFO_FIELD(field, reg)             \
@@ -177,6 +198,15 @@ void ret_from_intr(void);
 #define ASM_STAC ASM_AC(STAC)
 #define ASM_CLAC ASM_AC(CLAC)
 
+.macro write_cr3 val:req, tmp1:req, tmp2:req
+        mov   %cr4, %\tmp1
+        mov   %\tmp1, %\tmp2
+        and   $~X86_CR4_PGE, %\tmp1
+        mov   %\tmp1, %cr4
+        mov   %\val, %cr3
+        mov   %\tmp2, %cr4
+.endm
+
 #define CR4_PV32_RESTORE                                           \
         667: ASM_NOP5;                                             \
         .pushsection .altinstr_replacement, "ax";                  \
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -41,6 +41,18 @@ struct cpu_info {
     struct vcpu *current_vcpu;
     unsigned long per_cpu_offset;
     unsigned long cr4;
+    /*
+     * Of the two following fields the latter is being set to the CR3 value
+     * to be used on the given pCPU for loading whenever 64-bit PV guest
+     * context is being entered. The value never changes once set.
+     * The former is the value to restore when re-entering Xen, if any. IOW
+     * its value being zero means there's nothing to restore. However, its
+     * value can also be negative, indicating to the exit-to-Xen code that
+     * restoring is not necessary, but allowing any nested entry code paths
+     * to still know the value to put back into CR3.
+     */
+    unsigned long xen_cr3;
+    unsigned long pv_cr3;
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -462,6 +462,7 @@ extern idt_entry_t idt_table[];
 extern idt_entry_t *idt_tables[];
 
 DECLARE_PER_CPU(struct tss_struct, init_tss);
+DECLARE_PER_CPU(root_pgentry_t *, root_pgt);
 
 extern void init_int80_direct_trap(struct vcpu *v);
 
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -24,8 +24,8 @@
 /* These are architectural limits. Current CPUs support only 40-bit phys. */
 #define PADDR_BITS              52
 #define VADDR_BITS              48
-#define PADDR_MASK              ((1UL << PADDR_BITS)-1)
-#define VADDR_MASK              ((1UL << VADDR_BITS)-1)
+#define PADDR_MASK              ((_AC(1,UL) << PADDR_BITS)-1)
+#define VADDR_MASK              ((_AC(1,UL) << VADDR_BITS)-1)
 
 #define VADDR_TOP_BIT           (1UL << (VADDR_BITS - 1))
 #define CANONICAL_MASK          (~0UL & ~VADDR_MASK)
@@ -107,6 +107,7 @@ typedef l4_pgentry_t root_pgentry_t;
       : (((_s) < ROOT_PAGETABLE_FIRST_XEN_SLOT) ||  \
          ((_s) > ROOT_PAGETABLE_LAST_XEN_SLOT)))
 
+#define root_table_offset         l4_table_offset
 #define root_get_pfn              l4e_get_pfn
 #define root_get_flags            l4e_get_flags
 #define root_get_intpte           l4e_get_intpte



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 15:16 ` [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
  2018-01-16 15:21   ` [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
@ 2018-01-16 15:22   ` Jan Beulich
  2018-01-16 16:07     ` Andrew Cooper
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Beulich @ 2018-01-16 15:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper

First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Add warning about incomplete implementation to command line option
    doc. Use local variable in paravirt_ctxt_switch_to().
v2: New.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -1849,6 +1849,18 @@ In the case that x2apic is in use, this
 clustered mode.  The default, given no hint from the **FADT**, is cluster
 mode.
 
+### xpti
+> `= <boolean>`
+
+> Default: `false` on AMD hardware
+> Default: `true` everywhere else
+
+Override default selection of whether to isolate 64-bit PV guest page
+tables.
+
+** WARNING: Not yet a complete isolation implementation, but better than
+nothing. **
+
 ### xsave
 > `= <boolean>`
 
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1509,10 +1509,13 @@ void paravirt_ctxt_switch_from(struct vc
 
 void paravirt_ctxt_switch_to(struct vcpu *v)
 {
+    root_pgentry_t *root_pgt = this_cpu(root_pgt);
     unsigned long cr4;
 
-    this_cpu(root_pgt)[root_table_offset(PERDOMAIN_VIRT_START)] =
-        l4e_from_page(v->domain->arch.perdomain_l3_pg, __PAGE_HYPERVISOR_RW);
+    if ( root_pgt )
+        root_pgt[root_table_offset(PERDOMAIN_VIRT_START)] =
+            l4e_from_page(v->domain->arch.perdomain_l3_pg,
+                          __PAGE_HYPERVISOR_RW);
 
     cr4 = pv_guest_cr4_to_real_cr4(v);
     if ( unlikely(cr4 != read_cr4()) )
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3685,7 +3685,7 @@ long do_mmu_update(
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
                     if ( !rc )
-                        sync_guest = true;
+                        sync_guest = this_cpu(root_pgt);
                     break;
 
                 case PGT_writable_page:
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -328,7 +328,7 @@ void start_secondary(void *unused)
     spin_debug_disable();
 
     get_cpu_info()->xen_cr3 = 0;
-    get_cpu_info()->pv_cr3 = __pa(this_cpu(root_pgt));
+    get_cpu_info()->pv_cr3 = this_cpu(root_pgt) ? __pa(this_cpu(root_pgt)) : 0;
 
     load_system_tables();
 
@@ -734,14 +734,20 @@ static int clone_mapping(const void *ptr
     return 0;
 }
 
+static __read_mostly int8_t opt_xpti = -1;
+boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
-    root_pgentry_t *rpt = alloc_xen_pagetable();
+    root_pgentry_t *rpt;
     unsigned int off;
     int rc;
 
+    if ( !opt_xpti )
+        return 0;
+
+    rpt = alloc_xen_pagetable();
     if ( !rpt )
         return -ENOMEM;
 
@@ -992,10 +998,14 @@ void __init smp_prepare_cpus(unsigned in
 
     stack_base[0] = stack_start;
 
+    if ( opt_xpti < 0 )
+        opt_xpti = boot_cpu_data.x86_vendor != X86_VENDOR_AMD;
+
     rc = setup_cpu_root_pgt(0);
     if ( rc )
         panic("Error %d setting up PV root page table\n", rc);
-    get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
+    if ( per_cpu(root_pgt, 0) )
+        get_cpu_info()->pv_cr3 = __pa(per_cpu(root_pgt, 0));
 
     set_nr_sockets();
 
@@ -1067,6 +1077,7 @@ void __init smp_prepare_boot_cpu(void)
 #endif
 
     get_cpu_info()->xen_cr3 = 0;
+    get_cpu_info()->pv_cr3 = 0;
 }
 
 static void
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -46,6 +46,7 @@ restore_all_guest:
         movabs $DIRECTMAP_VIRT_START, %rcx
         mov   %rdi, %rax
         and   %rsi, %rdi
+        jz    .Lrag_keep_cr3
         and   %r9, %rsi
         add   %rcx, %rdi
         add   %rcx, %rsi
@@ -62,6 +63,7 @@ restore_all_guest:
         rep movsq
         mov   %r9, STACK_CPUINFO_FIELD(xen_cr3)(%rdx)
         write_cr3 rax, rdi, rsi
+.Lrag_keep_cr3:
 
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 15:21   ` [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
@ 2018-01-16 16:05     ` Andrew Cooper
  2018-01-16 17:28     ` Andy Smith
  1 sibling, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 16:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 16/01/18 15:21, Jan Beulich wrote:
> --- a/xen/include/asm-x86/x86_64/page.h
> +++ b/xen/include/asm-x86/x86_64/page.h
> @@ -24,8 +24,8 @@
>  /* These are architectural limits. Current CPUs support only 40-bit phys. */
>  #define PADDR_BITS              52
>  #define VADDR_BITS              48
> -#define PADDR_MASK              ((1UL << PADDR_BITS)-1)
> -#define VADDR_MASK              ((1UL << VADDR_BITS)-1)
> +#define PADDR_MASK              ((_AC(1,UL) << PADDR_BITS)-1)
> +#define VADDR_MASK              ((_AC(1,UL) << VADDR_BITS)-1)

Sorry to only notice these now, but spaces as you're adjusting these.

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

>  
>  #define VADDR_TOP_BIT           (1UL << (VADDR_BITS - 1))
>  #define CANONICAL_MASK          (~0UL & ~VADDR_MASK)
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled
  2018-01-16 15:22   ` [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
@ 2018-01-16 16:07     ` Andrew Cooper
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 16:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

On 16/01/18 15:22, Jan Beulich wrote:
> First of all we don't need it on AMD systems. Additionally allow its use
> to be controlled by command line option. For best backportability, this
> intentionally doesn't use alternative instruction patching to achieve
> the intended effect - while we likely want it, this will be later
> follow-up.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 15:21   ` [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
  2018-01-16 16:05     ` Andrew Cooper
@ 2018-01-16 17:28     ` Andy Smith
  2018-01-16 18:02       ` George Dunlap
                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Andy Smith @ 2018-01-16 17:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Hi Jan,

On Tue, Jan 16, 2018 at 08:21:52AM -0700, Jan Beulich wrote:
> This is a very simplistic change limiting the amount of memory a running
> 64-bit PV guest has mapped (and hence available for attacking): Only the
> mappings of stack, IDT, and TSS are being cloned from the direct map
> into per-CPU page tables.

Can this be used with Comet/Vixen to further protect PV guests? i.e.
if the shim hypervisor has these changes then will it also limit
what a process in the PV guest can see in that shim hypervisor,
which therefore protects its own guest kernel a bit too?

Thanks,
Andy

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 17:28     ` Andy Smith
@ 2018-01-16 18:02       ` George Dunlap
  2018-01-16 18:13       ` Andrew Cooper
  2018-01-16 19:02       ` Wei Liu
  2 siblings, 0 replies; 29+ messages in thread
From: George Dunlap @ 2018-01-16 18:02 UTC (permalink / raw)
  To: Andy Smith; +Cc: xen-devel, Jan Beulich

On Tue, Jan 16, 2018 at 5:28 PM, Andy Smith <andy@strugglers.net> wrote:
> Hi Jan,
>
> On Tue, Jan 16, 2018 at 08:21:52AM -0700, Jan Beulich wrote:
>> This is a very simplistic change limiting the amount of memory a running
>> 64-bit PV guest has mapped (and hence available for attacking): Only the
>> mappings of stack, IDT, and TSS are being cloned from the direct map
>> into per-CPU page tables.
>
> Can this be used with Comet/Vixen to further protect PV guests? i.e.
> if the shim hypervisor has these changes then will it also limit
> what a process in the PV guest can see in that shim hypervisor,
> which therefore protects its own guest kernel a bit too?

Technically, yes, it should.

However,
 1) It should be unnecessary.  If you're running PV with the
"bandaid", you should be reasonably safe without using the shim
 2) The shim adds nearly 40% overhead in my words-case tests; and so
does the bandaid.  Together I think your performance would be pretty
terrible.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 17:28     ` Andy Smith
  2018-01-16 18:02       ` George Dunlap
@ 2018-01-16 18:13       ` Andrew Cooper
  2018-01-16 19:02       ` Wei Liu
  2 siblings, 0 replies; 29+ messages in thread
From: Andrew Cooper @ 2018-01-16 18:13 UTC (permalink / raw)
  To: Andy Smith, Jan Beulich; +Cc: xen-devel

On 16/01/18 17:28, Andy Smith wrote:
> Hi Jan,
>
> On Tue, Jan 16, 2018 at 08:21:52AM -0700, Jan Beulich wrote:
>> This is a very simplistic change limiting the amount of memory a running
>> 64-bit PV guest has mapped (and hence available for attacking): Only the
>> mappings of stack, IDT, and TSS are being cloned from the direct map
>> into per-CPU page tables.
> Can this be used with Comet/Vixen to further protect PV guests? i.e.
> if the shim hypervisor has these changes then will it also limit
> what a process in the PV guest can see in that shim hypervisor,
> which therefore protects its own guest kernel a bit too?

Yes.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests
  2018-01-16 17:28     ` Andy Smith
  2018-01-16 18:02       ` George Dunlap
  2018-01-16 18:13       ` Andrew Cooper
@ 2018-01-16 19:02       ` Wei Liu
  2 siblings, 0 replies; 29+ messages in thread
From: Wei Liu @ 2018-01-16 19:02 UTC (permalink / raw)
  To: Andy Smith; +Cc: xen-devel, Wei Liu, Jan Beulich

On Tue, Jan 16, 2018 at 05:28:40PM +0000, Andy Smith wrote:
> Hi Jan,
> 
> On Tue, Jan 16, 2018 at 08:21:52AM -0700, Jan Beulich wrote:
> > This is a very simplistic change limiting the amount of memory a running
> > 64-bit PV guest has mapped (and hence available for attacking): Only the
> > mappings of stack, IDT, and TSS are being cloned from the direct map
> > into per-CPU page tables.
> 
> Can this be used with Comet/Vixen to further protect PV guests? i.e.
> if the shim hypervisor has these changes then will it also limit
> what a process in the PV guest can see in that shim hypervisor,
> which therefore protects its own guest kernel a bit too?
> 

Yes, but please be warned that the guest is very very slow. I don't
think XPTI + shim is very usable at this stage.

If you're interested in trying that out, check out staging branch and
build a shim from there.

Wei.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2018-01-16 19:03 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-15 11:01 [PATCH v2 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
2018-01-15 11:06 ` [PATCH v2 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
2018-01-15 18:23   ` Andrew Cooper
2018-01-16  7:46     ` Jan Beulich
2018-01-16 11:51       ` Andrew Cooper
2018-01-16 12:33         ` Jan Beulich
2018-01-16 13:26           ` Andrew Cooper
2018-01-16  9:33     ` Jan Beulich
2018-01-16 11:56       ` Andrew Cooper
2018-01-16 12:25         ` Jan Beulich
2018-01-15 11:07 ` [PATCH v2 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
2018-01-15 18:26   ` Andrew Cooper
2018-01-16  8:12     ` Jan Beulich
2018-01-16 13:20       ` Andrew Cooper
2018-01-16 13:51         ` Jan Beulich
2018-01-16 12:12   ` George Dunlap
2018-01-16 12:21     ` Juergen Gross
2018-01-16 12:39       ` George Dunlap
2018-01-16 12:35     ` Jan Beulich
2018-01-16 12:40       ` George Dunlap
2018-01-16 15:16 ` [PATCH v3 0/2] x86: initial simplistic Meltdown mitigation Jan Beulich
2018-01-16 15:21   ` [PATCH v3 1/2] x86: Meltdown band-aid against malicious 64-bit PV guests Jan Beulich
2018-01-16 16:05     ` Andrew Cooper
2018-01-16 17:28     ` Andy Smith
2018-01-16 18:02       ` George Dunlap
2018-01-16 18:13       ` Andrew Cooper
2018-01-16 19:02       ` Wei Liu
2018-01-16 15:22   ` [PATCH v3 2/2] x86: allow Meltdown band-aid to be disabled Jan Beulich
2018-01-16 16:07     ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.