All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations
@ 2015-08-06 16:45 Ben Catterall
  2015-08-06 16:45 ` [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper Ben Catterall
                   ` (4 more replies)
  0 siblings, 5 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-06 16:45 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
	Ben Catterall

Hi all,

I have a working base for this and would appreciate feedback at this point to
evaluate if it is moving in the right direction.

Many thanks in advance,
Ben

The aim of this work is to create a proof-of-concept to establish if it is
feasible to move certain Xen operations into a deprivileged context to mitigate
the impact of a bug or compromise in such areas. An example would be x86_emulate
or virtual device emulation which is not done in QEMU for performance reasons.

This patch series contains the underlying support mechanisms for this mode,
which include:
 - Setting up the necessary monitor page table entries for the deprivileged
   code, data and stack regions.
 - Moving into and out of this mode
 - Handle system calls from this mode
 - Trapping exceptions taken whilst in this mode


Performance testing
-------------------
Performance testing indicates that the overhead for this deprivileged mode is
approximately 25%. This overhead is the cost of moving into deprivileged mode
and then fully back out of deprivileged mode.

I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
E5-2407 0 processor. This was done from a python script within the HVM guest
using time.time() and running Debian Jessie. Each write was trapped to cause a
vmexit and the time for each write was calculated. These experiments were
repeated. Note that only the host and this HVM guest were running (both Debian
Jessie) during the experiments.

20e-6 seconds was the average time for performing the write without the
      deprivileged code running.
25e-6 seconds was the average time for performing the write with an entry and
      exit from deprvileged mode.

Further Work
------------
 - Support migration of vcpus between pcpus. This will likely be done by using
   a hard affinity to a pcpu and setting a 'migration pending' flag so that
   once we return from deprivileged mode and the stack has unwound, we can then
   migrate the vcpu.
   - Prevent DoS attacks on migration: A counter is needed to prevent
     a spinning deprivileged mode from preventing migration. We could count
     the number of quanta which have passed since we failed to migrate, then
     migrate when it becomes too high.

 - Add support for SVM and test on AMD processors.
   - We need to get the host MSRs for AMD SVM mode.

Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
@ 2015-08-06 16:45 ` Ben Catterall
  2015-08-06 19:22   ` Andrew Cooper
  2015-08-06 16:45 ` [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-06 16:45 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
	Ben Catterall

This allocation function is used by the deprivileged mode initialisation code
to allocate pages for the new page table mappings and page frames on the HAP
page heap.

Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
 xen/arch/x86/mm/hap/hap.c    | 23 +++++++++++++++++++++++
 xen/include/asm-x86/domain.h |  1 +
 xen/include/asm-x86/hap.h    |  5 +++++
 3 files changed, 29 insertions(+)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index d375c4d..593cba7 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -844,6 +844,29 @@ static const struct paging_mode hap_paging_long_mode = {
     .guest_levels           = 4
 };
 
+/************************************************/
+/*               HAP HVM functions              */
+/************************************************/
+struct page_info *hvm_alloc_page(struct domain *d)
+{
+    struct page_info *pg;
+
+    if( (pg = hap_alloc(d)) == NULL )
+    {
+        HAP_ERROR("HVM: Out of memory allocating HVM page\n");
+        domain_crash(d);
+        return NULL;
+    }
+
+    ASSERT(paging_locked_by_me(d));
+
+    /* Track our used pages */
+    page_list_add(pg, &d->arch.paging.hap.hvmlist);
+
+    return pg;
+}
+
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 0fce09e..c46a31a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -154,6 +154,7 @@ struct shadow_vcpu {
 /************************************************/
 struct hap_domain {
     struct page_list_head freelist;
+    struct page_list_head hvmlist;  /* HVM usermode page allocation */
     unsigned int      total_pages;  /* number of pages allocated */
     unsigned int      free_pages;   /* number of pages on freelists */
     unsigned int      p2m_pages;    /* number of pages allocates to p2m */
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index bd87481..0432eb5 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -48,6 +48,11 @@ int   hap_track_dirty_vram(struct domain *d,
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
 void hap_set_alloc_for_pvh_dom0(struct domain *d, unsigned long num_pages);
 
+/************************************************/
+/*               hap HVM functions              */
+/************************************************/
+
+struct page_info *hvm_alloc_page(struct domain *d);
 #endif /* XEN_HAP_H */
 
 /*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables
  2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
  2015-08-06 16:45 ` [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper Ben Catterall
@ 2015-08-06 16:45 ` Ben Catterall
  2015-08-06 19:52   ` Andrew Cooper
  2015-08-06 16:45 ` [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-06 16:45 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
	Ben Catterall

The paging structure mappings for the deprivileged mode are added
to the monitor page table for HVM guests. The entries are generated by
walking the page tables and mapping in new pages. If a higher-level page table
mapping exists, we attempt to traverse all the way down to a leaf page and add
the page there. Access bits are flipped as needed.

The page entries are generated for deprivileged .text, .data and a stack. The
data is copied from sections allocated by the linker. The mappings are setup in
an unused portion of the Xen virtual address space. The pages are mapped in as
user mode accessible, with NX bits set for the data and stack regions and the
code region is set to be executable and read-only.

The needed pages are allocated on the HAP page heap and are deallocated when
those heap pages are deallocated (on domain destruction).

Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
 xen/arch/x86/hvm/Makefile          |   1 +
 xen/arch/x86/hvm/deprivileged.c    | 441 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/hap/hap.c          |  10 +-
 xen/arch/x86/xen.lds.S             |  19 ++
 xen/include/asm-x86/config.h       |  10 +-
 xen/include/xen/hvm/deprivileged.h |  94 ++++++++
 6 files changed, 573 insertions(+), 2 deletions(-)
 create mode 100644 xen/arch/x86/hvm/deprivileged.c
 create mode 100644 xen/include/xen/hvm/deprivileged.h

diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index 794e793..bd83ba3 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -16,6 +16,7 @@ obj-y += pmtimer.o
 obj-y += quirks.o
 obj-y += rtc.o
 obj-y += save.o
+obj-y += deprivileged.o
 obj-y += stdvga.o
 obj-y += vioapic.o
 obj-y += viridian.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
new file mode 100644
index 0000000..071d900
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -0,0 +1,441 @@
+/*
+ * HVM deprivileged mode to provide support for running operations in
+ * user mode from Xen
+ */
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <xen/compiler.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+#include <asm/hvm/vmx/vmx.h>
+#include <xen/hvm/deprivileged.h>
+
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
+{
+    int ret;
+    void *p;
+    unsigned long int size;
+
+    /* Copy the .text segment for deprivileged code */
+    size = (unsigned long int)&__hvm_deprivileged_text_end -
+           (unsigned long int)&__hvm_deprivileged_text_start;
+
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long int)&__hvm_deprivileged_text_start,
+                              (unsigned long int)HVM_DEPRIVILEGED_TEXT_ADDR,
+                              size, 0 /* No write */);
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv .text. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+
+    /* Copy the .data segment for ring3 code */
+    size = (unsigned long int)&__hvm_deprivileged_data_end -
+           (unsigned long int)&__hvm_deprivileged_data_start;
+
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long int)&__hvm_deprivileged_data_start,
+                              (unsigned long int)HVM_DEPRIVILEGED_DATA_ADDR,
+                              size, _PAGE_NX | _PAGE_RW);
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv .data. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+
+    /* Setup the stack mappings.
+     * This is a bit of a hack: by allocating a blank area
+     * we can reuse the hvm_deprivileged_copy_pages code.
+     * The stacks are actually allocated in the per-vcpu initialisation
+     * routine.
+     */
+    size = STACK_SIZE;
+
+    p = alloc_xenheap_pages(STACK_ORDER, 0);
+    ret = hvm_deprivileged_copy_pages(d, l4e,
+                              (unsigned long int)p,
+                              (unsigned long int)HVM_DEPRIVILEGED_STACK_ADDR,
+                              size, _PAGE_NX | _PAGE_RW);
+
+    free_xenheap_pages(p, STACK_ORDER);
+
+    if( ret )
+    {
+        printk("HVM: Error when initialising depriv stack. Code: %d", ret);
+        domain_crash(d);
+        return;
+    }
+}
+
+void hvm_deprivileged_destroy(struct domain *d)
+{
+    struct page_info *pg = NULL;
+
+    /* Free up all the pages we allocated from the HAP HVM list */
+    while( (pg = page_list_remove_head(&d->arch.paging.hap.hvmlist)) )
+    {
+        d->arch.paging.hap.free_pages++;
+        page_list_add_tail(pg, &d->arch.paging.hap.freelist);
+    }
+}
+
+/* Create a copy of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed.
+ *
+ * If we find a leaf entry in a page table (one which holds the
+ * mfn of a 4KB, 2MB, etc. page frame) which has already been
+ * mapped in, then we bail as we have a collision and this likely
+ * means a bug or the memory configuration has been changed.
+ *
+ * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
+ * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
+ * The PAGE_RW flag will be enabled for all page structure entries
+ * above the leaf page if that leaf page has PAGE_RW set. This is needed to
+ * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
+ *
+ * TODO: We proceed down to L1 4KB pages and then map these in. We should
+ * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
+ * page access. When we stop would depend on size (e.g. use 2MB pages for a
+ * few MBs)
+ */
+int hvm_deprivileged_copy_pages(struct domain *d,
+                            l4_pgentry_t *l4t_base,
+                            unsigned long int src_start,
+                            unsigned long int dst_start,
+                            unsigned long int size,
+                            unsigned int l1_flags)
+{
+    struct page_info *l3t_pg; /* the destination page */
+    l3_pgentry_t *l3t_base;
+    unsigned long int l4t_idx_dst_start;
+    unsigned long int l4t_idx_dst_end;
+    unsigned int i;
+    unsigned long int flags = _PAGE_USER | _PAGE_PRESENT;
+
+    /* Leaf page needs RW? */
+    if( l1_flags & _PAGE_RW )
+        flags |= _PAGE_RW;
+
+    /* Calculate where in the destination we need pages
+     * The PML4 page table doesn't map all of virtual memory: for a
+     * 48-bit implementation it's just 512 L4 slots. We also need to
+     * know which L4 slots our entries lie in.
+     */
+    l4t_idx_dst_start = l4_table_offset(dst_start);
+    l4t_idx_dst_end   = l4_table_offset(dst_start + size) +
+        !!( ((dst_start + size) & ((1ul << L4_PAGETABLE_SHIFT) - 1)) %
+            HVM_L4_PAGE_RANGE );
+
+    for( i = l4t_idx_dst_start; i < l4t_idx_dst_end; i++ )
+    {
+        ASSERT( size >= 0 );
+
+        /* Is this an empty entry? */
+        if( !(l4e_get_intpte(l4t_base[i])) )
+        {
+            /* Allocate a new L3 table */
+            if( (l3t_pg = hvm_alloc_page(d)) == NULL )
+                return HVM_ERR_PG_ALLOC;
+
+            l3t_base = map_domain_page(_mfn(page_to_mfn(l3t_pg)));
+
+            /* Add the page into the L4 table */
+            l4t_base[i] = l4e_from_page(l3t_pg, flags);
+
+            hvm_deprivileged_copy_l3(d, l3t_base, src_start, dst_start,
+                        (size > HVM_L4_PAGE_RANGE) ? HVM_L4_PAGE_RANGE : size,
+                        l1_flags);
+
+            unmap_domain_page(l3t_base);
+        }
+        else
+        {
+            /* If there is already page information then the page has been
+             * prepared by something else
+             *
+             * If the PSE bit is set, then we can't recurse as this is
+             * a leaf page so we fail.
+             */
+            if( (l4e_get_flags(l4t_base[i]) & _PAGE_PSE) )
+            {
+                panic("HVM: Leaf page is already mapped\n");
+            }
+
+            /* We can try recursing on this and see if where we want to put our
+             * new pages is empty.
+             *
+             * We do need to flip this to be a user mode page though so that
+             * the usermode children can be accessed. This is fine as long as
+             * we preserve the access bits of any supervisor entries that are
+             * used in the leaf case.
+             */
+
+            l3t_base = map_l3t_from_l4e(l4t_base[i]);
+
+            hvm_deprivileged_copy_l3(d, l3t_base, src_start, dst_start,
+                        (size > HVM_L4_PAGE_RANGE) ? HVM_L4_PAGE_RANGE : size,
+                        l1_flags);
+
+            l4t_base[i] = l4e_from_intpte(l4e_get_intpte(l4t_base[i]) | flags);
+
+            unmap_domain_page(l3t_base);
+        }
+
+        size -= HVM_L4_PAGE_RANGE;
+        src_start += HVM_L4_PAGE_RANGE;
+        dst_start += HVM_L4_PAGE_RANGE;
+    }
+
+    return 0;
+}
+
+int hvm_deprivileged_copy_l3(struct domain *d,
+                         l3_pgentry_t *l3t_base,
+                         unsigned long int src_start,
+                         unsigned long int dst_start,
+                         unsigned long int size,
+                         unsigned int l1_flags)
+{
+    struct page_info *l2t_pg; /* the destination page */
+    l2_pgentry_t *l2t_base;
+    unsigned long int l3t_idx_dst_start;
+    unsigned long int l3t_idx_dst_end;
+    unsigned int i;
+    unsigned long int flags = _PAGE_USER | _PAGE_PRESENT;
+
+    /* Leaf page needs RW? */
+    if( l1_flags & _PAGE_RW )
+        flags |= _PAGE_RW;
+
+    /* Calculate where in the destination we need pages */
+    l3t_idx_dst_start = l3_table_offset(dst_start);
+    l3t_idx_dst_end   = l3_table_offset(dst_start + size) +
+        !!( ((dst_start + size) & ((1ul << L3_PAGETABLE_SHIFT) - 1)) %
+             HVM_L3_PAGE_RANGE );
+
+    for( i = l3t_idx_dst_start; i < l3t_idx_dst_end; i++ )
+    {
+        ASSERT( size >= 0 );
+
+        /* Is this an empty entry? */
+        if( !(l3e_get_intpte(l3t_base[i])) )
+        {
+            /* Allocate a new L2 table */
+            if( (l2t_pg = hvm_alloc_page(d)) == NULL )
+                return HVM_ERR_PG_ALLOC;
+
+            l2t_base = map_domain_page(_mfn(page_to_mfn(l2t_pg)));
+
+            /* Add the page into the L3 table */
+            l3t_base[i] = l3e_from_page(l2t_pg, flags);
+
+            hvm_deprivileged_copy_l2(d, l2t_base, src_start, dst_start,
+                        (size > HVM_L3_PAGE_RANGE) ? HVM_L3_PAGE_RANGE : size,
+                        l1_flags);
+
+            unmap_domain_page(l2t_base);
+        }
+        else
+        {
+            /* If there is already page information then the page has been
+             * prepared by something else
+             *
+             * If the PSE bit is set, then we can't recurse as this is
+             * a leaf page so we fail.
+             */
+            if( (l3e_get_flags(l3t_base[i]) & _PAGE_PSE) )
+            {
+                panic("HVM: Leaf page is already mapped\n");
+            }
+
+            /* We can try recursing on this and see if where we want to put our
+             * new pages is empty.
+             *
+             * We do need to flip this to be a user mode page though so that
+             * the usermode children can be accessed. This is fine as long as
+             * we preserve the access bits of any supervisor entries that are
+             * used in the leaf case.
+             */
+
+            l2t_base = map_l2t_from_l3e(l3t_base[i]);
+
+            hvm_deprivileged_copy_l2(d, l2t_base, src_start, dst_start,
+                        (size > HVM_L3_PAGE_RANGE) ? HVM_L3_PAGE_RANGE : size,
+                        l1_flags);
+
+            l3t_base[i] = l3e_from_intpte(l3e_get_intpte(l3t_base[i]) | flags);
+
+            unmap_domain_page(l2t_base);
+        }
+
+        size -= HVM_L3_PAGE_RANGE;
+        src_start += HVM_L3_PAGE_RANGE;
+        dst_start += HVM_L3_PAGE_RANGE;
+    }
+
+    return 0;
+}
+
+int hvm_deprivileged_copy_l2(struct domain *d,
+                         l2_pgentry_t *l2t_base,
+                         unsigned long int src_start,
+                         unsigned long int dst_start,
+                         unsigned long int size,
+                         unsigned int l1_flags)
+{
+    struct page_info *l1t_pg; /* the destination page */
+    l1_pgentry_t *l1t_base;
+    unsigned long int l2t_idx_dst_start;
+    unsigned long int l2t_idx_dst_end;
+    unsigned int i;
+    unsigned long int flags = _PAGE_USER | _PAGE_PRESENT;
+
+    /* Leaf page needs RW? */
+    if( l1_flags & _PAGE_RW )
+        flags |= _PAGE_RW;
+
+    /* Calculate where in the destination we need pages */
+    l2t_idx_dst_start = l2_table_offset(dst_start);
+    l2t_idx_dst_end   = l2_table_offset(dst_start + size) +
+        !!( ((dst_start + size) & ((1ul << L2_PAGETABLE_SHIFT) - 1)) %
+             HVM_L2_PAGE_RANGE );
+
+    for( i = l2t_idx_dst_start; i < l2t_idx_dst_end; i++ )
+    {
+        ASSERT( size >= 0 );
+
+        /* Is this an empty entry? */
+        if( !(l2e_get_intpte(l2t_base[i])) )
+        {
+            /* Allocate a new L1 table */
+            if( (l1t_pg = hvm_alloc_page(d)) == NULL )
+                return HVM_ERR_PG_ALLOC;
+
+            l1t_base = map_domain_page(_mfn(page_to_mfn(l1t_pg)));
+
+            /* Add the page into the L2 table */
+            l2t_base[i] = l2e_from_page(l1t_pg, flags);
+
+            hvm_deprivileged_copy_l1(d, l1t_base, src_start, dst_start,
+                        (size > HVM_L2_PAGE_RANGE) ? HVM_L2_PAGE_RANGE : size,
+                        l1_flags );
+
+            unmap_domain_page(l1t_base);
+        }
+        else
+        {
+            /* If there is already page information then the page has been
+             * prepared by something else
+             *
+             * If the PSE bit is set, then we can't recurse as this is
+             * a leaf page so we fail.
+             */
+            if( (l2e_get_flags(l2t_base[i]) & _PAGE_PSE) )
+            {
+                panic("HVM: Leaf page is already mapped\n");
+            }
+
+            /* We can try recursing on this and see if where we want to put our
+             * new pages is empty.
+             *
+             * We do need to flip this to be a user mode page though so that
+             * the usermode children can be accessed. This is fine as long as
+             * we preserve the access bits of any supervisor entries that are
+             * used in the leaf case.
+             */
+
+            l1t_base = map_l1t_from_l2e(l2t_base[i]);
+
+            hvm_deprivileged_copy_l1(d, l1t_base, src_start, dst_start,
+                        (size > HVM_L2_PAGE_RANGE) ? HVM_L2_PAGE_RANGE : size,
+                        l1_flags);
+
+            l2t_base[i] = l2e_from_intpte(l2e_get_intpte(l2t_base[i]) | flags);
+
+            unmap_domain_page(l1t_base);
+        }
+
+        size -= HVM_L2_PAGE_RANGE;
+        src_start += HVM_L2_PAGE_RANGE;
+        dst_start += HVM_L2_PAGE_RANGE;
+    }
+    return 0;
+}
+
+int hvm_deprivileged_copy_l1(struct domain *d,
+                         l1_pgentry_t *l1t_base,
+                         unsigned long int src_start,
+                         unsigned long int dst_start,
+                         unsigned long int size,
+                         unsigned int l1_flags)
+{
+    struct page_info *dst_pg; /* the destination page */
+    unsigned long int l1t_idx_dst_start;
+    unsigned long int l1t_idx_dst_end;
+    unsigned long int flags = _PAGE_USER | _PAGE_GLOBAL | _PAGE_PRESENT;
+    unsigned int i;
+    char *src_data;
+    char *dst_data; /* Pointer for writing into the page */
+
+    /* Calculate where in the destination we need pages */
+    l1t_idx_dst_start = l1_table_offset(dst_start);
+    l1t_idx_dst_end   = l1_table_offset(dst_start + size) +
+        !!( ((dst_start + size) & ((1ul << L1_PAGETABLE_SHIFT) - 1)) %
+             HVM_L1_PAGE_RANGE );
+
+    for( i = l1t_idx_dst_start; i < l1t_idx_dst_end; i++ )
+    {
+        ASSERT( size >= 0 );
+
+        /* Is this an empty entry? */
+        if( !(l1e_get_intpte(l1t_base[i])) )
+        {
+            /* Create a new 4KB page */
+            if( (dst_pg = hvm_alloc_page(d)) == NULL )
+                return HVM_ERR_PG_ALLOC;
+
+            /* Map in src and dst, perform the copy then add it to the
+             * L1 table
+             */
+            dst_data = map_domain_page(_mfn(page_to_mfn(dst_pg)));
+            src_data = map_domain_page(_mfn(virt_to_mfn(src_start)));
+            ASSERT( dst_data != NULL && src_data != NULL );
+
+            memcpy(dst_data, src_data,
+                   (size > HVM_4KB_PAGESIZE) ? HVM_4KB_PAGESIZE : size);
+
+            unmap_domain_page(src_data);
+            unmap_domain_page(dst_data);
+
+            l1t_base[i] = l1e_from_page(dst_pg, flags | l1_flags);
+
+            size -= HVM_4KB_PAGESIZE;
+            src_start += HVM_4KB_PAGESIZE;
+            dst_start += HVM_4KB_PAGESIZE;
+        }
+        else
+        {
+            /* If there is already page information then the page has been
+             * prepared by something else, and we can't overwrite it
+             * as this is the leaf case.
+             */
+            panic("HVM: Region already mapped: %lx\n",
+                  l1e_get_intpte(l1t_base[i]));
+        }
+    }
+    return 0;
+}
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 593cba7..abc5113 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -40,7 +40,8 @@
 #include <asm/domain.h>
 #include <xen/numa.h>
 #include <asm/hvm/nestedhvm.h>
-
+#include <xen/hvm/deprivileged.h>
+#include <asm/hvm/vmx/vmx.h>
 #include "private.h"
 
 /* Override macros from asm/page.h to make them work with mfn_t */
@@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, mfn_t l4mfn)
            &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
            ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
 
+    /* Initialise the HVM deprivileged mode feature */
+    hvm_deprivileged_init(d, l4e);
+
     /* Install the per-domain mappings for this domain */
     l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
         l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
@@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
 
     /* Put the memory back in the pool */
     hap_free(d, mmfn);
+
+    /* Destroy the HVM tables */
+    ASSERT(paging_locked_by_me(d));
+    hvm_deprivileged_destroy(d);
 }
 
 /************************************************/
diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
index 6553cff..0bfe0cf 100644
--- a/xen/arch/x86/xen.lds.S
+++ b/xen/arch/x86/xen.lds.S
@@ -50,6 +50,25 @@ SECTIONS
        _etext = .;             /* End of text section */
   } :text = 0x9090
 
+  /* HVM deprivileged mode segments
+   * Used to map the ring3 static data and .text
+   */
+
+  . = ALIGN(PAGE_SIZE);
+  .hvm_deprivileged_text : {
+      __hvm_deprivileged_text_start = . ;
+      *(.hvm_deprivileged_enhancement.text)
+      __hvm_deprivileged_text_end = . ;
+  } : text
+
+  . = ALIGN(PAGE_SIZE);
+  .hvm_deprivileged_data : {
+      __hvm_deprivileged_data_start = . ;
+      *(.hvm_deprivileged_enhancement.data)
+      __hvm_deprivileged_data_end = . ;
+  } : text
+
+  . = ALIGN(PAGE_SIZE);
   .rodata : {
        /* Bug frames table */
        . = ALIGN(4);
diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
index 3e9be83..302399d 100644
--- a/xen/include/asm-x86/config.h
+++ b/xen/include/asm-x86/config.h
@@ -186,7 +186,7 @@ extern unsigned char boot_edid_info[128];
  *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
  *    HVM/idle: continuation of 1:1 mapping
  *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
- *    HVM/idle: unused
+ *    HVM: HVM deprivileged mode 
  *
  * Compatibility guest area layout:
  *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
@@ -280,6 +280,14 @@ extern unsigned char boot_edid_info[128];
 #endif
 #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
 
+/* Slot 511: HVM deprivileged mode
+ * The virtual addresses where the .text, .data and stack should be
+ * placed.
+ */
+#define HVM_DEPRIVILEGED_TEXT_ADDR  (PML4_ADDR(511))
+#define HVM_DEPRIVILEGED_DATA_ADDR  (PML4_ADDR(511) + 0xa000000)
+#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
+
 #ifndef __ASSEMBLY__
 
 /* This is not a fixed value, just a lower limit. */
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
new file mode 100644
index 0000000..6cc803e
--- /dev/null
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -0,0 +1,94 @@
+#ifndef __X86_HVM_DEPRIVILEGED
+#define __X86_HVM_DEPRIVILEGED
+
+/* This is also included in the HVM deprivileged mode .S file */
+#ifndef __ASSEMBLY__
+
+#include <asm/page.h>
+#include <xen/lib.h>
+#include <xen/mm.h>
+#include <xen/domain_page.h>
+#include <xen/config.h>
+#include <xen/types.h>
+#include <xen/sched.h>
+#include <asm/paging.h>
+#include <asm/hap.h>
+#include <asm/paging.h>
+#include <asm-x86/page.h>
+#include <public/domctl.h>
+#include <xen/domain_page.h>
+
+/* Initialise the HVM deprivileged mode. This just sets up the general
+ * page mappings for .text and .data. It does not prepare each HVM vcpu's data
+ * or stack which needs to be done separately using
+ * hvm_deprivileged_prepare_vcpu.
+ */
+void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e);
+
+/* Free up the data used by the HVM deprivileged enhancements.
+ * This frees general page mappings. It does not destroy the per-vcpu
+ * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
+ * This method should be called after those per-vcpu destruction routines.
+ */
+void hvm_deprivileged_destroy(struct domain *d);
+
+/* Use create a copy of the data at the specified virtual address.
+ * When writing to the source, it will walk the page table hierarchy, creating
+ * new levels as needed, and then copy the data across.
+ */
+int hvm_deprivileged_copy_pages(struct domain *d,
+                                l4_pgentry_t *l4e_base,
+                                unsigned long int src_start,
+                                unsigned long int dst_start,
+                                unsigned long int size,
+                                unsigned int l1_flags);
+
+int hvm_deprivileged_copy_l3(struct domain *d,
+                             l3_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+int hvm_deprivileged_copy_l2(struct domain *d,
+                             l2_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+/* The left case of the copy. Will allocate the pages and actually copy the
+ * data.
+ */
+int hvm_deprivileged_copy_l1(struct domain *d,
+                             l1_pgentry_t *l1t_base,
+                             unsigned long int src_start,
+                             unsigned long int dst_start,
+                             unsigned long int size,
+                             unsigned int l1_flags);
+
+
+/* The segments where the user mode .text and .data are stored */
+extern unsigned long int __hvm_deprivileged_text_start;
+extern unsigned long int __hvm_deprivileged_text_end;
+extern unsigned long int __hvm_deprivileged_data_start;
+extern unsigned long int __hvm_deprivileged_data_end;
+
+#endif
+
+/* The sizes of the pages */
+#define HVM_1GB_PAGESIZE 0x40000000ul
+#define HVM_2MB_PAGESIZE 0x00200000ul
+#define HVM_4KB_PAGESIZE 0x00001000ul
+
+/* The size in bytes that a single L(1,2,3,4} entry covers.
+ * There are 512 (left shift by 9) entries in each page-structure.
+ */
+#define HVM_L4_PAGE_RANGE (HVM_1GB_PAGESIZE << 9)
+#define HVM_L3_PAGE_RANGE (HVM_2MB_PAGESIZE << 9)
+#define HVM_L2_PAGE_RANGE (HVM_4KB_PAGESIZE << 9)
+#define HVM_L1_PAGE_RANGE (HVM_4KB_PAGESIZE     )
+
+#define HVM_ERR_PG_ALLOC -1
+
+#endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
  2015-08-06 16:45 ` [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper Ben Catterall
  2015-08-06 16:45 ` [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
@ 2015-08-06 16:45 ` Ben Catterall
  2015-08-06 20:55   ` Andrew Cooper
  2015-08-10  9:49   ` Tim Deegan
  2015-08-06 16:45 ` [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
  2015-08-12  9:50 ` [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Jan Beulich
  4 siblings, 2 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-06 16:45 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
	Ben Catterall

The process to switch into and out of deprivileged mode can be likened to
setjmp/longjmp.

To enter deprivileged mode, we take a copy of the stack from the guest's
registers up to the current stack pointer. This allows us to restore the stack
when we have finished the deprivileged mode operation, meaning we can continue
execution from that point. This is similar to if a context switch had happened.

To exit deprivileged mode, we copy the stack back, replacing the current stack.
We can then continue execution from where we left off, which will unwind the
stack and free up resources. This method means that we do not need to
change any other code paths and its invocation will be transparent to callers.
This should allow the feature to be more easily deployed to different parts
of Xen.

Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
Extra work is needed to properly migrate vcpus between pcpus.

The switch to and from deprivileged mode is performed using sysret and syscall
respectively.

The return paths in entry.S have been edited so that, when we receive an
interrupt whilst in deprivileged mode, we return into that mode correctly.

A hook on the syscall handler in entry.S has also been added which handles
returning from user mode and will support deprivileged mode system calls when
these are needed.

Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
 xen/arch/x86/domain.c               |  12 +++
 xen/arch/x86/hvm/Makefile           |   1 +
 xen/arch/x86/hvm/deprivileged.c     | 103 ++++++++++++++++++
 xen/arch/x86/hvm/deprivileged_asm.S | 205 ++++++++++++++++++++++++++++++++++++
 xen/arch/x86/hvm/vmx/vmx.c          |   7 ++
 xen/arch/x86/x86_64/asm-offsets.c   |   5 +
 xen/arch/x86/x86_64/entry.S         |  35 ++++++
 xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
 xen/include/xen/hvm/deprivileged.h  |  38 +++++++
 xen/include/xen/sched.h             |  18 +++-
 10 files changed, 425 insertions(+), 1 deletion(-)
 create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 045f6ff..a0e5e70 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -62,6 +62,7 @@
 #include <xen/iommu.h>
 #include <compat/vcpu.h>
 #include <asm/psr.h>
+#include <xen/hvm/deprivileged.h>
 
 DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
 DEFINE_PER_CPU(unsigned long, cr4);
@@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
     if ( has_hvm_container_domain(d) )
     {
         rc = hvm_vcpu_initialise(v);
+
+        /* Initialise HVM deprivileged mode */
+        printk("HVM initialising deprivileged mode ...");
+        hvm_deprivileged_prepare_vcpu(v);
+        printk("Done.\n");
+
         goto done;
     }
 
@@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
     vcpu_destroy_fpu(v);
 
     if ( has_hvm_container_vcpu(v) )
+    {
+        /* Destroy the deprivileged mode on this vcpu */
+        hvm_deprivileged_destroy_vcpu(v);
+
         hvm_vcpu_destroy(v);
+    }
     else
         xfree(v->arch.pv_vcpu.trap_ctxt);
 }
diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
index bd83ba3..6819886 100644
--- a/xen/arch/x86/hvm/Makefile
+++ b/xen/arch/x86/hvm/Makefile
@@ -17,6 +17,7 @@ obj-y += quirks.o
 obj-y += rtc.o
 obj-y += save.o
 obj-y += deprivileged.o
+obj-y += deprivileged_asm.o
 obj-y += stdvga.o
 obj-y += vioapic.o
 obj-y += viridian.o
diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
index 071d900..979fc69 100644
--- a/xen/arch/x86/hvm/deprivileged.c
+++ b/xen/arch/x86/hvm/deprivileged.c
@@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
     }
     return 0;
 }
+
+/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
+ */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
+{
+    struct page_info *pg;
+
+    /* TODO: clarify if this MEMF is correct */
+    /* Allocate 2^STACK_ORDER contiguous pages */
+    pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
+    if( pg == NULL )
+    {
+        panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n");
+        return -ENOMEM;
+    }
+
+    vcpu->stack = page_to_virt(pg);
+    vcpu->rsp = 0;
+    vcpu->user_mode = 0;
+
+    return 0;
+}
+
+/* Called on destroying each vcpu */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
+{
+    free_domheap_pages(virt_to_page(vcpu->stack), STACK_ORDER);
+}
+
+/* Called to perform a user mode operation.
+ * Execution context is saved and then we move into user mode.
+ * This method is then jumped into to restore execution context after
+ * exiting user mode.
+ */
+void hvm_deprivileged_user_mode(void)
+{
+    struct vcpu *vcpu = get_current();
+    unsigned long int efer = read_efer();
+    register unsigned long sp asm("rsp");
+
+    ASSERT( vcpu->user_mode == 0 );
+    ASSERT( vcpu->stack != 0 );
+    ASSERT( vcpu->rsp == 0 );
+
+    /* Flip the SCE bit to allow sysret/call */
+    write_efer(efer | EFER_SCE);
+
+    /* Save the msr lstar and star. Xen does lazy loading of these
+     * so we need to put the host values in and then restore the
+     * guest ones once we're done.
+     */
+    rdmsrl(MSR_LSTAR, vcpu->msr_lstar);
+    rdmsrl(MSR_STAR, vcpu->msr_star);
+    wrmsrl(MSR_LSTAR,get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]);
+    wrmsrl(MSR_STAR, get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]);
+
+    /* The assembly routine to handle moving into/out of deprivileged mode */
+    hvm_deprivileged_user_mode_asm();
+
+    /* If our copy failed */
+    if( unlikely(vcpu->rsp == 0) )
+    {
+        gdprintk(XENLOG_ERR, "HVM: Stack too large in %s\n", __FUNCTION__);
+        domain_crash_synchronous();
+    }
+
+    /* Debug info */
+    vcpu->old_msr_lstar = get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR];
+    vcpu->old_msr_star = get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR];
+    vcpu->old_rsp = sp;
+    vcpu->old_processor = smp_processor_id();
+
+    /* Restore the efer and saved msr registers */
+    write_efer(efer);
+    wrmsrl(MSR_LSTAR, vcpu->msr_lstar);
+    wrmsrl(MSR_STAR, vcpu->msr_star);
+    vcpu->user_mode = 0;
+    vcpu->rsp = 0;
+}
+
+/* Called when the user mode operation has completed
+ * Perform C-level processing on return pathx
+ */
+void hvm_deprivileged_finish_user_mode(void)
+{
+    /* If we are not returning from user mode: bail */
+    ASSERT(get_current()->user_mode == 1);
+
+    hvm_deprivileged_finish_user_mode_asm();
+}
+
+void hvm_deprivileged_check_trap(const char* func_name)
+{
+    if( current->user_mode == 1 )
+    {
+        printk("HVM Deprivileged Mode: Trap whilst in user mode, %s\n",
+               func_name);
+        domain_crash_synchronous();
+    }
+}
+
+
+
diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
new file mode 100644
index 0000000..00a9e9c
--- /dev/null
+++ b/xen/arch/x86/hvm/deprivileged_asm.S
@@ -0,0 +1,205 @@
+/*
+ * HVM security enhancements assembly code
+ */
+#include <xen/config.h>
+#include <xen/errno.h>
+#include <xen/softirq.h>
+#include <asm/asm_defns.h>
+#include <asm/apicdef.h>
+#include <asm/page.h>
+#include <public/xen.h>
+#include <irq_vectors.h>
+#include <xen/hvm/deprivileged.h>
+
+/* Handles entry into the deprivileged mode and returning from this
+ * mode. This requires copying the current Xen privileged stack across
+ * to a per-vcpu buffer as we need to be able to handle interrupts and
+ * exceptions whilst in this mode. Xen is non-preemptable so our
+ * privileged mode stack would  be clobbered if we did not save it.
+ *
+ * If we are entering deprivileged mode, then we use a sysret to get there.
+ * If we are returning from deprivileged mode, then we need to unwind the stack
+ * so we copy it back over the current stack so that we can return from the
+ * call path where we came in from.
+ *
+ * We're doing sort-of a long jump/set jump with copying to a stack to
+ * preserve it and allow returning code to continue executing from
+ * within this method.
+ */
+ENTRY(hvm_deprivileged_user_mode_asm)
+        /* Save our registers */
+        push   %rax
+        push   %rbx
+        push   %rcx
+        push   %rdx
+        push   %rsi
+        push   %rdi
+        push   %rbp
+        push   %r8
+        push   %r9
+        push   %r10
+        push   %r11
+        push   %r12
+        push   %r13
+        push   %r14
+        push   %r15
+        pushfq
+
+        /* Perform a near call to push rip onto the stack */
+        call   1f
+
+        /* Magic: Add to the stored rip the size of the code between
+         * label 1 and label 2. This allows  us to restart execution at label 2.
+         */
+1:      addq   $2f-1b, (%rsp)
+
+        GET_CURRENT(%r8)
+        xor    %rsi, %rsi
+
+        /* The following is equivalent to
+         * (get_cpu_info() + sizeof(struct cpu_info))
+         * This gets us to the top of the stack.
+         */
+        GET_STACK_BASE(%rcx)
+        addq   $STACK_SIZE, %rcx
+
+        movq   VCPU_stack(%r8), %rdi
+
+        /* We need copy the current stack across to our buffer
+         * Calculate the number of bytes to copy:
+         * (top of stack - current stack pointer)
+         * NOTE: We must not push any more data onto our stack after this point
+         * as it won't be saved.
+         */
+        sub    %rsp, %rcx
+
+        /* If the stack is too big, we don't do the copy: handled by caller. */
+        cmpq   $STACK_SIZE, %rcx
+        ja     3f
+
+        mov    %rsp, %rsi
+/* USER MODE ENTRY POINT */
+2:
+        /* More magic: If we came here from preparing to go into user mode,
+         * then we copy our current stack to the buffer (the lines above
+         * have setup rsi, rdi and rcx to do this).
+         *
+         * If we came here from user mode, then we movsb to copy from
+         * our buffer into our current stack so that we can continue
+         * execution from the current code point, and return back to the guest
+         * via the path we came in. rsi, rdi and rcx have been setup by the
+         * de-privileged return path for this.
+         */
+        rep    movsb
+        mov    %rsp, %rsi
+
+        GET_CURRENT(%r8)
+        movq   VCPU_user_mode(%r8), %rdx
+        movq   VCPU_rsp(%r8), %rax
+
+        /* If !user_mode  */
+        cmpq   $0, %rdx
+        jne    3f
+        cli
+
+        movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
+
+        movq   $0x10200, %r11          /* RFLAGS user mode enable interrupts */
+        movq   $1, VCPU_user_mode(%r8) /* Now in user mode */
+        movq   %rsi, VCPU_rsp(%r8)     /* The rsp to restore to */
+
+        /* Stack ptr is set by user mode to avoid race conditions.
+         * See Intel manual 2 on the sysret instruction.
+         */
+        movq   $HVM_STACK_PTR, %rbx
+        sysretq                         /* Enter deprivileged mode */
+
+3:      GET_CURRENT(%r8)
+        movq   %rsi, VCPU_rsp(%r8)
+        pop    %rax    /* Pop off rip: used in a jump so still on stack */
+
+        /* Restore registers */
+        popfq
+        pop    %r15
+        pop    %r14
+        pop    %r13
+        pop    %r12
+        pop    %r11
+        pop    %r10
+        pop    %r9
+        pop    %r8
+        pop    %rbp
+        pop    %rdi
+        pop    %rsi
+        pop    %rdx
+        pop    %rcx
+        pop    %rbx
+        pop    %rax
+        ret
+
+/* Finished in user mode so return */
+ENTRY(hvm_deprivileged_finish_user_mode_asm)
+        /* The source is the copied stack in our buffer.
+         * The destination is our current stack.
+         *
+         * We need to:
+         * - Move the stack pointer to where it was before we entered
+         *   deprivileged mode.
+         * - Setup rsi, rdi and rcx as needed to perform the copy
+         * - Jump to the address held at the top of the stack which
+         *   is the user mode return address
+         */
+        cli
+        GET_CURRENT(%rbx)
+        movq   VCPU_stack(%rbx), %rsi
+        movq   VCPU_rsp(%rbx), %rdi
+
+        /* The return address that the near call pushed onto the
+         * buffer is pointed to by stack, so use that for rip.
+         */
+        movq   %rdi, %rsp
+
+        /* The following is equivalent to
+         * (get_cpu_info() + sizeof(struct cpu_info) - vcpu->rsp)
+         * This works out how many bytes we need to copy:
+         * (top of stack - bottom of stack)
+         */
+        GET_STACK_BASE(%rcx)
+        addq   $STACK_SIZE, %rcx
+        subq   %rdi, %rcx
+
+        /* Go to user mode return code */
+        jmp    *(%rsi)
+
+/* Entry point from the assembly syscall handlers */
+ENTRY(hvm_deprivileged_handle_user_mode)
+
+        /* Handle a user mode hypercall here */
+
+
+        /* We are finished in user mode */
+        call hvm_deprivileged_finish_user_mode
+
+        ret
+
+.section .hvm_deprivileged_enhancement.text,"ax"
+/* HVM deprivileged code */
+ENTRY(hvm_deprivileged_ring3)
+        /* sysret has loaded eip from rcx and rflags from r11.
+         * CS and SS have been loaded from the MSR for ring 3.
+         * We now need to  switch to the user mode stack
+         */
+        /* Setup usermode stack */
+        movabs $HVM_STACK_PTR, %rsp
+
+        /* Perform user mode processing */
+
+        mov $0xf, %rcx
+1: dec  %rcx
+        cmp $0, %rcx
+        jne 1b
+
+        /* Return to ring 0 */
+        syscall
+
+.previous
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index c32d863..595b0f2 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -59,6 +59,8 @@
 #include <asm/event.h>
 #include <asm/monitor.h>
 #include <public/arch-x86/cpuid.h>
+#include <xen/hvm/deprivileged.h>
+
 
 static bool_t __initdata opt_force_ept;
 boolean_param("force-ept", opt_force_ept);
@@ -194,6 +196,10 @@ void vmx_save_host_msrs(void)
         set_bit(VMX_INDEX_MSR_ ## address, &host_msr_state->flags);     \
     } while ( 0 )
 
+struct vmx_msr_state *get_host_msr_state(void) {
+    return &this_cpu(host_msr_state);
+}
+
 static enum handler_return
 long_mode_do_msr_read(unsigned int msr, uint64_t *msr_content)
 {
@@ -272,6 +278,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
     case MSR_LSTAR:
         if ( !is_canonical_address(msr_content) )
             goto uncanonical_address;
+
         WRITE_MSR(LSTAR);
         break;
 
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 447c650..fd5de44 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -115,6 +115,11 @@ void __dummy__(void)
     OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
     BLANK();
 
+    OFFSET(VCPU_stack, struct vcpu, stack);
+    OFFSET(VCPU_rsp, struct vcpu, rsp);
+    OFFSET(VCPU_user_mode, struct vcpu, user_mode);
+    BLANK();
+
     OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 74677a2..fa9155c 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -102,6 +102,15 @@ restore_all_xen:
         RESTORE_ALL adj=8
         iretq
 
+/* Returning from user mode */
+handle_hvm_user_mode:
+
+        call hvm_deprivileged_handle_user_mode
+
+        /* Go back into user mode */
+        cli
+        jmp  restore_all_guest
+
 /*
  * When entering SYSCALL from kernel mode:
  *  %rax                            = hypercall vector
@@ -131,6 +140,14 @@ ENTRY(lstar_enter)
         testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
         jz    switch_to_kernel
 
+        /* Were we in Xen's ring 3?  */
+        push %rbx
+        GET_CURRENT(%rbx)
+        movq VCPU_user_mode(%rbx), %rbx
+        cmp  $1, %rbx
+        je   handle_hvm_user_mode
+        pop  %rbx
+
 /*hypercall:*/
         movq  %r10,%rcx
         cmpq  $NR_hypercalls,%rax
@@ -487,6 +504,13 @@ ENTRY(common_interrupt)
 /* No special register assumptions. */
 ENTRY(ret_from_intr)
         GET_CURRENT(%rbx)
+
+        /* If we are in Xen's user mode, return into it */
+        cmpq $1,VCPU_user_mode(%rbx)
+        cli
+        je    restore_all_guest
+        sti
+
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         movq  VCPU_domain(%rbx),%rax
@@ -509,6 +533,14 @@ handle_exception_saved:
         GET_CURRENT(%rbx)
         PERFC_INCR(exceptions, %rax, %rbx)
         callq *(%rdx,%rax,8)
+
+        /* If we are in Xen's user mode, return into it */
+        /* TODO: Test this path */
+        cmpq  $1,VCPU_user_mode(%rbx)
+        cli
+        je    restore_all_guest
+        sti
+
         testb $3,UREGS_cs(%rsp)
         jz    restore_all_xen
         leaq  VCPU_trap_bounce(%rbx),%rdx
@@ -664,6 +696,9 @@ handle_ist_exception:
         movl  $EVENT_CHECK_VECTOR,%edi
         call  send_IPI_self
 1:      movq  VCPU_domain(%rbx),%rax
+        /* This also handles Xen ring3 return for us.
+         * So, there is no need to explicitly do a user mode check.
+         */
         cmpb  $0,DOMAIN_is_32bit_pv(%rax)
         je    restore_all_guest
         jmp   compat_restore_all_guest
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 3fbfa44..98e269e 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -565,4 +565,6 @@ typedef struct {
     u16 eptp_index;
 } ve_info_t;
 
+struct vmx_msr_state *get_host_msr_state(void);
+
 #endif /* __ASM_X86_HVM_VMX_VMX_H__ */
diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
index 6cc803e..e42f39a 100644
--- a/xen/include/xen/hvm/deprivileged.h
+++ b/xen/include/xen/hvm/deprivileged.h
@@ -68,6 +68,37 @@ int hvm_deprivileged_copy_l1(struct domain *d,
                              unsigned int l1_flags);
 
 
+/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
+int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
+
+/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
+void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
+
+/* Called to perform a user mode operation. */
+void hvm_deprivileged_user_mode(void);
+
+/* Called when the user mode operation has completed */
+void hvm_deprivileged_finish_user_mode(void);
+
+/* Called to move into and then out of user mode. Needed for accessing
+ * assembly features.
+ */
+void hvm_deprivileged_user_mode_asm(void);
+
+/* Called on the return path to return to the correct execution point */
+void hvm_deprivileged_finish_user_mode_asm(void);
+
+/* Handle any syscalls that the user mode makes */
+void hvm_deprivileged_handle_user_mode(void);
+
+/* The ring 3 code */
+void hvm_deprivileged_ring3(void);
+
+/* Call when inside a trap that should cause a domain crash if in user mode
+ * e.g. an invalid_op is trapped whilst in user mode.
+ */
+void hvm_deprivileged_check_trap(const char* func_name);
+
 /* The segments where the user mode .text and .data are stored */
 extern unsigned long int __hvm_deprivileged_text_start;
 extern unsigned long int __hvm_deprivileged_text_end;
@@ -91,4 +122,11 @@ extern unsigned long int __hvm_deprivileged_data_end;
 
 #define HVM_ERR_PG_ALLOC -1
 
+/* The user mode stack pointer.
++ * The stack grows down so set this to top of the stack region. Then,
++ * as this is 0-indexed, move into the stack, not just after it.
++ * Subtract 16 bytes for correct stack alignment.
++ */
+#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16)
+
 #endif
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 73d3bc8..180643e 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
 
 struct waitqueue_vcpu;
 
-struct vcpu 
+struct vcpu
 {
     int              vcpu_id;
 
@@ -158,6 +158,22 @@ struct vcpu
 
     void            *sched_priv;    /* scheduler-specific data */
 
+    /* HVM deprivileged mode state */
+    void *stack;             /* Location of stack to save data onto */
+    unsigned long rsp;       /* rsp of our stack to restore our data to */
+    unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
+
+    /* The mstar of the processor that we are currently executing on.
+     *  we need to save this because Xen does lazy saving of these.
+     */
+    unsigned long int msr_lstar; /* lstar */
+    unsigned long int msr_star;
+
+    /* Debug info */
+    unsigned long int old_rsp;
+    unsigned long int old_processor;
+    unsigned long int old_msr_lstar;
+    unsigned long int old_msr_star;
     struct vcpu_runstate_info runstate;
 #ifndef CONFIG_COMPAT
 # define runstate_guest(v) ((v)->runstate_guest)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
                   ` (2 preceding siblings ...)
  2015-08-06 16:45 ` [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
@ 2015-08-06 16:45 ` Ben Catterall
  2015-08-06 21:24   ` Andrew Cooper
  2015-08-10 10:07   ` Tim Deegan
  2015-08-12  9:50 ` [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Jan Beulich
  4 siblings, 2 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-06 16:45 UTC (permalink / raw)
  To: xen-devel
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, jbeulich,
	Ben Catterall

Added trap handlers to catch exceptions such as a page fault, general
protection fault, etc. These handlers will crash the domain as such exceptions
would indicate that either there is a bug in deprivileged mode or it has been
compromised by an attacker.

Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
---
 xen/arch/x86/mm/hap/hap.c |  9 +++++++++
 xen/arch/x86/traps.c      | 41 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index abc5113..43bde89 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v, unsigned long va,
 {
     struct domain *d = v->domain;
 
+    /* If we get a page fault whilst in HVM security user mode */
+    if( v->user_mode == 1 )
+    {
+        printk("HVM: #PF (%u:%u) whilst in user mode\n",
+                 d->domain_id, v->vcpu_id);
+        domain_crash_synchronous();
+    }
+
     HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
               d->domain_id, v->vcpu_id);
+
     domain_crash(d);
     return 0;
 }
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 9f5a6c6..19d465f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -74,6 +74,7 @@
 #include <asm/vpmu.h>
 #include <public/arch-x86/cpuid.h>
 #include <xsm/xsm.h>
+#include <xen/hvm/deprivileged.h>
 
 /*
  * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
@@ -500,6 +501,11 @@ static void do_guest_trap(
     struct trap_bounce *tb;
     const struct trap_info *ti;
 
+    /* If we take the trap whilst in HVM deprivileged mode
+     * then we should crash the domain.
+     */
+    hvm_deprivileged_check_trap(__FUNCTION__);
+
     trace_pv_trap(trapnr, regs->eip, use_error_code, regs->error_code);
 
     tb = &v->arch.pv_vcpu.trap_bounce;
@@ -619,6 +625,11 @@ static void do_trap(struct cpu_user_regs *regs, int use_error_code)
 
     if ( guest_mode(regs) )
     {
+        /* If we take the trap whilst in HVM deprivileged mode
+         * then we should crash the domain.
+         */
+        hvm_deprivileged_check_trap(__FUNCTION__);
+
         do_guest_trap(trapnr, regs, use_error_code);
         return;
     }
@@ -1072,6 +1083,11 @@ void do_invalid_op(struct cpu_user_regs *regs)
 
     if ( likely(guest_mode(regs)) )
     {
+        /* If we take the trap whilst in HVM deprivileged mode
+         * then we should crash the domain.
+         */
+        hvm_deprivileged_check_trap(__FUNCTION__);
+
         if ( !emulate_invalid_rdtscp(regs) &&
              !emulate_forced_invalid_op(regs) )
             do_guest_trap(TRAP_invalid_op, regs, 0);
@@ -1163,7 +1179,12 @@ void do_int3(struct cpu_user_regs *regs)
     {
         debugger_trap_fatal(TRAP_int3, regs);
         return;
-    } 
+    }
+
+    /* If we take the trap whilst in HVM deprivileged mode
+     * then we should crash the domain.
+     */
+    hvm_deprivileged_check_trap(__FUNCTION__);
 
     do_guest_trap(TRAP_int3, regs, 0);
 }
@@ -3231,6 +3252,11 @@ void do_general_protection(struct cpu_user_regs *regs)
     if ( !guest_mode(regs) )
         goto gp_in_kernel;
 
+    /* If we take the trap whilst in HVM deprivileged mode
+     * then we should crash the domain.
+     */
+    hvm_deprivileged_check_trap(__FUNCTION__);
+
     /*
      * Cunning trick to allow arbitrary "INT n" handling.
      * 
@@ -3490,6 +3516,11 @@ void do_device_not_available(struct cpu_user_regs *regs)
 
     BUG_ON(!guest_mode(regs));
 
+    /* If we take the trap whilst in HVM deprivileged mode
+     * then we should crash the domain.
+     */
+    hvm_deprivileged_check_trap(__FUNCTION__);
+
     vcpu_restore_fpu_lazy(curr);
 
     if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
@@ -3531,6 +3562,14 @@ void do_debug(struct cpu_user_regs *regs)
 
     DEBUGGER_trap_entry(TRAP_debug, regs);
 
+    if( guest_mode(regs) )
+    {
+        /* If we take the trap whilst in HVM deprivileged mode
+         * then we should crash the domain.
+         */
+        hvm_deprivileged_check_trap(__FUNCTION__);
+    }
+
     if ( !guest_mode(regs) )
     {
         if ( regs->eflags & X86_EFLAGS_TF )
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-06 16:45 ` [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper Ben Catterall
@ 2015-08-06 19:22   ` Andrew Cooper
  2015-08-07  9:57     ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-06 19:22 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 06/08/15 17:45, Ben Catterall wrote:
> This allocation function is used by the deprivileged mode initialisation code
> to allocate pages for the new page table mappings and page frames on the HAP
> page heap.
>
> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>

This is fine for your test box, but isn't fine for systems out there
without hardware EPT/NPT support.  For older systems like that (or in
certain specific workloads), shadow paging is used instead.

This feature is applicable to any HVM domain, which means that it
shouldn't depend on HAP or shadow paging.

How much memory is allocated for the depriv area, and what exactly is
allocated in total?

I expect it isn't very much, and would suggest using
d->arch.paging.alloc_page() instead (which is the generic "get me some
memory accounted against the domain" helper) which looks as if it should
suffice.

Depending on exactly how much memory is needed, you might need to bump
the default minimum shadow pool size.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables
  2015-08-06 16:45 ` [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
@ 2015-08-06 19:52   ` Andrew Cooper
  2015-08-07 13:19     ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-06 19:52 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 06/08/15 17:45, Ben Catterall wrote:
> The paging structure mappings for the deprivileged mode are added
> to the monitor page table for HVM guests. The entries are generated by
> walking the page tables and mapping in new pages. If a higher-level page table
> mapping exists, we attempt to traverse all the way down to a leaf page and add
> the page there. Access bits are flipped as needed.
>
> The page entries are generated for deprivileged .text, .data and a stack. The
> data is copied from sections allocated by the linker. The mappings are setup in
> an unused portion of the Xen virtual address space. The pages are mapped in as
> user mode accessible, with NX bits set for the data and stack regions and the
> code region is set to be executable and read-only.
>
> The needed pages are allocated on the HAP page heap and are deallocated when
> those heap pages are deallocated (on domain destruction).
>
> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> ---
>  xen/arch/x86/hvm/Makefile          |   1 +
>  xen/arch/x86/hvm/deprivileged.c    | 441 +++++++++++++++++++++++++++++++++++++
>  xen/arch/x86/mm/hap/hap.c          |  10 +-
>  xen/arch/x86/xen.lds.S             |  19 ++
>  xen/include/asm-x86/config.h       |  10 +-
>  xen/include/xen/hvm/deprivileged.h |  94 ++++++++
>  6 files changed, 573 insertions(+), 2 deletions(-)
>  create mode 100644 xen/arch/x86/hvm/deprivileged.c
>  create mode 100644 xen/include/xen/hvm/deprivileged.h
>
> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
> index 794e793..bd83ba3 100644
> --- a/xen/arch/x86/hvm/Makefile
> +++ b/xen/arch/x86/hvm/Makefile
> @@ -16,6 +16,7 @@ obj-y += pmtimer.o
>  obj-y += quirks.o
>  obj-y += rtc.o
>  obj-y += save.o
> +obj-y += deprivileged.o

We attempt to keep objects linked in alphabetical order, where possible.

>  obj-y += stdvga.o
>  obj-y += vioapic.o
>  obj-y += viridian.o
> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
> new file mode 100644
> index 0000000..071d900
> --- /dev/null
> +++ b/xen/arch/x86/hvm/deprivileged.c
> @@ -0,0 +1,441 @@
> +/*
> + * HVM deprivileged mode to provide support for running operations in
> + * user mode from Xen
> + */
> +#include <xen/lib.h>
> +#include <xen/mm.h>
> +#include <xen/domain_page.h>
> +#include <xen/config.h>
> +#include <xen/types.h>
> +#include <xen/sched.h>
> +#include <asm/paging.h>
> +#include <xen/compiler.h>
> +#include <asm/hap.h>
> +#include <asm/paging.h>
> +#include <asm-x86/page.h>
> +#include <public/domctl.h>
> +#include <xen/domain_page.h>
> +#include <asm/hvm/vmx/vmx.h>
> +#include <xen/hvm/deprivileged.h>
> +
> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
> +{
> +    int ret;
> +    void *p;
> +    unsigned long int size;
> +
> +    /* Copy the .text segment for deprivileged code */

Why do you need to copy the deprivileged text section at all?  It is
read only and completely common.  All you should need to do is make
userspace mappings to it where it resides in the Xen linked area.

i.e. this is an operation which should be performed once at startup, not
once per domain.  You will also reduce your memory overhead by doing this.

> +    size = (unsigned long int)&__hvm_deprivileged_text_end -

Don't bother qualifying unsigned long with a further "int".  A plain
"unsigned long" is the expected declaration.

> +           (unsigned long int)&__hvm_deprivileged_text_start;
> +
> +    ret = hvm_deprivileged_copy_pages(d, l4e,
> +                              (unsigned long int)&__hvm_deprivileged_text_start,
> +                              (unsigned long int)HVM_DEPRIVILEGED_TEXT_ADDR,
> +                              size, 0 /* No write */);
> +
> +    if( ret )
> +    {
> +        printk("HVM: Error when initialising depriv .text. Code: %d", ret);
> +        domain_crash(d);
> +        return;
> +    }
> +
> +    /* Copy the .data segment for ring3 code */
> +    size = (unsigned long int)&__hvm_deprivileged_data_end -
> +           (unsigned long int)&__hvm_deprivileged_data_start;
> +
> +    ret = hvm_deprivileged_copy_pages(d, l4e,
> +                              (unsigned long int)&__hvm_deprivileged_data_start,
> +                              (unsigned long int)HVM_DEPRIVILEGED_DATA_ADDR,
> +                              size, _PAGE_NX | _PAGE_RW);

What do you expect to end up in a data section like this?

> +
> +    if( ret )
> +    {
> +        printk("HVM: Error when initialising depriv .data. Code: %d", ret);
> +        domain_crash(d);
> +        return;
> +    }
> +
> +    /* Setup the stack mappings.

Xen recommended style for multiline comments is

/*
 * Setup ...

> +     * This is a bit of a hack: by allocating a blank area
> +     * we can reuse the hvm_deprivileged_copy_pages code.
> +     * The stacks are actually allocated in the per-vcpu initialisation
> +     * routine.
> +     */
> +    size = STACK_SIZE;
> +
> +    p = alloc_xenheap_pages(STACK_ORDER, 0);
> +    ret = hvm_deprivileged_copy_pages(d, l4e,
> +                              (unsigned long int)p,
> +                              (unsigned long int)HVM_DEPRIVILEGED_STACK_ADDR,
> +                              size, _PAGE_NX | _PAGE_RW);
> +
> +    free_xenheap_pages(p, STACK_ORDER);

The userspace stacks really don't need to be this large.  Xens primary
stack, which runs all of this code normally, is limited at 2 pages,
which is 1/4 of STACK_SIZE.

> +
> +    if( ret )
> +    {
> +        printk("HVM: Error when initialising depriv stack. Code: %d", ret);
> +        domain_crash(d);
> +        return;
> +    }
> +}
> +
> +void hvm_deprivileged_destroy(struct domain *d)
> +{
> +    struct page_info *pg = NULL;
> +
> +    /* Free up all the pages we allocated from the HAP HVM list */
> +    while( (pg = page_list_remove_head(&d->arch.paging.hap.hvmlist)) )
> +    {
> +        d->arch.paging.hap.free_pages++;
> +        page_list_add_tail(pg, &d->arch.paging.hap.freelist);
> +    }

See c/s 0174da5b7 for why loops like this are bad in general.  However,
if you use the shadow pool, this should be able to disappear.

> +}
> +
> +/* Create a copy of the data at the specified virtual address.
> + * When writing to the source, it will walk the page table hierarchy, creating
> + * new levels as needed.
> + *
> + * If we find a leaf entry in a page table (one which holds the
> + * mfn of a 4KB, 2MB, etc. page frame) which has already been
> + * mapped in, then we bail as we have a collision and this likely
> + * means a bug or the memory configuration has been changed.
> + *
> + * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
> + * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
> + * The PAGE_RW flag will be enabled for all page structure entries
> + * above the leaf page if that leaf page has PAGE_RW set. This is needed to
> + * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
> + *
> + * TODO: We proceed down to L1 4KB pages and then map these in. We should
> + * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
> + * page access. When we stop would depend on size (e.g. use 2MB pages for a
> + * few MBs)
> + */
> +int hvm_deprivileged_copy_pages(struct domain *d,
> +                            l4_pgentry_t *l4t_base,
> +                            unsigned long int src_start,
> +                            unsigned long int dst_start,
> +                            unsigned long int size,
> +                            unsigned int l1_flags)

<large snip>

Does map_pages_to_xen() not do everything you need here?  It certainly
should be fine for the .text section, and will keep the idle and monitor
tables up to date for you.

Either way, it would be useful to start early on with a description of
what the new memory layout is expected to be, and which bits are global,
which are per pcpu and which are per vcpu.

>
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index 593cba7..abc5113 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -40,7 +40,8 @@
>  #include <asm/domain.h>
>  #include <xen/numa.h>
>  #include <asm/hvm/nestedhvm.h>
> -
> +#include <xen/hvm/deprivileged.h>
> +#include <asm/hvm/vmx/vmx.h>

This looks like a spurious addition.

>  #include "private.h"
>  
>  /* Override macros from asm/page.h to make them work with mfn_t */
> @@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, mfn_t l4mfn)
>             &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
>             ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
>  
> +    /* Initialise the HVM deprivileged mode feature */
> +    hvm_deprivileged_init(d, l4e);
> +
>      /* Install the per-domain mappings for this domain */
>      l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
>          l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
> @@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
>  
>      /* Put the memory back in the pool */
>      hap_free(d, mmfn);
> +
> +    /* Destroy the HVM tables */
> +    ASSERT(paging_locked_by_me(d));
> +    hvm_deprivileged_destroy(d);
>  }
>  
>  /************************************************/
> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
> index 6553cff..0bfe0cf 100644
> --- a/xen/arch/x86/xen.lds.S
> +++ b/xen/arch/x86/xen.lds.S
> @@ -50,6 +50,25 @@ SECTIONS
>         _etext = .;             /* End of text section */
>    } :text = 0x9090
>  
> +  /* HVM deprivileged mode segments
> +   * Used to map the ring3 static data and .text
> +   */
> +
> +  . = ALIGN(PAGE_SIZE);
> +  .hvm_deprivileged_text : {
> +      __hvm_deprivileged_text_start = . ;
> +      *(.hvm_deprivileged_enhancement.text)
> +      __hvm_deprivileged_text_end = . ;
> +  } : text
> +
> +  . = ALIGN(PAGE_SIZE);
> +  .hvm_deprivileged_data : {
> +      __hvm_deprivileged_data_start = . ;
> +      *(.hvm_deprivileged_enhancement.data)
> +      __hvm_deprivileged_data_end = . ;
> +  } : text
> +
> +  . = ALIGN(PAGE_SIZE);
>    .rodata : {
>         /* Bug frames table */
>         . = ALIGN(4);
> diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
> index 3e9be83..302399d 100644
> --- a/xen/include/asm-x86/config.h
> +++ b/xen/include/asm-x86/config.h
> @@ -186,7 +186,7 @@ extern unsigned char boot_edid_info[128];
>   *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
>   *    HVM/idle: continuation of 1:1 mapping
>   *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
> - *    HVM/idle: unused
> + *    HVM: HVM deprivileged mode 
>   *
>   * Compatibility guest area layout:
>   *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
> @@ -280,6 +280,14 @@ extern unsigned char boot_edid_info[128];
>  #endif
>  #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
>  
> +/* Slot 511: HVM deprivileged mode
> + * The virtual addresses where the .text, .data and stack should be
> + * placed.
> + */
> +#define HVM_DEPRIVILEGED_TEXT_ADDR  (PML4_ADDR(511))
> +#define HVM_DEPRIVILEGED_DATA_ADDR  (PML4_ADDR(511) + 0xa000000)
> +#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
> +
>  #ifndef __ASSEMBLY__
>  
>  /* This is not a fixed value, just a lower limit. */
> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
> new file mode 100644
> index 0000000..6cc803e
> --- /dev/null
> +++ b/xen/include/xen/hvm/deprivileged.h
> @@ -0,0 +1,94 @@
> +#ifndef __X86_HVM_DEPRIVILEGED
> +#define __X86_HVM_DEPRIVILEGED
> +
> +/* This is also included in the HVM deprivileged mode .S file */
> +#ifndef __ASSEMBLY__
> +
> +#include <asm/page.h>
> +#include <xen/lib.h>
> +#include <xen/mm.h>
> +#include <xen/domain_page.h>
> +#include <xen/config.h>
> +#include <xen/types.h>
> +#include <xen/sched.h>
> +#include <asm/paging.h>
> +#include <asm/hap.h>
> +#include <asm/paging.h>
> +#include <asm-x86/page.h>
> +#include <public/domctl.h>
> +#include <xen/domain_page.h>
> +
> +/* Initialise the HVM deprivileged mode. This just sets up the general
> + * page mappings for .text and .data. It does not prepare each HVM vcpu's data
> + * or stack which needs to be done separately using
> + * hvm_deprivileged_prepare_vcpu.
> + */
> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e);
> +
> +/* Free up the data used by the HVM deprivileged enhancements.
> + * This frees general page mappings. It does not destroy the per-vcpu
> + * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
> + * This method should be called after those per-vcpu destruction routines.
> + */
> +void hvm_deprivileged_destroy(struct domain *d);
> +
> +/* Use create a copy of the data at the specified virtual address.
> + * When writing to the source, it will walk the page table hierarchy, creating
> + * new levels as needed, and then copy the data across.
> + */
> +int hvm_deprivileged_copy_pages(struct domain *d,
> +                                l4_pgentry_t *l4e_base,
> +                                unsigned long int src_start,
> +                                unsigned long int dst_start,
> +                                unsigned long int size,
> +                                unsigned int l1_flags);
> +
> +int hvm_deprivileged_copy_l3(struct domain *d,
> +                             l3_pgentry_t *l1t_base,
> +                             unsigned long int src_start,
> +                             unsigned long int dst_start,
> +                             unsigned long int size,
> +                             unsigned int l1_flags);
> +
> +int hvm_deprivileged_copy_l2(struct domain *d,
> +                             l2_pgentry_t *l1t_base,
> +                             unsigned long int src_start,
> +                             unsigned long int dst_start,
> +                             unsigned long int size,
> +                             unsigned int l1_flags);
> +
> +/* The left case of the copy. Will allocate the pages and actually copy the
> + * data.
> + */
> +int hvm_deprivileged_copy_l1(struct domain *d,
> +                             l1_pgentry_t *l1t_base,
> +                             unsigned long int src_start,
> +                             unsigned long int dst_start,
> +                             unsigned long int size,
> +                             unsigned int l1_flags);
> +
> +
> +/* The segments where the user mode .text and .data are stored */
> +extern unsigned long int __hvm_deprivileged_text_start;
> +extern unsigned long int __hvm_deprivileged_text_end;
> +extern unsigned long int __hvm_deprivileged_data_start;
> +extern unsigned long int __hvm_deprivileged_data_end;

extern unsigned long __hvm_$foo[];

Will be a more useful type to work with, and harder to accidentally misuse.

> +
> +#endif
> +
> +/* The sizes of the pages */
> +#define HVM_1GB_PAGESIZE 0x40000000ul
> +#define HVM_2MB_PAGESIZE 0x00200000ul
> +#define HVM_4KB_PAGESIZE 0x00001000ul
> +
> +/* The size in bytes that a single L(1,2,3,4} entry covers.
> + * There are 512 (left shift by 9) entries in each page-structure.
> + */
> +#define HVM_L4_PAGE_RANGE (HVM_1GB_PAGESIZE << 9)
> +#define HVM_L3_PAGE_RANGE (HVM_2MB_PAGESIZE << 9)
> +#define HVM_L2_PAGE_RANGE (HVM_4KB_PAGESIZE << 9)
> +#define HVM_L1_PAGE_RANGE (HVM_4KB_PAGESIZE     )

Values like this should be straight from page.h rather than redefining
here.  if page.h is missing some, that is the correct place to add them.

~Andrew

> +
> +#define HVM_ERR_PG_ALLOC -1
> +
> +#endif

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-06 16:45 ` [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
@ 2015-08-06 20:55   ` Andrew Cooper
  2015-08-07 12:51     ` Ben Catterall
  2015-08-11  9:45     ` Ian Campbell
  2015-08-10  9:49   ` Tim Deegan
  1 sibling, 2 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-06 20:55 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 06/08/15 17:45, Ben Catterall wrote:
> The process to switch into and out of deprivileged mode can be likened to
> setjmp/longjmp.
>
> To enter deprivileged mode, we take a copy of the stack from the guest's
> registers up to the current stack pointer. This allows us to restore the stack
> when we have finished the deprivileged mode operation, meaning we can continue
> execution from that point. This is similar to if a context switch had happened.
>
> To exit deprivileged mode, we copy the stack back, replacing the current stack.
> We can then continue execution from where we left off, which will unwind the
> stack and free up resources. This method means that we do not need to
> change any other code paths and its invocation will be transparent to callers.
> This should allow the feature to be more easily deployed to different parts
> of Xen.
>
> Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
> Extra work is needed to properly migrate vcpus between pcpus.

Under what circumstances do you see there being persistent state in the
depriv area between calls, given that the calls are synchronous from VM
actions?

>
> The switch to and from deprivileged mode is performed using sysret and syscall
> respectively.

I suspect we need to borrow the SS attribute workaround from Linux to
make this function reliably on AMD systems.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b

>
> The return paths in entry.S have been edited so that, when we receive an
> interrupt whilst in deprivileged mode, we return into that mode correctly.
>
> A hook on the syscall handler in entry.S has also been added which handles
> returning from user mode and will support deprivileged mode system calls when
> these are needed.
>
> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> ---
>  xen/arch/x86/domain.c               |  12 +++
>  xen/arch/x86/hvm/Makefile           |   1 +
>  xen/arch/x86/hvm/deprivileged.c     | 103 ++++++++++++++++++
>  xen/arch/x86/hvm/deprivileged_asm.S | 205 ++++++++++++++++++++++++++++++++++++
>  xen/arch/x86/hvm/vmx/vmx.c          |   7 ++
>  xen/arch/x86/x86_64/asm-offsets.c   |   5 +
>  xen/arch/x86/x86_64/entry.S         |  35 ++++++
>  xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
>  xen/include/xen/hvm/deprivileged.h  |  38 +++++++
>  xen/include/xen/sched.h             |  18 +++-
>  10 files changed, 425 insertions(+), 1 deletion(-)
>  create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
>
> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
> index 045f6ff..a0e5e70 100644
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -62,6 +62,7 @@
>  #include <xen/iommu.h>
>  #include <compat/vcpu.h>
>  #include <asm/psr.h>
> +#include <xen/hvm/deprivileged.h>
>  
>  DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>  DEFINE_PER_CPU(unsigned long, cr4);
> @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
>      if ( has_hvm_container_domain(d) )
>      {
>          rc = hvm_vcpu_initialise(v);
> +
> +        /* Initialise HVM deprivileged mode */
> +        printk("HVM initialising deprivileged mode ...");

All printk()s should have a XENLOG_$severity prefix.

> +        hvm_deprivileged_prepare_vcpu(v);
> +        printk("Done.\n");
> +
>          goto done;
>      }
>  
> @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
>      vcpu_destroy_fpu(v);
>  
>      if ( has_hvm_container_vcpu(v) )
> +    {
> +        /* Destroy the deprivileged mode on this vcpu */
> +        hvm_deprivileged_destroy_vcpu(v);
> +
>          hvm_vcpu_destroy(v);
> +    }
>      else
>          xfree(v->arch.pv_vcpu.trap_ctxt);
>  }
> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
> index bd83ba3..6819886 100644
> --- a/xen/arch/x86/hvm/Makefile
> +++ b/xen/arch/x86/hvm/Makefile
> @@ -17,6 +17,7 @@ obj-y += quirks.o
>  obj-y += rtc.o
>  obj-y += save.o
>  obj-y += deprivileged.o
> +obj-y += deprivileged_asm.o
>  obj-y += stdvga.o
>  obj-y += vioapic.o
>  obj-y += viridian.o
> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
> index 071d900..979fc69 100644
> --- a/xen/arch/x86/hvm/deprivileged.c
> +++ b/xen/arch/x86/hvm/deprivileged.c
> @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>      }
>      return 0;
>  }
> +
> +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
> + */
> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
> +{
> +    struct page_info *pg;
> +
> +    /* TODO: clarify if this MEMF is correct */
> +    /* Allocate 2^STACK_ORDER contiguous pages */
> +    pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
> +    if( pg == NULL )
> +    {
> +        panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n");
> +        return -ENOMEM;
> +    }
> +
> +    vcpu->stack = page_to_virt(pg);

Xen has two heaps, the xenheap and the domheap.

You may only construct pointers like this into the xenheap.  The domheap
is not guaranteed to have safe virtual mappings to.  (This code only
works because your test box isn't bigger than 5TB.  Also there is a bug
with xenheap allocations at the same point, but I need to fix that bug).

All access to domheap pages must strictly be within a
map_domain_page()/unmap() region, which construct save temporary mappings.

> +    vcpu->rsp = 0;
> +    vcpu->user_mode = 0;
> +
> +    return 0;
> +}
> +
> +/* Called on destroying each vcpu */
> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
> +{
> +    free_domheap_pages(virt_to_page(vcpu->stack), STACK_ORDER);
> +}
> +
> +/* Called to perform a user mode operation.
> + * Execution context is saved and then we move into user mode.
> + * This method is then jumped into to restore execution context after
> + * exiting user mode.
> + */
> +void hvm_deprivileged_user_mode(void)
> +{
> +    struct vcpu *vcpu = get_current();
> +    unsigned long int efer = read_efer();
> +    register unsigned long sp asm("rsp");
> +
> +    ASSERT( vcpu->user_mode == 0 );
> +    ASSERT( vcpu->stack != 0 );
> +    ASSERT( vcpu->rsp == 0 );
> +
> +    /* Flip the SCE bit to allow sysret/call */
> +    write_efer(efer | EFER_SCE);
> +
> +    /* Save the msr lstar and star. Xen does lazy loading of these
> +     * so we need to put the host values in and then restore the
> +     * guest ones once we're done.
> +     */
> +    rdmsrl(MSR_LSTAR, vcpu->msr_lstar);
> +    rdmsrl(MSR_STAR, vcpu->msr_star);
> +    wrmsrl(MSR_LSTAR,get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]);
> +    wrmsrl(MSR_STAR, get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]);

A partial context switch like this should be implemented as two new
hvm_ops such as hvm_op.depriv_ctxt_switch_{to,from}()

This allows you to keep the common code clean of vendor specific code.

> +
> +    /* The assembly routine to handle moving into/out of deprivileged mode */
> +    hvm_deprivileged_user_mode_asm();
> +
> +    /* If our copy failed */
> +    if( unlikely(vcpu->rsp == 0) )
> +    {
> +        gdprintk(XENLOG_ERR, "HVM: Stack too large in %s\n", __FUNCTION__);

__func__ please.  It conforms to C99 whereas __FUNCTION__ is a gnuism.

> +        domain_crash_synchronous();
> +    }
> +
> +    /* Debug info */
> +    vcpu->old_msr_lstar = get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR];
> +    vcpu->old_msr_star = get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR];
> +    vcpu->old_rsp = sp;
> +    vcpu->old_processor = smp_processor_id();
> +
> +    /* Restore the efer and saved msr registers */
> +    write_efer(efer);
> +    wrmsrl(MSR_LSTAR, vcpu->msr_lstar);
> +    wrmsrl(MSR_STAR, vcpu->msr_star);
> +    vcpu->user_mode = 0;
> +    vcpu->rsp = 0;
> +}
> +
> +/* Called when the user mode operation has completed
> + * Perform C-level processing on return pathx
> + */
> +void hvm_deprivileged_finish_user_mode(void)
> +{
> +    /* If we are not returning from user mode: bail */
> +    ASSERT(get_current()->user_mode == 1);
> +
> +    hvm_deprivileged_finish_user_mode_asm();
> +}
> +
> +void hvm_deprivileged_check_trap(const char* func_name)
> +{
> +    if( current->user_mode == 1 )
> +    {
> +        printk("HVM Deprivileged Mode: Trap whilst in user mode, %s\n",
> +               func_name);
> +        domain_crash_synchronous();
> +    }
> +}
> +
> +
> +
> diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
> new file mode 100644
> index 0000000..00a9e9c
> --- /dev/null
> +++ b/xen/arch/x86/hvm/deprivileged_asm.S
> @@ -0,0 +1,205 @@
> +/*
> + * HVM security enhancements assembly code
> + */
> +#include <xen/config.h>
> +#include <xen/errno.h>
> +#include <xen/softirq.h>
> +#include <asm/asm_defns.h>
> +#include <asm/apicdef.h>
> +#include <asm/page.h>
> +#include <public/xen.h>
> +#include <irq_vectors.h>
> +#include <xen/hvm/deprivileged.h>
> +
> +/* Handles entry into the deprivileged mode and returning from this
> + * mode. This requires copying the current Xen privileged stack across
> + * to a per-vcpu buffer as we need to be able to handle interrupts and
> + * exceptions whilst in this mode. Xen is non-preemptable so our
> + * privileged mode stack would  be clobbered if we did not save it.
> + *
> + * If we are entering deprivileged mode, then we use a sysret to get there.
> + * If we are returning from deprivileged mode, then we need to unwind the stack
> + * so we copy it back over the current stack so that we can return from the
> + * call path where we came in from.
> + *
> + * We're doing sort-of a long jump/set jump with copying to a stack to
> + * preserve it and allow returning code to continue executing from
> + * within this method.
> + */
> +ENTRY(hvm_deprivileged_user_mode_asm)
> +        /* Save our registers */
> +        push   %rax
> +        push   %rbx
> +        push   %rcx
> +        push   %rdx
> +        push   %rsi
> +        push   %rdi
> +        push   %rbp
> +        push   %r8
> +        push   %r9
> +        push   %r10
> +        push   %r11
> +        push   %r12
> +        push   %r13
> +        push   %r14
> +        push   %r15
> +        pushfq
> +
> +        /* Perform a near call to push rip onto the stack */
> +        call   1f
> +
> +        /* Magic: Add to the stored rip the size of the code between
> +         * label 1 and label 2. This allows  us to restart execution at label 2.
> +         */
> +1:      addq   $2f-1b, (%rsp)
> +
> +        GET_CURRENT(%r8)
> +        xor    %rsi, %rsi
> +
> +        /* The following is equivalent to
> +         * (get_cpu_info() + sizeof(struct cpu_info))
> +         * This gets us to the top of the stack.
> +         */
> +        GET_STACK_BASE(%rcx)
> +        addq   $STACK_SIZE, %rcx
> +
> +        movq   VCPU_stack(%r8), %rdi
> +
> +        /* We need copy the current stack across to our buffer
> +         * Calculate the number of bytes to copy:
> +         * (top of stack - current stack pointer)
> +         * NOTE: We must not push any more data onto our stack after this point
> +         * as it won't be saved.
> +         */
> +        sub    %rsp, %rcx
> +
> +        /* If the stack is too big, we don't do the copy: handled by caller. */
> +        cmpq   $STACK_SIZE, %rcx
> +        ja     3f
> +
> +        mov    %rsp, %rsi
> +/* USER MODE ENTRY POINT */
> +2:
> +        /* More magic: If we came here from preparing to go into user mode,

There is a very fine line between magic and gross hack ;)

I havn't quite decided which this is yet, but it certainly is neat, if
rather opaque.

> +         * then we copy our current stack to the buffer (the lines above
> +         * have setup rsi, rdi and rcx to do this).
> +         *
> +         * If we came here from user mode, then we movsb to copy from
> +         * our buffer into our current stack so that we can continue
> +         * execution from the current code point, and return back to the guest
> +         * via the path we came in. rsi, rdi and rcx have been setup by the
> +         * de-privileged return path for this.
> +         */
> +        rep    movsb
> +        mov    %rsp, %rsi
> +
> +        GET_CURRENT(%r8)
> +        movq   VCPU_user_mode(%r8), %rdx
> +        movq   VCPU_rsp(%r8), %rax
> +
> +        /* If !user_mode  */
> +        cmpq   $0, %rdx
> +        jne    3f
> +        cli
> +
> +        movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
> +
> +        movq   $0x10200, %r11          /* RFLAGS user mode enable interrupts */

Please use $(X86_FLAGS_IF | X86_FLAGS_MBS) to be more clear which flags
are being set.

Also, by enabling interrupts, you need some hook to short-circuit the
scheduling softirq.  As it currently stands, a timer interrupt
interrupting depriv mode is liable to swap all your state out from under
you.

> +        movq   $1, VCPU_user_mode(%r8) /* Now in user mode */
> +        movq   %rsi, VCPU_rsp(%r8)     /* The rsp to restore to */
> +
> +        /* Stack ptr is set by user mode to avoid race conditions.

What race condition are you referring to?

> +         * See Intel manual 2 on the sysret instruction.

As a general rule, read both the Intel and the AMD manual for bits like
this.  sysret is one of the areas where implementations differ.

> +         */
> +        movq   $HVM_STACK_PTR, %rbx
> +        sysretq                         /* Enter deprivileged mode */
> +
> +3:      GET_CURRENT(%r8)
> +        movq   %rsi, VCPU_rsp(%r8)
> +        pop    %rax    /* Pop off rip: used in a jump so still on stack */
> +
> +        /* Restore registers */
> +        popfq
> +        pop    %r15
> +        pop    %r14
> +        pop    %r13
> +        pop    %r12
> +        pop    %r11
> +        pop    %r10
> +        pop    %r9
> +        pop    %r8
> +        pop    %rbp
> +        pop    %rdi
> +        pop    %rsi
> +        pop    %rdx
> +        pop    %rcx
> +        pop    %rbx
> +        pop    %rax
> +        ret
> +
> +/* Finished in user mode so return */
> +ENTRY(hvm_deprivileged_finish_user_mode_asm)
> +        /* The source is the copied stack in our buffer.
> +         * The destination is our current stack.
> +         *
> +         * We need to:
> +         * - Move the stack pointer to where it was before we entered
> +         *   deprivileged mode.
> +         * - Setup rsi, rdi and rcx as needed to perform the copy
> +         * - Jump to the address held at the top of the stack which
> +         *   is the user mode return address
> +         */
> +        cli
> +        GET_CURRENT(%rbx)
> +        movq   VCPU_stack(%rbx), %rsi
> +        movq   VCPU_rsp(%rbx), %rdi
> +
> +        /* The return address that the near call pushed onto the
> +         * buffer is pointed to by stack, so use that for rip.
> +         */
> +        movq   %rdi, %rsp
> +
> +        /* The following is equivalent to
> +         * (get_cpu_info() + sizeof(struct cpu_info) - vcpu->rsp)
> +         * This works out how many bytes we need to copy:
> +         * (top of stack - bottom of stack)
> +         */
> +        GET_STACK_BASE(%rcx)
> +        addq   $STACK_SIZE, %rcx
> +        subq   %rdi, %rcx
> +
> +        /* Go to user mode return code */
> +        jmp    *(%rsi)
> +
> +/* Entry point from the assembly syscall handlers */
> +ENTRY(hvm_deprivileged_handle_user_mode)
> +
> +        /* Handle a user mode hypercall here */
> +
> +
> +        /* We are finished in user mode */
> +        call hvm_deprivileged_finish_user_mode
> +
> +        ret
> +
> +.section .hvm_deprivileged_enhancement.text,"ax"
> +/* HVM deprivileged code */
> +ENTRY(hvm_deprivileged_ring3)
> +        /* sysret has loaded eip from rcx and rflags from r11.
> +         * CS and SS have been loaded from the MSR for ring 3.
> +         * We now need to  switch to the user mode stack
> +         */
> +        /* Setup usermode stack */
> +        movabs $HVM_STACK_PTR, %rsp
> +
> +        /* Perform user mode processing */
> +
> +        mov $0xf, %rcx
> +1: dec  %rcx
> +        cmp $0, %rcx
> +        jne 1b
> +
> +        /* Return to ring 0 */
> +        syscall
> +
> +.previous
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index c32d863..595b0f2 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -59,6 +59,8 @@
>  #include <asm/event.h>
>  #include <asm/monitor.h>
>  #include <public/arch-x86/cpuid.h>
> +#include <xen/hvm/deprivileged.h>
> +
>  
>  static bool_t __initdata opt_force_ept;
>  boolean_param("force-ept", opt_force_ept);
> @@ -194,6 +196,10 @@ void vmx_save_host_msrs(void)
>          set_bit(VMX_INDEX_MSR_ ## address, &host_msr_state->flags);     \
>      } while ( 0 )
>  
> +struct vmx_msr_state *get_host_msr_state(void) {
> +    return &this_cpu(host_msr_state);
> +}
> +
>  static enum handler_return
>  long_mode_do_msr_read(unsigned int msr, uint64_t *msr_content)
>  {
> @@ -272,6 +278,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
>      case MSR_LSTAR:
>          if ( !is_canonical_address(msr_content) )
>              goto uncanonical_address;
> +

Please avoid spurious changes like this.

>          WRITE_MSR(LSTAR);
>          break;
>  
> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
> index 447c650..fd5de44 100644
> --- a/xen/arch/x86/x86_64/asm-offsets.c
> +++ b/xen/arch/x86/x86_64/asm-offsets.c
> @@ -115,6 +115,11 @@ void __dummy__(void)
>      OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
>      BLANK();
>  
> +    OFFSET(VCPU_stack, struct vcpu, stack);
> +    OFFSET(VCPU_rsp, struct vcpu, rsp);
> +    OFFSET(VCPU_user_mode, struct vcpu, user_mode);
> +    BLANK();
> +
>      OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
>      BLANK();
>  
> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
> index 74677a2..fa9155c 100644
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -102,6 +102,15 @@ restore_all_xen:
>          RESTORE_ALL adj=8
>          iretq
>  
> +/* Returning from user mode */
> +handle_hvm_user_mode:
> +
> +        call hvm_deprivileged_handle_user_mode
> +
> +        /* Go back into user mode */
> +        cli
> +        jmp  restore_all_guest
> +
>  /*
>   * When entering SYSCALL from kernel mode:
>   *  %rax                            = hypercall vector
> @@ -131,6 +140,14 @@ ENTRY(lstar_enter)
>          testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
>          jz    switch_to_kernel
>  
> +        /* Were we in Xen's ring 3?  */
> +        push %rbx
> +        GET_CURRENT(%rbx)
> +        movq VCPU_user_mode(%rbx), %rbx
> +        cmp  $1, %rbx
> +        je   handle_hvm_user_mode
> +        pop  %rbx

No need for the movq or rbx clobber.  This entire block can be:

cmpb $1, VCPU_user_mode(%rbx)
je handle_hvm_user_mode

Similar to the $TF_kernel_mode check in context above.



> +
>  /*hypercall:*/
>          movq  %r10,%rcx
>          cmpq  $NR_hypercalls,%rax
> @@ -487,6 +504,13 @@ ENTRY(common_interrupt)
>  /* No special register assumptions. */
>  ENTRY(ret_from_intr)
>          GET_CURRENT(%rbx)
> +
> +        /* If we are in Xen's user mode, return into it */
> +        cmpq $1,VCPU_user_mode(%rbx)
> +        cli
> +        je    restore_all_guest
> +        sti
> +

None of this should be necessary - the exception frame created by
lstar_enter should cause ret_from_intr to do the correct thing.

>          testb $3,UREGS_cs(%rsp)
>          jz    restore_all_xen
>          movq  VCPU_domain(%rbx),%rax
> @@ -509,6 +533,14 @@ handle_exception_saved:
>          GET_CURRENT(%rbx)
>          PERFC_INCR(exceptions, %rax, %rbx)
>          callq *(%rdx,%rax,8)
> +
> +        /* If we are in Xen's user mode, return into it */
> +        /* TODO: Test this path */
> +        cmpq  $1,VCPU_user_mode(%rbx)
> +        cli
> +        je    restore_all_guest
> +        sti
> +
>          testb $3,UREGS_cs(%rsp)
>          jz    restore_all_xen
>          leaq  VCPU_trap_bounce(%rbx),%rdx
> @@ -664,6 +696,9 @@ handle_ist_exception:
>          movl  $EVENT_CHECK_VECTOR,%edi
>          call  send_IPI_self
>  1:      movq  VCPU_domain(%rbx),%rax
> +        /* This also handles Xen ring3 return for us.
> +         * So, there is no need to explicitly do a user mode check.
> +         */
>          cmpb  $0,DOMAIN_is_32bit_pv(%rax)
>          je    restore_all_guest
>          jmp   compat_restore_all_guest
> diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
> index 3fbfa44..98e269e 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmx.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
> @@ -565,4 +565,6 @@ typedef struct {
>      u16 eptp_index;
>  } ve_info_t;
>  
> +struct vmx_msr_state *get_host_msr_state(void);
> +
>  #endif /* __ASM_X86_HVM_VMX_VMX_H__ */
> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
> index 6cc803e..e42f39a 100644
> --- a/xen/include/xen/hvm/deprivileged.h
> +++ b/xen/include/xen/hvm/deprivileged.h
> @@ -68,6 +68,37 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>                               unsigned int l1_flags);
>  
>  
> +/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
> +
> +/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
> +
> +/* Called to perform a user mode operation. */
> +void hvm_deprivileged_user_mode(void);
> +
> +/* Called when the user mode operation has completed */
> +void hvm_deprivileged_finish_user_mode(void);
> +
> +/* Called to move into and then out of user mode. Needed for accessing
> + * assembly features.
> + */
> +void hvm_deprivileged_user_mode_asm(void);
> +
> +/* Called on the return path to return to the correct execution point */
> +void hvm_deprivileged_finish_user_mode_asm(void);
> +
> +/* Handle any syscalls that the user mode makes */
> +void hvm_deprivileged_handle_user_mode(void);
> +
> +/* The ring 3 code */
> +void hvm_deprivileged_ring3(void);
> +
> +/* Call when inside a trap that should cause a domain crash if in user mode
> + * e.g. an invalid_op is trapped whilst in user mode.
> + */
> +void hvm_deprivileged_check_trap(const char* func_name);
> +
>  /* The segments where the user mode .text and .data are stored */
>  extern unsigned long int __hvm_deprivileged_text_start;
>  extern unsigned long int __hvm_deprivileged_text_end;
> @@ -91,4 +122,11 @@ extern unsigned long int __hvm_deprivileged_data_end;
>  
>  #define HVM_ERR_PG_ALLOC -1
>  
> +/* The user mode stack pointer.
> ++ * The stack grows down so set this to top of the stack region. Then,
> ++ * as this is 0-indexed, move into the stack, not just after it.
> ++ * Subtract 16 bytes for correct stack alignment.
> ++ */
> +#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16)
> +
>  #endif
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> index 73d3bc8..180643e 100644
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
>  
>  struct waitqueue_vcpu;
>  
> -struct vcpu 
> +struct vcpu

Trailing whitespace is nasty, but we avoid inflating the patch by
dropping whitespace on lines not touched by semantic changes.

>  {
>      int              vcpu_id;
>  
> @@ -158,6 +158,22 @@ struct vcpu
>  
>      void            *sched_priv;    /* scheduler-specific data */
>  
> +    /* HVM deprivileged mode state */
> +    void *stack;             /* Location of stack to save data onto */
> +    unsigned long rsp;       /* rsp of our stack to restore our data to */
> +    unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
> +
> +    /* The mstar of the processor that we are currently executing on.
> +     *  we need to save this because Xen does lazy saving of these.
> +     */
> +    unsigned long int msr_lstar; /* lstar */
> +    unsigned long int msr_star;

There should be no need to store this like this.  Follow what the
current context switching code does.

~Andrew

> +
> +    /* Debug info */
> +    unsigned long int old_rsp;
> +    unsigned long int old_processor;
> +    unsigned long int old_msr_lstar;
> +    unsigned long int old_msr_star;
>      struct vcpu_runstate_info runstate;
>  #ifndef CONFIG_COMPAT
>  # define runstate_guest(v) ((v)->runstate_guest)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-06 16:45 ` [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
@ 2015-08-06 21:24   ` Andrew Cooper
  2015-08-07 12:32     ` Ben Catterall
  2015-08-10 10:07   ` Tim Deegan
  1 sibling, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-06 21:24 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 06/08/2015 17:45, Ben Catterall wrote:
> Added trap handlers to catch exceptions such as a page fault, general
> protection fault, etc. These handlers will crash the domain as such exceptions
> would indicate that either there is a bug in deprivileged mode or it has been
> compromised by an attacker.
>
> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> ---
>  xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>  xen/arch/x86/traps.c      | 41 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index abc5113..43bde89 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v, unsigned long va,
>  {
>      struct domain *d = v->domain;
>  
> +    /* If we get a page fault whilst in HVM security user mode */
> +    if( v->user_mode == 1 )
> +    {
> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
> +                 d->domain_id, v->vcpu_id);

%pv is your friend.  Like Linux, we have custom printk formats.  In this
case, passing 'v' as a parameter to %pv will cause d$Xv$Y to be
printed.  (The example below predates %pv being introduced).

> +        domain_crash_synchronous();

No need for _synchronous() here.  _synchronous() should only be used
when you can't safely recover.  It ends up spinning in a tight loop
waiting for the next timer interrupt, is anything up to 30ms away.

~Andrew

> +    }
> +
>      HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
>                d->domain_id, v->vcpu_id);
> +
>      domain_crash(d);
>      return 0;
>  }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-06 19:22   ` Andrew Cooper
@ 2015-08-07  9:57     ` Ben Catterall
  2015-08-07 13:14       ` Andrew Cooper
  2015-08-10  8:50       ` Tim Deegan
  0 siblings, 2 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-07  9:57 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 06/08/15 20:22, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
>> This allocation function is used by the deprivileged mode initialisation code
>> to allocate pages for the new page table mappings and page frames on the HAP
>> page heap.
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> This is fine for your test box, but isn't fine for systems out there
> without hardware EPT/NPT support.  For older systems like that (or in
> certain specific workloads), shadow paging is used instead.
>
> This feature is applicable to any HVM domain, which means that it
> shouldn't depend on HAP or shadow paging.
>
> How much memory is allocated for the depriv area, and what exactly is
> allocated in total?
So, per-vcpu:
- a user mode stack which, from your comments in [RFC 2/4], can be 2 pages
- local data (may or may not be needed, depends on the device) which 
will be around
   a page or two.

Text segment: as per your comments in RFC 2/4, this will be changed to 
be an alias
so no extra memory.
> I expect it isn't very much, and would suggest using
> d->arch.paging.alloc_page() instead (which is the generic "get me some
> memory accounted against the domain" helper) which looks as if it should
> suffice.
>
> Depending on exactly how much memory is needed, you might need to bump
> the default minimum shadow pool size.
>
> ~Andrew
Ok, will do. That will also solve the EPT/NPT problem, thanks!

Ben

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-06 21:24   ` Andrew Cooper
@ 2015-08-07 12:32     ` Ben Catterall
  2015-08-07 13:19       ` Andrew Cooper
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-07 12:32 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich



On 06/08/15 22:24, Andrew Cooper wrote:
> On 06/08/2015 17:45, Ben Catterall wrote:
>> Added trap handlers to catch exceptions such as a page fault, general
>> protection fault, etc. These handlers will crash the domain as such exceptions
>> would indicate that either there is a bug in deprivileged mode or it has been
>> compromised by an attacker.
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>> ---
>>   xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>>   xen/arch/x86/traps.c      | 41 ++++++++++++++++++++++++++++++++++++++++-
>>   2 files changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>> index abc5113..43bde89 100644
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v, unsigned long va,
>>   {
>>       struct domain *d = v->domain;
>>   
>> +    /* If we get a page fault whilst in HVM security user mode */
>> +    if( v->user_mode == 1 )
>> +    {
>> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
>> +                 d->domain_id, v->vcpu_id);
> %pv is your friend.  Like Linux, we have custom printk formats.  In this
> case, passing 'v' as a parameter to %pv will cause d$Xv$Y to be
> printed.  (The example below predates %pv being introduced).
ok, will do. thanks!
>
>> +        domain_crash_synchronous();
> No need for _synchronous() here.  _synchronous() should only be used
> when you can't safely recover.  It ends up spinning in a tight loop
> waiting for the next timer interrupt, is anything up to 30ms away.
I'm not sure if we can safely recover from this. This will only be 
triggered if there is a bug in depriv mode
or if the mode has been compromised and an attacker has tried to access 
unavailable memory.
 From my understanding (am I missing something?): domain_crash 
effectively sets flags to tell the scheduler that
it should be killed the next time the scheduler runs and then returns. 
In which case, if we don't do a
synchronous crash, this return path would return back into the 
deprivileged mode, we would not
have mapped in the page (as we shouldn't), and then we get another fault.

What do you think is the best way forward? Thanks!


> ~Andrew
>
>> +    }
>> +
>>       HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
>>                 d->domain_id, v->vcpu_id);
>> +
>>       domain_crash(d);
>>       return 0;
>>   }

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-06 20:55   ` Andrew Cooper
@ 2015-08-07 12:51     ` Ben Catterall
  2015-08-07 13:08       ` David Vrabel
  2015-08-07 14:24       ` Andrew Cooper
  2015-08-11  9:45     ` Ian Campbell
  1 sibling, 2 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-07 12:51 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich



On 06/08/15 21:55, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
>> The process to switch into and out of deprivileged mode can be likened to
>> setjmp/longjmp.
>>
>> To enter deprivileged mode, we take a copy of the stack from the guest's
>> registers up to the current stack pointer. This allows us to restore the stack
>> when we have finished the deprivileged mode operation, meaning we can continue
>> execution from that point. This is similar to if a context switch had happened.
>>
>> To exit deprivileged mode, we copy the stack back, replacing the current stack.
>> We can then continue execution from where we left off, which will unwind the
>> stack and free up resources. This method means that we do not need to
>> change any other code paths and its invocation will be transparent to callers.
>> This should allow the feature to be more easily deployed to different parts
>> of Xen.
>>
>> Note that this copy of the stack is per-vcpu but, it will contain per-pcpu data.
>> Extra work is needed to properly migrate vcpus between pcpus.
>
> Under what circumstances do you see there being persistent state in the
> depriv area between calls, given that the calls are synchronous from VM
> actions?

I don't know if we can make these synchronous as we need a way to 
interrupt the vcpu if it's spinning for a long time. Otherwise an 
attacker could just spin in depriv and cause a DoS. With that in mind, 
the scheduler may decide to migrate the vcpu whilst it's in depriv mode 
which would mean this per-pcpu data is held in the stack copy which is 
then migrated to another pcpu incorrectly.

>
>>
>> The switch to and from deprivileged mode is performed using sysret and syscall
>> respectively.
>
> I suspect we need to borrow the SS attribute workaround from Linux to
> make this function reliably on AMD systems.
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
>
 >
Ah! ok, I'll look into this. Thanks!
>>
>> The return paths in entry.S have been edited so that, when we receive an
>> interrupt whilst in deprivileged mode, we return into that mode correctly.
>>
>> A hook on the syscall handler in entry.S has also been added which handles
>> returning from user mode and will support deprivileged mode system calls when
>> these are needed.
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>> ---
>>   xen/arch/x86/domain.c               |  12 +++
>>   xen/arch/x86/hvm/Makefile           |   1 +
>>   xen/arch/x86/hvm/deprivileged.c     | 103 ++++++++++++++++++
>>   xen/arch/x86/hvm/deprivileged_asm.S | 205 ++++++++++++++++++++++++++++++++++++
>>   xen/arch/x86/hvm/vmx/vmx.c          |   7 ++
>>   xen/arch/x86/x86_64/asm-offsets.c   |   5 +
>>   xen/arch/x86/x86_64/entry.S         |  35 ++++++
>>   xen/include/asm-x86/hvm/vmx/vmx.h   |   2 +
>>   xen/include/xen/hvm/deprivileged.h  |  38 +++++++
>>   xen/include/xen/sched.h             |  18 +++-
>>   10 files changed, 425 insertions(+), 1 deletion(-)
>>   create mode 100644 xen/arch/x86/hvm/deprivileged_asm.S
>>
>> diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
>> index 045f6ff..a0e5e70 100644
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -62,6 +62,7 @@
>>   #include <xen/iommu.h>
>>   #include <compat/vcpu.h>
>>   #include <asm/psr.h>
>> +#include <xen/hvm/deprivileged.h>
>>
>>   DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
>>   DEFINE_PER_CPU(unsigned long, cr4);
>> @@ -446,6 +447,12 @@ int vcpu_initialise(struct vcpu *v)
>>       if ( has_hvm_container_domain(d) )
>>       {
>>           rc = hvm_vcpu_initialise(v);
>> +
>> +        /* Initialise HVM deprivileged mode */
>> +        printk("HVM initialising deprivileged mode ...");
>
> All printk()s should have a XENLOG_$severity prefix.
>
will do.
>> +        hvm_deprivileged_prepare_vcpu(v);
>> +        printk("Done.\n");
>> +
>>           goto done;
>>       }
>>
>> @@ -523,7 +530,12 @@ void vcpu_destroy(struct vcpu *v)
>>       vcpu_destroy_fpu(v);
>>
>>       if ( has_hvm_container_vcpu(v) )
>> +    {
>> +        /* Destroy the deprivileged mode on this vcpu */
>> +        hvm_deprivileged_destroy_vcpu(v);
>> +
>>           hvm_vcpu_destroy(v);
>> +    }
>>       else
>>           xfree(v->arch.pv_vcpu.trap_ctxt);
>>   }
>> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
>> index bd83ba3..6819886 100644
>> --- a/xen/arch/x86/hvm/Makefile
>> +++ b/xen/arch/x86/hvm/Makefile
>> @@ -17,6 +17,7 @@ obj-y += quirks.o
>>   obj-y += rtc.o
>>   obj-y += save.o
>>   obj-y += deprivileged.o
>> +obj-y += deprivileged_asm.o
>>   obj-y += stdvga.o
>>   obj-y += vioapic.o
>>   obj-y += viridian.o
>> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
>> index 071d900..979fc69 100644
>> --- a/xen/arch/x86/hvm/deprivileged.c
>> +++ b/xen/arch/x86/hvm/deprivileged.c
>> @@ -439,3 +439,106 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>>       }
>>       return 0;
>>   }
>> +
>> +/* Used to prepare each vcpus data for user mode. Call for each HVM vcpu.
>> + */
>> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu)
>> +{
>> +    struct page_info *pg;
>> +
>> +    /* TODO: clarify if this MEMF is correct */
>> +    /* Allocate 2^STACK_ORDER contiguous pages */
>> +    pg = alloc_domheap_pages(NULL, STACK_ORDER, MEMF_no_owner);
>> +    if( pg == NULL )
>> +    {
>> +        panic("HVM: Out of memory on per-vcpu deprivileged mode init.\n");
>> +        return -ENOMEM;
>> +    }
>> +
>> +    vcpu->stack = page_to_virt(pg);
>
> Xen has two heaps, the xenheap and the domheap.
>
> You may only construct pointers like this into the xenheap.  The domheap
> is not guaranteed to have safe virtual mappings to.  (This code only
> works because your test box isn't bigger than 5TB.  Also there is a bug
> with xenheap allocations at the same point, but I need to fix that bug).
>
> All access to domheap pages must strictly be within a
> map_domain_page()/unmap() region, which construct save temporary mappings.
>
ok, I'll add these.
>> +    vcpu->rsp = 0;
>> +    vcpu->user_mode = 0;
>> +
>> +    return 0;
>> +}
>> +
>> +/* Called on destroying each vcpu */
>> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu)
>> +{
>> +    free_domheap_pages(virt_to_page(vcpu->stack), STACK_ORDER);
>> +}
>> +
>> +/* Called to perform a user mode operation.
>> + * Execution context is saved and then we move into user mode.
>> + * This method is then jumped into to restore execution context after
>> + * exiting user mode.
>> + */
>> +void hvm_deprivileged_user_mode(void)
>> +{
>> +    struct vcpu *vcpu = get_current();
>> +    unsigned long int efer = read_efer();
>> +    register unsigned long sp asm("rsp");
>> +
>> +    ASSERT( vcpu->user_mode == 0 );
>> +    ASSERT( vcpu->stack != 0 );
>> +    ASSERT( vcpu->rsp == 0 );
>> +
>> +    /* Flip the SCE bit to allow sysret/call */
>> +    write_efer(efer | EFER_SCE);
>> +
>> +    /* Save the msr lstar and star. Xen does lazy loading of these
>> +     * so we need to put the host values in and then restore the
>> +     * guest ones once we're done.
>> +     */
>> +    rdmsrl(MSR_LSTAR, vcpu->msr_lstar);
>> +    rdmsrl(MSR_STAR, vcpu->msr_star);
>> +    wrmsrl(MSR_LSTAR,get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR]);
>> +    wrmsrl(MSR_STAR, get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR]);
>
> A partial context switch like this should be implemented as two new
> hvm_ops such as hvm_op.depriv_ctxt_switch_{to,from}()
>
> This allows you to keep the common code clean of vendor specific code.
>
>> +
>> +    /* The assembly routine to handle moving into/out of deprivileged mode */
>> +    hvm_deprivileged_user_mode_asm();
>> +
>> +    /* If our copy failed */
>> +    if( unlikely(vcpu->rsp == 0) )
>> +    {
>> +        gdprintk(XENLOG_ERR, "HVM: Stack too large in %s\n", __FUNCTION__);
>
> __func__ please.  It conforms to C99 whereas __FUNCTION__ is a gnuism.
>
got it.
>> +        domain_crash_synchronous();
>> +    }
>> +
>> +    /* Debug info */
>> +    vcpu->old_msr_lstar = get_host_msr_state()->msrs[VMX_INDEX_MSR_LSTAR];
>> +    vcpu->old_msr_star = get_host_msr_state()->msrs[VMX_INDEX_MSR_STAR];
>> +    vcpu->old_rsp = sp;
>> +    vcpu->old_processor = smp_processor_id();
>> +
>> +    /* Restore the efer and saved msr registers */
>> +    write_efer(efer);
>> +    wrmsrl(MSR_LSTAR, vcpu->msr_lstar);
>> +    wrmsrl(MSR_STAR, vcpu->msr_star);
>> +    vcpu->user_mode = 0;
>> +    vcpu->rsp = 0;
>> +}
>> +
>> +/* Called when the user mode operation has completed
>> + * Perform C-level processing on return pathx
>> + */
>> +void hvm_deprivileged_finish_user_mode(void)
>> +{
>> +    /* If we are not returning from user mode: bail */
>> +    ASSERT(get_current()->user_mode == 1);
>> +
>> +    hvm_deprivileged_finish_user_mode_asm();
>> +}
>> +
>> +void hvm_deprivileged_check_trap(const char* func_name)
>> +{
>> +    if( current->user_mode == 1 )
>> +    {
>> +        printk("HVM Deprivileged Mode: Trap whilst in user mode, %s\n",
>> +               func_name);
>> +        domain_crash_synchronous();
>> +    }
>> +}
>> +
>> +
>> +
>> diff --git a/xen/arch/x86/hvm/deprivileged_asm.S b/xen/arch/x86/hvm/deprivileged_asm.S
>> new file mode 100644
>> index 0000000..00a9e9c
>> --- /dev/null
>> +++ b/xen/arch/x86/hvm/deprivileged_asm.S
>> @@ -0,0 +1,205 @@
>> +/*
>> + * HVM security enhancements assembly code
>> + */
>> +#include <xen/config.h>
>> +#include <xen/errno.h>
>> +#include <xen/softirq.h>
>> +#include <asm/asm_defns.h>
>> +#include <asm/apicdef.h>
>> +#include <asm/page.h>
>> +#include <public/xen.h>
>> +#include <irq_vectors.h>
>> +#include <xen/hvm/deprivileged.h>
>> +
>> +/* Handles entry into the deprivileged mode and returning from this
>> + * mode. This requires copying the current Xen privileged stack across
>> + * to a per-vcpu buffer as we need to be able to handle interrupts and
>> + * exceptions whilst in this mode. Xen is non-preemptable so our
>> + * privileged mode stack would  be clobbered if we did not save it.
>> + *
>> + * If we are entering deprivileged mode, then we use a sysret to get there.
>> + * If we are returning from deprivileged mode, then we need to unwind the stack
>> + * so we copy it back over the current stack so that we can return from the
>> + * call path where we came in from.
>> + *
>> + * We're doing sort-of a long jump/set jump with copying to a stack to
>> + * preserve it and allow returning code to continue executing from
>> + * within this method.
>> + */
>> +ENTRY(hvm_deprivileged_user_mode_asm)
>> +        /* Save our registers */
>> +        push   %rax
>> +        push   %rbx
>> +        push   %rcx
>> +        push   %rdx
>> +        push   %rsi
>> +        push   %rdi
>> +        push   %rbp
>> +        push   %r8
>> +        push   %r9
>> +        push   %r10
>> +        push   %r11
>> +        push   %r12
>> +        push   %r13
>> +        push   %r14
>> +        push   %r15
>> +        pushfq
>> +
>> +        /* Perform a near call to push rip onto the stack */
>> +        call   1f
>> +
>> +        /* Magic: Add to the stored rip the size of the code between
>> +         * label 1 and label 2. This allows  us to restart execution at label 2.
>> +         */
>> +1:      addq   $2f-1b, (%rsp)
>> +
>> +        GET_CURRENT(%r8)
>> +        xor    %rsi, %rsi
>> +
>> +        /* The following is equivalent to
>> +         * (get_cpu_info() + sizeof(struct cpu_info))
>> +         * This gets us to the top of the stack.
>> +         */
>> +        GET_STACK_BASE(%rcx)
>> +        addq   $STACK_SIZE, %rcx
>> +
>> +        movq   VCPU_stack(%r8), %rdi
>> +
>> +        /* We need copy the current stack across to our buffer
>> +         * Calculate the number of bytes to copy:
>> +         * (top of stack - current stack pointer)
>> +         * NOTE: We must not push any more data onto our stack after this point
>> +         * as it won't be saved.
>> +         */
>> +        sub    %rsp, %rcx
>> +
>> +        /* If the stack is too big, we don't do the copy: handled by caller. */
>> +        cmpq   $STACK_SIZE, %rcx
>> +        ja     3f
>> +
>> +        mov    %rsp, %rsi
>> +/* USER MODE ENTRY POINT */
>> +2:
>> +        /* More magic: If we came here from preparing to go into user mode,
>
> There is a very fine line between magic and gross hack ;)
>
> I havn't quite decided which this is yet, but it certainly is neat, if
> rather opaque.
>
>> +         * then we copy our current stack to the buffer (the lines above
>> +         * have setup rsi, rdi and rcx to do this).
>> +         *
>> +         * If we came here from user mode, then we movsb to copy from
>> +         * our buffer into our current stack so that we can continue
>> +         * execution from the current code point, and return back to the guest
>> +         * via the path we came in. rsi, rdi and rcx have been setup by the
>> +         * de-privileged return path for this.
>> +         */
>> +        rep    movsb
>> +        mov    %rsp, %rsi
>> +
>> +        GET_CURRENT(%r8)
>> +        movq   VCPU_user_mode(%r8), %rdx
>> +        movq   VCPU_rsp(%r8), %rax
>> +
>> +        /* If !user_mode  */
>> +        cmpq   $0, %rdx
>> +        jne    3f
>> +        cli
>> +
>> +        movabs $HVM_DEPRIVILEGED_TEXT_ADDR, %rcx /* RIP in user mode */
>> +
>> +        movq   $0x10200, %r11          /* RFLAGS user mode enable interrupts */
>
> Please use $(X86_FLAGS_IF | X86_FLAGS_MBS) to be more clear which flags
> are being set.
>
will do.
> Also, by enabling interrupts, you need some hook to short-circuit the
> scheduling softirq.  As it currently stands, a timer interrupt
> interrupting depriv mode is liable to swap all your state out from under
> you.
>
We need interrupts to be enabled so that we can prevent a DoS from 
depriv by allowing the scheduler to decide to deschedule it. That's also 
why we needed some of the return path changes.
>> +        movq   $1, VCPU_user_mode(%r8) /* Now in user mode */
>> +        movq   %rsi, VCPU_rsp(%r8)     /* The rsp to restore to */
>> +
>> +        /* Stack ptr is set by user mode to avoid race conditions.
>
> What race condition are you referring to?
>
>> +         * See Intel manual 2 on the sysret instruction.
>
> As a general rule, read both the Intel and the AMD manual for bits like
> this.  sysret is one of the areas where implementations differ.
>
>> +         */
>> +        movq   $HVM_STACK_PTR, %rbx
>> +        sysretq                         /* Enter deprivileged mode */
>> +
>> +3:      GET_CURRENT(%r8)
>> +        movq   %rsi, VCPU_rsp(%r8)
>> +        pop    %rax    /* Pop off rip: used in a jump so still on stack */
>> +
>> +        /* Restore registers */
>> +        popfq
>> +        pop    %r15
>> +        pop    %r14
>> +        pop    %r13
>> +        pop    %r12
>> +        pop    %r11
>> +        pop    %r10
>> +        pop    %r9
>> +        pop    %r8
>> +        pop    %rbp
>> +        pop    %rdi
>> +        pop    %rsi
>> +        pop    %rdx
>> +        pop    %rcx
>> +        pop    %rbx
>> +        pop    %rax
>> +        ret
>> +
>> +/* Finished in user mode so return */
>> +ENTRY(hvm_deprivileged_finish_user_mode_asm)
>> +        /* The source is the copied stack in our buffer.
>> +         * The destination is our current stack.
>> +         *
>> +         * We need to:
>> +         * - Move the stack pointer to where it was before we entered
>> +         *   deprivileged mode.
>> +         * - Setup rsi, rdi and rcx as needed to perform the copy
>> +         * - Jump to the address held at the top of the stack which
>> +         *   is the user mode return address
>> +         */
>> +        cli
>> +        GET_CURRENT(%rbx)
>> +        movq   VCPU_stack(%rbx), %rsi
>> +        movq   VCPU_rsp(%rbx), %rdi
>> +
>> +        /* The return address that the near call pushed onto the
>> +         * buffer is pointed to by stack, so use that for rip.
>> +         */
>> +        movq   %rdi, %rsp
>> +
>> +        /* The following is equivalent to
>> +         * (get_cpu_info() + sizeof(struct cpu_info) - vcpu->rsp)
>> +         * This works out how many bytes we need to copy:
>> +         * (top of stack - bottom of stack)
>> +         */
>> +        GET_STACK_BASE(%rcx)
>> +        addq   $STACK_SIZE, %rcx
>> +        subq   %rdi, %rcx
>> +
>> +        /* Go to user mode return code */
>> +        jmp    *(%rsi)
>> +
>> +/* Entry point from the assembly syscall handlers */
>> +ENTRY(hvm_deprivileged_handle_user_mode)
>> +
>> +        /* Handle a user mode hypercall here */
>> +
>> +
>> +        /* We are finished in user mode */
>> +        call hvm_deprivileged_finish_user_mode
>> +
>> +        ret
>> +
>> +.section .hvm_deprivileged_enhancement.text,"ax"
>> +/* HVM deprivileged code */
>> +ENTRY(hvm_deprivileged_ring3)
>> +        /* sysret has loaded eip from rcx and rflags from r11.
>> +         * CS and SS have been loaded from the MSR for ring 3.
>> +         * We now need to  switch to the user mode stack
>> +         */
>> +        /* Setup usermode stack */
>> +        movabs $HVM_STACK_PTR, %rsp
>> +
>> +        /* Perform user mode processing */
>> +
>> +        mov $0xf, %rcx
>> +1: dec  %rcx
>> +        cmp $0, %rcx
>> +        jne 1b
>> +
>> +        /* Return to ring 0 */
>> +        syscall
>> +
>> +.previous
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index c32d863..595b0f2 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -59,6 +59,8 @@
>>   #include <asm/event.h>
>>   #include <asm/monitor.h>
>>   #include <public/arch-x86/cpuid.h>
>> +#include <xen/hvm/deprivileged.h>
>> +
>>
>>   static bool_t __initdata opt_force_ept;
>>   boolean_param("force-ept", opt_force_ept);
>> @@ -194,6 +196,10 @@ void vmx_save_host_msrs(void)
>>           set_bit(VMX_INDEX_MSR_ ## address, &host_msr_state->flags);     \
>>       } while ( 0 )
>>
>> +struct vmx_msr_state *get_host_msr_state(void) {
>> +    return &this_cpu(host_msr_state);
>> +}
>> +
>>   static enum handler_return
>>   long_mode_do_msr_read(unsigned int msr, uint64_t *msr_content)
>>   {
>> @@ -272,6 +278,7 @@ long_mode_do_msr_write(unsigned int msr, uint64_t msr_content)
>>       case MSR_LSTAR:
>>           if ( !is_canonical_address(msr_content) )
>>               goto uncanonical_address;
>> +
>
> Please avoid spurious changes like this.
>
apologies.
>>           WRITE_MSR(LSTAR);
>>           break;
>>
>> diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
>> index 447c650..fd5de44 100644
>> --- a/xen/arch/x86/x86_64/asm-offsets.c
>> +++ b/xen/arch/x86/x86_64/asm-offsets.c
>> @@ -115,6 +115,11 @@ void __dummy__(void)
>>       OFFSET(VCPU_nsvm_hap_enabled, struct vcpu, arch.hvm_vcpu.nvcpu.u.nsvm.ns_hap_enabled);
>>       BLANK();
>>
>> +    OFFSET(VCPU_stack, struct vcpu, stack);
>> +    OFFSET(VCPU_rsp, struct vcpu, rsp);
>> +    OFFSET(VCPU_user_mode, struct vcpu, user_mode);
>> +    BLANK();
>> +
>>       OFFSET(DOMAIN_is_32bit_pv, struct domain, arch.is_32bit_pv);
>>       BLANK();
>>
>> diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
>> index 74677a2..fa9155c 100644
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -102,6 +102,15 @@ restore_all_xen:
>>           RESTORE_ALL adj=8
>>           iretq
>>
>> +/* Returning from user mode */
>> +handle_hvm_user_mode:
>> +
>> +        call hvm_deprivileged_handle_user_mode
>> +
>> +        /* Go back into user mode */
>> +        cli
>> +        jmp  restore_all_guest
>> +
>>   /*
>>    * When entering SYSCALL from kernel mode:
>>    *  %rax                            = hypercall vector
>> @@ -131,6 +140,14 @@ ENTRY(lstar_enter)
>>           testb $TF_kernel_mode,VCPU_thread_flags(%rbx)
>>           jz    switch_to_kernel
>>
>> +        /* Were we in Xen's ring 3?  */
>> +        push %rbx
>> +        GET_CURRENT(%rbx)
>> +        movq VCPU_user_mode(%rbx), %rbx
>> +        cmp  $1, %rbx
>> +        je   handle_hvm_user_mode
>> +        pop  %rbx
>
> No need for the movq or rbx clobber.  This entire block can be:
>
> cmpb $1, VCPU_user_mode(%rbx)
> je handle_hvm_user_mode
>
> Similar to the $TF_kernel_mode check in context above.
>
got it. Thanks!
>
>
>> +
>>   /*hypercall:*/
>>           movq  %r10,%rcx
>>           cmpq  $NR_hypercalls,%rax
>> @@ -487,6 +504,13 @@ ENTRY(common_interrupt)
>>   /* No special register assumptions. */
>>   ENTRY(ret_from_intr)
>>           GET_CURRENT(%rbx)
>> +
>> +        /* If we are in Xen's user mode, return into it */
>> +        cmpq $1,VCPU_user_mode(%rbx)
>> +        cli
>> +        je    restore_all_guest
>> +        sti
>> +
>
> None of this should be necessary - the exception frame created by
> lstar_enter should cause ret_from_intr to do the correct thing.
>

I think this is needed as we have interrupts enabled and so we can take 
interrupts from paths other than lstar_enter. This ensures that Xen 
doesn't treat our depriv mode as a PV guest which led to random page, 
general protection etc. faults.

>>           testb $3,UREGS_cs(%rsp)
>>           jz    restore_all_xen
>>           movq  VCPU_domain(%rbx),%rax
>> @@ -509,6 +533,14 @@ handle_exception_saved:
>>           GET_CURRENT(%rbx)
>>           PERFC_INCR(exceptions, %rax, %rbx)
>>           callq *(%rdx,%rax,8)
>> +
>> +        /* If we are in Xen's user mode, return into it */
>> +        /* TODO: Test this path */
>> +        cmpq  $1,VCPU_user_mode(%rbx)
>> +        cli
>> +        je    restore_all_guest
>> +        sti
>> +
>>           testb $3,UREGS_cs(%rsp)
>>           jz    restore_all_xen
>>           leaq  VCPU_trap_bounce(%rbx),%rdx
>> @@ -664,6 +696,9 @@ handle_ist_exception:
>>           movl  $EVENT_CHECK_VECTOR,%edi
>>           call  send_IPI_self
>>   1:      movq  VCPU_domain(%rbx),%rax
>> +        /* This also handles Xen ring3 return for us.
>> +         * So, there is no need to explicitly do a user mode check.
>> +         */
>>           cmpb  $0,DOMAIN_is_32bit_pv(%rax)
>>           je    restore_all_guest
>>           jmp   compat_restore_all_guest
>> diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
>> index 3fbfa44..98e269e 100644
>> --- a/xen/include/asm-x86/hvm/vmx/vmx.h
>> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
>> @@ -565,4 +565,6 @@ typedef struct {
>>       u16 eptp_index;
>>   } ve_info_t;
>>
>> +struct vmx_msr_state *get_host_msr_state(void);
>> +
>>   #endif /* __ASM_X86_HVM_VMX_VMX_H__ */
>> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
>> index 6cc803e..e42f39a 100644
>> --- a/xen/include/xen/hvm/deprivileged.h
>> +++ b/xen/include/xen/hvm/deprivileged.h
>> @@ -68,6 +68,37 @@ int hvm_deprivileged_copy_l1(struct domain *d,
>>                                unsigned int l1_flags);
>>
>>
>> +/* Used to prepare each vcpu's data for user mode. Call for each HVM vcpu. */
>> +int hvm_deprivileged_prepare_vcpu(struct vcpu *vcpu);
>> +
>> +/* Destroy each vcpu's data for Xen user mode. Again, call for each vcpu. */
>> +void hvm_deprivileged_destroy_vcpu(struct vcpu *vcpu);
>> +
>> +/* Called to perform a user mode operation. */
>> +void hvm_deprivileged_user_mode(void);
>> +
>> +/* Called when the user mode operation has completed */
>> +void hvm_deprivileged_finish_user_mode(void);
>> +
>> +/* Called to move into and then out of user mode. Needed for accessing
>> + * assembly features.
>> + */
>> +void hvm_deprivileged_user_mode_asm(void);
>> +
>> +/* Called on the return path to return to the correct execution point */
>> +void hvm_deprivileged_finish_user_mode_asm(void);
>> +
>> +/* Handle any syscalls that the user mode makes */
>> +void hvm_deprivileged_handle_user_mode(void);
>> +
>> +/* The ring 3 code */
>> +void hvm_deprivileged_ring3(void);
>> +
>> +/* Call when inside a trap that should cause a domain crash if in user mode
>> + * e.g. an invalid_op is trapped whilst in user mode.
>> + */
>> +void hvm_deprivileged_check_trap(const char* func_name);
>> +
>>   /* The segments where the user mode .text and .data are stored */
>>   extern unsigned long int __hvm_deprivileged_text_start;
>>   extern unsigned long int __hvm_deprivileged_text_end;
>> @@ -91,4 +122,11 @@ extern unsigned long int __hvm_deprivileged_data_end;
>>
>>   #define HVM_ERR_PG_ALLOC -1
>>
>> +/* The user mode stack pointer.
>> ++ * The stack grows down so set this to top of the stack region. Then,
>> ++ * as this is 0-indexed, move into the stack, not just after it.
>> ++ * Subtract 16 bytes for correct stack alignment.
>> ++ */
>> +#define HVM_STACK_PTR (HVM_DEPRIVILEGED_STACK_ADDR + STACK_SIZE - 16)
>> +
>>   #endif
>> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
>> index 73d3bc8..180643e 100644
>> --- a/xen/include/xen/sched.h
>> +++ b/xen/include/xen/sched.h
>> @@ -137,7 +137,7 @@ void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */
>>
>>   struct waitqueue_vcpu;
>>
>> -struct vcpu
>> +struct vcpu
>
> Trailing whitespace is nasty, but we avoid inflating the patch by
> dropping whitespace on lines not touched by semantic changes.
>
>>   {
>>       int              vcpu_id;
>>
>> @@ -158,6 +158,22 @@ struct vcpu
>>
>>       void            *sched_priv;    /* scheduler-specific data */
>>
>> +    /* HVM deprivileged mode state */
>> +    void *stack;             /* Location of stack to save data onto */
>> +    unsigned long rsp;       /* rsp of our stack to restore our data to */
>> +    unsigned long user_mode; /* Are we (possibly moving into) in user mode? */
>> +
>> +    /* The mstar of the processor that we are currently executing on.
>> +     *  we need to save this because Xen does lazy saving of these.
>> +     */
>> +    unsigned long int msr_lstar; /* lstar */
>> +    unsigned long int msr_star;
>
> There should be no need to store this like this.  Follow what the
> current context switching code does.
>
ok, I'll take a look.
> ~Andrew
>
>> +
>> +    /* Debug info */
>> +    unsigned long int old_rsp;
>> +    unsigned long int old_processor;
>> +    unsigned long int old_msr_lstar;
>> +    unsigned long int old_msr_star;
>>       struct vcpu_runstate_info runstate;
>>   #ifndef CONFIG_COMPAT
>>   # define runstate_guest(v) ((v)->runstate_guest)
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-07 12:51     ` Ben Catterall
@ 2015-08-07 13:08       ` David Vrabel
  2015-08-07 14:24       ` Andrew Cooper
  1 sibling, 0 replies; 53+ messages in thread
From: David Vrabel @ 2015-08-07 13:08 UTC (permalink / raw)
  To: Ben Catterall, Andrew Cooper, xen-devel
  Cc: george.dunlap, keir, tim, ian.campbell, jbeulich

On 07/08/15 13:51, Ben Catterall wrote:
> 
> I don't know if we can make these synchronous as we need a way to
> interrupt the vcpu if it's spinning for a long time. Otherwise an
> attacker could just spin in depriv and cause a DoS. With that in mind,
> the scheduler may decide to migrate the vcpu whilst it's in depriv mode
> which would mean this per-pcpu data is held in the stack copy which is
> then migrated to another pcpu incorrectly.

IMO, DoS attacks on depriv'd emulators aren't very interesting.

I think it is counter-productive to address this attack in this initial
implementation at the expense (delays/complexity/etc.) of solving the
key requirement of mitigating information leaks and privilege escalation
attacks

David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-07  9:57     ` Ben Catterall
@ 2015-08-07 13:14       ` Andrew Cooper
  2015-08-10  8:50       ` Tim Deegan
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-07 13:14 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 07/08/15 10:57, Ben Catterall wrote:
> On 06/08/15 20:22, Andrew Cooper wrote:
>> On 06/08/15 17:45, Ben Catterall wrote:
>>> This allocation function is used by the deprivileged mode
>>> initialisation code
>>> to allocate pages for the new page table mappings and page frames on
>>> the HAP
>>> page heap.
>>>
>>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>> This is fine for your test box, but isn't fine for systems out there
>> without hardware EPT/NPT support.  For older systems like that (or in
>> certain specific workloads), shadow paging is used instead.
>>
>> This feature is applicable to any HVM domain, which means that it
>> shouldn't depend on HAP or shadow paging.
>>
>> How much memory is allocated for the depriv area, and what exactly is
>> allocated in total?
> So, per-vcpu:
> - a user mode stack which, from your comments in [RFC 2/4], can be 2
> pages
> - local data (may or may not be needed, depends on the device) which
> will be around
>   a page or two.
>
> Text segment: as per your comments in RFC 2/4, this will be changed to
> be an alias
> so no extra memory.

Plus pagetables for all of these.

The stack definitely doesn't need to be per-vcpu.  per-pcpu is fine, and
will reduce the overhead.

I still don't see why local data would be needed between calls into
depriv code.  Small enough data can live on the stack, and allowing data
to persist across calls risks getting state out-of-sync with the device
model under emulation.

(That is not to say that local data isn't needed.  I just can't see a
viable use for it at this stage.)

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-07 12:32     ` Ben Catterall
@ 2015-08-07 13:19       ` Andrew Cooper
  2015-08-07 13:26         ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-07 13:19 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 07/08/15 13:32, Ben Catterall wrote:
>
>
> On 06/08/15 22:24, Andrew Cooper wrote:
>> On 06/08/2015 17:45, Ben Catterall wrote:
>>> Added trap handlers to catch exceptions such as a page fault, general
>>> protection fault, etc. These handlers will crash the domain as such
>>> exceptions
>>> would indicate that either there is a bug in deprivileged mode or it
>>> has been
>>> compromised by an attacker.
>>>
>>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>>> ---
>>>   xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>>>   xen/arch/x86/traps.c      | 41
>>> ++++++++++++++++++++++++++++++++++++++++-
>>>   2 files changed, 49 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>>> index abc5113..43bde89 100644
>>> --- a/xen/arch/x86/mm/hap/hap.c
>>> +++ b/xen/arch/x86/mm/hap/hap.c
>>> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v,
>>> unsigned long va,
>>>   {
>>>       struct domain *d = v->domain;
>>>   +    /* If we get a page fault whilst in HVM security user mode */
>>> +    if( v->user_mode == 1 )
>>> +    {
>>> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
>>> +                 d->domain_id, v->vcpu_id);
>> %pv is your friend.  Like Linux, we have custom printk formats.  In this
>> case, passing 'v' as a parameter to %pv will cause d$Xv$Y to be
>> printed.  (The example below predates %pv being introduced).
> ok, will do. thanks!
>>
>>> +        domain_crash_synchronous();
>> No need for _synchronous() here.  _synchronous() should only be used
>> when you can't safely recover.  It ends up spinning in a tight loop
>> waiting for the next timer interrupt, is anything up to 30ms away.
> I'm not sure if we can safely recover from this. This will only be
> triggered if there is a bug in depriv mode
> or if the mode has been compromised and an attacker has tried to
> access unavailable memory.
> From my understanding (am I missing something?): domain_crash
> effectively sets flags to tell the scheduler that
> it should be killed the next time the scheduler runs and then returns.
> In which case, if we don't do a
> synchronous crash, this return path would return back into the
> deprivileged mode, we would not
> have mapped in the page (as we shouldn't), and then we get another fault.
>
> What do you think is the best way forward? Thanks!
>

Given that there is a use of domain_crash(d) in context below, it is
clearly safe to use from here.  (Although my general point about hap vs
shadow code still applies, meaning that hap_page_fault() is not the
correct function to hook like this.)

domain_crash() sets a flag, but exiting out from a fault handler heading
back towards ring3 code should check for pending softirqs.  However,
because of the way you have hooked return-to-depriv, you might have
broken this.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables
  2015-08-06 19:52   ` Andrew Cooper
@ 2015-08-07 13:19     ` Ben Catterall
  2015-08-07 15:20       ` Andrew Cooper
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-07 13:19 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich



On 06/08/15 20:52, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
>> The paging structure mappings for the deprivileged mode are added
>> to the monitor page table for HVM guests. The entries are generated by
>> walking the page tables and mapping in new pages. If a higher-level page table
>> mapping exists, we attempt to traverse all the way down to a leaf page and add
>> the page there. Access bits are flipped as needed.
>>
>> The page entries are generated for deprivileged .text, .data and a stack. The
>> data is copied from sections allocated by the linker. The mappings are setup in
>> an unused portion of the Xen virtual address space. The pages are mapped in as
>> user mode accessible, with NX bits set for the data and stack regions and the
>> code region is set to be executable and read-only.
>>
>> The needed pages are allocated on the HAP page heap and are deallocated when
>> those heap pages are deallocated (on domain destruction).
>>
>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>> ---
>>   xen/arch/x86/hvm/Makefile          |   1 +
>>   xen/arch/x86/hvm/deprivileged.c    | 441 +++++++++++++++++++++++++++++++++++++
>>   xen/arch/x86/mm/hap/hap.c          |  10 +-
>>   xen/arch/x86/xen.lds.S             |  19 ++
>>   xen/include/asm-x86/config.h       |  10 +-
>>   xen/include/xen/hvm/deprivileged.h |  94 ++++++++
>>   6 files changed, 573 insertions(+), 2 deletions(-)
>>   create mode 100644 xen/arch/x86/hvm/deprivileged.c
>>   create mode 100644 xen/include/xen/hvm/deprivileged.h
>>
>> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
>> index 794e793..bd83ba3 100644
>> --- a/xen/arch/x86/hvm/Makefile
>> +++ b/xen/arch/x86/hvm/Makefile
>> @@ -16,6 +16,7 @@ obj-y += pmtimer.o
>>   obj-y += quirks.o
>>   obj-y += rtc.o
>>   obj-y += save.o
>> +obj-y += deprivileged.o
>
> We attempt to keep objects linked in alphabetical order, where possible.
>
apologies
>>   obj-y += stdvga.o
>>   obj-y += vioapic.o
>>   obj-y += viridian.o
>> diff --git a/xen/arch/x86/hvm/deprivileged.c b/xen/arch/x86/hvm/deprivileged.c
>> new file mode 100644
>> index 0000000..071d900
>> --- /dev/null
>> +++ b/xen/arch/x86/hvm/deprivileged.c
>> @@ -0,0 +1,441 @@
>> +/*
>> + * HVM deprivileged mode to provide support for running operations in
>> + * user mode from Xen
>> + */
>> +#include <xen/lib.h>
>> +#include <xen/mm.h>
>> +#include <xen/domain_page.h>
>> +#include <xen/config.h>
>> +#include <xen/types.h>
>> +#include <xen/sched.h>
>> +#include <asm/paging.h>
>> +#include <xen/compiler.h>
>> +#include <asm/hap.h>
>> +#include <asm/paging.h>
>> +#include <asm-x86/page.h>
>> +#include <public/domctl.h>
>> +#include <xen/domain_page.h>
>> +#include <asm/hvm/vmx/vmx.h>
>> +#include <xen/hvm/deprivileged.h>
>> +
>> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
>> +{
>> +    int ret;
>> +    void *p;
>> +    unsigned long int size;
>> +
>> +    /* Copy the .text segment for deprivileged code */
>
> Why do you need to copy the deprivileged text section at all?  It is
> read only and completely common.  All you should need to do is make
> userspace mappings to it where it resides in the Xen linked area.
>
tbh: At the time, I was unsure if it could open a hole if, when Xen 
switched over to superpages, and the deprivileged code was not aligned 
on a 4KB page boundary, then mapping it in might prove problematic. 
Mainly as we'd map in some of Xen's ring 0 code by accident, which could 
prove useful to an attacker as they could then call this from userspace.

Though, now I've thought about it more, I can ensure in the linker that 
it's aligned properly  and I think map_pages_to_xen() will do what I need.

> i.e. this is an operation which should be performed once at startup, not
> once per domain.  You will also reduce your memory overhead by doing this.
>
got it.
>> +    size = (unsigned long int)&__hvm_deprivileged_text_end -
>
> Don't bother qualifying unsigned long with a further "int".  A plain
> "unsigned long" is the expected declaration.
understood.
>
>> +           (unsigned long int)&__hvm_deprivileged_text_start;
>> +
>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>> +                              (unsigned long int)&__hvm_deprivileged_text_start,
>> +                              (unsigned long int)HVM_DEPRIVILEGED_TEXT_ADDR,
>> +                              size, 0 /* No write */);
>> +
>> +    if( ret )
>> +    {
>> +        printk("HVM: Error when initialising depriv .text. Code: %d", ret);
>> +        domain_crash(d);
>> +        return;
>> +    }
>> +
>> +    /* Copy the .data segment for ring3 code */
>> +    size = (unsigned long int)&__hvm_deprivileged_data_end -
>> +           (unsigned long int)&__hvm_deprivileged_data_start;
>> +
>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>> +                              (unsigned long int)&__hvm_deprivileged_data_start,
>> +                              (unsigned long int)HVM_DEPRIVILEGED_DATA_ADDR,
>> +                              size, _PAGE_NX | _PAGE_RW);
>
> What do you expect to end up in a data section like this?
>
It's for passing in configuration data at the start when we run the 
device (reduce the number of system calls needed) and for local data 
which the emulated devices may need. It can also be used to put the 
results of any system calls that the deprivileged code may call.
>> +
>> +    if( ret )
>> +    {
>> +        printk("HVM: Error when initialising depriv .data. Code: %d", ret);
>> +        domain_crash(d);
>> +        return;
>> +    }
>> +
>> +    /* Setup the stack mappings.
>
> Xen recommended style for multiline comments is
>
> /*
>   * Setup ...
>
apologies. I'll revise all of these.
>> +     * This is a bit of a hack: by allocating a blank area
>> +     * we can reuse the hvm_deprivileged_copy_pages code.
>> +     * The stacks are actually allocated in the per-vcpu initialisation
>> +     * routine.
>> +     */
>> +    size = STACK_SIZE;
>> +
>> +    p = alloc_xenheap_pages(STACK_ORDER, 0);
>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>> +                              (unsigned long int)p,
>> +                              (unsigned long int)HVM_DEPRIVILEGED_STACK_ADDR,
>> +                              size, _PAGE_NX | _PAGE_RW);
>> +
>> +    free_xenheap_pages(p, STACK_ORDER);
>
> The userspace stacks really don't need to be this large.  Xens primary
> stack, which runs all of this code normally, is limited at 2 pages,
> which is 1/4 of STACK_SIZE.
>
got it. will do.
>> +
>> +    if( ret )
>> +    {
>> +        printk("HVM: Error when initialising depriv stack. Code: %d", ret);
>> +        domain_crash(d);
>> +        return;
>> +    }
>> +}
>> +
>> +void hvm_deprivileged_destroy(struct domain *d)
>> +{
>> +    struct page_info *pg = NULL;
>> +
>> +    /* Free up all the pages we allocated from the HAP HVM list */
>> +    while( (pg = page_list_remove_head(&d->arch.paging.hap.hvmlist)) )
>> +    {
>> +        d->arch.paging.hap.free_pages++;
>> +        page_list_add_tail(pg, &d->arch.paging.hap.freelist);
>> +    }
>
> See c/s 0174da5b7 for why loops like this are bad in general.  However,
> if you use the shadow pool, this should be able to disappear.
>
I'll switch pools as you suggested.
>> +}
>> +
>> +/* Create a copy of the data at the specified virtual address.
>> + * When writing to the source, it will walk the page table hierarchy, creating
>> + * new levels as needed.
>> + *
>> + * If we find a leaf entry in a page table (one which holds the
>> + * mfn of a 4KB, 2MB, etc. page frame) which has already been
>> + * mapped in, then we bail as we have a collision and this likely
>> + * means a bug or the memory configuration has been changed.
>> + *
>> + * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and PAGE_PRESENT set by
>> + * default. The extra l1_flags are used for extra control e.g. PAGE_RW.
>> + * The PAGE_RW flag will be enabled for all page structure entries
>> + * above the leaf page if that leaf page has PAGE_RW set. This is needed to
>> + * permit the writes on the leaf pages. See the Intel manual 3A section 4.6.
>> + *
>> + * TODO: We proceed down to L1 4KB pages and then map these in. We should
>> + * stop the recursion on L3/L2 for a 1GB or 2MB page which would mean faster
>> + * page access. When we stop would depend on size (e.g. use 2MB pages for a
>> + * few MBs)
>> + */
>> +int hvm_deprivileged_copy_pages(struct domain *d,
>> +                            l4_pgentry_t *l4t_base,
>> +                            unsigned long int src_start,
>> +                            unsigned long int dst_start,
>> +                            unsigned long int size,
>> +                            unsigned int l1_flags)
>
> <large snip>
>
> Does map_pages_to_xen() not do everything you need here?  It certainly
> should be fine for the .text section, and will keep the idle and monitor
> tables up to date for you.
>
As it updates the idle tables I haven't used it. From my understanding: 
those idle tables are also used by the PV guests when the guest is 
created as the idle table is copied from FIRST_XEN_SLOT up to the end of 
the table. In config.h, the region where I'm mapping in my new mode has 
been allocated for PV guest-defined usage. I can edit 
init_guest_l4_table so that it doesn't copy this last slot across. I 
don't know if this would have further repercussions.

What would you suggest? Thanks!

> Either way, it would be useful to start early on with a description of
> what the new memory layout is expected to be, and which bits are global,
> which are per pcpu and which are per vcpu.
>
Ok, I'll add this to the comments. Briefly:

The .text mappings are global for all HVM vcpus.
The data mappings need to be per vcpu as each vcpu depriv mode has its 
own local data and configuration.
The stack mappings need to be per vcpu for the same reason.

These are mapped into 0xffffff8000000000 which is currently unused for 
HVM guests.

I'll revise the global page bits as, after further consideration (and 
our discussion clarifying them), they are no longer needed.

>>
>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>> index 593cba7..abc5113 100644
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -40,7 +40,8 @@
>>   #include <asm/domain.h>
>>   #include <xen/numa.h>
>>   #include <asm/hvm/nestedhvm.h>
>> -
>> +#include <xen/hvm/deprivileged.h>
>> +#include <asm/hvm/vmx/vmx.h>
>
> This looks like a spurious addition.
>
apologies.
>>   #include "private.h"
>>
>>   /* Override macros from asm/page.h to make them work with mfn_t */
>> @@ -401,6 +402,9 @@ static void hap_install_xen_entries_in_l4(struct vcpu *v, mfn_t l4mfn)
>>              &idle_pg_table[ROOT_PAGETABLE_FIRST_XEN_SLOT],
>>              ROOT_PAGETABLE_XEN_SLOTS * sizeof(l4_pgentry_t));
>>
>> +    /* Initialise the HVM deprivileged mode feature */
>> +    hvm_deprivileged_init(d, l4e);
>> +
>>       /* Install the per-domain mappings for this domain */
>>       l4e[l4_table_offset(PERDOMAIN_VIRT_START)] =
>>           l4e_from_pfn(mfn_x(page_to_mfn(d->arch.perdomain_l3_pg)),
>> @@ -439,6 +443,10 @@ static void hap_destroy_monitor_table(struct vcpu* v, mfn_t mmfn)
>>
>>       /* Put the memory back in the pool */
>>       hap_free(d, mmfn);
>> +
>> +    /* Destroy the HVM tables */
>> +    ASSERT(paging_locked_by_me(d));
>> +    hvm_deprivileged_destroy(d);
>>   }
>>
>>   /************************************************/
>> diff --git a/xen/arch/x86/xen.lds.S b/xen/arch/x86/xen.lds.S
>> index 6553cff..0bfe0cf 100644
>> --- a/xen/arch/x86/xen.lds.S
>> +++ b/xen/arch/x86/xen.lds.S
>> @@ -50,6 +50,25 @@ SECTIONS
>>          _etext = .;             /* End of text section */
>>     } :text = 0x9090
>>
>> +  /* HVM deprivileged mode segments
>> +   * Used to map the ring3 static data and .text
>> +   */
>> +
>> +  . = ALIGN(PAGE_SIZE);
>> +  .hvm_deprivileged_text : {
>> +      __hvm_deprivileged_text_start = . ;
>> +      *(.hvm_deprivileged_enhancement.text)
>> +      __hvm_deprivileged_text_end = . ;
>> +  } : text
>> +
>> +  . = ALIGN(PAGE_SIZE);
>> +  .hvm_deprivileged_data : {
>> +      __hvm_deprivileged_data_start = . ;
>> +      *(.hvm_deprivileged_enhancement.data)
>> +      __hvm_deprivileged_data_end = . ;
>> +  } : text
>> +
>> +  . = ALIGN(PAGE_SIZE);
>>     .rodata : {
>>          /* Bug frames table */
>>          . = ALIGN(4);
>> diff --git a/xen/include/asm-x86/config.h b/xen/include/asm-x86/config.h
>> index 3e9be83..302399d 100644
>> --- a/xen/include/asm-x86/config.h
>> +++ b/xen/include/asm-x86/config.h
>> @@ -186,7 +186,7 @@ extern unsigned char boot_edid_info[128];
>>    *  0xffff880000000000 - 0xffffff7fffffffff [119.5TB,           PML4:272-510]
>>    *    HVM/idle: continuation of 1:1 mapping
>>    *  0xffffff8000000000 - 0xffffffffffffffff [512GB, 2^39 bytes  PML4:511]
>> - *    HVM/idle: unused
>> + *    HVM: HVM deprivileged mode
>>    *
>>    * Compatibility guest area layout:
>>    *  0x0000000000000000 - 0x00000000f57fffff [3928MB,            PML4:0]
>> @@ -280,6 +280,14 @@ extern unsigned char boot_edid_info[128];
>>   #endif
>>   #define DIRECTMAP_VIRT_END      (DIRECTMAP_VIRT_START + DIRECTMAP_SIZE)
>>
>> +/* Slot 511: HVM deprivileged mode
>> + * The virtual addresses where the .text, .data and stack should be
>> + * placed.
>> + */
>> +#define HVM_DEPRIVILEGED_TEXT_ADDR  (PML4_ADDR(511))
>> +#define HVM_DEPRIVILEGED_DATA_ADDR  (PML4_ADDR(511) + 0xa000000)
>> +#define HVM_DEPRIVILEGED_STACK_ADDR (PML4_ADDR(511) + 0xc000000)
>> +
>>   #ifndef __ASSEMBLY__
>>
>>   /* This is not a fixed value, just a lower limit. */
>> diff --git a/xen/include/xen/hvm/deprivileged.h b/xen/include/xen/hvm/deprivileged.h
>> new file mode 100644
>> index 0000000..6cc803e
>> --- /dev/null
>> +++ b/xen/include/xen/hvm/deprivileged.h
>> @@ -0,0 +1,94 @@
>> +#ifndef __X86_HVM_DEPRIVILEGED
>> +#define __X86_HVM_DEPRIVILEGED
>> +
>> +/* This is also included in the HVM deprivileged mode .S file */
>> +#ifndef __ASSEMBLY__
>> +
>> +#include <asm/page.h>
>> +#include <xen/lib.h>
>> +#include <xen/mm.h>
>> +#include <xen/domain_page.h>
>> +#include <xen/config.h>
>> +#include <xen/types.h>
>> +#include <xen/sched.h>
>> +#include <asm/paging.h>
>> +#include <asm/hap.h>
>> +#include <asm/paging.h>
>> +#include <asm-x86/page.h>
>> +#include <public/domctl.h>
>> +#include <xen/domain_page.h>
>> +
>> +/* Initialise the HVM deprivileged mode. This just sets up the general
>> + * page mappings for .text and .data. It does not prepare each HVM vcpu's data
>> + * or stack which needs to be done separately using
>> + * hvm_deprivileged_prepare_vcpu.
>> + */
>> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e);
>> +
>> +/* Free up the data used by the HVM deprivileged enhancements.
>> + * This frees general page mappings. It does not destroy the per-vcpu
>> + * data so hvm_deprivileged_destroy_vcpu also needs to be called for each vcpu.
>> + * This method should be called after those per-vcpu destruction routines.
>> + */
>> +void hvm_deprivileged_destroy(struct domain *d);
>> +
>> +/* Use create a copy of the data at the specified virtual address.
>> + * When writing to the source, it will walk the page table hierarchy, creating
>> + * new levels as needed, and then copy the data across.
>> + */
>> +int hvm_deprivileged_copy_pages(struct domain *d,
>> +                                l4_pgentry_t *l4e_base,
>> +                                unsigned long int src_start,
>> +                                unsigned long int dst_start,
>> +                                unsigned long int size,
>> +                                unsigned int l1_flags);
>> +
>> +int hvm_deprivileged_copy_l3(struct domain *d,
>> +                             l3_pgentry_t *l1t_base,
>> +                             unsigned long int src_start,
>> +                             unsigned long int dst_start,
>> +                             unsigned long int size,
>> +                             unsigned int l1_flags);
>> +
>> +int hvm_deprivileged_copy_l2(struct domain *d,
>> +                             l2_pgentry_t *l1t_base,
>> +                             unsigned long int src_start,
>> +                             unsigned long int dst_start,
>> +                             unsigned long int size,
>> +                             unsigned int l1_flags);
>> +
>> +/* The left case of the copy. Will allocate the pages and actually copy the
>> + * data.
>> + */
>> +int hvm_deprivileged_copy_l1(struct domain *d,
>> +                             l1_pgentry_t *l1t_base,
>> +                             unsigned long int src_start,
>> +                             unsigned long int dst_start,
>> +                             unsigned long int size,
>> +                             unsigned int l1_flags);
>> +
>> +
>> +/* The segments where the user mode .text and .data are stored */
>> +extern unsigned long int __hvm_deprivileged_text_start;
>> +extern unsigned long int __hvm_deprivileged_text_end;
>> +extern unsigned long int __hvm_deprivileged_data_start;
>> +extern unsigned long int __hvm_deprivileged_data_end;
>
> extern unsigned long __hvm_$foo[];
>
> Will be a more useful type to work with, and harder to accidentally misuse.
>
thanks, will do!
>> +
>> +#endif
>> +
>> +/* The sizes of the pages */
>> +#define HVM_1GB_PAGESIZE 0x40000000ul
>> +#define HVM_2MB_PAGESIZE 0x00200000ul
>> +#define HVM_4KB_PAGESIZE 0x00001000ul
>> +
>> +/* The size in bytes that a single L(1,2,3,4} entry covers.
>> + * There are 512 (left shift by 9) entries in each page-structure.
>> + */
>> +#define HVM_L4_PAGE_RANGE (HVM_1GB_PAGESIZE << 9)
>> +#define HVM_L3_PAGE_RANGE (HVM_2MB_PAGESIZE << 9)
>> +#define HVM_L2_PAGE_RANGE (HVM_4KB_PAGESIZE << 9)
>> +#define HVM_L1_PAGE_RANGE (HVM_4KB_PAGESIZE     )
>
> Values like this should be straight from page.h rather than redefining
> here.  if page.h is missing some, that is the correct place to add them.
>
got it.
> ~Andrew
>
>> +
>> +#define HVM_ERR_PG_ALLOC -1
>> +
>> +#endif
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-07 13:19       ` Andrew Cooper
@ 2015-08-07 13:26         ` Ben Catterall
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-07 13:26 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich



On 07/08/15 14:19, Andrew Cooper wrote:
> On 07/08/15 13:32, Ben Catterall wrote:
>>
>>
>> On 06/08/15 22:24, Andrew Cooper wrote:
>>> On 06/08/2015 17:45, Ben Catterall wrote:
>>>> Added trap handlers to catch exceptions such as a page fault, general
>>>> protection fault, etc. These handlers will crash the domain as such
>>>> exceptions
>>>> would indicate that either there is a bug in deprivileged mode or it
>>>> has been
>>>> compromised by an attacker.
>>>>
>>>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>>>> ---
>>>>    xen/arch/x86/mm/hap/hap.c |  9 +++++++++
>>>>    xen/arch/x86/traps.c      | 41
>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>    2 files changed, 49 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
>>>> index abc5113..43bde89 100644
>>>> --- a/xen/arch/x86/mm/hap/hap.c
>>>> +++ b/xen/arch/x86/mm/hap/hap.c
>>>> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v,
>>>> unsigned long va,
>>>>    {
>>>>        struct domain *d = v->domain;
>>>>    +    /* If we get a page fault whilst in HVM security user mode */
>>>> +    if( v->user_mode == 1 )
>>>> +    {
>>>> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
>>>> +                 d->domain_id, v->vcpu_id);
>>> %pv is your friend.  Like Linux, we have custom printk formats.  In this
>>> case, passing 'v' as a parameter to %pv will cause d$Xv$Y to be
>>> printed.  (The example below predates %pv being introduced).
>> ok, will do. thanks!
>>>
>>>> +        domain_crash_synchronous();
>>> No need for _synchronous() here.  _synchronous() should only be used
>>> when you can't safely recover.  It ends up spinning in a tight loop
>>> waiting for the next timer interrupt, is anything up to 30ms away.
>> I'm not sure if we can safely recover from this. This will only be
>> triggered if there is a bug in depriv mode
>> or if the mode has been compromised and an attacker has tried to
>> access unavailable memory.
>>  From my understanding (am I missing something?): domain_crash
>> effectively sets flags to tell the scheduler that
>> it should be killed the next time the scheduler runs and then returns.
>> In which case, if we don't do a
>> synchronous crash, this return path would return back into the
>> deprivileged mode, we would not
>> have mapped in the page (as we shouldn't), and then we get another fault.
>>
>> What do you think is the best way forward? Thanks!
>>
>
> Given that there is a use of domain_crash(d) in context below, it is
> clearly safe to use from here.  (Although my general point about hap vs
> shadow code still applies, meaning that hap_page_fault() is not the
> correct function to hook like this.)
>
> domain_crash() sets a flag, but exiting out from a fault handler heading
> back towards ring3 code should check for pending softirqs.  However,
> because of the way you have hooked return-to-depriv, you might have
> broken this.
>
Understood, I'll move the handler, make this change and examine the 
return path. Thanks!
> ~Andrew
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-07 12:51     ` Ben Catterall
  2015-08-07 13:08       ` David Vrabel
@ 2015-08-07 14:24       ` Andrew Cooper
  1 sibling, 0 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-07 14:24 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 07/08/15 13:51, Ben Catterall wrote:
> On 06/08/15 21:55, Andrew Cooper wrote:
>> On 06/08/15 17:45, Ben Catterall wrote:
>>> The process to switch into and out of deprivileged mode can be
>>> likened to
>>> setjmp/longjmp.
>>>
>>> To enter deprivileged mode, we take a copy of the stack from the
>>> guest's
>>> registers up to the current stack pointer. This allows us to restore
>>> the stack
>>> when we have finished the deprivileged mode operation, meaning we
>>> can continue
>>> execution from that point. This is similar to if a context switch
>>> had happened.
>>>
>>> To exit deprivileged mode, we copy the stack back, replacing the
>>> current stack.
>>> We can then continue execution from where we left off, which will
>>> unwind the
>>> stack and free up resources. This method means that we do not need to
>>> change any other code paths and its invocation will be transparent
>>> to callers.
>>> This should allow the feature to be more easily deployed to
>>> different parts
>>> of Xen.
>>>
>>> Note that this copy of the stack is per-vcpu but, it will contain
>>> per-pcpu data.
>>> Extra work is needed to properly migrate vcpus between pcpus.
>>
>> Under what circumstances do you see there being persistent state in the
>> depriv area between calls, given that the calls are synchronous from VM
>> actions?
>
> I don't know if we can make these synchronous as we need a way to
> interrupt the vcpu if it's spinning for a long time. Otherwise an
> attacker could just spin in depriv and cause a DoS. With that in mind,
> the scheduler may decide to migrate the vcpu whilst it's in depriv
> mode which would mean this per-pcpu data is held in the stack copy
> which is then migrated to another pcpu incorrectly.

If the emulator spins for a sufficient time, it is fine to shoot the
domain.  This is a strict improvement on the current behaviour where a
spinning emulator would shoot the host, via a watchdog timeout.

As said elsewhere, this kind of DoS is not a very interesting attack
vector.  State handling errors which cause Xen to change the wrong thing
are far more interesting from a guests point of view.

http://xenbits.xen.org/xsa/advisory-123.html (full host compromise) or
http://xenbits.xen.org/xsa/advisory-108.html (read other guests data)
are examples of kinds of interesting issues which could potentially be
mitigated with this depriv infrastructure.

>
>>
>>>
>>> The switch to and from deprivileged mode is performed using sysret
>>> and syscall
>>> respectively.
>>
>> I suspect we need to borrow the SS attribute workaround from Linux to
>> make this function reliably on AMD systems.
>>
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=61f01dd941ba9e06d2bf05994450ecc3d61b6b8b
>>
>>
> >
> Ah! ok, I'll look into this. Thanks!

Just be aware of it.  Don't spend your time attempting to retrofit it to
Xen.  It is more work than it looks.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables
  2015-08-07 13:19     ` Ben Catterall
@ 2015-08-07 15:20       ` Andrew Cooper
  0 siblings, 0 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-07 15:20 UTC (permalink / raw)
  To: Ben Catterall, xen-devel; +Cc: george.dunlap, tim, keir, ian.campbell, jbeulich

On 07/08/15 14:19, Ben Catterall wrote:
> On 06/08/15 20:52, Andrew Cooper wrote:
>> On 06/08/15 17:45, Ben Catterall wrote:
>>> The paging structure mappings for the deprivileged mode are added
>>> to the monitor page table for HVM guests. The entries are generated by
>>> walking the page tables and mapping in new pages. If a higher-level
>>> page table
>>> mapping exists, we attempt to traverse all the way down to a leaf
>>> page and add
>>> the page there. Access bits are flipped as needed.
>>>
>>> The page entries are generated for deprivileged .text, .data and a
>>> stack. The
>>> data is copied from sections allocated by the linker. The mappings
>>> are setup in
>>> an unused portion of the Xen virtual address space. The pages are
>>> mapped in as
>>> user mode accessible, with NX bits set for the data and stack
>>> regions and the
>>> code region is set to be executable and read-only.
>>>
>>> The needed pages are allocated on the HAP page heap and are
>>> deallocated when
>>> those heap pages are deallocated (on domain destruction).
>>>
>>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>>> ---
>>>   xen/arch/x86/hvm/Makefile          |   1 +
>>>   xen/arch/x86/hvm/deprivileged.c    | 441
>>> +++++++++++++++++++++++++++++++++++++
>>>   xen/arch/x86/mm/hap/hap.c          |  10 +-
>>>   xen/arch/x86/xen.lds.S             |  19 ++
>>>   xen/include/asm-x86/config.h       |  10 +-
>>>   xen/include/xen/hvm/deprivileged.h |  94 ++++++++
>>>   6 files changed, 573 insertions(+), 2 deletions(-)
>>>   create mode 100644 xen/arch/x86/hvm/deprivileged.c
>>>   create mode 100644 xen/include/xen/hvm/deprivileged.h
>>>
>>> diff --git a/xen/arch/x86/hvm/Makefile b/xen/arch/x86/hvm/Makefile
>>> index 794e793..bd83ba3 100644
>>> --- a/xen/arch/x86/hvm/Makefile
>>> +++ b/xen/arch/x86/hvm/Makefile
>>> @@ -16,6 +16,7 @@ obj-y += pmtimer.o
>>>   obj-y += quirks.o
>>>   obj-y += rtc.o
>>>   obj-y += save.o
>>> +obj-y += deprivileged.o
>>
>> We attempt to keep objects linked in alphabetical order, where possible.
>>
> apologies
>>>   obj-y += stdvga.o
>>>   obj-y += vioapic.o
>>>   obj-y += viridian.o
>>> diff --git a/xen/arch/x86/hvm/deprivileged.c
>>> b/xen/arch/x86/hvm/deprivileged.c
>>> new file mode 100644
>>> index 0000000..071d900
>>> --- /dev/null
>>> +++ b/xen/arch/x86/hvm/deprivileged.c
>>> @@ -0,0 +1,441 @@
>>> +/*
>>> + * HVM deprivileged mode to provide support for running operations in
>>> + * user mode from Xen
>>> + */
>>> +#include <xen/lib.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/config.h>
>>> +#include <xen/types.h>
>>> +#include <xen/sched.h>
>>> +#include <asm/paging.h>
>>> +#include <xen/compiler.h>
>>> +#include <asm/hap.h>
>>> +#include <asm/paging.h>
>>> +#include <asm-x86/page.h>
>>> +#include <public/domctl.h>
>>> +#include <xen/domain_page.h>
>>> +#include <asm/hvm/vmx/vmx.h>
>>> +#include <xen/hvm/deprivileged.h>
>>> +
>>> +void hvm_deprivileged_init(struct domain *d, l4_pgentry_t *l4e)
>>> +{
>>> +    int ret;
>>> +    void *p;
>>> +    unsigned long int size;
>>> +
>>> +    /* Copy the .text segment for deprivileged code */
>>
>> Why do you need to copy the deprivileged text section at all?  It is
>> read only and completely common.  All you should need to do is make
>> userspace mappings to it where it resides in the Xen linked area.
>>
> tbh: At the time, I was unsure if it could open a hole if, when Xen
> switched over to superpages, and the deprivileged code was not aligned
> on a 4KB page boundary, then mapping it in might prove problematic.
> Mainly as we'd map in some of Xen's ring 0 code by accident, which
> could prove useful to an attacker as they could then call this from
> userspace.

The monitor tables are only used in the context of HVM guests.  PV
guests run in their own pagetables (including a subset of the idle
tables), but have no chance of accidentally gaining these mappings. (A
64bit PV kernel will get legitimately irate if part of it was replaced
with some hvm depriv code.)

Creating user mappings of Xen's .text section does allow a PV guest to
execute the code, but only at the current privilege of the PV guest. 
Therefore, while a bug, doesn't present an increased attack surface. 
(It might be interesting as a side channel to analyse for RoP purposes,
but not as a direct attack surface.)

>
> Though, now I've thought about it more, I can ensure in the linker
> that it's aligned properly  and I think map_pages_to_xen() will do
> what I need.

Playing with alignment is easy.

>
>>
>>> +           (unsigned long int)&__hvm_deprivileged_text_start;
>>> +
>>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>>> +                              (unsigned long
>>> int)&__hvm_deprivileged_text_start,
>>> +                              (unsigned long
>>> int)HVM_DEPRIVILEGED_TEXT_ADDR,
>>> +                              size, 0 /* No write */);
>>> +
>>> +    if( ret )
>>> +    {
>>> +        printk("HVM: Error when initialising depriv .text. Code:
>>> %d", ret);
>>> +        domain_crash(d);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Copy the .data segment for ring3 code */
>>> +    size = (unsigned long int)&__hvm_deprivileged_data_end -
>>> +           (unsigned long int)&__hvm_deprivileged_data_start;
>>> +
>>> +    ret = hvm_deprivileged_copy_pages(d, l4e,
>>> +                              (unsigned long
>>> int)&__hvm_deprivileged_data_start,
>>> +                              (unsigned long
>>> int)HVM_DEPRIVILEGED_DATA_ADDR,
>>> +                              size, _PAGE_NX | _PAGE_RW);
>>
>> What do you expect to end up in a data section like this?
>>
> It's for passing in configuration data at the start when we run the
> device (reduce the number of system calls needed) and for local data
> which the emulated devices may need. It can also be used to put the
> results of any system calls that the deprivileged code may call.

You can achieve the same thing by making the "stacks" larger and
segmenting them.  The API could pass a pointer to the data in one of the
registers, and it just happens to be sheer coincidence that the "stack"
and "data area" happen to be adjacent.

>
>>> +}
>>> +
>>> +/* Create a copy of the data at the specified virtual address.
>>> + * When writing to the source, it will walk the page table
>>> hierarchy, creating
>>> + * new levels as needed.
>>> + *
>>> + * If we find a leaf entry in a page table (one which holds the
>>> + * mfn of a 4KB, 2MB, etc. page frame) which has already been
>>> + * mapped in, then we bail as we have a collision and this likely
>>> + * means a bug or the memory configuration has been changed.
>>> + *
>>> + * Pages have PAGE_USER, PAGE_GLOBAL (if supported) and
>>> PAGE_PRESENT set by
>>> + * default. The extra l1_flags are used for extra control e.g.
>>> PAGE_RW.
>>> + * The PAGE_RW flag will be enabled for all page structure entries
>>> + * above the leaf page if that leaf page has PAGE_RW set. This is
>>> needed to
>>> + * permit the writes on the leaf pages. See the Intel manual 3A
>>> section 4.6.
>>> + *
>>> + * TODO: We proceed down to L1 4KB pages and then map these in. We
>>> should
>>> + * stop the recursion on L3/L2 for a 1GB or 2MB page which would
>>> mean faster
>>> + * page access. When we stop would depend on size (e.g. use 2MB
>>> pages for a
>>> + * few MBs)
>>> + */
>>> +int hvm_deprivileged_copy_pages(struct domain *d,
>>> +                            l4_pgentry_t *l4t_base,
>>> +                            unsigned long int src_start,
>>> +                            unsigned long int dst_start,
>>> +                            unsigned long int size,
>>> +                            unsigned int l1_flags)
>>
>> <large snip>
>>
>> Does map_pages_to_xen() not do everything you need here?  It certainly
>> should be fine for the .text section, and will keep the idle and monitor
>> tables up to date for you.
>>
> As it updates the idle tables I haven't used it. From my
> understanding: those idle tables are also used by the PV guests when
> the guest is created as the idle table is copied from FIRST_XEN_SLOT
> up to the end of the table. In config.h, the region where I'm mapping
> in my new mode has been allocated for PV guest-defined usage. I can
> edit init_guest_l4_table so that it doesn't copy this last slot
> across. I don't know if this would have further repercussions.
>
> What would you suggest? Thanks!

The idle pagetables are Xens primary set of pagetables.  Globally common
stuff such the depriv .text should reside in them.

PV guest pagetables are constructed by memcpy()ing certain slots from
the idle pagetables, the vast majority of the virtual address space is
owned by the guest, so is untouched by Xen.

In this case, your new depriv extensions live in an area which unused
for HVM monitor tables, but would belong to a PV guest kernel.


However, map_pages_to_xen() isn't quite appropriate for making a set of
Xen pagetables for mappings to be hooked into monitor tables, so you are
going to have to keep the current allocation functions.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-07  9:57     ` Ben Catterall
  2015-08-07 13:14       ` Andrew Cooper
@ 2015-08-10  8:50       ` Tim Deegan
  2015-08-10  8:52         ` Tim Deegan
  1 sibling, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2015-08-10  8:50 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper, jbeulich

Hi,

At 10:57 +0100 on 07 Aug (1438945038), Ben Catterall wrote:
> On 06/08/15 20:22, Andrew Cooper wrote:
> > On 06/08/15 17:45, Ben Catterall wrote:
> >> This allocation function is used by the deprivileged mode initialisation code
> >> to allocate pages for the new page table mappings and page frames on the HAP
> >> page heap.
> >>
> >> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> > This is fine for your test box, but isn't fine for systems out there
> > without hardware EPT/NPT support.  For older systems like that (or in
> > certain specific workloads), shadow paging is used instead.
> >
> > This feature is applicable to any HVM domain, which means that it
> > shouldn't depend on HAP or shadow paging.
> >
> > How much memory is allocated for the depriv area, and what exactly is
> > allocated in total?
> So, per-vcpu:
> - a user mode stack which, from your comments in [RFC 2/4], can be 2 pages
> - local data (may or may not be needed, depends on the device) which 
> will be around
>    a page or two.
> 
> Text segment: as per your comments in RFC 2/4, this will be changed to 
> be an alias
> so no extra memory.
> > I expect it isn't very much, and would suggest using
> > d->arch.paging.alloc_page() instead (which is the generic "get me some
> > memory accounted against the domain" helper) which looks as if it should
> > suffice.

Whie I agree that it would be good to account this to the domain,
paging->alloc_page() is an internal _paging assistance_ helper. :)
This new allocation is nothing to do with mm/paging-assistance, so
either it should find its own memory or the hap/shadow pool needs to
be made more generic.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-10  8:50       ` Tim Deegan
@ 2015-08-10  8:52         ` Tim Deegan
  2015-08-10  8:55           ` Andrew Cooper
  0 siblings, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2015-08-10  8:52 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper, jbeulich

At 09:50 +0100 on 10 Aug (1439200241), Tim Deegan wrote:
> Hi,
> 
> At 10:57 +0100 on 07 Aug (1438945038), Ben Catterall wrote:
> > On 06/08/15 20:22, Andrew Cooper wrote:
> > > On 06/08/15 17:45, Ben Catterall wrote:
> > >> This allocation function is used by the deprivileged mode initialisation code
> > >> to allocate pages for the new page table mappings and page frames on the HAP
> > >> page heap.
> > >>
> > >> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
> > > This is fine for your test box, but isn't fine for systems out there
> > > without hardware EPT/NPT support.  For older systems like that (or in
> > > certain specific workloads), shadow paging is used instead.
> > >
> > > This feature is applicable to any HVM domain, which means that it
> > > shouldn't depend on HAP or shadow paging.
> > >
> > > How much memory is allocated for the depriv area, and what exactly is
> > > allocated in total?
> > So, per-vcpu:
> > - a user mode stack which, from your comments in [RFC 2/4], can be 2 pages
> > - local data (may or may not be needed, depends on the device) which 
> > will be around
> >    a page or two.
> > 
> > Text segment: as per your comments in RFC 2/4, this will be changed to 
> > be an alias
> > so no extra memory.
> > > I expect it isn't very much, and would suggest using
> > > d->arch.paging.alloc_page() instead (which is the generic "get me some
> > > memory accounted against the domain" helper) which looks as if it should
> > > suffice.
> 
> Whie I agree that it would be good to account this to the domain,
> paging->alloc_page() is an internal _paging assistance_ helper. :)
> This new allocation is nothing to do with mm/paging-assistance, so
> either it should find its own memory or the hap/shadow pool needs to
> be made more generic.

...at which point other HVM overheads - VMCx pages, bitmaps &c - could
be allocated from it as well.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-10  8:52         ` Tim Deegan
@ 2015-08-10  8:55           ` Andrew Cooper
  2015-08-10 10:08             ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-10  8:55 UTC (permalink / raw)
  To: Tim Deegan, Ben Catterall
  Cc: george.dunlap, xen-devel, keir, ian.campbell, jbeulich

On 10/08/2015 09:52, Tim Deegan wrote:
> At 09:50 +0100 on 10 Aug (1439200241), Tim Deegan wrote:
>> Hi,
>>
>> At 10:57 +0100 on 07 Aug (1438945038), Ben Catterall wrote:
>>> On 06/08/15 20:22, Andrew Cooper wrote:
>>>> On 06/08/15 17:45, Ben Catterall wrote:
>>>>> This allocation function is used by the deprivileged mode initialisation code
>>>>> to allocate pages for the new page table mappings and page frames on the HAP
>>>>> page heap.
>>>>>
>>>>> Signed-off-by: Ben Catterall <Ben.Catterall@citrix.com>
>>>> This is fine for your test box, but isn't fine for systems out there
>>>> without hardware EPT/NPT support.  For older systems like that (or in
>>>> certain specific workloads), shadow paging is used instead.
>>>>
>>>> This feature is applicable to any HVM domain, which means that it
>>>> shouldn't depend on HAP or shadow paging.
>>>>
>>>> How much memory is allocated for the depriv area, and what exactly is
>>>> allocated in total?
>>> So, per-vcpu:
>>> - a user mode stack which, from your comments in [RFC 2/4], can be 2 pages
>>> - local data (may or may not be needed, depends on the device) which 
>>> will be around
>>>    a page or two.
>>>
>>> Text segment: as per your comments in RFC 2/4, this will be changed to 
>>> be an alias
>>> so no extra memory.
>>>> I expect it isn't very much, and would suggest using
>>>> d->arch.paging.alloc_page() instead (which is the generic "get me some
>>>> memory accounted against the domain" helper) which looks as if it should
>>>> suffice.
>> Whie I agree that it would be good to account this to the domain,
>> paging->alloc_page() is an internal _paging assistance_ helper. :)
>> This new allocation is nothing to do with mm/paging-assistance, so
>> either it should find its own memory or the hap/shadow pool needs to
>> be made more generic.
> ...at which point other HVM overheads - VMCx pages, bitmaps &c - could
> be allocated from it as well.

 I agree very much in principle, but I believe other threads have
settles on all allocations being global, or per-pcpu, which means no
per-domain allocation.

(Not that we shouldn't have a general per-domain pool longterm)

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-06 16:45 ` [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
  2015-08-06 20:55   ` Andrew Cooper
@ 2015-08-10  9:49   ` Tim Deegan
  2015-08-10 10:14     ` Andrew Cooper
  2015-08-11 10:35     ` Ben Catterall
  1 sibling, 2 replies; 53+ messages in thread
From: Tim Deegan @ 2015-08-10  9:49 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich

Hi,

At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
> The process to switch into and out of deprivileged mode can be likened to
> setjmp/longjmp.
> 
> To enter deprivileged mode, we take a copy of the stack from the guest's
> registers up to the current stack pointer.

This copy is pretty unfortunate, but I can see that avoiding it will
be a bit complex.  Could we do something with more stacks?  AFAICS
there have to be three stacks anyway:

 - one to hold the depriv execution context;
 - one to hold the privileged execution context; and
 - one to take interrupts on.

So maybe we could do some fiddling to make Xen take interrupts on a
different stack while we're depriv'd?

If we do have to copy, we could track whether the original stack has
been clobbered by an interrupt, and so avoid (at least some of) the
copy back afterwards?

One nit in the assembler - if I've followed correctly, this saved IP:

> +        /* Perform a near call to push rip onto the stack */
> +        call   1f

is returned to (with adjustments) here:

> +        /* Go to user mode return code */
> +        jmp    *(%rsi)

It would be good to make this a matched pair of call/ret if we can;
the CPU has special branch prediction tracking for function calls that
gets confused by a call that's not returned to.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-06 16:45 ` [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
  2015-08-06 21:24   ` Andrew Cooper
@ 2015-08-10 10:07   ` Tim Deegan
  2015-08-11 10:33     ` Ben Catterall
  1 sibling, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2015-08-10 10:07 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich

Hi,

> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v, unsigned long va,
>  {
>      struct domain *d = v->domain;
>  
> +    /* If we get a page fault whilst in HVM security user mode */
> +    if( v->user_mode == 1 )
> +    {
> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
> +                 d->domain_id, v->vcpu_id);
> +        domain_crash_synchronous();
> +    }
> +

This should happen in paging_fault() so it can guard the
shadow-pagetable paths too.  Once it's there, it'll need a check for
is_hvm_vcpu() as well as for user_mode.  Maybe have a helper function
'is_hvm_deprivileged_vcpu()' to do both checks, also used in
hvm_deprivileged_check_trap() &c.

>      HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
>                d->domain_id, v->vcpu_id);
> +
>      domain_crash(d);
>      return 0;
>  }
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 9f5a6c6..19d465f 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -74,6 +74,7 @@
>  #include <asm/vpmu.h>
>  #include <public/arch-x86/cpuid.h>
>  #include <xsm/xsm.h>
> +#include <xen/hvm/deprivileged.h>
>  
>  /*
>   * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
> @@ -500,6 +501,11 @@ static void do_guest_trap(
>      struct trap_bounce *tb;
>      const struct trap_info *ti;
>  
> +    /* If we take the trap whilst in HVM deprivileged mode
> +     * then we should crash the domain.
> +     */
> +    hvm_deprivileged_check_trap(__FUNCTION__);

I wonder whether it would be better to switch to an IDT with all
unacceptable traps stubbed out, rather than have to blacklist them all
separately.  Probably not - this check is cheap, and maintaining the
parallel tables would be a pain. 

Or maybe there's some single point upstream of here, in the asm
handlers, that would catch all the cases where this check is needed?

In any case, the check needs to return an error code so the caller
knows to return without running the rest of the handler (and likewise
elsewhere).

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper
  2015-08-10  8:55           ` Andrew Cooper
@ 2015-08-10 10:08             ` Tim Deegan
  0 siblings, 0 replies; 53+ messages in thread
From: Tim Deegan @ 2015-08-10 10:08 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, keir, ian.campbell, george.dunlap, jbeulich, Ben Catterall

At 09:55 +0100 on 10 Aug (1439200516), Andrew Cooper wrote:
> On 10/08/2015 09:52, Tim Deegan wrote:
> >> Whie I agree that it would be good to account this to the domain,
> >> paging->alloc_page() is an internal _paging assistance_ helper. :)
> >> This new allocation is nothing to do with mm/paging-assistance, so
> >> either it should find its own memory or the hap/shadow pool needs to
> >> be made more generic.
> > ...at which point other HVM overheads - VMCx pages, bitmaps &c - could
> > be allocated from it as well.
> 
>  I agree very much in principle, but I believe other threads have
> settles on all allocations being global, or per-pcpu, which means no
> per-domain allocation.

Grand so. :)

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-10  9:49   ` Tim Deegan
@ 2015-08-10 10:14     ` Andrew Cooper
  2015-08-11  9:55       ` Tim Deegan
  2015-08-20 14:42       ` Ben Catterall
  2015-08-11 10:35     ` Ben Catterall
  1 sibling, 2 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-10 10:14 UTC (permalink / raw)
  To: Tim Deegan, Ben Catterall
  Cc: george.dunlap, xen-devel, keir, ian.campbell, jbeulich

On 10/08/15 10:49, Tim Deegan wrote:
> Hi,
>
> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>> The process to switch into and out of deprivileged mode can be likened to
>> setjmp/longjmp.
>>
>> To enter deprivileged mode, we take a copy of the stack from the guest's
>> registers up to the current stack pointer.
> This copy is pretty unfortunate, but I can see that avoiding it will
> be a bit complex.  Could we do something with more stacks?  AFAICS
> there have to be three stacks anyway:
>
>  - one to hold the depriv execution context;
>  - one to hold the privileged execution context; and
>  - one to take interrupts on.
>
> So maybe we could do some fiddling to make Xen take interrupts on a
> different stack while we're depriv'd?

That should happen naturally by virtue of the privilege level change
involved in taking the interrupt.  Conceptually, taking interrupts from
depriv mode is no different to taking them in a PV guest.

Some complications which come to mind (none insurmountable):

* Under this model, PV exception handlers should copy themselves onto
the privileged execution stack.
* Currently, the IST handlers  copy themselves onto the primary stack if
they interrupt guest context.
* AMD Task Register on vmexit.  (this old gem)

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-06 20:55   ` Andrew Cooper
  2015-08-07 12:51     ` Ben Catterall
@ 2015-08-11  9:45     ` Ian Campbell
  1 sibling, 0 replies; 53+ messages in thread
From: Ian Campbell @ 2015-08-11  9:45 UTC (permalink / raw)
  To: Andrew Cooper, Ben Catterall, xen-devel
  Cc: george.dunlap, tim, keir, jbeulich

On Thu, 2015-08-06 at 21:55 +0100, Andrew Cooper wrote:
> On 06/08/15 17:45, Ben Catterall wrote:
> > The process to switch into and out of deprivileged mode can be likened 
> > to
> > setjmp/longjmp.
> > 
> > To enter deprivileged mode, we take a copy of the stack from the 
> > guest's
> > registers up to the current stack pointer. This allows us to restore 
> > the stack
> > when we have finished the deprivileged mode operation, meaning we can 
> > continue
> > execution from that point. This is similar to if a context switch had 
> > happened.
> > 
> > To exit deprivileged mode, we copy the stack back, replacing the 
> > current stack.
> > We can then continue execution from where we left off, which will 
> > unwind the
> > stack and free up resources. This method means that we do not need to
> > change any other code paths and its invocation will be transparent to 
> > callers.
> > This should allow the feature to be more easily deployed to different 
> > parts
> > of Xen.
> > 
> > Note that this copy of the stack is per-vcpu but, it will contain per
> > -pcpu data.
> > Extra work is needed to properly migrate vcpus between pcpus.
> 
> Under what circumstances do you see there being persistent state in the
> depriv area between calls, given that the calls are synchronous from VM
> actions?

Would we not want to keep (some of) the device model's state in a depriv
area? e.g. anything which is purely internal to the DM which is therefore
only accessed from depriv-land?

Ian.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-10 10:14     ` Andrew Cooper
@ 2015-08-11  9:55       ` Tim Deegan
  2015-08-11 16:51         ` Ben Catterall
  2015-08-20 14:42       ` Ben Catterall
  1 sibling, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2015-08-11  9:55 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, keir, ian.campbell, george.dunlap, jbeulich, Ben Catterall

At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
> On 10/08/15 10:49, Tim Deegan wrote:
> > Hi,
> >
> > At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
> >> The process to switch into and out of deprivileged mode can be likened to
> >> setjmp/longjmp.
> >>
> >> To enter deprivileged mode, we take a copy of the stack from the guest's
> >> registers up to the current stack pointer.
> > This copy is pretty unfortunate, but I can see that avoiding it will
> > be a bit complex.  Could we do something with more stacks?  AFAICS
> > there have to be three stacks anyway:
> >
> >  - one to hold the depriv execution context;
> >  - one to hold the privileged execution context; and
> >  - one to take interrupts on.
> >
> > So maybe we could do some fiddling to make Xen take interrupts on a
> > different stack while we're depriv'd?
> 
> That should happen naturally by virtue of the privilege level change
> involved in taking the interrupt.

Right, and this is why we need a third stack - so interrupts don't
trash the existing priv state on the 'normal' Xen stack.  And so we
either need to copy the priv stack out (and maybe copy it back), or
tell the CPU to use a different stack.

If we had enough headroom, we could try to be clever and tell the CPU
to take interrupts on the priv stack _below_ the existing state.  That
would avoid the first of your problems below.

> * Under this model, PV exception handlers should copy themselves onto
> the privileged execution stack.
> * Currently, the IST handlers  copy themselves onto the primary stack if
> they interrupt guest context.
> * AMD Task Register on vmexit.  (this old gem)

Gah, this thing. :(

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-10 10:07   ` Tim Deegan
@ 2015-08-11 10:33     ` Ben Catterall
  2015-08-17 13:59       ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-11 10:33 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich



On 10/08/15 11:07, Tim Deegan wrote:
> Hi,
>
>> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v, unsigned long va,
>>   {
>>       struct domain *d = v->domain;
>>
>> +    /* If we get a page fault whilst in HVM security user mode */
>> +    if( v->user_mode == 1 )
>> +    {
>> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
>> +                 d->domain_id, v->vcpu_id);
>> +        domain_crash_synchronous();
>> +    }
>> +
>
> This should happen in paging_fault() so it can guard the
> shadow-pagetable paths too.  Once it's there, it'll need a check for
> is_hvm_vcpu() as well as for user_mode.  Maybe have a helper function
> 'is_hvm_deprivileged_vcpu()' to do both checks, also used in
> hvm_deprivileged_check_trap() &c.
>
Ok, I'll make this change.
>>       HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
>>                 d->domain_id, v->vcpu_id);
>> +
>>       domain_crash(d);
>>       return 0;
>>   }
>> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>> index 9f5a6c6..19d465f 100644
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -74,6 +74,7 @@
>>   #include <asm/vpmu.h>
>>   #include <public/arch-x86/cpuid.h>
>>   #include <xsm/xsm.h>
>> +#include <xen/hvm/deprivileged.h>
>>
>>   /*
>>    * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
>> @@ -500,6 +501,11 @@ static void do_guest_trap(
>>       struct trap_bounce *tb;
>>       const struct trap_info *ti;
>>
>> +    /* If we take the trap whilst in HVM deprivileged mode
>> +     * then we should crash the domain.
>> +     */
>> +    hvm_deprivileged_check_trap(__FUNCTION__);
>
> I wonder whether it would be better to switch to an IDT with all
> unacceptable traps stubbed out, rather than have to blacklist them all
> separately.  Probably not - this check is cheap, and maintaining the
> parallel tables would be a pain.
>
> Or maybe there's some single point upstream of here, in the asm
> handlers, that would catch all the cases where this check is needed?
>
Yep, I think this can be done.
> In any case, the check needs to return an error code so the caller
> knows to return without running the rest of the handler (and likewise
> elsewhere).
>
understood.
> Cheers,
>
> Tim.
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-10  9:49   ` Tim Deegan
  2015-08-10 10:14     ` Andrew Cooper
@ 2015-08-11 10:35     ` Ben Catterall
  1 sibling, 0 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-11 10:35 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich



On 10/08/15 10:49, Tim Deegan wrote:
> Hi,
>
> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>> The process to switch into and out of deprivileged mode can be likened to
>> setjmp/longjmp.
>>
>> To enter deprivileged mode, we take a copy of the stack from the guest's
>> registers up to the current stack pointer.
>
> This copy is pretty unfortunate, but I can see that avoiding it will
> be a bit complex.  Could we do something with more stacks?  AFAICS
> there have to be three stacks anyway:
>
>   - one to hold the depriv execution context;
>   - one to hold the privileged execution context; and
>   - one to take interrupts on.
>
> So maybe we could do some fiddling to make Xen take interrupts on a
> different stack while we're depriv'd?
>
> If we do have to copy, we could track whether the original stack has
> been clobbered by an interrupt, and so avoid (at least some of) the
> copy back afterwards?
>
> One nit in the assembler - if I've followed correctly, this saved IP:
>
>> +        /* Perform a near call to push rip onto the stack */
>> +        call   1f
>
> is returned to (with adjustments) here:
>
>> +        /* Go to user mode return code */
>> +        jmp    *(%rsi)
>
> It would be good to make this a matched pair of call/ret if we can;
> the CPU has special branch prediction tracking for function calls that
> gets confused by a call that's not returned to.
>
sure, will do.
> Cheers,
>
> Tim.
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11  9:55       ` Tim Deegan
@ 2015-08-11 16:51         ` Ben Catterall
  2015-08-11 17:05           ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-11 16:51 UTC (permalink / raw)
  To: Tim Deegan, Andrew Cooper
  Cc: george.dunlap, xen-devel, keir, ian.campbell, jbeulich



On 11/08/15 10:55, Tim Deegan wrote:
> At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
>> On 10/08/15 10:49, Tim Deegan wrote:
>>> Hi,
>>>
>>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>>>> The process to switch into and out of deprivileged mode can be likened to
>>>> setjmp/longjmp.
>>>>
>>>> To enter deprivileged mode, we take a copy of the stack from the guest's
>>>> registers up to the current stack pointer.
>>> This copy is pretty unfortunate, but I can see that avoiding it will
>>> be a bit complex.  Could we do something with more stacks?  AFAICS
>>> there have to be three stacks anyway:
>>>
>>>   - one to hold the depriv execution context;
>>>   - one to hold the privileged execution context; and
>>>   - one to take interrupts on.
>>>
>>> So maybe we could do some fiddling to make Xen take interrupts on a
>>> different stack while we're depriv'd?
>>
>> That should happen naturally by virtue of the privilege level change
>> involved in taking the interrupt.
>
> Right, and this is why we need a third stack - so interrupts don't
> trash the existing priv state on the 'normal' Xen stack.  And so we
> either need to copy the priv stack out (and maybe copy it back), or
> tell the CPU to use a different stack.

The copy is relatively small and paid only on the first and last entries 
into the mode. I don't know if this is cheaper than the  bookwork that 
would be needed on entering and returning from the mode to switch to 
these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
need changing on the first and last entry/exit if we have the extra 
stack, is that correct? Or, is this a more dramatic change in that 
everything uses this three stack model rather than just this feature.

I'm not sure how much in Xen would need changing to switch across to 
using three stacks. Also, would this also need to be done for PV guests? 
Would that need to be a separate patch series?

What's the overall consensus? Thanks!

>
> If we had enough headroom, we could try to be clever and tell the CPU
> to take interrupts on the priv stack _below_ the existing state.  That
> would avoid the first of your problems below.
>
>> * Under this model, PV exception handlers should copy themselves onto
>> the privileged execution stack.
>> * Currently, the IST handlers  copy themselves onto the primary stack if
>> they interrupt guest context.
>> * AMD Task Register on vmexit.  (this old gem)
>
> Gah, this thing. :
Curious (and I can't seem find this in the manuals): What is this thing?
>
> Tim.
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 16:51         ` Ben Catterall
@ 2015-08-11 17:05           ` Tim Deegan
  2015-08-11 17:19             ` Andrew Cooper
  2015-08-12 13:22             ` Ben Catterall
  0 siblings, 2 replies; 53+ messages in thread
From: Tim Deegan @ 2015-08-11 17:05 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper, jbeulich

Hi,

At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:
> On 11/08/15 10:55, Tim Deegan wrote:
> > At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
> >> On 10/08/15 10:49, Tim Deegan wrote:
> >>> Hi,
> >>>
> >>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
> >>>> The process to switch into and out of deprivileged mode can be likened to
> >>>> setjmp/longjmp.
> >>>>
> >>>> To enter deprivileged mode, we take a copy of the stack from the guest's
> >>>> registers up to the current stack pointer.
> >>> This copy is pretty unfortunate, but I can see that avoiding it will
> >>> be a bit complex.  Could we do something with more stacks?  AFAICS
> >>> there have to be three stacks anyway:
> >>>
> >>>   - one to hold the depriv execution context;
> >>>   - one to hold the privileged execution context; and
> >>>   - one to take interrupts on.
> >>>
> >>> So maybe we could do some fiddling to make Xen take interrupts on a
> >>> different stack while we're depriv'd?
> >>
> >> That should happen naturally by virtue of the privilege level change
> >> involved in taking the interrupt.
> >
> > Right, and this is why we need a third stack - so interrupts don't
> > trash the existing priv state on the 'normal' Xen stack.  And so we
> > either need to copy the priv stack out (and maybe copy it back), or
> > tell the CPU to use a different stack.
> 
> The copy is relatively small and paid only on the first and last entries 
> into the mode. I don't know if this is cheaper than the  bookwork that 
> would be needed on entering and returning from the mode to switch to 
> these stacks. I'm assuming the sp pointers in the TSS and ISTs would 
> need changing on the first and last entry/exit if we have the extra 
> stack, is that correct?

Yep.

> Or, is this a more dramatic change in that 
> everything uses this three stack model rather than just this feature.

Well, some other parts would have to change to accomodate this new
behaviour - that was what Andrew was talking about.

BTW, I think there need to be three stacks anyway, since the depriv
code shouldn't be allowed to write to the priv code's stack frames.
Or maybe I've misunderstood how much access the depriv code will have.

> I'm not sure how much in Xen would need changing to switch across to 
> using three stacks. Also, would this also need to be done for PV guests? 
> Would that need to be a separate patch series?
> 
> What's the overall consensus? Thanks!

I'm not sure there is one yet -- needs some more discussion of
whether the non-copying approach is feasible.

> > If we had enough headroom, we could try to be clever and tell the CPU
> > to take interrupts on the priv stack _below_ the existing state.  That
> > would avoid the first of your problems below.
> >
> >> * Under this model, PV exception handlers should copy themselves onto
> >> the privileged execution stack.
> >> * Currently, the IST handlers  copy themselves onto the primary stack if
> >> they interrupt guest context.
> >> * AMD Task Register on vmexit.  (this old gem)
> >
> > Gah, this thing. :
> Curious (and I can't seem find this in the manuals): What is this thing?

IIRC: AMD processors don't context switch TR on vmexit, which makes
using IST handlers tricky there.  We'd have to do the TR context
switch ourselves, and that would be expensive.  Andrew, am I
remembering that right?

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 17:05           ` Tim Deegan
@ 2015-08-11 17:19             ` Andrew Cooper
  2015-08-11 18:29               ` Boris Ostrovsky
  2015-08-12 10:10               ` Jan Beulich
  2015-08-12 13:22             ` Ben Catterall
  1 sibling, 2 replies; 53+ messages in thread
From: Andrew Cooper @ 2015-08-11 17:19 UTC (permalink / raw)
  To: Tim Deegan, Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, jbeulich,
	Aravind Gopalakrishnan, Suravee Suthikulpanit, Boris Ostrovsky

On 11/08/15 18:05, Tim Deegan wrote:
>
>>>> * Under this model, PV exception handlers should copy themselves onto
>>>> the privileged execution stack.
>>>> * Currently, the IST handlers  copy themselves onto the primary stack if
>>>> they interrupt guest context.
>>>> * AMD Task Register on vmexit.  (this old gem)
>>> Gah, this thing. :
>> Curious (and I can't seem find this in the manuals): What is this thing?
> IIRC: AMD processors don't context switch TR on vmexit,

Correct

> which makes using IST handlers tricky there.

(That is one way of putting it)

IST handlers cannot be used by Xen if Xen does not switch the task
register before stgi, or IST exceptions (NMI, MCE and double fault) will
be taken with guest-supplied stack pointers.

> We'd have to do the TR context switch ourselves, and that would be expensive.

It is suspected to be expensive, but I have never actually seen any
numbers one way or another.

> Andrew, am I remembering that right?

Looks about right.

I have been meaning to investigate this for a while, but never had the time.

Xen opts for disabling interrupt stack tables in the context of AMD HVM
vcpus, which interacts catastrophically with debug builds using
MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
detect stack overflows, but without an IST double fault hander, ends in
a triple fault rather than a host crash detailing the stack overflow.

KVM unilaterally reloads the host task register on vmexit, and I suspect
this is probably the way to go, but have not had time to investigate
whether there is any performance impact from doing so.  Given how little
of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
as expensive as it might have been in legacy modes.

(CC'ing the AMD SVM maintainers to see if they have any information on
this subject)

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 17:19             ` Andrew Cooper
@ 2015-08-11 18:29               ` Boris Ostrovsky
  2015-08-12 13:29                 ` Andrew Cooper
  2015-08-12 10:10               ` Jan Beulich
  1 sibling, 1 reply; 53+ messages in thread
From: Boris Ostrovsky @ 2015-08-11 18:29 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan, Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, jbeulich,
	Aravind Gopalakrishnan, Suravee Suthikulpanit

On 08/11/2015 01:19 PM, Andrew Cooper wrote:
> On 11/08/15 18:05, Tim Deegan wrote:
>>>>> * Under this model, PV exception handlers should copy themselves onto
>>>>> the privileged execution stack.
>>>>> * Currently, the IST handlers  copy themselves onto the primary stack if
>>>>> they interrupt guest context.
>>>>> * AMD Task Register on vmexit.  (this old gem)
>>>> Gah, this thing. :
>>> Curious (and I can't seem find this in the manuals): What is this thing?
>> IIRC: AMD processors don't context switch TR on vmexit,
> Correct
>
>> which makes using IST handlers tricky there.
> (That is one way of putting it)
>
> IST handlers cannot be used by Xen if Xen does not switch the task
> register before stgi, or IST exceptions (NMI, MCE and double fault) will
> be taken with guest-supplied stack pointers.
>
>> We'd have to do the TR context switch ourselves, and that would be expensive.
> It is suspected to be expensive, but I have never actually seen any
> numbers one way or another.
>
>> Andrew, am I remembering that right?
> Looks about right.
>
> I have been meaning to investigate this for a while, but never had the time.
>
> Xen opts for disabling interrupt stack tables in the context of AMD HVM
> vcpus, which interacts catastrophically with debug builds using
> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
> detect stack overflows, but without an IST double fault hander, ends in
> a triple fault rather than a host crash detailing the stack overflow.
>
> KVM unilaterally reloads the host task register on vmexit, and I suspect
> this is probably the way to go, but have not had time to investigate
> whether there is any performance impact from doing so.  Given how little
> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
> as expensive as it might have been in legacy modes.
>
> (CC'ing the AMD SVM maintainers to see if they have any information on
> this subject)
>

I actually didn't even realize that TR is not saved on vmexit ;-/.

Would switching TR only when we know that we need to enter this 
deprivileged mode help? Assuming that it is less expensive than copying 
the stack.

-boris

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations
  2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
                   ` (3 preceding siblings ...)
  2015-08-06 16:45 ` [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
@ 2015-08-12  9:50 ` Jan Beulich
  2015-08-12 11:27   ` Ben Catterall
  4 siblings, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2015-08-12  9:50 UTC (permalink / raw)
  To: Ben Catterall
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, xen-devel

>>> On 06.08.15 at 18:45, <Ben.Catterall@citrix.com> wrote:
> Performance testing
> -------------------
> Performance testing indicates that the overhead for this deprivileged mode is
> approximately 25%. This overhead is the cost of moving into deprivileged mode
> and then fully back out of deprivileged mode.
> 
> I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
> E5-2407 0 processor. This was done from a python script within the HVM guest
> using time.time() and running Debian Jessie. Each write was trapped to cause a
> vmexit and the time for each write was calculated. These experiments were
> repeated. Note that only the host and this HVM guest were running (both Debian
> Jessie) during the experiments.

I'm not sure I/O port writes (the more without saying to which port)
are the best suited item for evaluating the overhead: Ideally you
would pick an operation that is comparably fast (I/O port accesses
generally aren't if the net effect is that the emulator accesses a
physical port). But it certainly serves as a first indication (not too
bad, but large enough that I wouldn't be convinced this is a good
idea).

Jan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 17:19             ` Andrew Cooper
  2015-08-11 18:29               ` Boris Ostrovsky
@ 2015-08-12 10:10               ` Jan Beulich
  1 sibling, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2015-08-12 10:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Tim Deegan,
	Aravind Gopalakrishnan, Suravee Suthikulpanit, Ben Catterall,
	Boris Ostrovsky

>>> On 11.08.15 at 19:19, <andrew.cooper3@citrix.com> wrote:
> KVM unilaterally reloads the host task register on vmexit, and I suspect
> this is probably the way to go, but have not had time to investigate
> whether there is any performance impact from doing so.  Given how little
> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
> as expensive as it might have been in legacy modes.

How much of the TSS is being used shouldn't matter at all for
LTR execution time, since the instruction doesn't access the
TSS itself (and iirc there's also no caching documented for TSS
uses).

Jan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations
  2015-08-12  9:50 ` [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Jan Beulich
@ 2015-08-12 11:27   ` Ben Catterall
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-12 11:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, george.dunlap, andrew.cooper3, tim, xen-devel


On 12/08/15 10:50, Jan Beulich wrote:
>>>> On 06.08.15 at 18:45, <Ben.Catterall@citrix.com> wrote:
>> Performance testing
>> -------------------
>> Performance testing indicates that the overhead for this deprivileged mode is
>> approximately 25%. This overhead is the cost of moving into deprivileged mode
>> and then fully back out of deprivileged mode.
>>
>> I performed 100000 writes to a single I/O port on an Intel 2.2GHz Xeon
>> E5-2407 0 processor. This was done from a python script within the HVM guest
>> using time.time() and running Debian Jessie. Each write was trapped to cause a
>> vmexit and the time for each write was calculated. These experiments were
>> repeated. Note that only the host and this HVM guest were running (both Debian
>> Jessie) during the experiments.
>
> I'm not sure I/O port writes (the more without saying to which port)
> are the best suited item for evaluating the overhead: Ideally you
> would pick an operation that is comparably fast (I/O port accesses
> generally aren't if the net effect is that the emulator accesses a
> physical port). But it certainly serves as a first indication (not too
> bad, but large enough that I wouldn't be convinced this is a good
> idea).
>
> Jan
>

Thanks for this Jan: I overlooked this.

Retested with the port (0x1000) i/o emulation operation replaced with a 
nop to get the pure overhead. With the operation takes 5.2e-6s and 
without takes 1.7e-6s. So the difference is 3.5e-6. That's about 308% 
overhead on a nop. So, I'm guessing that whether this is used or not 
will depend on the overhead and frequency of the operation it will be 
handling.

Thanks,
Ben

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 17:05           ` Tim Deegan
  2015-08-11 17:19             ` Andrew Cooper
@ 2015-08-12 13:22             ` Ben Catterall
  2015-08-12 13:26               ` Tim Deegan
  1 sibling, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-12 13:22 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper, jbeulich



On 11/08/15 18:05, Tim Deegan wrote:
> Hi,
>
> At 17:51 +0100 on 11 Aug (1439315508), Ben Catterall wrote:
>> On 11/08/15 10:55, Tim Deegan wrote:
>>> At 11:14 +0100 on 10 Aug (1439205273), Andrew Cooper wrote:
>>>> On 10/08/15 10:49, Tim Deegan wrote:
>>>>> Hi,
>>>>>
>>>>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>>>>>> The process to switch into and out of deprivileged mode can be likened to
>>>>>> setjmp/longjmp.
>>>>>>
>>>>>> To enter deprivileged mode, we take a copy of the stack from the guest's
>>>>>> registers up to the current stack pointer.
>>>>> This copy is pretty unfortunate, but I can see that avoiding it will
>>>>> be a bit complex.  Could we do something with more stacks?  AFAICS
>>>>> there have to be three stacks anyway:
>>>>>
>>>>>    - one to hold the depriv execution context;
>>>>>    - one to hold the privileged execution context; and
>>>>>    - one to take interrupts on.
>>>>>
>>>>> So maybe we could do some fiddling to make Xen take interrupts on a
>>>>> different stack while we're depriv'd?
>>>>
>>>> That should happen naturally by virtue of the privilege level change
>>>> involved in taking the interrupt.
>>>
>>> Right, and this is why we need a third stack - so interrupts don't
>>> trash the existing priv state on the 'normal' Xen stack.  And so we
>>> either need to copy the priv stack out (and maybe copy it back), or
>>> tell the CPU to use a different stack.
>>
>> The copy is relatively small and paid only on the first and last entries
>> into the mode. I don't know if this is cheaper than the  bookwork that
>> would be needed on entering and returning from the mode to switch to
>> these stacks. I'm assuming the sp pointers in the TSS and ISTs would
>> need changing on the first and last entry/exit if we have the extra
>> stack, is that correct?
>
> Yep.
>
>> Or, is this a more dramatic change in that
>> everything uses this three stack model rather than just this feature.
>
> Well, some other parts would have to change to accomodate this new
> behaviour - that was what Andrew was talking about.
>
> BTW, I think there need to be three stacks anyway, since the depriv
> code shouldn't be allowed to write to the priv code's stack frames.
> Or maybe I've misunderstood how much access the depriv code will have.
So, just to clarify:

We have a separate deprivileged stack allocated which the deprivileged 
code uses. This is mapped in user mode.

We have the privileged stack which Xen runs on. To prevent this being 
clobbered when we are in our mode and take an interrupt, we copy this 
out to a buffer. This buffer is the saved privileged stack state.

So, we sort of have three stacks already, just the privileged stack is 
copied out to a buffer, rather than switching pointers to another 
interrupt stack.

Hopefully that clarifies?

>
>> I'm not sure how much in Xen would need changing to switch across to
>> using three stacks. Also, would this also need to be done for PV guests?
>> Would that need to be a separate patch series?
>>
>> What's the overall consensus? Thanks!
>
> I'm not sure there is one yet -- needs some more discussion of
> whether the non-copying approach is feasible.
>
>>> If we had enough headroom, we could try to be clever and tell the CPU
>>> to take interrupts on the priv stack _below_ the existing state.  That
>>> would avoid the first of your problems below.
>>>
>>>> * Under this model, PV exception handlers should copy themselves onto
>>>> the privileged execution stack.
>>>> * Currently, the IST handlers  copy themselves onto the primary stack if
>>>> they interrupt guest context.
>>>> * AMD Task Register on vmexit.  (this old gem)
>>>
>>> Gah, this thing. :
>> Curious (and I can't seem find this in the manuals): What is this thing?
>
> IIRC: AMD processors don't context switch TR on vmexit, which makes
> using IST handlers tricky there.  We'd have to do the TR context
> switch ourselves, and that would be expensive.  Andrew, am I
> remembering that right?
>
Thanks!
> Tim.
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-12 13:22             ` Ben Catterall
@ 2015-08-12 13:26               ` Tim Deegan
  0 siblings, 0 replies; 53+ messages in thread
From: Tim Deegan @ 2015-08-12 13:26 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper, jbeulich

At 14:22 +0100 on 12 Aug (1439389325), Ben Catterall wrote:
> On 11/08/15 18:05, Tim Deegan wrote:
> > BTW, I think there need to be three stacks anyway, since the depriv
> > code shouldn't be allowed to write to the priv code's stack frames.
> > Or maybe I've misunderstood how much access the depriv code will have.
> So, just to clarify:
> 
> We have a separate deprivileged stack allocated which the deprivileged 
> code uses. This is mapped in user mode.
> 
> We have the privileged stack which Xen runs on. To prevent this being 
> clobbered when we are in our mode and take an interrupt, we copy this 
> out to a buffer. This buffer is the saved privileged stack state.
> 
> So, we sort of have three stacks already, just the privileged stack is 
> copied out to a buffer, rather than switching pointers to another 
> interrupt stack.
> 
> Hopefully that clarifies?

Yes, thanks -- the buffer is what I was thinking of as a third stack.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-11 18:29               ` Boris Ostrovsky
@ 2015-08-12 13:29                 ` Andrew Cooper
  2015-08-12 13:33                   ` Andrew Cooper
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-12 13:29 UTC (permalink / raw)
  To: Boris Ostrovsky, Tim Deegan, Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, jbeulich,
	Aravind Gopalakrishnan, Suravee Suthikulpanit

On 11/08/15 19:29, Boris Ostrovsky wrote:
> On 08/11/2015 01:19 PM, Andrew Cooper wrote:
>> On 11/08/15 18:05, Tim Deegan wrote:
>>>>>> * Under this model, PV exception handlers should copy themselves
>>>>>> onto
>>>>>> the privileged execution stack.
>>>>>> * Currently, the IST handlers  copy themselves onto the primary
>>>>>> stack if
>>>>>> they interrupt guest context.
>>>>>> * AMD Task Register on vmexit.  (this old gem)
>>>>> Gah, this thing. :
>>>> Curious (and I can't seem find this in the manuals): What is this
>>>> thing?
>>> IIRC: AMD processors don't context switch TR on vmexit,
>> Correct
>>
>>> which makes using IST handlers tricky there.
>> (That is one way of putting it)
>>
>> IST handlers cannot be used by Xen if Xen does not switch the task
>> register before stgi, or IST exceptions (NMI, MCE and double fault) will
>> be taken with guest-supplied stack pointers.
>>
>>> We'd have to do the TR context switch ourselves, and that would be
>>> expensive.
>> It is suspected to be expensive, but I have never actually seen any
>> numbers one way or another.
>>
>>> Andrew, am I remembering that right?
>> Looks about right.
>>
>> I have been meaning to investigate this for a while, but never had
>> the time.
>>
>> Xen opts for disabling interrupt stack tables in the context of AMD HVM
>> vcpus, which interacts catastrophically with debug builds using
>> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
>> detect stack overflows, but without an IST double fault hander, ends in
>> a triple fault rather than a host crash detailing the stack overflow.
>>
>> KVM unilaterally reloads the host task register on vmexit, and I suspect
>> this is probably the way to go, but have not had time to investigate
>> whether there is any performance impact from doing so.  Given how little
>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
>> as expensive as it might have been in legacy modes.
>>
>> (CC'ing the AMD SVM maintainers to see if they have any information on
>> this subject)
>>
>
> I actually didn't even realize that TR is not saved on vmexit ;-/.
>
> Would switching TR only when we know that we need to enter this
> deprivileged mode help?

This is an absolute must.  It is not safe to use syscall/sysexit without
IST in place for NMIs and MCEs.

> Assuming that it is less expensive than copying the stack.

I was referring to the stack overflow issue, and whether it might be
sensible to pro-actively which TR.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-12 13:29                 ` Andrew Cooper
@ 2015-08-12 13:33                   ` Andrew Cooper
  2015-08-17 13:53                     ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-12 13:33 UTC (permalink / raw)
  To: Boris Ostrovsky, Tim Deegan, Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap,
	Aravind Gopalakrishnan, jbeulich, Suravee Suthikulpanit

On 12/08/15 14:29, Andrew Cooper wrote:
> On 11/08/15 19:29, Boris Ostrovsky wrote:
>> On 08/11/2015 01:19 PM, Andrew Cooper wrote:
>>> On 11/08/15 18:05, Tim Deegan wrote:
>>>>>>> * Under this model, PV exception handlers should copy themselves
>>>>>>> onto
>>>>>>> the privileged execution stack.
>>>>>>> * Currently, the IST handlers  copy themselves onto the primary
>>>>>>> stack if
>>>>>>> they interrupt guest context.
>>>>>>> * AMD Task Register on vmexit.  (this old gem)
>>>>>> Gah, this thing. :
>>>>> Curious (and I can't seem find this in the manuals): What is this
>>>>> thing?
>>>> IIRC: AMD processors don't context switch TR on vmexit,
>>> Correct
>>>
>>>> which makes using IST handlers tricky there.
>>> (That is one way of putting it)
>>>
>>> IST handlers cannot be used by Xen if Xen does not switch the task
>>> register before stgi, or IST exceptions (NMI, MCE and double fault) will
>>> be taken with guest-supplied stack pointers.
>>>
>>>> We'd have to do the TR context switch ourselves, and that would be
>>>> expensive.
>>> It is suspected to be expensive, but I have never actually seen any
>>> numbers one way or another.
>>>
>>>> Andrew, am I remembering that right?
>>> Looks about right.
>>>
>>> I have been meaning to investigate this for a while, but never had
>>> the time.
>>>
>>> Xen opts for disabling interrupt stack tables in the context of AMD HVM
>>> vcpus, which interacts catastrophically with debug builds using
>>> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
>>> detect stack overflows, but without an IST double fault hander, ends in
>>> a triple fault rather than a host crash detailing the stack overflow.
>>>
>>> KVM unilaterally reloads the host task register on vmexit, and I suspect
>>> this is probably the way to go, but have not had time to investigate
>>> whether there is any performance impact from doing so.  Given how little
>>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
>>> as expensive as it might have been in legacy modes.
>>>
>>> (CC'ing the AMD SVM maintainers to see if they have any information on
>>> this subject)
>>>
>> I actually didn't even realize that TR is not saved on vmexit ;-/.
>>
>> Would switching TR only when we know that we need to enter this
>> deprivileged mode help?
> This is an absolute must.  It is not safe to use syscall/sysexit without
> IST in place for NMIs and MCEs.
>
>> Assuming that it is less expensive than copying the stack.
> I was referring to the stack overflow issue, and whether it might be
> sensible to pro-actively which TR.

Ahem! s/which/switch/

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-12 13:33                   ` Andrew Cooper
@ 2015-08-17 13:53                     ` Ben Catterall
  2015-08-17 15:07                       ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-17 13:53 UTC (permalink / raw)
  To: Andrew Cooper, Boris Ostrovsky, Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap,
	Aravind Gopalakrishnan, jbeulich, Suravee Suthikulpanit



On 12/08/15 14:33, Andrew Cooper wrote:
> On 12/08/15 14:29, Andrew Cooper wrote:
>> On 11/08/15 19:29, Boris Ostrovsky wrote:
>>> On 08/11/2015 01:19 PM, Andrew Cooper wrote:
>>>> On 11/08/15 18:05, Tim Deegan wrote:
>>>>>>>> * Under this model, PV exception handlers should copy themselves
>>>>>>>> onto
>>>>>>>> the privileged execution stack.
>>>>>>>> * Currently, the IST handlers  copy themselves onto the primary
>>>>>>>> stack if
>>>>>>>> they interrupt guest context.
>>>>>>>> * AMD Task Register on vmexit.  (this old gem)
>>>>>>> Gah, this thing. :
>>>>>> Curious (and I can't seem find this in the manuals): What is this
>>>>>> thing?
>>>>> IIRC: AMD processors don't context switch TR on vmexit,
>>>> Correct
>>>>
>>>>> which makes using IST handlers tricky there.
>>>> (That is one way of putting it)
>>>>
>>>> IST handlers cannot be used by Xen if Xen does not switch the task
>>>> register before stgi, or IST exceptions (NMI, MCE and double fault) will
>>>> be taken with guest-supplied stack pointers.
>>>>
>>>>> We'd have to do the TR context switch ourselves, and that would be
>>>>> expensive.
>>>> It is suspected to be expensive, but I have never actually seen any
>>>> numbers one way or another.
>>>>
>>>>> Andrew, am I remembering that right?
>>>> Looks about right.
>>>>
>>>> I have been meaning to investigate this for a while, but never had
>>>> the time.
>>>>
>>>> Xen opts for disabling interrupt stack tables in the context of AMD HVM
>>>> vcpus, which interacts catastrophically with debug builds using
>>>> MEMORY_GUARD.  MEMORY_GUARD shoots a page out of the primary stack to
>>>> detect stack overflows, but without an IST double fault hander, ends in
>>>> a triple fault rather than a host crash detailing the stack overflow.
>>>>
>>>> KVM unilaterally reloads the host task register on vmexit, and I suspect
>>>> this is probably the way to go, but have not had time to investigate
>>>> whether there is any performance impact from doing so.  Given how little
>>>> of a TSS is actually used in long mode, I wouldn't expect an `ltr` to be
>>>> as expensive as it might have been in legacy modes.
>>>>
>>>> (CC'ing the AMD SVM maintainers to see if they have any information on
>>>> this subject)
>>>>
>>> I actually didn't even realize that TR is not saved on vmexit ;-/.
>>>
>>> Would switching TR only when we know that we need to enter this
>>> deprivileged mode help?
>> This is an absolute must.  It is not safe to use syscall/sysexit without
>> IST in place for NMIs and MCEs.
>>
>>> Assuming that it is less expensive than copying the stack.
>> I was referring to the stack overflow issue, and whether it might be
>> sensible to pro-actively which TR.
>
> Ahem! s/which/switch/
>
> ~Andrew
>

So, have we arrived at a decision for this? Thanks!

Ben

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-11 10:33     ` Ben Catterall
@ 2015-08-17 13:59       ` Ben Catterall
  2015-08-17 14:58         ` Tim Deegan
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-17 13:59 UTC (permalink / raw)
  To: Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich



On 11/08/15 11:33, Ben Catterall wrote:
>
>
> On 10/08/15 11:07, Tim Deegan wrote:
>> Hi,
>>
>>> @@ -685,8 +685,17 @@ static int hap_page_fault(struct vcpu *v,
>>> unsigned long va,
>>>   {
>>>       struct domain *d = v->domain;
>>>
>>> +    /* If we get a page fault whilst in HVM security user mode */
>>> +    if( v->user_mode == 1 )
>>> +    {
>>> +        printk("HVM: #PF (%u:%u) whilst in user mode\n",
>>> +                 d->domain_id, v->vcpu_id);
>>> +        domain_crash_synchronous();
>>> +    }
>>> +
>>
>> This should happen in paging_fault() so it can guard the
>> shadow-pagetable paths too.  Once it's there, it'll need a check for
>> is_hvm_vcpu() as well as for user_mode.  Maybe have a helper function
>> 'is_hvm_deprivileged_vcpu()' to do both checks, also used in
>> hvm_deprivileged_check_trap() &c.
I've moved this and now need to add shadow page table support as this 
currently only supports HAP.
>>
> Ok, I'll make this change.
>>>       HAP_ERROR("Intercepted a guest #PF (%u:%u) with HAP enabled.\n",
>>>                 d->domain_id, v->vcpu_id);
>>> +
>>>       domain_crash(d);
>>>       return 0;
>>>   }
>>> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
>>> index 9f5a6c6..19d465f 100644
>>> --- a/xen/arch/x86/traps.c
>>> +++ b/xen/arch/x86/traps.c
>>> @@ -74,6 +74,7 @@
>>>   #include <asm/vpmu.h>
>>>   #include <public/arch-x86/cpuid.h>
>>>   #include <xsm/xsm.h>
>>> +#include <xen/hvm/deprivileged.h>
>>>
>>>   /*
>>>    * opt_nmi: one of 'ignore', 'dom0', or 'fatal'.
>>> @@ -500,6 +501,11 @@ static void do_guest_trap(
>>>       struct trap_bounce *tb;
>>>       const struct trap_info *ti;
>>>
>>> +    /* If we take the trap whilst in HVM deprivileged mode
>>> +     * then we should crash the domain.
>>> +     */
>>> +    hvm_deprivileged_check_trap(__FUNCTION__);
>>
>> I wonder whether it would be better to switch to an IDT with all
>> unacceptable traps stubbed out, rather than have to blacklist them all
>> separately.  Probably not - this check is cheap, and maintaining the
>> parallel tables would be a pain.
>>
>> Or maybe there's some single point upstream of here, in the asm
>> handlers, that would catch all the cases where this check is needed?
>>
> Yep, I think this can be done.
Had a deeper look at this. There is a point where all exceptions come in 
in the asm (handle_exception in entry.S) and we could branch out at this 
point. It does mean that we'd treat _all_ exceptions that occur in this 
mode the same way whereas the current way means that we can treat them 
differently (e.g. get __func__). So, should I make all exceptions go to 
the same point or keep as is?

Thanks!
>> In any case, the check needs to return an error code so the caller
>> knows to return without running the rest of the handler (and likewise
>> elsewhere).
>>
> understood.
>> Cheers,
>>
>> Tim.
>>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-17 13:59       ` Ben Catterall
@ 2015-08-17 14:58         ` Tim Deegan
  2015-08-17 15:14           ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Tim Deegan @ 2015-08-17 14:58 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, andrew.cooper3, jbeulich

At 14:59 +0100 on 17 Aug (1439823550), Ben Catterall wrote:
> On 11/08/15 11:33, Ben Catterall wrote:
> > On 10/08/15 11:07, Tim Deegan wrote:
> >> This should happen in paging_fault() so it can guard the
> >> shadow-pagetable paths too.  Once it's there, it'll need a check for
> >> is_hvm_vcpu() as well as for user_mode.  Maybe have a helper function
> >> 'is_hvm_deprivileged_vcpu()' to do both checks, also used in
> >> hvm_deprivileged_check_trap() &c.
> I've moved this and now need to add shadow page table support as this 
> currently only supports HAP.

Thanks.  There shouldn't be very much work to do for this -- after all
we're not running in the guest's pagetables so the actual shadow PT
mechanism won't be needed.  Maybe just a common helper function called
from the shadow & hap monitor-table builders?

> >> I wonder whether it would be better to switch to an IDT with all
> >> unacceptable traps stubbed out, rather than have to blacklist them all
> >> separately.  Probably not - this check is cheap, and maintaining the
> >> parallel tables would be a pain.
> >>
> >> Or maybe there's some single point upstream of here, in the asm
> >> handlers, that would catch all the cases where this check is needed?
> >>
> > Yep, I think this can be done.
> Had a deeper look at this. There is a point where all exceptions come in 
> in the asm (handle_exception in entry.S) and we could branch out at this 
> point. It does mean that we'd treat _all_ exceptions that occur in this 
> mode the same way whereas the current way means that we can treat them 
> differently (e.g. get __func__). So, should I make all exceptions go to 
> the same point or keep as is?

I think trap them all at the same point, unless you plan to have any
exceptions that don't just kill the guest.  I don't think you do, do
you?

This code is really Jan and Andrew's area, though.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-17 13:53                     ` Ben Catterall
@ 2015-08-17 15:07                       ` Tim Deegan
  2015-08-17 15:17                         ` Jan Beulich
  2015-08-18 16:55                         ` Andrew Cooper
  0 siblings, 2 replies; 53+ messages in thread
From: Tim Deegan @ 2015-08-17 15:07 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper,
	Aravind Gopalakrishnan, jbeulich, Boris Ostrovsky,
	Suravee Suthikulpanit

At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
> On 12/08/15 14:33, Andrew Cooper wrote:
> > On 12/08/15 14:29, Andrew Cooper wrote:
> >> On 11/08/15 19:29, Boris Ostrovsky wrote:
> >>> Would switching TR only when we know that we need to enter this
> >>> deprivileged mode help?
> >> This is an absolute must.  It is not safe to use syscall/sysexit without
> >> IST in place for NMIs and MCEs.
> >>
> >>> Assuming that it is less expensive than copying the stack.
> >> I was referring to the stack overflow issue, and whether it might be
> >> sensible to pro-actively which TR.
> >
> > Ahem! s/which/switch/
> >
> > ~Andrew
> >
> 
> So, have we arrived at a decision for this? Thanks!

Seems to have stalled a bit.  OK, I propose that:
 - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
 - we make that SP be an extension of the main stack, so that things
   like current() Just Work[tm];
 - we set this up and tear it down when we enter/leave depriv mode.
 - someone ought to look at the case where IST handlers copy
   themselves to the main stack, and see if we need to adjust that too.

Any other proposals?

I think we can leave the question of TR switching on VMEXIT as a
separate issue.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for deprivileged mode
  2015-08-17 14:58         ` Tim Deegan
@ 2015-08-17 15:14           ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2015-08-17 15:14 UTC (permalink / raw)
  To: Ben Catterall, Tim Deegan
  Cc: george.dunlap, andrew.cooper3, xen-devel, keir, ian.campbell

>>> On 17.08.15 at 16:58, <tim@xen.org> wrote:
> At 14:59 +0100 on 17 Aug (1439823550), Ben Catterall wrote:
>> On 11/08/15 11:33, Ben Catterall wrote:
>> > On 10/08/15 11:07, Tim Deegan wrote:
>> >> I wonder whether it would be better to switch to an IDT with all
>> >> unacceptable traps stubbed out, rather than have to blacklist them all
>> >> separately.  Probably not - this check is cheap, and maintaining the
>> >> parallel tables would be a pain.
>> >>
>> >> Or maybe there's some single point upstream of here, in the asm
>> >> handlers, that would catch all the cases where this check is needed?
>> >>
>> > Yep, I think this can be done.
>> Had a deeper look at this. There is a point where all exceptions come in 
>> in the asm (handle_exception in entry.S) and we could branch out at this 
>> point. It does mean that we'd treat _all_ exceptions that occur in this 
>> mode the same way whereas the current way means that we can treat them 
>> differently (e.g. get __func__). So, should I make all exceptions go to 
>> the same point or keep as is?
> 
> I think trap them all at the same point, unless you plan to have any
> exceptions that don't just kill the guest.  I don't think you do, do
> you?
> 
> This code is really Jan and Andrew's area, though.

And I think that deciding one way or the other here isn't necessary
at this point in time. Once there is a clear picture of whether the
route being explored here is actually usable, we can decide which
one is the better model. For now I'd recommend using whatever is
cheaper to implement.

Jan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-17 15:07                       ` Tim Deegan
@ 2015-08-17 15:17                         ` Jan Beulich
  2015-08-18 10:25                           ` Ben Catterall
  2015-08-18 16:55                         ` Andrew Cooper
  1 sibling, 1 reply; 53+ messages in thread
From: Jan Beulich @ 2015-08-17 15:17 UTC (permalink / raw)
  To: Ben Catterall, Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper,
	Aravind Gopalakrishnan, Suravee Suthikulpanit, Boris Ostrovsky

>>> On 17.08.15 at 17:07, <tim@xen.org> wrote:
> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>> So, have we arrived at a decision for this? Thanks!
> 
> Seems to have stalled a bit.  OK, I propose that:
>  - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
>  - we make that SP be an extension of the main stack, so that things
>    like current() Just Work[tm];
>  - we set this up and tear it down when we enter/leave depriv mode.
>  - someone ought to look at the case where IST handlers copy
>    themselves to the main stack, and see if we need to adjust that too.
> 
> Any other proposals?

No.

> I think we can leave the question of TR switching on VMEXIT as a
> separate issue.

Just like for the other one - at this point I think anything that work
should be okay. Dealing with quirks can be deferred (but it would
be nice if a respective note was added in a prominent place so it
doesn't get forgotten once/if these patches leave RFC state).

Jan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-17 15:17                         ` Jan Beulich
@ 2015-08-18 10:25                           ` Ben Catterall
  2015-08-18 10:26                             ` Ben Catterall
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-18 10:25 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper,
	Aravind Gopalakrishnan, Suravee Suthikulpanit, Boris Ostrovsky



On 17/08/15 16:17, Jan Beulich wrote:
>>>> On 17.08.15 at 17:07, <tim@xen.org> wrote:
>> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>>> So, have we arrived at a decision for this? Thanks!
>>
>> Seems to have stalled a bit.  OK, I propose that:
>>   - we use TR/IST to make Xen take interrupts/exceptions at a different SP;
>>   - we make that SP be an extension of the main stack, so that things
>>     like current() Just Work[tm];
 From Xen's cpu stack layout, page 4 is currently unused so I'll put it 
here. Is this an acceptable?
>>   - we set this up and tear it down when we enter/leave depriv mode.
>>   - someone ought to look at the case where IST handlers copy
>>     themselves to the main stack, and see if we need to adjust that too.
>>
>> Any other proposals?
>
> No.
>


>> I think we can leave the question of TR switching on VMEXIT as a
>> separate issue.
>
> Just like for the other one - at this point I think anything that work
> should be okay. Dealing with quirks can be deferred (but it would
> be nice if a respective note was added in a prominent place so it
> doesn't get forgotten once/if these patches leave RFC state).
>
> Jan
>
Ok, thanks all!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-18 10:25                           ` Ben Catterall
@ 2015-08-18 10:26                             ` Ben Catterall
  2015-08-18 14:22                               ` Jan Beulich
  0 siblings, 1 reply; 53+ messages in thread
From: Ben Catterall @ 2015-08-18 10:26 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper,
	Aravind Gopalakrishnan, Suravee Suthikulpanit, Boris Ostrovsky



On 18/08/15 11:25, Ben Catterall wrote:
>
>
> On 17/08/15 16:17, Jan Beulich wrote:
>>>>> On 17.08.15 at 17:07, <tim@xen.org> wrote:
>>> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>>>> So, have we arrived at a decision for this? Thanks!
>>>
>>> Seems to have stalled a bit.  OK, I propose that:
>>>   - we use TR/IST to make Xen take interrupts/exceptions at a
>>> different SP;
>>>   - we make that SP be an extension of the main stack, so that things
>>>     like current() Just Work[tm];
>  From Xen's cpu stack layout, page 4 is currently unused so I'll put it
> here. Is this an acceptable?
Or, would it be better to put it at position 5, and move the optional 
MEMORY_GUARD page down to position 4?
>>>   - we set this up and tear it down when we enter/leave depriv mode.
>>>   - someone ought to look at the case where IST handlers copy
>>>     themselves to the main stack, and see if we need to adjust that too.
>>>
>>> Any other proposals?
>>
>> No.
>>
>
>
>>> I think we can leave the question of TR switching on VMEXIT as a
>>> separate issue.
>>
>> Just like for the other one - at this point I think anything that work
>> should be okay. Dealing with quirks can be deferred (but it would
>> be nice if a respective note was added in a prominent place so it
>> doesn't get forgotten once/if these patches leave RFC state).
>>
>> Jan
>>
> Ok, thanks all!

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-18 10:26                             ` Ben Catterall
@ 2015-08-18 14:22                               ` Jan Beulich
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Beulich @ 2015-08-18 14:22 UTC (permalink / raw)
  To: Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap, Andrew Cooper,
	Tim Deegan, Aravind Gopalakrishnan, SuraveeSuthikulpanit,
	BorisOstrovsky

>>> On 18.08.15 at 12:26, <Ben.Catterall@citrix.com> wrote:
> On 18/08/15 11:25, Ben Catterall wrote:
>> On 17/08/15 16:17, Jan Beulich wrote:
>>>>>> On 17.08.15 at 17:07, <tim@xen.org> wrote:
>>>> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>>>>> So, have we arrived at a decision for this? Thanks!
>>>>
>>>> Seems to have stalled a bit.  OK, I propose that:
>>>>   - we use TR/IST to make Xen take interrupts/exceptions at a
>>>> different SP;
>>>>   - we make that SP be an extension of the main stack, so that things
>>>>     like current() Just Work[tm];
>>  From Xen's cpu stack layout, page 4 is currently unused so I'll put it
>> here. Is this an acceptable?
> Or, would it be better to put it at position 5, and move the optional 
> MEMORY_GUARD page down to position 4?

At this stage either would do. Once (if) the experiment yielded
positive results, we can still evaluate the pros and cons of either
placement.

Jan

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-17 15:07                       ` Tim Deegan
  2015-08-17 15:17                         ` Jan Beulich
@ 2015-08-18 16:55                         ` Andrew Cooper
  2015-08-19 10:36                           ` Ben Catterall
  1 sibling, 1 reply; 53+ messages in thread
From: Andrew Cooper @ 2015-08-18 16:55 UTC (permalink / raw)
  To: Tim Deegan, Ben Catterall
  Cc: xen-devel, keir, ian.campbell, george.dunlap,
	Aravind Gopalakrishnan, jbeulich, Boris Ostrovsky,
	Suravee Suthikulpanit



On 17/08/15 08:07, Tim Deegan wrote:
> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>> On 12/08/15 14:33, Andrew Cooper wrote:
>>> On 12/08/15 14:29, Andrew Cooper wrote:
>>>> On 11/08/15 19:29, Boris Ostrovsky wrote:
>>>>> Would switching TR only when we know that we need to enter this
>>>>> deprivileged mode help?
>>>> This is an absolute must.  It is not safe to use syscall/sysexit without
>>>> IST in place for NMIs and MCEs.
>>>>
>>>>> Assuming that it is less expensive than copying the stack.
>>>> I was referring to the stack overflow issue, and whether it might be
>>>> sensible to pro-actively which TR.
>>> Ahem! s/which/switch/
>>>
>>> ~Andrew
>>>
>> So, have we arrived at a decision for this? Thanks!

Apologies for the delay - I am currently at the Xen Developer Summit.

> Seems to have stalled a bit.  OK, I propose that:
>   - we use TR/IST to make Xen take interrupts/exceptions at a different SP;

Xen re-enables interrupts in most interrupt handlers, which means that 
they must not have an IST set.  If an IST was set, a second interrupt 
would clobber the frame of the first.

However, just adjusting tss->rsp0 and syscall top-of-stack to the 
current rsp when entering depriv mode should be sufficient, and will 
avoid needing to copy the stack.

>   - we make that SP be an extension of the main stack, so that things
>     like current() Just Work[tm];
>   - we set this up and tear it down when we enter/leave depriv mode.
>   - someone ought to look at the case where IST handlers copy
>     themselves to the main stack, and see if we need to adjust that too.

They will need adjusting, but just disabling the copy entirely should be ok.

>
> Any other proposals?
>
> I think we can leave the question of TR switching on VMEXIT as a
> separate issue.

Agreed.  It is orthogonal to this problem.

~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-18 16:55                         ` Andrew Cooper
@ 2015-08-19 10:36                           ` Ben Catterall
  0 siblings, 0 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-19 10:36 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan
  Cc: xen-devel, keir, ian.campbell, george.dunlap,
	Aravind Gopalakrishnan, jbeulich, Boris Ostrovsky,
	Suravee Suthikulpanit



On 18/08/15 17:55, Andrew Cooper wrote:
>
>
> On 17/08/15 08:07, Tim Deegan wrote:
>> At 14:53 +0100 on 17 Aug (1439823232), Ben Catterall wrote:
>>> On 12/08/15 14:33, Andrew Cooper wrote:
>>>> On 12/08/15 14:29, Andrew Cooper wrote:
>>>>> On 11/08/15 19:29, Boris Ostrovsky wrote:
>>>>>> Would switching TR only when we know that we need to enter this
>>>>>> deprivileged mode help?
>>>>> This is an absolute must.  It is not safe to use syscall/sysexit
>>>>> without
>>>>> IST in place for NMIs and MCEs.
>>>>>
>>>>>> Assuming that it is less expensive than copying the stack.
>>>>> I was referring to the stack overflow issue, and whether it might be
>>>>> sensible to pro-actively which TR.
>>>> Ahem! s/which/switch/
>>>>
>>>> ~Andrew
>>>>
>>> So, have we arrived at a decision for this? Thanks!
>
> Apologies for the delay - I am currently at the Xen Developer Summit.
>
No worries! Hope you're enjoying the summit. :)
>> Seems to have stalled a bit.  OK, I propose that:
>>   - we use TR/IST to make Xen take interrupts/exceptions at a
>> different SP;
>
> Xen re-enables interrupts in most interrupt handlers, which means that
> they must not have an IST set.  If an IST was set, a second interrupt
> would clobber the frame of the first.
>
> However, just adjusting tss->rsp0 and syscall top-of-stack to the
> current rsp when entering depriv mode should be sufficient, and will
> avoid needing to copy the stack.
Got it, thanks!
>
>>   - we make that SP be an extension of the main stack, so that things
>>     like current() Just Work[tm];
>>   - we set this up and tear it down when we enter/leave depriv mode.
>>   - someone ought to look at the case where IST handlers copy
>>     themselves to the main stack, and see if we need to adjust that too.
>
> They will need adjusting, but just disabling the copy entirely should be
> ok.
ok.
>
>>
>> Any other proposals?
>>
>> I think we can leave the question of TR switching on VMEXIT as a
>> separate issue.
>
> Agreed.  It is orthogonal to this problem.
>
> ~Andrew

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode
  2015-08-10 10:14     ` Andrew Cooper
  2015-08-11  9:55       ` Tim Deegan
@ 2015-08-20 14:42       ` Ben Catterall
  1 sibling, 0 replies; 53+ messages in thread
From: Ben Catterall @ 2015-08-20 14:42 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan
  Cc: george.dunlap, xen-devel, keir, ian.campbell, jbeulich



On 10/08/15 11:14, Andrew Cooper wrote:
> On 10/08/15 10:49, Tim Deegan wrote:
>> Hi,
>>
>> At 17:45 +0100 on 06 Aug (1438883118), Ben Catterall wrote:
>>> The process to switch into and out of deprivileged mode can be likened to
>>> setjmp/longjmp.
>>>
>>> To enter deprivileged mode, we take a copy of the stack from the guest's
>>> registers up to the current stack pointer.
>> This copy is pretty unfortunate, but I can see that avoiding it will
>> be a bit complex.  Could we do something with more stacks?  AFAICS
>> there have to be three stacks anyway:
>>
>>   - one to hold the depriv execution context;
>>   - one to hold the privileged execution context; and
>>   - one to take interrupts on.
>>
>> So maybe we could do some fiddling to make Xen take interrupts on a
>> different stack while we're depriv'd?
>
> That should happen naturally by virtue of the privilege level change
> involved in taking the interrupt.  Conceptually, taking interrupts from
> depriv mode is no different to taking them in a PV guest.
>
> Some complications which come to mind (none insurmountable):
>
> * Under this model, PV exception handlers should copy themselves onto
> the privileged execution stack.
> * Currently, the IST handlers  copy themselves onto the primary stack if
> they interrupt guest context.
 From what I understand from entry.S's assembly:
handle_ist_exception is used for machine_check and nmi ISTs and these 
perform this copy. The double fault handler does not do this copy.
  - we take the IST on a different stack page
  - the handler copies the guest's registers from its current page to 
  the bottom of the privileged stack so access routines for this still 
work as usual
  - Moves its rsp to just after this structure in the privileged stack
  - Calls do_nmi
  - does a ret_from_intr with the stack ptr on the privileged stack

Now, I _think_ it's sufficient to perform this copy and then just keep 
the rsp on the IST stack page (rather than moving it across as is 
currently done) so that we don't clobber the privileged stack.

Then, on the return path, move our rsp back to the privileged stack, 
just after the guest registers so that ret_from_intr can use the copied 
(and possibly modified) guest's registers.

Does that sound reasonable?

Thanks in advance!
> * AMD Task Register on vmexit.  (this old gem)
>
> ~Andrew
>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2015-08-20 14:42 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-06 16:45 [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Ben Catterall
2015-08-06 16:45 ` [RFC 1/4] HVM x86 deprivileged mode: Page allocation helper Ben Catterall
2015-08-06 19:22   ` Andrew Cooper
2015-08-07  9:57     ` Ben Catterall
2015-08-07 13:14       ` Andrew Cooper
2015-08-10  8:50       ` Tim Deegan
2015-08-10  8:52         ` Tim Deegan
2015-08-10  8:55           ` Andrew Cooper
2015-08-10 10:08             ` Tim Deegan
2015-08-06 16:45 ` [RFC 2/4] HVM x86 deprivileged mode: Create deprivileged page tables Ben Catterall
2015-08-06 19:52   ` Andrew Cooper
2015-08-07 13:19     ` Ben Catterall
2015-08-07 15:20       ` Andrew Cooper
2015-08-06 16:45 ` [RFC 3/4] HVM x86 deprivileged mode: Code for switching into/out of deprivileged mode Ben Catterall
2015-08-06 20:55   ` Andrew Cooper
2015-08-07 12:51     ` Ben Catterall
2015-08-07 13:08       ` David Vrabel
2015-08-07 14:24       ` Andrew Cooper
2015-08-11  9:45     ` Ian Campbell
2015-08-10  9:49   ` Tim Deegan
2015-08-10 10:14     ` Andrew Cooper
2015-08-11  9:55       ` Tim Deegan
2015-08-11 16:51         ` Ben Catterall
2015-08-11 17:05           ` Tim Deegan
2015-08-11 17:19             ` Andrew Cooper
2015-08-11 18:29               ` Boris Ostrovsky
2015-08-12 13:29                 ` Andrew Cooper
2015-08-12 13:33                   ` Andrew Cooper
2015-08-17 13:53                     ` Ben Catterall
2015-08-17 15:07                       ` Tim Deegan
2015-08-17 15:17                         ` Jan Beulich
2015-08-18 10:25                           ` Ben Catterall
2015-08-18 10:26                             ` Ben Catterall
2015-08-18 14:22                               ` Jan Beulich
2015-08-18 16:55                         ` Andrew Cooper
2015-08-19 10:36                           ` Ben Catterall
2015-08-12 10:10               ` Jan Beulich
2015-08-12 13:22             ` Ben Catterall
2015-08-12 13:26               ` Tim Deegan
2015-08-20 14:42       ` Ben Catterall
2015-08-11 10:35     ` Ben Catterall
2015-08-06 16:45 ` [RFC 4/4] HVM x86 deprivileged mode: Trap handlers for " Ben Catterall
2015-08-06 21:24   ` Andrew Cooper
2015-08-07 12:32     ` Ben Catterall
2015-08-07 13:19       ` Andrew Cooper
2015-08-07 13:26         ` Ben Catterall
2015-08-10 10:07   ` Tim Deegan
2015-08-11 10:33     ` Ben Catterall
2015-08-17 13:59       ` Ben Catterall
2015-08-17 14:58         ` Tim Deegan
2015-08-17 15:14           ` Jan Beulich
2015-08-12  9:50 ` [RFC 0/4] HVM x86 enhancements to run Xen deprivileged mode operations Jan Beulich
2015-08-12 11:27   ` Ben Catterall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.