All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/4] Add mem_access support for PV domains
@ 2014-07-08  2:50 Aravindh Puthiyaparambil
  2014-07-08  2:50 ` [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
                   ` (4 more replies)
  0 siblings, 5 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-07-08  2:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Tim Deegan,
	Ian Jackson, Jan Beulich

This patch series adds mem_access support for PV domains. To do this the PV
domain domain has to be run with shadow paging. A p2m implementation for
mem_access has been added to track the access permissions. Since special ring
pages are not created for PV domains, this is done as part of enabling
mem_access.This page is freed when mem_access is disabled or when the domain
is destroyed.

When mem_access is enabled for a PV domain, shadow paging is turned on and all
the shadows are dropped. In the resulting pagefaults, the entries are created
with the default access permissions. On future pagefaults, if there is a violation,
a mem_event is sent to the mem_access listener who will then resolve it.

The access permissions for individual pages are stored in the shadow_flags field
in the page_info structure. To get the access permissions for individual pages,
this field is referenced. To set the access permission of individual pages, the new
permission is set in the shadow_flags and the shadow for the gmfn is dropped. On the
resulting fault, the new PTE entry will be created with the new permission. A
new API has been added to set the default access permissions for PV domains.

Patches are based on top of commit f9cff088.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

  x86/mm: Shadow and p2m changes for PV mem_access
  x86/mem_access: mem_access and mem_event changes to support PV domains
  tools/libxc: Add APIs for PV mem_access
  tool/xen-access: Add support for PV domains

 tools/libxc/xc_mem_access.c          |  42 ++++++
 tools/libxc/xc_mem_event.c           |  23 +++-
 tools/libxc/xc_private.h             |   9 ++
 tools/libxc/xenctrl.h                |  28 +++-
 tools/tests/xen-access/xen-access.c  | 104 +++++++++------
 xen/arch/x86/domain.c                |  12 ++
 xen/arch/x86/mm/Makefile             |   2 +-
 xen/arch/x86/mm/mem_access.c         | 244 ++++++++++++++++++++++++++++++++++-
 xen/arch/x86/mm/mem_event.c          |  62 +++++++--
 xen/arch/x86/mm/p2m-ma.c             | 148 +++++++++++++++++++++
 xen/arch/x86/mm/p2m.c                |  52 +++++---
 xen/arch/x86/mm/paging.c             |   7 +
 xen/arch/x86/mm/shadow/common.c      |  75 ++++++++++-
 xen/arch/x86/mm/shadow/multi.c       | 101 ++++++++++++++-
 xen/arch/x86/mm/shadow/private.h     |   7 +
 xen/arch/x86/srat.c                  |   1 +
 xen/arch/x86/usercopy.c              |  12 ++
 xen/common/page_alloc.c              |   3 +
 xen/drivers/video/vesa.c             |   1 +
 xen/include/asm-x86/domain.h         |   9 ++
 xen/include/asm-x86/mem_access.h     |   3 +
 xen/include/asm-x86/mm.h             |   1 -
 xen/include/asm-x86/p2m.h            |  17 +++
 xen/include/asm-x86/paging.h         |   1 +
 xen/include/asm-x86/shadow.h         |  15 +++
 xen/include/asm-x86/x86_64/uaccess.h |   7 +
 xen/include/public/memory.h          |   3 +
 27 files changed, 899 insertions(+), 90 deletions(-)
 create mode 100644 xen/arch/x86/mm/p2m-ma.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
@ 2014-07-08  2:50 ` Aravindh Puthiyaparambil
  2014-07-24 14:29   ` Jan Beulich
  2014-07-08  2:50 ` [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-07-08  2:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Ian Jackson, Ian Campbell, Jan Beulich, Tim Deegan

Changes to shadow code
----------------------
If the shadow pagefault handler detects that a mem_access listener is
present, then it checks if an violation occurred. If it did, then the
vCPU is paused and an event is sent to the listener. The only instance
this does not occur is when a listener has registered for write
violations and Xen writes to a guest page.
Similarly if the propagation code detects that a mem_access listener is
present, then it creates the PTE after applying access permissions to it.
We do not police Xen writes to guest memory making PV on par with HVM.
The method to do this uses the CR0.WP bit was suggested by Jan Beulich.

P2M changes
-----------
Add a new p2m implementation for mem_access. The access permissions are
stashed in the shadow_flags field of the page_info structure as
suggested by Tim Deegan. p2m_mem_access_set_entry() sets the access value
of the mfn given as input and blows the shadow entries for the mfn.
p2m_mem_access_get_entry() returns the access value of the mfn given as
input.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>

---
Changes from RFC v1:
Removed shadow mem_access mode.
Removed the access lookup table and instead use the shadow_flags in the
page_info structure to stash the access permissions.
Modify p2m_access_to_flags() to only set restrictive permissions.
Replace if with case statement in p2m_mem_access_set_default().
Fix setting of default access value.
Do not police Xen writes to guest memory making PV on par with HVM.

NOTES
-----
Including sched.h in x86_64/uaccess.h caused a circular dependency.

make[3]: Entering directory `/kumo/knc-xen/xen/arch/x86'
gcc -o asm-offsets.s x86_64/asm-offsets.c <truncated>
In file included from /kumo/knc-xen/xen/include/asm/mm.h:9:0,
                 from /kumo/knc-xen/xen/include/xen/mm.h:115,
                 from /kumo/knc-xen/xen/include/asm/domain.h:5,
                 from /kumo/knc-xen/xen/include/xen/domain.h:6,
                 from /kumo/knc-xen/xen/include/xen/sched.h:10,
                 from x86_64/asm-offsets.c:10:
/kumo/knc-xen/xen/include/asm/uaccess.h: In function ‘__copy_to_user’:
/kumo/knc-xen/xen/include/asm/uaccess.h:197:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:197:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:200:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:200:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:203:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:203:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:206:13: error: dereferencing pointer to incomplete type
/kumo/knc-xen/xen/include/asm/uaccess.h:206:13: error: dereferencing pointer to incomplete type
make[3]: *** [asm-offsets.s] Error 1

The fix for this is the reason for the include changes to mm.h, paging.h srat.c and vesa.c.
 
 xen/arch/x86/mm/Makefile             |   2 +-
 xen/arch/x86/mm/p2m-ma.c             | 148 +++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c                |  52 ++++++++----
 xen/arch/x86/mm/paging.c             |   7 ++
 xen/arch/x86/mm/shadow/common.c      |  75 ++++++++++++++++--
 xen/arch/x86/mm/shadow/multi.c       | 101 +++++++++++++++++++++++-
 xen/arch/x86/mm/shadow/private.h     |   7 ++
 xen/arch/x86/srat.c                  |   1 +
 xen/arch/x86/usercopy.c              |  12 +++
 xen/common/page_alloc.c              |   3 +
 xen/drivers/video/vesa.c             |   1 +
 xen/include/asm-x86/domain.h         |   6 ++
 xen/include/asm-x86/mm.h             |   1 -
 xen/include/asm-x86/p2m.h            |  17 ++++
 xen/include/asm-x86/paging.h         |   1 +
 xen/include/asm-x86/shadow.h         |  15 ++++
 xen/include/asm-x86/x86_64/uaccess.h |   7 ++
 17 files changed, 430 insertions(+), 26 deletions(-)
 create mode 100644 xen/arch/x86/mm/p2m-ma.c

diff --git a/xen/arch/x86/mm/Makefile b/xen/arch/x86/mm/Makefile
index 73dcdf4..41128a4 100644
--- a/xen/arch/x86/mm/Makefile
+++ b/xen/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@ subdir-y += shadow
 subdir-y += hap
 
 obj-y += paging.o
-obj-y += p2m.o p2m-pt.o p2m-ept.o p2m-pod.o
+obj-y += p2m.o p2m-pt.o p2m-ept.o p2m-pod.o p2m-ma.o
 obj-y += guest_walk_2.o
 obj-y += guest_walk_3.o
 obj-$(x86_64) += guest_walk_4.o
diff --git a/xen/arch/x86/mm/p2m-ma.c b/xen/arch/x86/mm/p2m-ma.c
new file mode 100644
index 0000000..d8ad12c
--- /dev/null
+++ b/xen/arch/x86/mm/p2m-ma.c
@@ -0,0 +1,148 @@
+/******************************************************************************
+ * arch/x86/mm/p2m-ma.c
+ *
+ * Implementation of p2m data structures, for use by PV mem_access code.
+ *
+ * Copyright (c) 2014 Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <xen/hypercall.h>
+#include <xen/sched.h>
+#include <asm/p2m.h>
+#include <asm/shadow.h>
+#include "mm-locks.h"
+
+/* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_to_page
+#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
+
+/* Convert access permissions to page table flags */
+void p2m_access_to_flags(u32 *flags, p2m_access_t access)
+{
+    /*
+     * Restrict with access permissions while propagating more restrictive guest
+     * permissions.
+     */
+    switch ( access )
+    {
+    case p2m_access_r:
+        *flags &= ~_PAGE_RW;
+        *flags |= _PAGE_NX_BIT;
+        break;
+    case p2m_access_rx:
+    case p2m_access_rx2rw:
+        *flags &= ~_PAGE_RW;
+        break;
+    case p2m_access_rw:
+        *flags |= _PAGE_NX_BIT;
+        break;
+    case p2m_access_rwx:
+    default:
+        break;
+    }
+}
+
+/*
+ * Set the page permission of the mfn. This in effect removes all shadow
+ * mappings of that mfn. The access type of that mfn is stored in the access
+ * lookup table.
+ */
+static int
+p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
+                         unsigned int page_order, p2m_type_t p2mt,
+                         p2m_access_t p2ma)
+{
+    struct domain *d = p2m->domain;
+    struct page_info *page = mfn_to_page(mfn);
+
+    ASSERT(shadow_mode_enabled(d));
+
+    /*
+     * For PV domains we only support r, rw, rx, rx2rw and rwx access
+     * permissions
+     */
+    switch ( p2ma )
+    {
+    case p2m_access_n:
+    case p2m_access_w:
+    case p2m_access_x:
+    case p2m_access_wx:
+    case p2m_access_n2rwx:
+        return -EINVAL;
+    default:
+        break;
+    }
+
+    if ( page_get_owner(page) != d )
+        return -ENOENT;
+
+    paging_lock(d);
+
+    shadow_set_access(page, p2ma);
+
+    ASSERT(d->vcpu && d->vcpu[0]);
+    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
+        flush_tlb_mask(d->domain_dirty_cpumask);
+
+    paging_unlock(d);
+
+    return 0;
+}
+
+/* Get the page permission of the mfn from page_info->shadow_flags */
+static mfn_t
+p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
+                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
+                         unsigned int *page_order)
+{
+    struct domain *d = p2m->domain;
+    /* For PV guests mfn == gfn */
+    mfn_t mfn = _mfn(gfn);
+    struct page_info *page = mfn_to_page(mfn);
+
+    ASSERT(shadow_mode_enabled(d));
+
+    *t = p2m_ram_rw;
+
+    if ( page_get_owner(page) != d )
+        return _mfn(INVALID_MFN);
+
+    *a = shadow_get_access(page);
+    return mfn;
+}
+
+/* Reset the set_entry and get_entry function pointers */
+void p2m_mem_access_reset(struct p2m_domain *p2m)
+{
+    p2m_pt_init(p2m);
+}
+
+/* Set the set_entry and get_entry function pointers */
+void p2m_mem_access_init(struct p2m_domain *p2m)
+{
+    p2m->set_entry = p2m_mem_access_set_entry;
+    p2m->get_entry = p2m_mem_access_get_entry;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 642ec28..b275bfc 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -33,6 +33,7 @@
 #include <asm/mem_event.h>
 #include <public/mem_event.h>
 #include <asm/mem_sharing.h>
+#include <asm/shadow.h>
 #include <xen/event.h>
 #include <asm/hvm/nestedhvm.h>
 #include <asm/hvm/svm/amd-iommu-proto.h>
@@ -247,7 +248,9 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
     if ( q & P2M_UNSHARE )
         q |= P2M_ALLOC;
 
-    if ( !p2m || !paging_mode_translate(p2m->domain) )
+    if ( !p2m ||
+         (!paging_mode_translate(p2m->domain) &&
+         !mem_event_check_ring(&p2m->domain->mem_event->access)) )
     {
         /* Not necessarily true, but for non-translated guests, we claim
          * it's the most generic kind of memory */
@@ -284,7 +287,9 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
 
 void __put_gfn(struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !p2m || !paging_mode_translate(p2m->domain) )
+    if ( !p2m ||
+         (!paging_mode_translate(p2m->domain) &&
+         !mem_event_check_ring(&p2m->domain->mem_event->access)) )
         /* Nothing to do in this case */
         return;
 
@@ -1426,18 +1431,10 @@ void p2m_mem_access_resume(struct domain *d)
     }
 }
 
-/* Set access type for a region of pfns.
- * If start_pfn == -1ul, sets the default access type */
-long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
-                        uint32_t start, uint32_t mask, xenmem_access_t access)
+int p2m_convert_xenmem_access(struct p2m_domain *p2m,
+                              xenmem_access_t mem_access, p2m_access_t *a)
 {
-    struct p2m_domain *p2m = p2m_get_hostp2m(d);
-    p2m_access_t a, _a;
-    p2m_type_t t;
-    mfn_t mfn;
-    long rc = 0;
-
-    static const p2m_access_t memaccess[] = {
+    static const p2m_access_t p2ma[] = {
 #define ACCESS(ac) [XENMEM_access_##ac] = p2m_access_##ac
         ACCESS(n),
         ACCESS(r),
@@ -1452,21 +1449,42 @@ long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
 #undef ACCESS
     };
 
-    switch ( access )
+    switch ( mem_access )
     {
-    case 0 ... ARRAY_SIZE(memaccess) - 1:
-        a = memaccess[access];
+    case 0 ... ARRAY_SIZE(p2ma) - 1:
+        *a = p2ma[mem_access];
         break;
     case XENMEM_access_default:
-        a = p2m->default_access;
+        *a = p2m->default_access;
         break;
     default:
         return -EINVAL;
     }
+    return 0;
+}
+
+/*
+ * Set access type for a region of pfns.
+ * If start_pfn == -1ul, sets the default access type for HVM domains
+ */
+long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
+                        uint32_t start, uint32_t mask, xenmem_access_t access)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    p2m_access_t a, _a;
+    p2m_type_t t;
+    mfn_t mfn;
+    long rc = 0;
+
+    rc = p2m_convert_xenmem_access(p2m, access, &a);
+    if ( rc != 0 )
+        return rc;
 
     /* If request to set default access */
     if ( pfn == ~0ul )
     {
+        if ( is_pv_domain(d) )
+            return -ENOSYS;
         p2m->default_access = a;
         return 0;
     }
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index 32764ba..47397f1 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -627,6 +627,13 @@ void paging_teardown(struct domain *d)
     /* clean up log dirty resources. */
     paging_log_dirty_teardown(d);
 
+    /*
+     * Reset p2m setup in the case where a mem_access listener is present while
+     * the domain is being destroyed or it crashed without cleaning up.
+     */
+    if ( is_pv_domain(d) )
+        p2m_mem_access_reset(p2m_get_hostp2m(d));
+
     /* Move populate-on-demand cache back to domain_list for destruction */
     p2m_pod_empty_cache(d);
 }
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 3c803b6..9aacd8e 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -36,6 +36,7 @@
 #include <asm/current.h>
 #include <asm/flushtlb.h>
 #include <asm/shadow.h>
+#include <asm/mem_event.h>
 #include <xen/numa.h>
 #include "private.h"
 
@@ -1356,7 +1357,7 @@ void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
 
 /* Deliberately free all the memory we can: this will tear down all of
  * this domain's shadows */
-static void shadow_blow_tables(struct domain *d) 
+void shadow_blow_tables(struct domain *d)
 {
     struct page_info *sp, *t;
     struct vcpu *v = d->vcpu[0];
@@ -2435,15 +2436,20 @@ int sh_remove_all_mappings(struct vcpu *v, mfn_t gmfn)
     /* If that didn't catch the mapping, something is very wrong */
     if ( !sh_check_page_has_no_refs(page) )
     {
-        /* Don't complain if we're in HVM and there are some extra mappings: 
+        /*
+         * Don't complain if we're in HVM and there are some extra mappings:
          * The qemu helper process has an untyped mapping of this dom's RAM 
          * and the HVM restore program takes another.
          * Also allow one typed refcount for xenheap pages, to match
-         * share_xen_page_with_guest(). */
+         * share_xen_page_with_guest().
+         * PV domains that have a mem_access listener, runs in shadow mode
+         * without refcounts.
+         */
         if ( !(shadow_mode_external(v->domain)
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
-                   == !!is_xen_heap_page(page))) )
+                   == !!is_xen_heap_page(page))) &&
+             !mem_event_check_ring(&v->domain->mem_event->access) )
         {
             SHADOW_ERROR("can't find all mappings of mfn %lx: "
                           "c=%08lx t=%08lx\n", mfn_x(gmfn), 
@@ -2953,7 +2959,7 @@ int shadow_enable(struct domain *d, u32 mode)
         paging_unlock(d);
     }
 
-    /* Allow p2m and log-dirty code to borrow shadow memory */
+    /* Allow p2m, log-dirty and mem_access code to borrow shadow memory */
     d->arch.paging.alloc_page = shadow_alloc_p2m_page;
     d->arch.paging.free_page = shadow_free_p2m_page;
 
@@ -3197,7 +3203,7 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
         }
     }
 
-    /* Allow p2m and log-dirty code to borrow shadow memory */
+    /* Allow p2m, log-dirty and mem_access code to borrow shadow memory */
     d->arch.paging.alloc_page = shadow_alloc_p2m_page;
     d->arch.paging.free_page = shadow_free_p2m_page;
 
@@ -3661,6 +3667,63 @@ out:
 }
 
 /**************************************************************************/
+/* mem_access support */
+
+/*
+ * Shadow specific code which is called in XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
+ * for PV guests.
+ * Return 0 on success.
+ */
+int shadow_enable_mem_access(struct domain *d)
+{
+    int ret;
+
+    paging_lock(d);
+
+#if (SHADOW_OPTIMIZATIONS & SHOPT_LINUX_L3_TOPLEVEL)
+    /*
+     * 32bit PV guests on 64bit xen behave like older 64bit linux: they
+     * change an l4e instead of cr3 to switch tables.  Give them the
+     * same optimization
+     */
+    if ( is_pv_32on64_domain(d) )
+        d->arch.paging.shadow.opt_flags = SHOPT_LINUX_L3_TOPLEVEL;
+#endif
+
+    ret = shadow_one_bit_enable(d, PG_SH_enable);
+    paging_unlock(d);
+
+    return ret;
+}
+
+/*
+ * Shadow specific code which is called in
+ * XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE for PV guests
+ */
+int shadow_disable_mem_access(struct domain *d)
+{
+    int ret;
+
+    paging_lock(d);
+    ret = shadow_one_bit_disable(d, PG_SH_enable);
+    paging_unlock(d);
+
+    return ret;
+}
+
+void shadow_set_access(struct page_info *page, p2m_access_t a)
+{
+    page->shadow_flags = (page->shadow_flags & ~SHF_access_mask) |
+                         a << SHF_access_shift;
+
+}
+
+p2m_access_t shadow_get_access(struct page_info *page)
+{
+    return (page->shadow_flags & SHF_access_mask) >> SHF_access_shift;
+}
+
+/**************************************************************************/
 /* Shadow-control XEN_DOMCTL dispatcher */
 
 int shadow_domctl(struct domain *d, 
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index c6c9d10..db30396 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -38,6 +38,8 @@
 #include <asm/hvm/cacheattr.h>
 #include <asm/mtrr.h>
 #include <asm/guest_pt.h>
+#include <asm/mem_event.h>
+#include <asm/mem_access.h>
 #include <public/sched.h>
 #include "private.h"
 #include "types.h"
@@ -625,6 +627,14 @@ _sh_propagate(struct vcpu *v,
             }
     }
 
+    /* Propagate access permissions */
+    if ( unlikely(mem_event_check_ring(&d->mem_event->access)) &&
+         level == 1 && !sh_mfn_is_a_page_table(target_mfn) )
+    {
+        p2m_access_t a = shadow_get_access(mfn_to_page(target_mfn));
+        p2m_access_to_flags(&sflags, a);
+    }
+
     // Set the A&D bits for higher level shadows.
     // Higher level entries do not, strictly speaking, have dirty bits, but
     // since we use shadow linear tables, each of these entries may, at some
@@ -2822,6 +2832,7 @@ static int sh_page_fault(struct vcpu *v,
     int r;
     fetch_type_t ft = 0;
     p2m_type_t p2mt;
+    mem_event_request_t *req_ptr = NULL;
     uint32_t rc;
     int version;
 #if SHADOW_OPTIMIZATIONS & SHOPT_FAST_EMULATION
@@ -3009,7 +3020,84 @@ static int sh_page_fault(struct vcpu *v,
 
     /* What mfn is the guest trying to access? */
     gfn = guest_l1e_get_gfn(gw.l1e);
-    gmfn = get_gfn(d, gfn, &p2mt);
+    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
+        gmfn = get_gfn(d, gfn, &p2mt);
+    /*
+     * A mem_access listener is present, so we will first check if a violation
+     * has occurred.
+     */
+    else
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
+        p2m_access_t p2ma;
+
+        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0, NULL);
+        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
+             && regs->error_code & PFEC_page_present
+             && !(regs->error_code & PFEC_reserved_bit) )
+        {
+            int violation = 0;
+            bool_t access_w = !!(regs->error_code & PFEC_write_access);
+            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
+            bool_t access_r = access_x ? 0 : !access_w;
+
+            /* If the access is against the permissions, then send to mem_event */
+            switch ( p2ma )
+            {
+            case p2m_access_r:
+                violation = access_w || access_x;
+                break;
+            case p2m_access_rx:
+            case p2m_access_rx2rw:
+                violation = access_w;
+                break;
+            case p2m_access_rw:
+                violation = access_x;
+                break;
+            case p2m_access_rwx:
+            default:
+                break;
+            }
+
+            /*
+             * Do not police writes to guest memory from the Xen hypervisor.
+             * This keeps PV mem_access on par with HVM. Turn off CR0.WP here to
+             * allow the write to go through if the guest has marked the page as
+             * writable. Turn it back on in the guest access functions
+             * __copy_to_user / __put_user_size() after the write is completed.
+             */
+            if ( violation && access_w &&
+                 regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
+            {
+                unsigned long cr0 = read_cr0();
+
+                violation = 0;
+                if ( cr0 & X86_CR0_WP &&
+                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
+                {
+                    cr0 &= ~X86_CR0_WP;
+                    write_cr0(cr0);
+                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
+                }
+            }
+
+            if ( violation )
+            {
+                paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
+                              (va & ((1 << PAGE_SHIFT) - 1));
+                if ( !p2m_mem_access_check(gpa, 1, va, access_r, access_w,
+                                           access_x, &req_ptr) )
+                {
+                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" p2ma: %d\n",
+                                  (access_r ? 'r' : '-'),
+                                  (access_w ? 'w' : '-'),
+                                  (access_x ? 'x' : '-'), mfn_x(gmfn), p2ma);
+                    /* Rights not promoted, vcpu paused, work here is done */
+                    goto out_put_gfn;
+                }
+            }
+        }
+    }
 
     if ( shadow_mode_refcounts(d) && 
          ((!p2m_is_valid(p2mt) && !p2m_is_grant(p2mt)) ||
@@ -3214,7 +3302,18 @@ static int sh_page_fault(struct vcpu *v,
     SHADOW_PRINTK("fixed\n");
     shadow_audit_tables(v);
     paging_unlock(d);
+ out_put_gfn:
     put_gfn(d, gfn_x(gfn));
+
+    /* Send access violation to mem_access listener */
+    if ( unlikely(req_ptr != NULL) )
+    {
+        SHADOW_PRINTK("mem_access SEND violation mfn: 0x%"PRI_mfn"\n",
+                      mfn_x(gmfn));
+        mem_access_send_req(d, req_ptr);
+        xfree(req_ptr);
+    }
+
     return EXCRET_fault_fixed;
 
  emulate:
diff --git a/xen/arch/x86/mm/shadow/private.h b/xen/arch/x86/mm/shadow/private.h
index b778fcf..eddb3db 100644
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -260,6 +260,13 @@ static inline int sh_type_has_up_pointer(struct vcpu *v, unsigned int t)
 
 #define SHF_L1_ANY  (SHF_L1_32|SHF_L1_PAE|SHF_L1_64)
 
+/*
+ * Bits 14-17 of page_info->shadow_flags are used to store the p2m_access_t
+ * values of PV shadow domain pages.
+ */
+#define SHF_access_shift (SH_type_max_shadow + 1u)
+#define SHF_access_mask (0xfu << SHF_access_shift)
+
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) 
 /* Marks a guest L1 page table which is shadowed but not write-protected.
  * If set, then *only* L1 shadows (SHF_L1_*) are allowed. 
diff --git a/xen/arch/x86/srat.c b/xen/arch/x86/srat.c
index 2b05272..83d46bc 100644
--- a/xen/arch/x86/srat.c
+++ b/xen/arch/x86/srat.c
@@ -18,6 +18,7 @@
 #include <xen/acpi.h>
 #include <xen/numa.h>
 #include <xen/pfn.h>
+#include <xen/errno.h>
 #include <asm/e820.h>
 #include <asm/page.h>
 
diff --git a/xen/arch/x86/usercopy.c b/xen/arch/x86/usercopy.c
index 4cc78f5..eecf429 100644
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -45,6 +45,18 @@ unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned n)
         : "memory" );
     clac();
 
+    /*
+     * A mem_access listener was present and Xen tried to write to guest memory.
+     * To allow this write to go through without an event being sent to the
+     * listener or the pagetable entry being modified, we disabled CR0.WP in the
+     * shadow pagefault handler. We are enabling it back here again.
+     */
+    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) )
+    {
+        write_cr0(read_cr0() | X86_CR0_WP);
+        current->arch.pv_vcpu.need_cr0_wp_set = 0;
+    }
+
     return __n;
 }
 
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 7b4092d..5b6f747 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -43,6 +43,7 @@
 #include <asm/page.h>
 #include <asm/numa.h>
 #include <asm/flushtlb.h>
+#include <asm/shadow.h>
 #ifdef CONFIG_X86
 #include <asm/p2m.h>
 #include <asm/setup.h> /* for highmem_start only */
@@ -1660,6 +1661,8 @@ int assign_pages(
         page_set_owner(&pg[i], d);
         smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
         pg[i].count_info = PGC_allocated | 1;
+        if ( is_pv_domain(d) )
+            shadow_set_access(&pg[i], p2m_get_hostp2m(d)->default_access);
         page_list_add_tail(&pg[i], &d->page_list);
     }
 
diff --git a/xen/drivers/video/vesa.c b/xen/drivers/video/vesa.c
index 575db62..e7aa54a 100644
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -10,6 +10,7 @@
 #include <xen/xmalloc.h>
 #include <xen/kernel.h>
 #include <xen/vga.h>
+#include <xen/errno.h>
 #include <asm/io.h>
 #include <asm/page.h>
 #include "font.h"
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index abf55fb..f7b0262 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -380,6 +380,12 @@ struct pv_vcpu
     /* Deferred VA-based update state. */
     bool_t need_update_runstate_area;
     struct vcpu_time_info pending_system_time;
+
+    /*
+     * Flag that tracks if CR0.WP needs to be set after a Xen write to guest
+     * memory when a PV domain has a mem_access listener attached to it.
+     */
+    bool_t need_cr0_wp_set;
 };
 
 struct arch_vcpu
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index d253117..ec95feb 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -6,7 +6,6 @@
 #include <xen/list.h>
 #include <xen/spinlock.h>
 #include <asm/io.h>
-#include <asm/uaccess.h>
 
 /*
  * Per-page-frame information.
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 0ddbadb..029eea8 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -603,6 +603,10 @@ bool_t p2m_mem_access_check(paddr_t gpa, bool_t gla_valid, unsigned long gla,
 /* Resumes the running of the VCPU, restarting the last instruction */
 void p2m_mem_access_resume(struct domain *d);
 
+/* Convert xenmem_access_t to p2m_access_t */
+int p2m_convert_xenmem_access(struct p2m_domain *p2m,
+                              xenmem_access_t mem_access, p2m_access_t *a);
+
 /* Set access type for a region of pfns.
  * If start_pfn == -1ul, sets the default access type */
 long p2m_set_mem_access(struct domain *d, unsigned long start_pfn, uint32_t nr,
@@ -613,6 +617,19 @@ long p2m_set_mem_access(struct domain *d, unsigned long start_pfn, uint32_t nr,
 int p2m_get_mem_access(struct domain *d, unsigned long pfn,
                        xenmem_access_t *access);
 
+/*
+ * Functions specific to the p2m-ma implementation
+ */
+
+/* Set up p2m function pointers mem_access implementation */
+void p2m_mem_access_init(struct p2m_domain *p2m);
+
+/* Reset p2m function pointers */
+void p2m_mem_access_reset(struct p2m_domain *p2m);
+
+/* Convert access permissions to page table flags */
+void p2m_access_to_flags(u32 *flags, p2m_access_t access);
+
 /* 
  * Internal functions, only called by other p2m code
  */
diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
index 9b8f8de..d30c569 100644
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -32,6 +32,7 @@
 #include <xen/domain_page.h>
 #include <asm/flushtlb.h>
 #include <asm/domain.h>
+#include <asm/uaccess.h>
 
 /*****************************************************************************
  * Macros to tell which paging mode a domain is in */
diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
index f40cab4..0420dd8 100644
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -86,6 +86,18 @@ int shadow_disable_log_dirty(struct domain *d);
 /* shadow code to call when bitmap is being cleaned */
 void shadow_clean_dirty_bitmap(struct domain *d);
 
+/* shadow code to call when mem_access is enabled */
+int shadow_enable_mem_access(struct domain *d);
+
+/* shadow code to call when mem access is disabled */
+int shadow_disable_mem_access(struct domain *d);
+
+/* Set the access value in shadow_flags */
+void shadow_set_access(struct page_info *page, p2m_access_t a);
+
+/* Get the access value from shadow_flags */
+p2m_access_t shadow_get_access(struct page_info *page);
+
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
  * Called to initialize paging structures if the paging mode
  * has changed, and when bringing up a VCPU for the first time. */
@@ -114,6 +126,9 @@ static inline void shadow_remove_all_shadows(struct vcpu *v, mfn_t gmfn)
 /* Discard _all_ mappings from the domain's shadows. */
 void shadow_blow_tables_per_domain(struct domain *d);
 
+/* Tear down all of this domain's shadows */
+void shadow_blow_tables(struct domain *d);
+
 #endif /* _XEN_SHADOW_H */
 
 /*
diff --git a/xen/include/asm-x86/x86_64/uaccess.h b/xen/include/asm-x86/x86_64/uaccess.h
index 953abe7..6d13ec6 100644
--- a/xen/include/asm-x86/x86_64/uaccess.h
+++ b/xen/include/asm-x86/x86_64/uaccess.h
@@ -1,6 +1,8 @@
 #ifndef __X86_64_UACCESS_H
 #define __X86_64_UACCESS_H
 
+#include <xen/sched.h>
+
 #define COMPAT_ARG_XLAT_VIRT_BASE ((void *)ARG_XLAT_START(current))
 #define COMPAT_ARG_XLAT_SIZE      (2*PAGE_SIZE)
 struct vcpu;
@@ -65,6 +67,11 @@ do {									\
 	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
 	default: __put_user_bad();					\
 	}								\
+    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
+    { \
+        write_cr0(read_cr0() | X86_CR0_WP); \
+        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
+    } \
 } while (0)
 
 #define __get_user_size(x,ptr,size,retval,errret)			\
-- 
1.9.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
  2014-07-08  2:50 ` [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
@ 2014-07-08  2:50 ` Aravindh Puthiyaparambil
  2014-07-24 14:38   ` Jan Beulich
  2014-07-08  2:50 ` [PATCH RFC v2 3/4] tools/libxc: Add APIs for PV mem_access Aravindh Puthiyaparambil
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-07-08  2:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan, Keir Fraser, Jan Beulich

mem_access changes
------------------
New memory access sub-ops, XENMEM_access_op_set_default,
XENMEM_access_op_create_ring_page and XENMEM_access_op_get_ring_mfn have
been added. The mem_access listener makes these calls during setup.
The ring page is created from the xenheap and shared with the guest. It
is freed when mem_access is disabled or when the domain is shutdown or
destroyed.

XENMEM_access_op_set_default has been added to set the default
permission for the pages belonging to the PV domain. Unlike for a HVM 
domain, the mem_access listener cannot set access permissions for all
pages since it does not know all the mfns that belong to the PV domain.
The other reason for adding this is as a seperate sub-op and not folding
it into p2m_set_mem_access() is that the page_info pointer to start
setting default permissions for in the case of a return from hypercall
continuation will not fit in the hypercall op field.

XENMEM_access_op_[sg]et_access hypercalls are modified to accomodate
calls for a PV domain. When setting the access permissions for a mfn,
all shadows for that mfn are dropped. They get recreated with the new
permissions on the next page fault for that mfn. To get the permissions
for a mfn, value is returned from the shadow_flags.

mem_event changes
-----------------
The XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE/DISABLE hypercalls are
modified to allow mem_access to work with PV domains. When the access
listener goes to enable mem_access for a PV domain, shadow mode is turned
on and the p2m structures are initialized. When disabling, shadow mode is
turned off.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>

---
Changes from RFC v1:
Fallout due to changes in p2m and shadow code.
Add XENMEM_access_op_set_default access sub-op.

 xen/arch/x86/domain.c            |  12 ++
 xen/arch/x86/mm/mem_access.c     | 244 +++++++++++++++++++++++++++++++++++++--
 xen/arch/x86/mm/mem_event.c      |  62 +++++++---
 xen/include/asm-x86/domain.h     |   3 +
 xen/include/asm-x86/mem_access.h |   3 +
 xen/include/public/memory.h      |   3 +
 6 files changed, 307 insertions(+), 20 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index e896210..49d8545 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -57,6 +57,7 @@
 #include <asm/nmi.h>
 #include <asm/mce.h>
 #include <asm/amd.h>
+#include <asm/mem_access.h>
 #include <xen/numa.h>
 #include <xen/iommu.h>
 #include <compat/vcpu.h>
@@ -593,8 +594,11 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
         }
     }
     else
+    {
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
+        d->arch.pv_domain.access_ring_mfn = _mfn(INVALID_MFN);
+    }
 
     /* initialize default tsc behavior in case tools don't */
     tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
@@ -632,8 +636,16 @@ void arch_domain_destroy(struct domain *d)
 
     free_perdomain_mappings(d);
     if ( is_pv_domain(d) )
+    {
         free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab);
 
+        /*
+         * Free the PV mem_access ring xenheap page in the case where a
+         * mem_access listener is present while the domain is being destroyed.
+         */
+        mem_access_free_pv_ring(d);
+    }
+
     free_xenheap_page(d->shared_info);
     cleanup_domain_irq_mapping(d);
 }
diff --git a/xen/arch/x86/mm/mem_access.c b/xen/arch/x86/mm/mem_access.c
index e8465a5..8060446 100644
--- a/xen/arch/x86/mm/mem_access.c
+++ b/xen/arch/x86/mm/mem_access.c
@@ -26,8 +26,122 @@
 #include <xen/hypercall.h>
 #include <asm/p2m.h>
 #include <asm/mem_event.h>
+#include <asm/event.h>
+#include <asm/shadow.h>
 #include <xsm/xsm.h>
+#include "mm-locks.h"
 
+/* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
+#undef mfn_to_page
+#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
+
+static inline bool_t domain_valid_for_mem_access(struct domain *d)
+{
+    if ( is_hvm_domain(d) )
+    {
+        /* Only HAP is supported */
+        if ( !hap_enabled(d) )
+            return 0;
+
+        /* Currently only EPT is supported */
+        if ( !cpu_has_vmx )
+            return 0;
+    }
+    /*
+     * Only PV guests using shadow mode and running on CPUs with the NX bit are
+     * supported.
+     */
+    else if ( !shadow_mode_enabled(d) || !cpu_has_nx )
+        return 0;
+
+    return 1;
+}
+
+/*
+ * Set the default permission for all pages for a PV domain.
+ * Unlike for a HVM domain, the mem_access listener cannot set access
+ * permissions for all pages since it does not know all the mfns that belong to
+ * the PV domain. All it can do is set permission for individual pages. This
+ * functions blows away the shadows in lieu of that so that new faults will set
+ * the pagetable entry permissions to the default value. The function also sets
+ * the default access value in the page_info->shadow_flags for each page in the
+ * domain. start_page is used for the page to start setting default permissions
+ * in the case of hypercall continuation. This is also the reason, why this
+ * function cannot be folded in to p2m_set_mem_access(), as the pointer won't
+ * fit in the hypercall op field.
+ */
+static int mem_access_set_default(struct domain *d, uint64_t *start_page,
+                           xenmem_access_t access)
+{
+    struct p2m_domain *p2m = p2m_get_hostp2m(d);
+    struct page_info *page;
+    struct page_list_head head;
+    p2m_access_t a;
+    int rc = 0, ctr = 0;
+
+    if ( !is_pv_domain(d) )
+        return -ENOSYS;
+
+    ASSERT(shadow_mode_enabled(d));
+
+    rc = p2m_convert_xenmem_access(p2m, access, &a);
+    if ( rc != 0 )
+        return rc;
+
+    /*
+     * For PV domains we only support r, rw, rx, rx2rw and rwx access
+     * permissions
+     */
+    switch ( a )
+    {
+    case p2m_access_n:
+    case p2m_access_w:
+    case p2m_access_x:
+    case p2m_access_wx:
+    case p2m_access_n2rwx:
+        return -EINVAL;
+    default:
+        break;
+    }
+
+    paging_lock_recursive(d);
+
+    if ( *start_page )
+    {
+        head.next = (struct page_info *)*start_page;
+        head.tail = d->page_list.tail;
+    }
+    else
+        head = d->page_list;
+
+    page_list_for_each(page, &head)
+    {
+        shadow_set_access(page, access);
+        if ( page != head.tail && !(++ctr & MEMOP_CMD_MASK) &&
+             hypercall_preempt_check() )
+        {
+            struct page_info *next = page_list_next(page, &head);
+            if ( next )
+            {
+                *start_page = (uint64_t)next;
+                rc = -EAGAIN;
+            }
+            break;
+        }
+    }
+
+    if ( rc == 0 )
+    {
+        p2m->default_access = a;
+        shadow_blow_tables(d);
+    }
+
+    paging_unlock(d);
+
+    return rc;
+}
 
 int mem_access_memop(unsigned long cmd,
                      XEN_GUEST_HANDLE_PARAM(xen_mem_access_op_t) arg)
@@ -43,16 +157,14 @@ int mem_access_memop(unsigned long cmd,
     if ( rc )
         return rc;
 
-    rc = -EINVAL;
-    if ( !is_hvm_domain(d) )
-        goto out;
-
     rc = xsm_mem_event_op(XSM_DM_PRIV, d, XENMEM_access_op);
     if ( rc )
         goto out;
 
     rc = -ENODEV;
-    if ( unlikely(!d->mem_event->access.ring_page) )
+    if ( unlikely(!d->mem_event->access.ring_page) &&
+         mao.op != XENMEM_access_op_create_ring_page &&
+         mao.op != XENMEM_access_op_get_ring_mfn )
         goto out;
 
     switch ( mao.op )
@@ -67,10 +179,21 @@ int mem_access_memop(unsigned long cmd,
         unsigned long start_iter = cmd & ~MEMOP_CMD_MASK;
 
         rc = -EINVAL;
+        if ( !domain_valid_for_mem_access(d) )
+            break;
+
+        /*
+         * max_pfn for PV domains is obtained from the shared_info structures
+         * that the guest maintains. It is up to the guest to maintain this and
+         * is not filled in during early boot. So we do not check if we are
+         * crossing max_pfn here and will depend on the checks in
+         * p2m_mem_access_set_entry().
+         */
         if ( (mao.pfn != ~0ull) &&
              (mao.nr < start_iter ||
               ((mao.pfn + mao.nr - 1) < mao.pfn) ||
-              ((mao.pfn + mao.nr - 1) > domain_get_maximum_gpfn(d))) )
+              ((mao.pfn + mao.nr - 1) > domain_get_maximum_gpfn(d) &&
+                !is_pv_domain(d))) )
             break;
 
         rc = p2m_set_mem_access(d, mao.pfn, mao.nr, start_iter,
@@ -89,7 +212,18 @@ int mem_access_memop(unsigned long cmd,
         xenmem_access_t access;
 
         rc = -EINVAL;
-        if ( (mao.pfn > domain_get_maximum_gpfn(d)) && mao.pfn != ~0ull )
+        if ( !domain_valid_for_mem_access(d) )
+            break;
+
+        /*
+         * max_pfn for PV domains is obtained from the shared_info structures
+         * that the guest maintains. It is up to the guest to maintain this and
+         * is not filled in during early boot. So we do not check if we are
+         * crossing max_pfn here and will depend on the checks in
+         * p2m_mem_access_get_entry().
+         */
+        if ( (mao.pfn > domain_get_maximum_gpfn(d) && !is_pv_domain(d)) &&
+             mao.pfn != ~0ull )
             break;
 
         rc = p2m_get_mem_access(d, mao.pfn, &access);
@@ -102,6 +236,87 @@ int mem_access_memop(unsigned long cmd,
         break;
     }
 
+    case XENMEM_access_op_set_default:
+        /*
+         * mem_access listeners for HVM domains calls
+         * xc_set_mem_access(first_pfn = ~0) to set default access.
+         */
+        rc = -ENOSYS;
+        if ( !is_pv_domain(d) )
+            break;
+
+        rc = mem_access_set_default(d, (uint64_t *)&mao.pfn, mao.access);
+        if ( rc == -EAGAIN )
+        {
+            ASSERT(mao.pfn != 0);
+            rc = __copy_field_to_guest(arg, &mao, pfn) ? -EFAULT : 0;
+            if ( rc == 0 )
+                rc = hypercall_create_continuation(__HYPERVISOR_memory_op, "lh",
+                                                   XENMEM_access_op, arg);
+
+        }
+        break;
+
+    case XENMEM_access_op_create_ring_page:
+    {
+        void *access_ring_va;
+
+        /*
+         * The special ring page for HVM domains would have been setup during
+         * domain creation.
+         */
+        rc = -ENOSYS;
+        if ( !is_pv_domain(d) )
+            break;
+
+        /*
+         * The ring page was created by a mem_access listener but was not
+         * freed. Do not allow another xenheap page to be allocated.
+         */
+        if ( mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        {
+            rc = -EPERM;
+            break;
+        }
+
+        access_ring_va = alloc_xenheap_page();
+        if ( access_ring_va == NULL )
+        {
+            rc = -ENOMEM;
+            break;
+        }
+
+        clear_page(access_ring_va);
+        share_xen_page_with_guest(virt_to_page(access_ring_va), d,
+                                  XENSHARE_writable);
+
+        d->arch.pv_domain.access_ring_mfn = _mfn(virt_to_mfn(access_ring_va));
+
+        rc = 0;
+        break;
+    }
+
+    case XENMEM_access_op_get_ring_mfn:
+        /*
+         * mem_access listeners for HVM domains should call xc_hvm_param_get()
+         * instead of xc_mem_access_get_ring_mfn().
+         */
+        rc = -ENOSYS;
+        if ( !is_pv_domain(d) )
+            break;
+
+        if ( !mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        {
+            rc = -ENODEV;
+            break;
+        }
+
+        mao.pfn = mfn_x(d->arch.pv_domain.access_ring_mfn);
+        rc = __copy_field_to_guest(arg, &mao, pfn) ? -EFAULT : 0;
+
+        rc = 0;
+        break;
+
     default:
         rc = -ENOSYS;
         break;
@@ -123,6 +338,21 @@ int mem_access_send_req(struct domain *d, mem_event_request_t *req)
     return 0;
 } 
 
+/* Free the xenheap page used for the PV access ring */
+void mem_access_free_pv_ring(struct domain *d)
+{
+    struct page_info *pg = mfn_to_page(d->arch.pv_domain.access_ring_mfn);
+
+    if ( !mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        return;
+
+    BUG_ON(page_get_owner(pg) != d);
+    if ( test_and_clear_bit(_PGC_allocated, &pg->count_info) )
+        put_page(pg);
+    free_xenheap_page(mfn_to_virt(mfn_x(d->arch.pv_domain.access_ring_mfn)));
+    d->arch.pv_domain.access_ring_mfn = _mfn(INVALID_MFN);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index 40ae841..06ac9f4 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -25,6 +25,7 @@
 #include <xen/event.h>
 #include <xen/wait.h>
 #include <asm/p2m.h>
+#include <asm/shadow.h>
 #include <asm/mem_event.h>
 #include <asm/mem_paging.h>
 #include <asm/mem_access.h>
@@ -49,7 +50,12 @@ static int mem_event_enable(
     xen_event_channel_notification_t notification_fn)
 {
     int rc;
-    unsigned long ring_gfn = d->arch.hvm_domain.params[param];
+    unsigned long ring_gfn;
+
+    if ( is_pv_domain(d) && param == HVM_PARAM_ACCESS_RING_PFN )
+        ring_gfn = mfn_x(d->arch.pv_domain.access_ring_mfn);
+    else
+        ring_gfn = d->arch.hvm_domain.params[param];
 
     /* Only one helper at a time. If the helper crashed,
      * the ring is in an undefined state and so is the guest.
@@ -587,28 +593,58 @@ int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec,
         switch( mec->op )
         {
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE:
-        {
             rc = -ENODEV;
-            /* Only HAP is supported */
-            if ( !hap_enabled(d) )
-                break;
+            if ( !is_pv_domain(d) )
+            {
+                /* Only HAP is supported */
+                if ( !hap_enabled(d) )
+                    break;
 
-            /* Currently only EPT is supported */
-            if ( !cpu_has_vmx )
-                break;
+                /* Currently only EPT is supported */
+                if ( !cpu_has_vmx )
+                    break;
+            }
+            /* PV guests use shadow mem_access mode */
+            else
+            {
+                if ( !shadow_mode_enabled(d) )
+                {
+                    rc = shadow_enable_mem_access(d);
+                    if ( rc != 0 )
+                        goto pv_out;
+                }
+                p2m_mem_access_init(p2m_get_hostp2m(d));
+            }
 
             rc = mem_event_enable(d, mec, med, _VPF_mem_access, 
                                     HVM_PARAM_ACCESS_RING_PFN,
                                     mem_access_notification);
-        }
-        break;
+
+ pv_out:
+            if ( rc != 0 && is_pv_domain(d) )
+            {
+                p2m_mem_access_reset(p2m_get_hostp2m(d));
+                if ( shadow_mode_enabled(d) )
+                    shadow_disable_mem_access(d);
+                mem_access_free_pv_ring(d);
+            }
+            break;
 
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE:
-        {
             if ( med->ring_page )
                 rc = mem_event_disable(d, med);
-        }
-        break;
+
+            if ( is_pv_domain(d) )
+            {
+                domain_pause(d);
+                p2m_mem_access_reset(p2m_get_hostp2m(d));
+                if ( shadow_mode_enabled(d) )
+                    shadow_disable_mem_access(d);
+
+                mem_access_free_pv_ring(d);
+                domain_unpause(d);
+            }
+            break;
 
         default:
             rc = -ENOSYS;
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index f7b0262..cf2ae2a 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -227,6 +227,9 @@ struct pv_domain
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
+
+    /* mfn of the mem_access ring page for PV domains */
+    mfn_t access_ring_mfn;
 };
 
 struct arch_domain
diff --git a/xen/include/asm-x86/mem_access.h b/xen/include/asm-x86/mem_access.h
index 5c7c5fd..bf9fce9 100644
--- a/xen/include/asm-x86/mem_access.h
+++ b/xen/include/asm-x86/mem_access.h
@@ -27,6 +27,9 @@ int mem_access_memop(unsigned long cmd,
                      XEN_GUEST_HANDLE_PARAM(xen_mem_access_op_t) arg);
 int mem_access_send_req(struct domain *d, mem_event_request_t *req);
 
+/* Free the xenheap page used for the access ring */
+void mem_access_free_pv_ring(struct domain *d);
+
 #endif /* _XEN_ASM_MEM_ACCESS_H */
 
 /*
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 2c57aa0..5ba1581 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -389,6 +389,9 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_event_op_t);
 #define XENMEM_access_op_resume             0
 #define XENMEM_access_op_set_access         1
 #define XENMEM_access_op_get_access         2
+#define XENMEM_access_op_set_default        3
+#define XENMEM_access_op_create_ring_page   4
+#define XENMEM_access_op_get_ring_mfn       5
 
 typedef enum {
     XENMEM_access_n,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH RFC v2 3/4] tools/libxc: Add APIs for PV mem_access
  2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
  2014-07-08  2:50 ` [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
  2014-07-08  2:50 ` [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
@ 2014-07-08  2:50 ` Aravindh Puthiyaparambil
  2014-07-08  2:50 ` [PATCH RFC v2 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
  2014-07-08 16:27 ` [PATCH RFC v2 0/4] Add mem_access " Konrad Rzeszutek Wilk
  4 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-07-08  2:50 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Jackson, Ian Campbell, Stefano Stabellini

Add APIs, xc_set_mem_access_default(), xc_mem_access_create_ring_page()
and xc_mem_access_get_ring_mfn(). xc_mem_event_enable() will call
xc_mem_access_create_ring_page() before enabling mem_access for PV
domains. This is not needed for HVM domains as the page is created
during domain creation time. It can then call
xc_mem_access_get_ring_mfn() to get the mfn of the created page to map
in. This is requivalent to xc_get_hvm_param(HVM_PARAM_ACCESS_RING_PFN)
for HVM domains.

xc_set_mem_access_default sets the default permission for a PV domain.
This should not be called for HVM domains. A mem_access listener for a
HVM domain does this in two steps:
    xc_set_mem_access(xch, domid, default_access, ~0ull, 0);
    xc_set_mem_access(xch, domid, default_access, 0, max_pages);
However for a PV domain, this is not possible as the address
translations are done by the guest and the listener does not know all the
mfns that belong to the PV domain. This functions performs the operation
for the mem_access listener. Additionally this was not done as part of
step 1 due to the way hypercall continuation works in the hypervisor.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>

---
Changes from RFC v1:
Added xc_set_mem_access_default() API.
The ring page setup has been moved to xc_mem_event_enable() because of
xsa-99 changes.
 
 tools/libxc/xc_mem_access.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xc_mem_event.c  | 23 ++++++++++++++++++++++-
 tools/libxc/xc_private.h    |  9 +++++++++
 tools/libxc/xenctrl.h       | 28 +++++++++++++++++++++++++++-
 4 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/xc_mem_access.c b/tools/libxc/xc_mem_access.c
index 461f0e9..f7699fa 100644
--- a/tools/libxc/xc_mem_access.c
+++ b/tools/libxc/xc_mem_access.c
@@ -87,6 +87,48 @@ int xc_get_mem_access(xc_interface *xch,
     return rc;
 }
 
+int xc_set_mem_access_default(xc_interface *xch, domid_t domain_id,
+                              xenmem_access_t default_access)
+{
+    xen_mem_access_op_t mao =
+    {
+        .op     = XENMEM_access_op_set_default,
+        .domid  = domain_id,
+        .access = default_access
+    };
+
+    return do_memory_op(xch, XENMEM_access_op, &mao, sizeof(mao));
+}
+
+int xc_mem_access_create_ring_page(xc_interface *xch, domid_t domain_id)
+{
+    xen_mem_access_op_t mao =
+    {
+        .op    = XENMEM_access_op_create_ring_page,
+        .domid = domain_id
+    };
+
+    return do_memory_op(xch, XENMEM_access_op, &mao, sizeof(mao));
+}
+
+int xc_mem_access_get_ring_mfn(xc_interface *xch, domid_t domain_id,
+                               uint64_t *mfn)
+{
+    int rc;
+    xen_mem_access_op_t mao =
+    {
+        .op    = XENMEM_access_op_get_ring_mfn,
+        .domid = domain_id
+    };
+
+    rc = do_memory_op(xch, XENMEM_access_op, &mao, sizeof(mao));
+
+    if ( !rc )
+        *mfn = mao.pfn;
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xc_mem_event.c b/tools/libxc/xc_mem_event.c
index faf1cc6..41be04f 100644
--- a/tools/libxc/xc_mem_event.c
+++ b/tools/libxc/xc_mem_event.c
@@ -64,6 +64,7 @@ void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
     xen_pfn_t ring_pfn, mmap_pfn;
     unsigned int op, mode;
     int rc1, rc2, saved_errno;
+    xc_domaininfo_t dom_info;
 
     if ( !port )
     {
@@ -71,6 +72,13 @@ void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
         return NULL;
     }
 
+    rc1 = xc_domain_getinfolist(xch, domain_id, 1, &dom_info);
+    if ( rc1 != 1 || dom_info.domain != domain_id )
+    {
+        PERROR("Error getting domain info\n");
+        return NULL;
+     }
+
     /* Pause the domain for ring page setup */
     rc1 = xc_domain_pause(xch, domain_id);
     if ( rc1 != 0 )
@@ -80,7 +88,20 @@ void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
     }
 
     /* Get the pfn of the ring page */
-    rc1 = xc_hvm_param_get(xch, domain_id, param, &pfn);
+    if ( dom_info.flags & XEN_DOMINF_hvm_guest )
+        rc1 = xc_hvm_param_get(xch, domain_id, param, &pfn);
+    else if ( param == HVM_PARAM_ACCESS_RING_PFN )
+    {
+        rc1 = xc_mem_access_create_ring_page(xch, domain_id);
+        if ( rc1 != 0 )
+        {
+            PERROR("Failed to create ring page\n");
+            goto out;
+        }
+
+        rc1 = xc_mem_access_get_ring_mfn(xch, domain_id, &pfn);
+    }
+
     if ( rc1 != 0 )
     {
         PERROR("Failed to get pfn of ring page\n");
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 6cc0f2b..c583c26 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -367,4 +367,13 @@ int xc_mem_event_memop(xc_interface *xch, domid_t domain_id,
 void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
                           uint32_t *port);
 
+/*
+ * Create the ring page for PV domains. This need not be called for HVM domains.
+ */
+int xc_mem_access_create_ring_page(xc_interface *xch, domid_t domain_id);
+
+/* Get the mfn of the ring page for PV domains. */
+int xc_mem_access_get_ring_mfn(xc_interface *xch, domid_t domain_id,
+                               uint64_t *mfn);
+
 #endif /* __XC_PRIVATE_H__ */
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 3578b09..2d25043 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -2260,15 +2260,25 @@ int xc_mem_paging_load(xc_interface *xch, domid_t domain_id,
  * Enables mem_access and returns the mapped ring page.
  * Will return NULL on error.
  * Caller has to unmap this page when done.
+ * Calling this for PV domains will enable shadow paging.
  */
 void *xc_mem_access_enable(xc_interface *xch, domid_t domain_id, uint32_t *port);
+
+/*
+ * For PV domains, this function has to be called even if xc_mem_access_enable()
+ * returns an error. This is to disable shadow paging and destroy the mem_access
+ * ring page.
+ */
 int xc_mem_access_disable(xc_interface *xch, domid_t domain_id);
 int xc_mem_access_resume(xc_interface *xch, domid_t domain_id);
 
 /*
  * Set a range of memory to a specific access.
  * Allowed types are XENMEM_access_default, XENMEM_access_n, any combination of
- * XENMEM_access_ + (rwx), and XENMEM_access_rx2rw
+ * XENMEM_access_ + (rwx), XENMEM_access_rx2rw and XENMEM_access_n2rwx for HVM
+ * domains.
+ * Allowed types are XENMEM_access_default, XENMEM_access_r, XENMEM_access_rw,
+ * XENMEM_access_rwx and XENMEM_access_rx2rw for PV domains.
  */
 int xc_set_mem_access(xc_interface *xch, domid_t domain_id,
                       xenmem_access_t access, uint64_t first_pfn,
@@ -2280,6 +2290,22 @@ int xc_set_mem_access(xc_interface *xch, domid_t domain_id,
 int xc_get_mem_access(xc_interface *xch, domid_t domain_id,
                       uint64_t pfn, xenmem_access_t *access);
 
+/*
+ * Set the default permission for a PV domain. This should not be called for HVM
+ * domains. A mem_access listener for a HVM domain does this in two steps:
+ * xc_set_mem_access(xch, domid, default_access, ~0ull, 0);
+ * xc_set_mem_access(xch, domid, default_access, 0, max_pages);
+ * However for a PV domain, this is not possible as the address translations are
+ * done by the guest and the listener does not know all the mfns that belong to
+ * the PV domain. This functions performs the operation for the mem_access
+ * listener. Additionally this was not done as part of step 1 due to the way
+ * hypercall continuation works in the hypervisor.
+ * Allowed access types are XENMEM_access_r, XENMEM_access_rw, XENMEM_access_rwx
+ * and XENMEM_access_rx2rw.
+ */
+int xc_set_mem_access_default(xc_interface *xch, domid_t domain_id,
+                              xenmem_access_t default_access);
+
 /***
  * Memory sharing operations.
  *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* [PATCH RFC v2 4/4] tool/xen-access: Add support for PV domains
  2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
                   ` (2 preceding siblings ...)
  2014-07-08  2:50 ` [PATCH RFC v2 3/4] tools/libxc: Add APIs for PV mem_access Aravindh Puthiyaparambil
@ 2014-07-08  2:50 ` Aravindh Puthiyaparambil
  2014-07-08 16:27 ` [PATCH RFC v2 0/4] Add mem_access " Konrad Rzeszutek Wilk
  4 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-07-08  2:50 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Ian Jackson, Ian Campbell, Aravindh Puthiyaprambil

Add support to the xen-access test program for it to work with PV domains.

Signed-off-by: Aravindh Puthiyaprambil <aravindp@cisco.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>

---
Changes from RFC v1:
Add call to xc_set_mem_access_default().
PV ring page setup is now done as part of xc_mem_access_enable() due to
xsa-99.

 tools/tests/xen-access/xen-access.c | 104 +++++++++++++++++++++---------------
 1 file changed, 62 insertions(+), 42 deletions(-)

diff --git a/tools/tests/xen-access/xen-access.c b/tools/tests/xen-access/xen-access.c
index 090df5f..02ea0c9 100644
--- a/tools/tests/xen-access/xen-access.c
+++ b/tools/tests/xen-access/xen-access.c
@@ -114,7 +114,8 @@ typedef struct xenaccess {
 } xenaccess_t;
 
 static int interrupted;
-bool evtchn_bind = 0, evtchn_open = 0, mem_access_enable = 0;
+static bool evtchn_bind = 0, evtchn_open = 0, mem_access_enable = 0, hvm = 0,
+            pv_cleanup = 0;
 
 static void close_handler(int sig)
 {
@@ -173,7 +174,7 @@ int xenaccess_teardown(xc_interface *xch, xenaccess_t *xenaccess)
     if ( xenaccess->mem_event.ring_page )
         munmap(xenaccess->mem_event.ring_page, XC_PAGE_SIZE);
 
-    if ( mem_access_enable )
+    if ( mem_access_enable  || (!hvm && pv_cleanup) )
     {
         rc = xc_mem_access_disable(xenaccess->xc_handle,
                                    xenaccess->mem_event.domain_id);
@@ -241,6 +242,27 @@ xenaccess_t *xenaccess_init(xc_interface **xch_r, domid_t domain_id)
     /* Set domain id */
     xenaccess->mem_event.domain_id = domain_id;
 
+    /* Get domaininfo */
+    xenaccess->domain_info = malloc(sizeof(xc_domaininfo_t));
+    if ( xenaccess->domain_info == NULL )
+    {
+        ERROR("Error allocating memory for domain info");
+        goto err;
+    }
+
+    rc = xc_domain_getinfolist(xenaccess->xc_handle, domain_id, 1,
+                               xenaccess->domain_info);
+    if ( rc != 1 )
+    {
+        ERROR("Error getting domain info");
+        goto err;
+    }
+
+    if ( xenaccess->domain_info->flags & XEN_DOMINF_hvm_guest )
+        hvm = 1;
+    else
+        pv_cleanup = 1;
+
     /* Initialise lock */
     mem_event_ring_lock_init(&xenaccess->mem_event);
 
@@ -293,24 +315,6 @@ xenaccess_t *xenaccess_init(xc_interface **xch_r, domid_t domain_id)
                    (mem_event_sring_t *)xenaccess->mem_event.ring_page,
                    XC_PAGE_SIZE);
 
-    /* Get domaininfo */
-    xenaccess->domain_info = malloc(sizeof(xc_domaininfo_t));
-    if ( xenaccess->domain_info == NULL )
-    {
-        ERROR("Error allocating memory for domain info");
-        goto err;
-    }
-
-    rc = xc_domain_getinfolist(xenaccess->xc_handle, domain_id, 1,
-                               xenaccess->domain_info);
-    if ( rc != 1 )
-    {
-        ERROR("Error getting domain info");
-        goto err;
-    }
-
-    DPRINTF("max_pages = %"PRIx64"\n", xenaccess->domain_info->max_pages);
-
     return xenaccess;
 
  err:
@@ -485,30 +489,38 @@ int main(int argc, char *argv[])
     }
 
     /* Set the default access type and convert all pages to it */
-    rc = xc_set_mem_access(xch, domain_id, default_access, ~0ull, 0);
-    if ( rc < 0 )
-    {
-        ERROR("Error %d setting default mem access type\n", rc);
-        goto exit;
-    }
+    if ( hvm )
+        rc = xc_set_mem_access(xch, domain_id, default_access, ~0ull, 0);
+    else
+        rc = xc_set_mem_access_default(xch, domain_id, default_access);
 
-    rc = xc_set_mem_access(xch, domain_id, default_access, 0,
-                           xenaccess->domain_info->max_pages);
     if ( rc < 0 )
     {
-        ERROR("Error %d setting all memory to access type %d\n", rc,
-              default_access);
+        ERROR("Error %d setting default mem access type\n", rc);
         goto exit;
     }
 
-    if ( int3 )
-        rc = xc_hvm_param_set(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_sync);
-    else
-        rc = xc_hvm_param_set(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_disabled);
-    if ( rc < 0 )
+    if ( hvm )
     {
-        ERROR("Error %d setting int3 mem_event\n", rc);
-        goto exit;
+        rc = xc_set_mem_access(xch, domain_id, default_access, 0,
+                               xenaccess->domain_info->max_pages);
+        if ( rc < 0 )
+        {
+            ERROR("Error %d setting all memory to access type %d\n", rc,
+                  default_access);
+            goto exit;
+        }
+        if ( int3 )
+            rc = xc_hvm_param_set(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3,
+                                  HVMPME_mode_sync);
+        else
+            rc = xc_hvm_param_set(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3,
+                                  HVMPME_mode_disabled);
+        if ( rc < 0 )
+        {
+            ERROR("Error %d setting int3 mem_event\n", rc);
+            goto exit;
+        }
     }
 
     /* Wait for access */
@@ -519,11 +531,19 @@ int main(int argc, char *argv[])
             DPRINTF("xenaccess shutting down on signal %d\n", interrupted);
 
             /* Unregister for every event */
-            rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, ~0ull, 0);
-            rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, 0,
-                                   xenaccess->domain_info->max_pages);
-            rc = xc_hvm_param_set(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_disabled);
-
+            if ( hvm )
+            {
+                rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, ~0ull,
+                                       0);
+                rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, 0,
+                                       xenaccess->domain_info->max_pages);
+                rc = xc_hvm_param_set(xch, domain_id,
+                                      HVM_PARAM_MEMORY_EVENT_INT3,
+                                      HVMPME_mode_disabled);
+            }
+            else
+                rc = xc_set_mem_access_default(xch, domain_id,
+                                               XENMEM_access_rwx);
             shutting_down = 1;
         }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 0/4] Add mem_access support for PV domains
  2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
                   ` (3 preceding siblings ...)
  2014-07-08  2:50 ` [PATCH RFC v2 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
@ 2014-07-08 16:27 ` Konrad Rzeszutek Wilk
  2014-07-08 17:57   ` Aravindh Puthiyaparambil (aravindp)
  2014-07-09  0:31   ` Aravindh Puthiyaparambil (aravindp)
  4 siblings, 2 replies; 85+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-07-08 16:27 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Tim Deegan,
	Ian Jackson, Jan Beulich, xen-devel

On Mon, Jul 07, 2014 at 07:50:01PM -0700, Aravindh Puthiyaparambil wrote:
> This patch series adds mem_access support for PV domains. To do this the PV
> domain domain has to be run with shadow paging. A p2m implementation for
> mem_access has been added to track the access permissions. Since special ring
> pages are not created for PV domains, this is done as part of enabling
> mem_access.This page is freed when mem_access is disabled or when the domain
> is destroyed.
> 
> When mem_access is enabled for a PV domain, shadow paging is turned on and all
> the shadows are dropped. In the resulting pagefaults, the entries are created
> with the default access permissions. On future pagefaults, if there is a violation,
> a mem_event is sent to the mem_access listener who will then resolve it.
> 
> The access permissions for individual pages are stored in the shadow_flags field
> in the page_info structure. To get the access permissions for individual pages,
> this field is referenced. To set the access permission of individual pages, the new
> permission is set in the shadow_flags and the shadow for the gmfn is dropped. On the
> resulting fault, the new PTE entry will be created with the new permission. A
> new API has been added to set the default access permissions for PV domains.

In regards to the new ops - you also would need to add the XSM hooks.

I recall that in the past Jan had some questions, but I don't recall
exactly what they were - does this patchset address that?

Thanks!
> 
> Patches are based on top of commit f9cff088.
> 
> Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Keir Fraser <keir@xen.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> 
>   x86/mm: Shadow and p2m changes for PV mem_access
>   x86/mem_access: mem_access and mem_event changes to support PV domains
>   tools/libxc: Add APIs for PV mem_access
>   tool/xen-access: Add support for PV domains
> 
>  tools/libxc/xc_mem_access.c          |  42 ++++++
>  tools/libxc/xc_mem_event.c           |  23 +++-
>  tools/libxc/xc_private.h             |   9 ++
>  tools/libxc/xenctrl.h                |  28 +++-
>  tools/tests/xen-access/xen-access.c  | 104 +++++++++------
>  xen/arch/x86/domain.c                |  12 ++
>  xen/arch/x86/mm/Makefile             |   2 +-
>  xen/arch/x86/mm/mem_access.c         | 244 ++++++++++++++++++++++++++++++++++-
>  xen/arch/x86/mm/mem_event.c          |  62 +++++++--
>  xen/arch/x86/mm/p2m-ma.c             | 148 +++++++++++++++++++++
>  xen/arch/x86/mm/p2m.c                |  52 +++++---
>  xen/arch/x86/mm/paging.c             |   7 +
>  xen/arch/x86/mm/shadow/common.c      |  75 ++++++++++-
>  xen/arch/x86/mm/shadow/multi.c       | 101 ++++++++++++++-
>  xen/arch/x86/mm/shadow/private.h     |   7 +
>  xen/arch/x86/srat.c                  |   1 +
>  xen/arch/x86/usercopy.c              |  12 ++
>  xen/common/page_alloc.c              |   3 +
>  xen/drivers/video/vesa.c             |   1 +
>  xen/include/asm-x86/domain.h         |   9 ++
>  xen/include/asm-x86/mem_access.h     |   3 +
>  xen/include/asm-x86/mm.h             |   1 -
>  xen/include/asm-x86/p2m.h            |  17 +++
>  xen/include/asm-x86/paging.h         |   1 +
>  xen/include/asm-x86/shadow.h         |  15 +++
>  xen/include/asm-x86/x86_64/uaccess.h |   7 +
>  xen/include/public/memory.h          |   3 +
>  27 files changed, 899 insertions(+), 90 deletions(-)
>  create mode 100644 xen/arch/x86/mm/p2m-ma.c
> 
> -- 
> 1.9.1
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 0/4] Add mem_access support for PV domains
  2014-07-08 16:27 ` [PATCH RFC v2 0/4] Add mem_access " Konrad Rzeszutek Wilk
@ 2014-07-08 17:57   ` Aravindh Puthiyaparambil (aravindp)
  2014-07-09  0:31   ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-08 17:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Tim Deegan,
	Ian Jackson, Jan Beulich, xen-devel

>In regards to the new ops - you also would need to add the XSM hooks.

OK, I will look in to adding the XSM hooks.

>I recall that in the past Jan had some questions, but I don't recall exactly what
>they were - does this patchset address that?

Yes, this patchset takes a shot at addressing them. The feedback from Jan and Tim will decide whether the approach is viable.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 0/4] Add mem_access support for PV domains
  2014-07-08 16:27 ` [PATCH RFC v2 0/4] Add mem_access " Konrad Rzeszutek Wilk
  2014-07-08 17:57   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-09  0:31   ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-09  0:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Tim Deegan,
	Ian Jackson, Jan Beulich, xen-devel

>>In regards to the new ops - you also would need to add the XSM hooks.
>
>OK, I will look in to adding the XSM hooks.

The two new ops XENMEM_access_op_create_ring_page and XENMEM_access_op_get_ring_mfn are sub-ops under XENMEM_access_op and it already has a XSM hook. Please see mem_access_memop().

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-08  2:50 ` [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
@ 2014-07-24 14:29   ` Jan Beulich
  2014-07-24 23:34     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-24 14:29 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 08.07.14 at 04:50, <aravindp@cisco.com> wrote:
> +/* Get the page permission of the mfn from page_info->shadow_flags */
> +static mfn_t
> +p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
> +                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
> +                         unsigned int *page_order)
> +{
> +    struct domain *d = p2m->domain;
> +    /* For PV guests mfn == gfn */

Valid, but why is this not being checked in any way in the earlier
"set" counterpart?

> +    mfn_t mfn = _mfn(gfn);
> +    struct page_info *page = mfn_to_page(mfn);
> +
> +    ASSERT(shadow_mode_enabled(d));
> +
> +    *t = p2m_ram_rw;
> +
> +    if ( page_get_owner(page) != d )

I think there's a mfn_valid() check missing prior to this re-ref of page.

> @@ -1356,7 +1357,7 @@ void shadow_prealloc(struct domain *d, u32 type, 
> unsigned int count)
>  
>  /* Deliberately free all the memory we can: this will tear down all of
>   * this domain's shadows */
> -static void shadow_blow_tables(struct domain *d) 
> +void shadow_blow_tables(struct domain *d)

This doesn't belong here - the patch doesn't add any new/external
caller of this function.

> @@ -2435,15 +2436,20 @@ int sh_remove_all_mappings(struct vcpu *v, mfn_t gmfn)
>      /* If that didn't catch the mapping, something is very wrong */
>      if ( !sh_check_page_has_no_refs(page) )
>      {
> -        /* Don't complain if we're in HVM and there are some extra mappings: 
> +        /*
> +         * Don't complain if we're in HVM and there are some extra mappings:
>           * The qemu helper process has an untyped mapping of this dom's RAM 
> 
>           * and the HVM restore program takes another.
>           * Also allow one typed refcount for xenheap pages, to match
> -         * share_xen_page_with_guest(). */
> +         * share_xen_page_with_guest().
> +         * PV domains that have a mem_access listener, runs in shadow mode
> +         * without refcounts.
> +         */
>          if ( !(shadow_mode_external(v->domain)
>                 && (page->count_info & PGC_count_mask) <= 3
>                 && ((page->u.inuse.type_info & PGT_count_mask)
> -                   == !!is_xen_heap_page(page))) )
> +                   == !!is_xen_heap_page(page))) &&
> +             !mem_event_check_ring(&v->domain->mem_event->access) )

To me this doesn't look to be in sync with the comment, as the new
check is being carried out regardless of domain type. Furthermore
this continues to have the problem of also hiding issues unrelated
to mem-access handling.

> @@ -3009,7 +3020,84 @@ static int sh_page_fault(struct vcpu *v,
>  
>      /* What mfn is the guest trying to access? */
>      gfn = guest_l1e_get_gfn(gw.l1e);
> -    gmfn = get_gfn(d, gfn, &p2mt);
> +    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
> +        gmfn = get_gfn(d, gfn, &p2mt);
> +    /*
> +     * A mem_access listener is present, so we will first check if a violation
> +     * has occurred.
> +     */
> +    else
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
> +        p2m_access_t p2ma;
> +
> +        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0, NULL);
> +        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
> +             && regs->error_code & PFEC_page_present
> +             && !(regs->error_code & PFEC_reserved_bit) )
> +        {
> +            int violation = 0;
> +            bool_t access_w = !!(regs->error_code & PFEC_write_access);
> +            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
> +            bool_t access_r = access_x ? 0 : !access_w;

"violation" looks to be a boolean just like the other three, so wants
to also be bool_t.

> +
> +            /* If the access is against the permissions, then send to mem_event */
> +            switch ( p2ma )
> +            {
> +            case p2m_access_r:
> +                violation = access_w || access_x;
> +                break;
> +            case p2m_access_rx:
> +            case p2m_access_rx2rw:
> +                violation = access_w;
> +                break;
> +            case p2m_access_rw:
> +                violation = access_x;
> +                break;
> +            case p2m_access_rwx:
> +            default:
> +                break;
> +            }
> +
> +            /*
> +             * Do not police writes to guest memory from the Xen hypervisor.
> +             * This keeps PV mem_access on par with HVM. Turn off CR0.WP here to
> +             * allow the write to go through if the guest has marked the page as
> +             * writable. Turn it back on in the guest access functions
> +             * __copy_to_user / __put_user_size() after the write is completed.
> +             */
> +            if ( violation && access_w &&
> +                 regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )

Definitely < instead of <= on the right side. But - is this safe, the more
that this again doesn't appear to be sitting in a guest kind specific block?
I'd at least expect this to be qualified by a regs->cs and/or
guest_mode() check.

> +            {
> +                unsigned long cr0 = read_cr0();
> +
> +                violation = 0;
> +                if ( cr0 & X86_CR0_WP &&
> +                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
> +                {
> +                    cr0 &= ~X86_CR0_WP;
> +                    write_cr0(cr0);
> +                    v->arch.pv_vcpu.need_cr0_wp_set = 1;

PV field access within a non-PV-only code block?

> +                }

I wonder how well the window where you're running with CR0.WP clear
is bounded: The flag serves as a kind of security measure, and hence
shouldn't be left off for extended periods of time.

> +            }
> +
> +            if ( violation )
> +            {
> +                paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
> +                              (va & ((1 << PAGE_SHIFT) - 1));
> +                if ( !p2m_mem_access_check(gpa, 1, va, access_r, access_w,
> +                                           access_x, &req_ptr) )
> +                {
> +                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" p2ma: %d\n",
> +                                  (access_r ? 'r' : '-'),
> +                                  (access_w ? 'w' : '-'),
> +                                  (access_x ? 'x' : '-'), mfn_x(gmfn), p2ma);
> +                    /* Rights not promoted, vcpu paused, work here is done */
> +                    goto out_put_gfn;

Rather than re-using just two lines from the normal exit path and
introducing to it a mem-access specific code block, put the exit
processing here, at once allowing req_ptr to be limited in scope?

> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -43,6 +43,7 @@
>  #include <asm/page.h>
>  #include <asm/numa.h>
>  #include <asm/flushtlb.h>
> +#include <asm/shadow.h>

Breaking ARM?

>  #ifdef CONFIG_X86
>  #include <asm/p2m.h>
>  #include <asm/setup.h> /* for highmem_start only */
> @@ -1660,6 +1661,8 @@ int assign_pages(
>          page_set_owner(&pg[i], d);
>          smp_wmb(); /* Domain pointer must be visible before updating refcnt. */
>          pg[i].count_info = PGC_allocated | 1;
> +        if ( is_pv_domain(d) )
> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)->default_access);

I don't think you should call shadow code from here.

> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -380,6 +380,12 @@ struct pv_vcpu
>      /* Deferred VA-based update state. */
>      bool_t need_update_runstate_area;
>      struct vcpu_time_info pending_system_time;
> +
> +    /*
> +     * Flag that tracks if CR0.WP needs to be set after a Xen write to guest
> +     * memory when a PV domain has a mem_access listener attached to it.
> +     */
> +    bool_t need_cr0_wp_set;
>  };

The new field would better go above need_update_runstate_area
for better space usage.

> --- a/xen/include/asm-x86/x86_64/uaccess.h
> +++ b/xen/include/asm-x86/x86_64/uaccess.h
> @@ -1,6 +1,8 @@
>  #ifndef __X86_64_UACCESS_H
>  #define __X86_64_UACCESS_H
>  
> +#include <xen/sched.h>

This is pretty ugly. You only reference the needed fields in a macro,
so you don't strictly need to include this here as long as all (or at least
most - the rest could be patched up) use sites include it.

> @@ -65,6 +67,11 @@ do {									\
>  	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
>  	default: __put_user_bad();					\
>  	}								\
> +    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
> +    { \
> +        write_cr0(read_cr0() | X86_CR0_WP); \
> +        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
> +    } \
>  } while (0)

Please obey to the original indentation method.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-08  2:50 ` [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
@ 2014-07-24 14:38   ` Jan Beulich
  2014-07-24 23:52     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-24 14:38 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil; +Cc: xen-devel, Keir Fraser, Tim Deegan

>>> On 08.07.14 at 04:50, <aravindp@cisco.com> wrote:
> +static int mem_access_set_default(struct domain *d, uint64_t *start_page,
> +                           xenmem_access_t access)
> +{
> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +    struct page_info *page;
> +    struct page_list_head head;
> +    p2m_access_t a;
> +    int rc = 0, ctr = 0;
> +
> +    if ( !is_pv_domain(d) )
> +        return -ENOSYS;
> +
> +    ASSERT(shadow_mode_enabled(d));
> +
> +    rc = p2m_convert_xenmem_access(p2m, access, &a);
> +    if ( rc != 0 )
> +        return rc;
> +
> +    /*
> +     * For PV domains we only support r, rw, rx, rx2rw and rwx access
> +     * permissions
> +     */
> +    switch ( a )
> +    {
> +    case p2m_access_n:
> +    case p2m_access_w:
> +    case p2m_access_x:
> +    case p2m_access_wx:
> +    case p2m_access_n2rwx:
> +        return -EINVAL;
> +    default:
> +        break;
> +    }
> +
> +    paging_lock_recursive(d);
> +
> +    if ( *start_page )
> +    {
> +        head.next = (struct page_info *)*start_page;

What guarantees that the continuation page is still on d->page_list,
or that now other page got inserted ahead of it? And anyway you're
iterating the list without holding d->page_alloc_lock.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-24 14:29   ` Jan Beulich
@ 2014-07-24 23:34     ` Aravindh Puthiyaparambil (aravindp)
  2014-07-25  7:19       ` Jan Beulich
  2014-08-28  9:09       ` Tim Deegan
  0 siblings, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-24 23:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>> +/* Get the page permission of the mfn from page_info->shadow_flags */
>> +static mfn_t
>> +p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long
>gfn,
>> +                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
>> +                         unsigned int *page_order)
>> +{
>> +    struct domain *d = p2m->domain;
>> +    /* For PV guests mfn == gfn */
>
>Valid, but why is this not being checked in any way in the earlier
>"set" counterpart?

I will add that check.

>> +    mfn_t mfn = _mfn(gfn);
>> +    struct page_info *page = mfn_to_page(mfn);
>> +
>> +    ASSERT(shadow_mode_enabled(d));
>> +
>> +    *t = p2m_ram_rw;
>> +
>> +    if ( page_get_owner(page) != d )
>
>I think there's a mfn_valid() check missing prior to this re-ref of page.

I will add that too.

>> @@ -1356,7 +1357,7 @@ void shadow_prealloc(struct domain *d, u32 type,
>> unsigned int count)
>>
>>  /* Deliberately free all the memory we can: this will tear down all of
>>   * this domain's shadows */
>> -static void shadow_blow_tables(struct domain *d)
>> +void shadow_blow_tables(struct domain *d)
>
>This doesn't belong here - the patch doesn't add any new/external
>caller of this function.

I was trying to bunch shadow changes in to one patch. I will add this to the mem_access patch where the external caller is being added.

>> @@ -2435,15 +2436,20 @@ int sh_remove_all_mappings(struct vcpu *v,
>mfn_t gmfn)
>>      /* If that didn't catch the mapping, something is very wrong */
>>      if ( !sh_check_page_has_no_refs(page) )
>>      {
>> -        /* Don't complain if we're in HVM and there are some extra mappings:
>> +        /*
>> +         * Don't complain if we're in HVM and there are some extra mappings:
>>           * The qemu helper process has an untyped mapping of this dom's RAM
>>
>>           * and the HVM restore program takes another.
>>           * Also allow one typed refcount for xenheap pages, to match
>> -         * share_xen_page_with_guest(). */
>> +         * share_xen_page_with_guest().
>> +         * PV domains that have a mem_access listener, runs in shadow mode
>> +         * without refcounts.
>> +         */
>>          if ( !(shadow_mode_external(v->domain)
>>                 && (page->count_info & PGC_count_mask) <= 3
>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>> -                   == !!is_xen_heap_page(page))) )
>> +                   == !!is_xen_heap_page(page))) &&
>> +             !mem_event_check_ring(&v->domain->mem_event->access) )
>
>To me this doesn't look to be in sync with the comment, as the new
>check is being carried out regardless of domain type. Furthermore
>this continues to have the problem of also hiding issues unrelated
>to mem-access handling.

I will add a check for PV domain. This check is wrt to refcouting. From what Tim told me PV domains cannot run in that mode, so I don't think any issues will be hidden if I add the check for PV domain.

>> @@ -3009,7 +3020,84 @@ static int sh_page_fault(struct vcpu *v,
>>
>>      /* What mfn is the guest trying to access? */
>>      gfn = guest_l1e_get_gfn(gw.l1e);
>> -    gmfn = get_gfn(d, gfn, &p2mt);
>> +    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
>> +        gmfn = get_gfn(d, gfn, &p2mt);
>> +    /*
>> +     * A mem_access listener is present, so we will first check if a violation
>> +     * has occurred.
>> +     */
>> +    else
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
>> +        p2m_access_t p2ma;
>> +
>> +        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0,
>NULL);
>> +        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
>> +             && regs->error_code & PFEC_page_present
>> +             && !(regs->error_code & PFEC_reserved_bit) )
>> +        {
>> +            int violation = 0;
>> +            bool_t access_w = !!(regs->error_code & PFEC_write_access);
>> +            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
>> +            bool_t access_r = access_x ? 0 : !access_w;
>
>"violation" looks to be a boolean just like the other three, so wants
>to also be bool_t.

Done.

>> +
>> +            /* If the access is against the permissions, then send to mem_event
>*/
>> +            switch ( p2ma )
>> +            {
>> +            case p2m_access_r:
>> +                violation = access_w || access_x;
>> +                break;
>> +            case p2m_access_rx:
>> +            case p2m_access_rx2rw:
>> +                violation = access_w;
>> +                break;
>> +            case p2m_access_rw:
>> +                violation = access_x;
>> +                break;
>> +            case p2m_access_rwx:
>> +            default:
>> +                break;
>> +            }
>> +
>> +            /*
>> +             * Do not police writes to guest memory from the Xen hypervisor.
>> +             * This keeps PV mem_access on par with HVM. Turn off CR0.WP
>here to
>> +             * allow the write to go through if the guest has marked the page as
>> +             * writable. Turn it back on in the guest access functions
>> +             * __copy_to_user / __put_user_size() after the write is completed.
>> +             */
>> +            if ( violation && access_w &&
>> +                 regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
>
>Definitely < instead of <= on the right side. But - is this safe, the more
>that this again doesn't appear to be sitting in a guest kind specific block?
>I'd at least expect this to be qualified by a regs->cs and/or
>guest_mode() check.

I will add the guest kind check. Should I do it for the whole block i.e add is_pv_domain() in addition to mem_event_check_ring()? That will also address next comment below. I will add the guest_mode() in addition to the above check for policing Xen writes to guest memory.

>> +            {
>> +                unsigned long cr0 = read_cr0();
>> +
>> +                violation = 0;
>> +                if ( cr0 & X86_CR0_WP &&
>> +                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>> +                {
>> +                    cr0 &= ~X86_CR0_WP;
>> +                    write_cr0(cr0);
>> +                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>
>PV field access within a non-PV-only code block?
>
>> +                }
>
>I wonder how well the window where you're running with CR0.WP clear
>is bounded: The flag serves as a kind of security measure, and hence
>shouldn't be left off for extended periods of time.

I agree. Is there a way I can bound this?

>> +            }
>> +
>> +            if ( violation )
>> +            {
>> +                paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
>> +                              (va & ((1 << PAGE_SHIFT) - 1));
>> +                if ( !p2m_mem_access_check(gpa, 1, va, access_r, access_w,
>> +                                           access_x, &req_ptr) )
>> +                {
>> +                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn"
>p2ma: %d\n",
>> +                                  (access_r ? 'r' : '-'),
>> +                                  (access_w ? 'w' : '-'),
>> +                                  (access_x ? 'x' : '-'), mfn_x(gmfn), p2ma);
>> +                    /* Rights not promoted, vcpu paused, work here is done */
>> +                    goto out_put_gfn;
>
>Rather than re-using just two lines from the normal exit path and
>introducing to it a mem-access specific code block, put the exit
>processing here, at once allowing req_ptr to be limited in scope?

Good idea. I will do that.

>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -43,6 +43,7 @@
>>  #include <asm/page.h>
>>  #include <asm/numa.h>
>>  #include <asm/flushtlb.h>
>> +#include <asm/shadow.h>
>
>Breaking ARM?
>>  #ifdef CONFIG_X86
>>  #include <asm/p2m.h>
>>  #include <asm/setup.h> /* for highmem_start only */
>> @@ -1660,6 +1661,8 @@ int assign_pages(
>>          page_set_owner(&pg[i], d);
>>          smp_wmb(); /* Domain pointer must be visible before updating refcnt.
>*/
>>          pg[i].count_info = PGC_allocated | 1;
>> +        if ( is_pv_domain(d) )
>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)->default_access);
>
>I don't think you should call shadow code from here.

Should I add a p2m wrapper for this, so that it is valid only for x86 and a no-op for ARM?

>> --- a/xen/include/asm-x86/domain.h
>> +++ b/xen/include/asm-x86/domain.h
>> @@ -380,6 +380,12 @@ struct pv_vcpu
>>      /* Deferred VA-based update state. */
>>      bool_t need_update_runstate_area;
>>      struct vcpu_time_info pending_system_time;
>> +
>> +    /*
>> +     * Flag that tracks if CR0.WP needs to be set after a Xen write to guest
>> +     * memory when a PV domain has a mem_access listener attached to it.
>> +     */
>> +    bool_t need_cr0_wp_set;
>>  };
>
>The new field would better go above need_update_runstate_area
>for better space usage.

Done.

>> --- a/xen/include/asm-x86/x86_64/uaccess.h
>> +++ b/xen/include/asm-x86/x86_64/uaccess.h
>> @@ -1,6 +1,8 @@
>>  #ifndef __X86_64_UACCESS_H
>>  #define __X86_64_UACCESS_H
>>
>> +#include <xen/sched.h>
>
>This is pretty ugly. You only reference the needed fields in a macro,
>so you don't strictly need to include this here as long as all (or at least
>most - the rest could be patched up) use sites include it.

OK, I will give that a shot.

>> @@ -65,6 +67,11 @@ do {
>			\
>>  	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
>>  	default: __put_user_bad();					\
>>  	}								\
>> +    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
>> +    { \
>> +        write_cr0(read_cr0() | X86_CR0_WP); \
>> +        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
>> +    } \
>>  } while (0)
>
>Please obey to the original indentation method.

I thought tab characters were not allowed for indentation. But I guess you do not want tabs and spaces for indentation purposes to be mixed within a macro?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-24 14:38   ` Jan Beulich
@ 2014-07-24 23:52     ` Aravindh Puthiyaparambil (aravindp)
  2014-07-25  7:23       ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-24 23:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Tim Deegan

>> +static int mem_access_set_default(struct domain *d, uint64_t
>*start_page,
>> +                           xenmem_access_t access) {
>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +    struct page_info *page;
>> +    struct page_list_head head;
>> +    p2m_access_t a;
>> +    int rc = 0, ctr = 0;
>> +
>> +    if ( !is_pv_domain(d) )
>> +        return -ENOSYS;
>> +
>> +    ASSERT(shadow_mode_enabled(d));
>> +
>> +    rc = p2m_convert_xenmem_access(p2m, access, &a);
>> +    if ( rc != 0 )
>> +        return rc;
>> +
>> +    /*
>> +     * For PV domains we only support r, rw, rx, rx2rw and rwx access
>> +     * permissions
>> +     */
>> +    switch ( a )
>> +    {
>> +    case p2m_access_n:
>> +    case p2m_access_w:
>> +    case p2m_access_x:
>> +    case p2m_access_wx:
>> +    case p2m_access_n2rwx:
>> +        return -EINVAL;
>> +    default:
>> +        break;
>> +    }
>> +
>> +    paging_lock_recursive(d);
>> +
>> +    if ( *start_page )
>> +    {
>> +        head.next = (struct page_info *)*start_page;
>
>What guarantees that the continuation page is still on d->page_list, or that
>now other page got inserted ahead of it? And anyway you're iterating the list
>without holding d->page_alloc_lock.

Good point. Should I grab the lock and release it only when the hypercall completes?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-24 23:34     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-25  7:19       ` Jan Beulich
  2014-07-25 21:39         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28  9:09       ` Tim Deegan
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-25  7:19 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 25.07.14 at 01:34, <aravindp@cisco.com> wrote:
>>> +
>>> +            /* If the access is against the permissions, then send to mem_event
>>*/
>>> +            switch ( p2ma )
>>> +            {
>>> +            case p2m_access_r:
>>> +                violation = access_w || access_x;
>>> +                break;
>>> +            case p2m_access_rx:
>>> +            case p2m_access_rx2rw:
>>> +                violation = access_w;
>>> +                break;
>>> +            case p2m_access_rw:
>>> +                violation = access_x;
>>> +                break;
>>> +            case p2m_access_rwx:
>>> +            default:
>>> +                break;
>>> +            }
>>> +
>>> +            /*
>>> +             * Do not police writes to guest memory from the Xen hypervisor.
>>> +             * This keeps PV mem_access on par with HVM. Turn off CR0.WP
>>here to
>>> +             * allow the write to go through if the guest has marked the page as
>>> +             * writable. Turn it back on in the guest access functions
>>> +             * __copy_to_user / __put_user_size() after the write is completed.
>>> +             */
>>> +            if ( violation && access_w &&
>>> +                 regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
>>
>>Definitely < instead of <= on the right side. But - is this safe, the more
>>that this again doesn't appear to be sitting in a guest kind specific block?
>>I'd at least expect this to be qualified by a regs->cs and/or
>>guest_mode() check.
> 
> I will add the guest kind check. Should I do it for the whole block i.e add 
> is_pv_domain() in addition to mem_event_check_ring()?

Since mem_event_check_ring() isn't PV specific, you'd need to go
through and amend any such checks with a PV one where needed.
And yes, doing it once for the whole code block seems like the
right thing to do here.

> That will also address 
> next comment below. I will add the guest_mode() in addition to the above 
> check for policing Xen writes to guest memory.
> 
>>> +            {
>>> +                unsigned long cr0 = read_cr0();
>>> +
>>> +                violation = 0;
>>> +                if ( cr0 & X86_CR0_WP &&
>>> +                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>> +                {
>>> +                    cr0 &= ~X86_CR0_WP;
>>> +                    write_cr0(cr0);
>>> +                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>>
>>PV field access within a non-PV-only code block?
>>
>>> +                }
>>
>>I wonder how well the window where you're running with CR0.WP clear
>>is bounded: The flag serves as a kind of security measure, and hence
>>shouldn't be left off for extended periods of time.
> 
> I agree. Is there a way I can bound this?

That's what you need to figure out. The simplistic solution (single
stepping just the critical instruction(s)) is probably not going to be
acceptable due to its fragility. I have no good other suggestions,
but I'm not eager to allow code in that weakens protection.

>>> --- a/xen/common/page_alloc.c
>>> +++ b/xen/common/page_alloc.c
>>> @@ -43,6 +43,7 @@
>>>  #include <asm/page.h>
>>>  #include <asm/numa.h>
>>>  #include <asm/flushtlb.h>
>>> +#include <asm/shadow.h>
>>
>>Breaking ARM?
>>>  #ifdef CONFIG_X86
>>>  #include <asm/p2m.h>
>>>  #include <asm/setup.h> /* for highmem_start only */
>>> @@ -1660,6 +1661,8 @@ int assign_pages(
>>>          page_set_owner(&pg[i], d);
>>>          smp_wmb(); /* Domain pointer must be visible before updating refcnt.
>>*/
>>>          pg[i].count_info = PGC_allocated | 1;
>>> +        if ( is_pv_domain(d) )
>>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)->default_access);
>>
>>I don't think you should call shadow code from here.
> 
> Should I add a p2m wrapper for this, so that it is valid only for x86 and a 
> no-op for ARM?

That's not necessarily enough, but at least presumably the right
route: You also need to avoid fiddling with struct page_info fields
that may be used (now or in the future) for other purposes, i.e.
you need to gate the setting of the flags by more than just
is_pv_domain().

>>> @@ -65,6 +67,11 @@ do {
>>			\
>>>  	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
>>>  	default: __put_user_bad();					\
>>>  	}								\
>>> +    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
>>> +    { \
>>> +        write_cr0(read_cr0() | X86_CR0_WP); \
>>> +        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
>>> +    } \
>>>  } while (0)
>>
>>Please obey to the original indentation method.
> 
> I thought tab characters were not allowed for indentation. But I guess you 
> do not want tabs and spaces for indentation purposes to be mixed within a 
> macro?

Correct - ./CODING_STYLE explicitly says that you should match
existing coding style when an entire file (or a significant code
portion) was imported from elsewhere without style adjustment.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-24 23:52     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-25  7:23       ` Jan Beulich
  2014-07-25 21:47         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-25  7:23 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, KeirFraser, Tim Deegan

>>> On 25.07.14 at 01:52, <aravindp@cisco.com> wrote:
>> > +static int mem_access_set_default(struct domain *d, uint64_t
>>*start_page,
>>> +                           xenmem_access_t access) {
>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>> +    struct page_info *page;
>>> +    struct page_list_head head;
>>> +    p2m_access_t a;
>>> +    int rc = 0, ctr = 0;
>>> +
>>> +    if ( !is_pv_domain(d) )
>>> +        return -ENOSYS;
>>> +
>>> +    ASSERT(shadow_mode_enabled(d));
>>> +
>>> +    rc = p2m_convert_xenmem_access(p2m, access, &a);
>>> +    if ( rc != 0 )
>>> +        return rc;
>>> +
>>> +    /*
>>> +     * For PV domains we only support r, rw, rx, rx2rw and rwx access
>>> +     * permissions
>>> +     */
>>> +    switch ( a )
>>> +    {
>>> +    case p2m_access_n:
>>> +    case p2m_access_w:
>>> +    case p2m_access_x:
>>> +    case p2m_access_wx:
>>> +    case p2m_access_n2rwx:
>>> +        return -EINVAL;
>>> +    default:
>>> +        break;
>>> +    }
>>> +
>>> +    paging_lock_recursive(d);
>>> +
>>> +    if ( *start_page )
>>> +    {
>>> +        head.next = (struct page_info *)*start_page;
>>
>>What guarantees that the continuation page is still on d->page_list, or that
>>now other page got inserted ahead of it? And anyway you're iterating the list
>>without holding d->page_alloc_lock.
> 
> Good point. Should I grab the lock and release it only when the hypercall 
> completes?

If "completes" means after any eventual continuation, then you
should be able to answer this with "no" yourself. If "completes" is
meant only up to the next continuation, then this wouldn't help
you anyway. IOW this needs a more sophisticated solution, or
you need to restrict the memory size of guests that can be subject
to mem-access handling.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-25  7:19       ` Jan Beulich
@ 2014-07-25 21:39         ` Aravindh Puthiyaparambil (aravindp)
  2014-07-28  6:49           ` Jan Beulich
  2014-08-28  9:14           ` Tim Deegan
  0 siblings, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-25 21:39 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>I wonder how well the window where you're running with CR0.WP clear
>>>is bounded: The flag serves as a kind of security measure, and hence
>>>shouldn't be left off for extended periods of time.
>>
>> I agree. Is there a way I can bound this?
>
>That's what you need to figure out. The simplistic solution (single
>stepping just the critical instruction(s)) is probably not going to be
>acceptable due to its fragility. I have no good other suggestions,
>but I'm not eager to allow code in that weakens protection.

>From the debugging I have done to get this working, this is what the flow should be. Xen tries to write to guest page marked read only and page fault occurs. So __copy_to_user_ll() -> handle_exception_saved->do_page_fault() and CR0.WP is cleared. Once the fault is handled __copy_to_user_ll() is retried and it goes through. At the end of which CR0.WP is turned on. So this is the only window that pv_vcpu.need_cr0_wp_set should be true. Is there a spot outside of this window that I check to see if it is set and if it is, turn it back on again? Would that be a sufficient bound?

>>>>          pg[i].count_info = PGC_allocated | 1;
>>>> +        if ( is_pv_domain(d) )
>>>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)-
>>default_access);
>>>
>>>I don't think you should call shadow code from here.
>>
>> Should I add a p2m wrapper for this, so that it is valid only for x86 and a
>> no-op for ARM?
>
>That's not necessarily enough, but at least presumably the right
>route: You also need to avoid fiddling with struct page_info fields
>that may be used (now or in the future) for other purposes, i.e.
>you need to gate the setting of the flags by more than just
>is_pv_domain().

Coupled with your response to the other thread, I am thinking I should move away from using the shadow_flags for access permissions. Tim's other suggestion was to try and re-use the p2m-pt implementation.

>>>Please obey to the original indentation method.
>>
>> I thought tab characters were not allowed for indentation. But I guess you
>> do not want tabs and spaces for indentation purposes to be mixed within a
>> macro?
>
>Correct - ./CODING_STYLE explicitly says that you should match
>existing coding style when an entire file (or a significant code
>portion) was imported from elsewhere without style adjustment.

Got it.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-25  7:23       ` Jan Beulich
@ 2014-07-25 21:47         ` Aravindh Puthiyaparambil (aravindp)
  2014-07-28  6:56           ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-25 21:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, KeirFraser, Tim Deegan



>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@suse.com]
>Sent: Friday, July 25, 2014 12:23 AM
>To: Aravindh Puthiyaparambil (aravindp)
>Cc: xen-devel@lists.xenproject.org; KeirFraser; Tim Deegan
>Subject: RE: [PATCH RFC v2 2/4] x86/mem_access: mem_access and
>mem_event changes to support PV domains
>
>>>> On 25.07.14 at 01:52, <aravindp@cisco.com> wrote:
>>> > +static int mem_access_set_default(struct domain *d, uint64_t
>>>*start_page,
>>>> +                           xenmem_access_t access) {
>>>> +    struct p2m_domain *p2m = p2m_get_hostp2m(d);
>>>> +    struct page_info *page;
>>>> +    struct page_list_head head;
>>>> +    p2m_access_t a;
>>>> +    int rc = 0, ctr = 0;
>>>> +
>>>> +    if ( !is_pv_domain(d) )
>>>> +        return -ENOSYS;
>>>> +
>>>> +    ASSERT(shadow_mode_enabled(d));
>>>> +
>>>> +    rc = p2m_convert_xenmem_access(p2m, access, &a);
>>>> +    if ( rc != 0 )
>>>> +        return rc;
>>>> +
>>>> +    /*
>>>> +     * For PV domains we only support r, rw, rx, rx2rw and rwx access
>>>> +     * permissions
>>>> +     */
>>>> +    switch ( a )
>>>> +    {
>>>> +    case p2m_access_n:
>>>> +    case p2m_access_w:
>>>> +    case p2m_access_x:
>>>> +    case p2m_access_wx:
>>>> +    case p2m_access_n2rwx:
>>>> +        return -EINVAL;
>>>> +    default:
>>>> +        break;
>>>> +    }
>>>> +
>>>> +    paging_lock_recursive(d);
>>>> +
>>>> +    if ( *start_page )
>>>> +    {
>>>> +        head.next = (struct page_info *)*start_page;
>>>
>>>What guarantees that the continuation page is still on d->page_list, or that
>>>now other page got inserted ahead of it? And anyway you're iterating the
>list
>>>without holding d->page_alloc_lock.
>>
>> Good point. Should I grab the lock and release it only when the hypercall
>> completes?
>
>If "completes" means after any eventual continuation, then you
>should be able to answer this with "no" yourself. If "completes" is

Oh no, I really worded my question badly.

>meant only up to the next continuation, then this wouldn't help
>you anyway. 

Yes, that is what I meant. And yes, I realize it won't be of any help.

>IOW this needs a more sophisticated solution, or

OK, I will look in to reusing the p2m-pt implementation.

>you need to restrict the memory size of guests that can be subject
>to mem-access handling.

If I was to stick to using the shadow_flags, what should the memory size restriction be so as to not have a hypercall continuation? 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-25 21:39         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-28  6:49           ` Jan Beulich
  2014-07-28 21:14             ` Aravindh Puthiyaparambil (aravindp)
  2014-07-30  4:05             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28  9:14           ` Tim Deegan
  1 sibling, 2 replies; 85+ messages in thread
From: Jan Beulich @ 2014-07-28  6:49 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 25.07.14 at 23:39, <aravindp@cisco.com> wrote:
>>>>I wonder how well the window where you're running with CR0.WP clear
>>>>is bounded: The flag serves as a kind of security measure, and hence
>>>>shouldn't be left off for extended periods of time.
>>>
>>> I agree. Is there a way I can bound this?
>>
>>That's what you need to figure out. The simplistic solution (single
>>stepping just the critical instruction(s)) is probably not going to be
>>acceptable due to its fragility. I have no good other suggestions,
>>but I'm not eager to allow code in that weakens protection.
> 
> From the debugging I have done to get this working, this is what the flow 
> should be. Xen tries to write to guest page marked read only and page fault 
> occurs. So __copy_to_user_ll() -> handle_exception_saved->do_page_fault() and 
> CR0.WP is cleared. Once the fault is handled __copy_to_user_ll() is retried 
> and it goes through. At the end of which CR0.WP is turned on. So this is the 
> only window that pv_vcpu.need_cr0_wp_set should be true. Is there a spot 
> outside of this window that I check to see if it is set and if it is, turn it 
> back on again? Would that be a sufficient bound?

That's the obvious (direct) path. What you leave aside are any
interrupts occurring in between.

>>>>>          pg[i].count_info = PGC_allocated | 1;
>>>>> +        if ( is_pv_domain(d) )
>>>>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)-
>>>default_access);
>>>>
>>>>I don't think you should call shadow code from here.
>>>
>>> Should I add a p2m wrapper for this, so that it is valid only for x86 and a
>>> no-op for ARM?
>>
>>That's not necessarily enough, but at least presumably the right
>>route: You also need to avoid fiddling with struct page_info fields
>>that may be used (now or in the future) for other purposes, i.e.
>>you need to gate the setting of the flags by more than just
>>is_pv_domain().
> 
> Coupled with your response to the other thread, I am thinking I should move 
> away from using the shadow_flags for access permissions. Tim's other 
> suggestion was to try and re-use the p2m-pt implementation.

I'll leave that to Tim and you.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-25 21:47         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-28  6:56           ` Jan Beulich
  2014-07-28 21:16             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-28  6:56 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, KeirFraser, Tim Deegan

>>> On 25.07.14 at 23:47, <aravindp@cisco.com> wrote:
>>you need to restrict the memory size of guests that can be subject
>>to mem-access handling.
> 
> If I was to stick to using the shadow_flags, what should the memory size 
> restriction be so as to not have a hypercall continuation? 

That's very hard to tell: What we care about is maximum processing
time for an individual hypercall (or non-preemptible portion thereof).
Hence you could either take a low enough guessed value that all of
us are convince won't cause any problems, or come up with a more
or less sophisticated formula that _you_ would need to prove is
never going to cause any problems. I can only repeat what I
(perhaps indirectly) stated before: You want the new feature, so
it's going to be primarily you to solve the problems associated with
it. We're there to help where possible, but as far as I'm concerned
if I don't offer an alternative suggestion right away then this usually
is because I can't think of one. Beyond that our primary role here
is to avoid new code causing damage or introducing (security or
other) risks.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-28  6:49           ` Jan Beulich
@ 2014-07-28 21:14             ` Aravindh Puthiyaparambil (aravindp)
  2014-07-30  4:05             ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-28 21:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>That's what you need to figure out. The simplistic solution (single
>>>stepping just the critical instruction(s)) is probably not going to be
>>>acceptable due to its fragility. I have no good other suggestions, but
>>>I'm not eager to allow code in that weakens protection.
>>
>> From the debugging I have done to get this working, this is what the
>> flow should be. Xen tries to write to guest page marked read only and
>> page fault occurs. So __copy_to_user_ll() ->
>> handle_exception_saved->do_page_fault() and CR0.WP is cleared. Once
>> the fault is handled __copy_to_user_ll() is retried and it goes
>> through. At the end of which CR0.WP is turned on. So this is the only
>> window that pv_vcpu.need_cr0_wp_set should be true. Is there a spot
>> outside of this window that I check to see if it is set and if it is, turn it back on
>again? Would that be a sufficient bound?
>
>That's the obvious (direct) path. What you leave aside are any interrupts
>occurring in between.

True. I was thinking about disabling interrupts in this window but that wouldn't account for non-maskable ones. This is going to be a tough nut to crack.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-07-28  6:56           ` Jan Beulich
@ 2014-07-28 21:16             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-28 21:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, KeirFraser, Tim Deegan

>That's very hard to tell: What we care about is maximum processing time for
>an individual hypercall (or non-preemptible portion thereof).
>Hence you could either take a low enough guessed value that all of us are
>convince won't cause any problems, or come up with a more or less
>sophisticated formula that _you_ would need to prove is never going to cause
>any problems. I can only repeat what I (perhaps indirectly) stated before: You
>want the new feature, so it's going to be primarily you to solve the problems
>associated with it. We're there to help where possible, but as far as I'm
>concerned if I don't offer an alternative suggestion right away then this usually
>is because I can't think of one. Beyond that our primary role here is to avoid
>new code causing damage or introducing (security or
>other) risks.

I agree with your viewpoint. Sorry if I have bugged you for too much help. Thanks for taking the time to review and provide feedback. Let me try solving these problems and I will post more patches.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-28  6:49           ` Jan Beulich
  2014-07-28 21:14             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-30  4:05             ` Aravindh Puthiyaparambil (aravindp)
  2014-07-30  7:11               ` Jan Beulich
  1 sibling, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-30  4:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>>That's what you need to figure out. The simplistic solution (single
>>>>stepping just the critical instruction(s)) is probably not going to be
>>>>acceptable due to its fragility. I have no good other suggestions, but
>>>>I'm not eager to allow code in that weakens protection.
>>>
>>> From the debugging I have done to get this working, this is what the
>>> flow should be. Xen tries to write to guest page marked read only and
>>> page fault occurs. So __copy_to_user_ll() ->
>>> handle_exception_saved->do_page_fault() and CR0.WP is cleared. Once
>>> the fault is handled __copy_to_user_ll() is retried and it goes
>>> through. At the end of which CR0.WP is turned on. So this is the only
>>> window that pv_vcpu.need_cr0_wp_set should be true. Is there a spot
>>> outside of this window that I check to see if it is set and if it is, turn it back
>on
>>again? Would that be a sufficient bound?
>>
>>That's the obvious (direct) path. What you leave aside are any interrupts
>>occurring in between.
>
>True. I was thinking about disabling interrupts in this window but that
>wouldn't account for non-maskable ones. This is going to be a tough nut to
>crack.

I took another stab at solving this issue. This is what the solution looks like:

1. mem_access listener attaches to a PV domain and listens for write violation.
2. Xen tries to write to a guest page that is marked not writable.
3. PF occurs.
	- Do not pass on the violation to the mem_access listener.
	- Temporarily change the access permission for the page in question to R/W.
	- Allow the fault to be handled which will end up with a PTE with the RW bit set.
	- Stash the mfn in question in the pv_vcpu structure.
	- Check if this field is set on exiting from guest access functions. If it is, reset the page permission to the default value and drop its shadows.
4. I had to take care that the checks in the guest access functions for resetting the page permissions do not kick in when the shadow code is trying to construct the PTE for the page in question or when removing all the mappings.

Here is a POC patch that does the above. I have it on top of the patch that was using CR0.WP to highlight the difference. I realize some of the other comments like using shadow_flags have not been addressed here. I just want to get feedback if this is a viable solution before addressing those issues.

Thanks,
Aravindh

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 49d8545..e6fd08e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -734,6 +734,8 @@ int arch_set_info_guest(
         if ( ((c(ldt_base) & (PAGE_SIZE - 1)) != 0) ||
              (c(ldt_ents) > 8192) )
             return -EINVAL;
+
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
     }
     else if ( is_pvh_vcpu(v) )
     {
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index d4473c1..f60be0c 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -1168,9 +1168,12 @@ int __init construct_dom0(
                COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
     }
 
-    /* Pages that are part of page tables must be read only. */
     if  ( is_pv_domain(d) )
+    {
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+        /* Pages that are part of page tables must be read only. */
         mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+    }
 
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
diff --git a/xen/arch/x86/mm/p2m-ma.c b/xen/arch/x86/mm/p2m-ma.c
index d8ad12c..aae972a 100644
--- a/xen/arch/x86/mm/p2m-ma.c
+++ b/xen/arch/x86/mm/p2m-ma.c
@@ -27,6 +27,8 @@
 #include "mm-locks.h"
 
 /* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
 #undef mfn_to_page
 #define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
 
@@ -125,6 +127,32 @@ p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
     return mfn;
 }
 
+void p2m_mem_access_reset_entry(void)
+{
+    mfn_t mfn = _mfn(current->arch.pv_vcpu.mfn_access_reset);
+    struct page_info *page;
+    struct domain *d = current->domain;
+
+    if ( !mfn_valid(mfn) )
+        return;
+
+    page = mfn_to_page(mfn);
+    if ( page_get_owner(page) != d )
+        return;
+
+    ASSERT(!paging_locked_by_me(d));
+    paging_lock(d);
+
+    shadow_set_access(page, p2m_get_hostp2m(d)->default_access);
+
+    if ( sh_remove_all_mappings(current, mfn) )
+        flush_tlb_mask(d->domain_dirty_cpumask);
+
+    current->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+
+    paging_unlock(d);
+}
+
 /* Reset the set_entry and get_entry function pointers */
 void p2m_mem_access_reset(struct p2m_domain *p2m)
 {
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 9aacd8e..76e6fa9 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1132,8 +1132,33 @@ int shadow_write_guest_entry(struct vcpu *v, intpte_t *p,
  * appropriately.  Returns 0 if we page-faulted, 1 for success. */
 {
     int failed;
+    unsigned long mfn_access_reset_saved = INVALID_MFN;
     paging_lock(v->domain);
+
+    /*
+     * If a mem_access listener is present for a PV guest and is listening for
+     * write violations, we want to allow Xen writes to guest memory to go
+     * through. To allow this we are setting PTE.RW for the MFN in question and
+     * resetting this after the write has gone through. The resetting is kicked
+     * off at the end of the guest access functions __copy_to_user_ll() and
+     * __put_user_size() if mfn_access_reset is a valid MFN. Since these
+     * functions are also called by the shadow code when setting the PTE.RW or
+     * sh_remove_all_mappings(), we temporarily set mfn_access_reset to an
+     * invalid value to prevent p2m_mem_access_reset_entry() from firing.
+     */
+    if ( unlikely(mem_event_check_ring(&v->domain->mem_event->access)) &&
+         mfn_valid(_mfn(v->arch.pv_vcpu.mfn_access_reset)) &&
+         is_pv_domain(v->domain) )
+    {
+        mfn_access_reset_saved = v->arch.pv_vcpu.mfn_access_reset;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+    }
+
     failed = __copy_to_user(p, &new, sizeof(new));
+
+    if ( unlikely(mfn_valid(_mfn(mfn_access_reset_saved))) )
+        v->arch.pv_vcpu.mfn_access_reset = mfn_access_reset_saved;
+
     if ( failed != sizeof(new) )
         sh_validate_guest_entry(v, gmfn, p, sizeof(new));
     paging_unlock(v->domain);
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index db30396..f572d23 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -795,7 +795,7 @@ static inline void safe_write_entry(void *dst, void *src)
 
 
 static inline void 
-shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
+shadow_write_entries(struct vcpu *v, void *d, void *s, int entries, mfn_t mfn)
 /* This function does the actual writes to shadow pages.
  * It must not be called directly, since it doesn't do the bookkeeping
  * that shadow_set_l*e() functions do. */
@@ -804,6 +804,26 @@ shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
     shadow_l1e_t *src = s;
     void *map = NULL;
     int i;
+    unsigned long mfn_access_reset_saved = INVALID_MFN;
+
+    /*
+     * If a mem_access listener is present for a PV guest and is listening for
+     * write violations, we want to allow Xen writes to guest memory to go
+     * through. To allow this we are setting PTE.RW for the MFN in question and
+     * resetting this after the write has gone through. The resetting is kicked
+     * off at the end of the guest access functions __copy_to_user_ll() and
+     * __put_user_size() if mfn_access_reset is a valid MFN. Since these
+     * functions are also called by the shadow code when setting the PTE.RW or
+     * sh_remove_all_mappings(), we temporarily set mfn_access_reset to an
+     * invalid value to prevent p2m_mem_access_reset_entry() from firing.
+     */
+    if ( unlikely(mem_event_check_ring(&v->domain->mem_event->access)) &&
+         mfn_valid(_mfn(v->arch.pv_vcpu.mfn_access_reset)) &&
+         is_pv_domain(v->domain) )
+    {
+        mfn_access_reset_saved = v->arch.pv_vcpu.mfn_access_reset;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+    }
 
     /* Because we mirror access rights at all levels in the shadow, an
      * l2 (or higher) entry with the RW bit cleared will leave us with
@@ -817,6 +837,9 @@ shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
         dst = map + ((unsigned long)dst & (PAGE_SIZE - 1));
     }
 
+    if ( unlikely(mfn_valid(_mfn(mfn_access_reset_saved))) &&
+         is_pv_domain(v->domain) )
+        v->arch.pv_vcpu.mfn_access_reset = mfn_access_reset_saved;
 
     for ( i = 0; i < entries; i++ )
         safe_write_entry(dst++, src++);
@@ -924,7 +947,7 @@ static int shadow_set_l4e(struct vcpu *v,
     }
 
     /* Write the new entry */
-    shadow_write_entries(sl4e, &new_sl4e, 1, sl4mfn);
+    shadow_write_entries(v, sl4e, &new_sl4e, 1, sl4mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( shadow_l4e_get_flags(old_sl4e) & _PAGE_PRESENT ) 
@@ -969,7 +992,7 @@ static int shadow_set_l3e(struct vcpu *v,
     }
 
     /* Write the new entry */
-    shadow_write_entries(sl3e, &new_sl3e, 1, sl3mfn);
+    shadow_write_entries(v, sl3e, &new_sl3e, 1, sl3mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( shadow_l3e_get_flags(old_sl3e) & _PAGE_PRESENT ) 
@@ -1051,9 +1074,9 @@ static int shadow_set_l2e(struct vcpu *v,
 
     /* Write the new entry */
 #if GUEST_PAGING_LEVELS == 2
-    shadow_write_entries(sl2e, &pair, 2, sl2mfn);
+    shadow_write_entries(v, sl2e, &pair, 2, sl2mfn);
 #else /* normal case */
-    shadow_write_entries(sl2e, &new_sl2e, 1, sl2mfn);
+    shadow_write_entries(v, sl2e, &new_sl2e, 1, sl2mfn);
 #endif
     flags |= SHADOW_SET_CHANGED;
 
@@ -1218,7 +1241,7 @@ static int shadow_set_l1e(struct vcpu *v,
     } 
 
     /* Write the new entry */
-    shadow_write_entries(sl1e, &new_sl1e, 1, sl1mfn);
+    shadow_write_entries(v, sl1e, &new_sl1e, 1, sl1mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( (shadow_l1e_get_flags(old_sl1e) & _PAGE_PRESENT) 
@@ -3069,15 +3092,11 @@ static int sh_page_fault(struct vcpu *v,
             if ( violation && access_w &&
                  regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
             {
-                unsigned long cr0 = read_cr0();
-
                 violation = 0;
-                if ( cr0 & X86_CR0_WP &&
-                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
+                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
                 {
-                    cr0 &= ~X86_CR0_WP;
-                    write_cr0(cr0);
-                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
+                    v->arch.pv_vcpu.mfn_access_reset = mfn_x(gmfn);
+                    shadow_set_access(mfn_to_page(gmfn), p2m_access_rw);
                 }
             }
 
diff --git a/xen/arch/x86/usercopy.c b/xen/arch/x86/usercopy.c
index eecf429..e2d192d 100644
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -8,6 +8,7 @@
 
 #include <xen/lib.h>
 #include <xen/sched.h>
+#include <asm/p2m.h>
 #include <asm/uaccess.h>
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned n)
@@ -47,16 +48,12 @@ unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned n)
 
     /*
      * A mem_access listener was present and Xen tried to write to guest memory.
-     * To allow this write to go through without an event being sent to the
-     * listener or the pagetable entry being modified, we disabled CR0.WP in the
-     * shadow pagefault handler. We are enabling it back here again.
+     * To allow this write to go through we modified the PTE in the shadow page
+     * fault handler. We are resetting the access permission of the page that
+     * was written to its default value here.
      */
-    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) )
-    {
-        write_cr0(read_cr0() | X86_CR0_WP);
-        current->arch.pv_vcpu.need_cr0_wp_set = 0;
-    }
-
+    if ( unlikely(mfn_valid(current->arch.pv_vcpu.mfn_access_reset)) )
+        p2m_mem_access_reset_entry();
     return __n;
 }
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index cf2ae2a..4b6b782 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -385,10 +385,10 @@ struct pv_vcpu
     struct vcpu_time_info pending_system_time;
 
     /*
-     * Flag that tracks if CR0.WP needs to be set after a Xen write to guest
-     * memory when a PV domain has a mem_access listener attached to it.
+     * Track if a mfn's access permission needs to be reset after a Xen write to
+     * guest memory when a PV domain has a mem_access listener attached to it.
      */
-    bool_t need_cr0_wp_set;
+    unsigned long mfn_access_reset;
 };
 
 struct arch_vcpu
diff --git a/xen/include/asm-x86/uaccess.h b/xen/include/asm-x86/uaccess.h
index 947470d..333d4b1 100644
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -21,6 +21,8 @@ unsigned long __copy_from_user_ll(void *to, const void *from, unsigned n);
 extern long __get_user_bad(void);
 extern void __put_user_bad(void);
 
+extern void p2m_mem_access_reset_entry(void);
+
 /**
  * get_user: - Get a simple variable from user space.
  * @x:   Variable to store result.
diff --git a/xen/include/asm-x86/x86_64/uaccess.h b/xen/include/asm-x86/x86_64/uaccess.h
index 6d13ec6..7cacbcd 100644
--- a/xen/include/asm-x86/x86_64/uaccess.h
+++ b/xen/include/asm-x86/x86_64/uaccess.h
@@ -1,8 +1,6 @@
 #ifndef __X86_64_UACCESS_H
 #define __X86_64_UACCESS_H
 
-#include <xen/sched.h>
-
 #define COMPAT_ARG_XLAT_VIRT_BASE ((void *)ARG_XLAT_START(current))
 #define COMPAT_ARG_XLAT_SIZE      (2*PAGE_SIZE)
 struct vcpu;
@@ -67,11 +65,8 @@ do {									\
 	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
 	default: __put_user_bad();					\
 	}								\
-    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
-    { \
-        write_cr0(read_cr0() | X86_CR0_WP); \
-        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
-    } \
+    if ( unlikely(mfn_valid(current->arch.pv_vcpu.mfn_access_reset)) ) \
+        p2m_mem_access_reset_entry(); \
 } while (0)
 
 #define __get_user_size(x,ptr,size,retval,errret)			\

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-30  4:05             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-07-30  7:11               ` Jan Beulich
  2014-07-30 18:35                 ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-07-30  7:11 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 30.07.14 at 06:05, <aravindp@cisco.com> wrote:
> I took another stab at solving this issue. This is what the solution looks 
> like:
> 
> 1. mem_access listener attaches to a PV domain and listens for write 
> violation.
> 2. Xen tries to write to a guest page that is marked not writable.
> 3. PF occurs.
> 	- Do not pass on the violation to the mem_access listener.
> 	- Temporarily change the access permission for the page in question to R/W.
> 	- Allow the fault to be handled which will end up with a PTE with the RW bit set.
> 	- Stash the mfn in question in the pv_vcpu structure.
> 	- Check if this field is set on exiting from guest access functions. If it 
> is, reset the page permission to the default value and drop its shadows.
> 4. I had to take care that the checks in the guest access functions for 
> resetting the page permissions do not kick in when the shadow code is trying 
> to construct the PTE for the page in question or when removing all the 
> mappings.
> 
> Here is a POC patch that does the above. I have it on top of the patch that 
> was using CR0.WP to highlight the difference. I realize some of the other 
> comments like using shadow_flags have not been addressed here. I just want to 
> get feedback if this is a viable solution before addressing those issues.

Still rather ugly, and still leaving the window in time as big as the
CR0.CD variant (the window in space got shrunk to just a page).
Furthermore, how would you guarantee no other vCPU of the
same guest modifies the now writable page in the meantime?

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-30  7:11               ` Jan Beulich
@ 2014-07-30 18:35                 ` Aravindh Puthiyaparambil (aravindp)
  2014-08-01  6:39                   ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-07-30 18:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>> I took another stab at solving this issue. This is what the solution
>> looks
>> like:
>>
>> 1. mem_access listener attaches to a PV domain and listens for write
>> violation.
>> 2. Xen tries to write to a guest page that is marked not writable.
>> 3. PF occurs.
>> 	- Do not pass on the violation to the mem_access listener.
>> 	- Temporarily change the access permission for the page in question
>to R/W.
>> 	- Allow the fault to be handled which will end up with a PTE with the
>RW bit set.
>> 	- Stash the mfn in question in the pv_vcpu structure.
>> 	- Check if this field is set on exiting from guest access functions.
>> If it is, reset the page permission to the default value and drop its shadows.
>> 4. I had to take care that the checks in the guest access functions
>> for resetting the page permissions do not kick in when the shadow code
>> is trying to construct the PTE for the page in question or when
>> removing all the mappings.
>>
>> Here is a POC patch that does the above. I have it on top of the patch
>> that was using CR0.WP to highlight the difference. I realize some of
>> the other comments like using shadow_flags have not been addressed
>> here. I just want to get feedback if this is a viable solution before addressing
>those issues.
>
>Still rather ugly, and still leaving the window in time as big as the CR0.CD
>variant (the window in space got shrunk to just a page).
>Furthermore, how would you guarantee no other vCPU of the same guest
>modifies the now writable page in the meantime?

Good point. The only other thing I can think of to get around this issue is to pause the domain during these writes and unpause it on the way out of the guest access functions.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-30 18:35                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-01  6:39                   ` Jan Beulich
  2014-08-01 18:08                     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-01  6:39 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 30.07.14 at 20:35, <aravindp@cisco.com> wrote:
>>Still rather ugly, and still leaving the window in time as big as the CR0.CD
>>variant (the window in space got shrunk to just a page).
>>Furthermore, how would you guarantee no other vCPU of the same guest
>>modifies the now writable page in the meantime?
> 
> Good point. The only other thing I can think of to get around this issue is 
> to pause the domain during these writes and unpause it on the way out of the 
> guest access functions.

If that's tolerable guest-performance-wise...

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-01  6:39                   ` Jan Beulich
@ 2014-08-01 18:08                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-04  7:03                       ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-01 18:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>> On 30.07.14 at 20:35, <aravindp@cisco.com> wrote:
>>>Still rather ugly, and still leaving the window in time as big as the
>>>CR0.CD variant (the window in space got shrunk to just a page).
>>>Furthermore, how would you guarantee no other vCPU of the same guest
>>>modifies the now writable page in the meantime?
>>
>> Good point. The only other thing I can think of to get around this
>> issue is to pause the domain during these writes and unpause it on the
>> way out of the guest access functions.
>
>If that's tolerable guest-performance-wise...

It would be for the use cases we. The performance hit would only be for users who want to watch for writes to memory that is shared between Xen and the guest. Given that which variant would you prefer between the CR0.WP and pagetable methods?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-01 18:08                     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-04  7:03                       ` Jan Beulich
  2014-08-05  0:14                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-04  7:03 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 01.08.14 at 20:08, <aravindp@cisco.com> wrote:
>>>>> On 30.07.14 at 20:35, <aravindp@cisco.com> wrote:
>>>>Still rather ugly, and still leaving the window in time as big as the
>>>>CR0.CD variant (the window in space got shrunk to just a page).
>>>>Furthermore, how would you guarantee no other vCPU of the same guest
>>>>modifies the now writable page in the meantime?
>>>
>>> Good point. The only other thing I can think of to get around this
>>> issue is to pause the domain during these writes and unpause it on the
>>> way out of the guest access functions.
>>
>>If that's tolerable guest-performance-wise...
> 
> It would be for the use cases we. The performance hit would only be for 
> users who want to watch for writes to memory that is shared between Xen and 
> the guest. Given that which variant would you prefer between the CR0.WP and 
> pagetable methods?

Since the page table one has the overall smaller window of reduced
protection, I think I'd prefer that one. However, judging the overhead
acceptability by just the specific use case you have is perhaps
insufficient for including your changes in the public tree, especially
with no clear perspective of how to reduce it if someone indeed cared.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-04  7:03                       ` Jan Beulich
@ 2014-08-05  0:14                         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-05  6:33                           ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-05  0:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>> Good point. The only other thing I can think of to get around this
>>>> issue is to pause the domain during these writes and unpause it on the
>>>> way out of the guest access functions.
>>>
>>>If that's tolerable guest-performance-wise...
>>
>> It would be for the use cases we. The performance hit would only be for
>> users who want to watch for writes to memory that is shared between Xen
>and
>> the guest. Given that which variant would you prefer between the CR0.WP
>and
>> pagetable methods?
>
>Since the page table one has the overall smaller window of reduced
>protection, I think I'd prefer that one. However, judging the overhead
>acceptability by just the specific use case you have is perhaps
>insufficient for including your changes in the public tree, especially
>with no clear perspective of how to reduce it if someone indeed cared.

I am a little lost by your statement about specific use case. People using mem_access typically use it for security and guest inspection purposes. They are aware of the performance hits that come along with that. Given that use case, would you please reconsider including these changes? Or were you talking about other use cases?

Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-05  0:14                         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-05  6:33                           ` Jan Beulich
  2014-08-13 22:14                             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-05  6:33 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 05.08.14 at 02:14, <aravindp@cisco.com> wrote:
>>>>> Good point. The only other thing I can think of to get around this
>>>>> issue is to pause the domain during these writes and unpause it on the
>>>>> way out of the guest access functions.
>>>>
>>>>If that's tolerable guest-performance-wise...
>>>
>>> It would be for the use cases we. The performance hit would only be for
>>> users who want to watch for writes to memory that is shared between Xen
>>and
>>> the guest. Given that which variant would you prefer between the CR0.WP
>>and
>>> pagetable methods?
>>
>>Since the page table one has the overall smaller window of reduced
>>protection, I think I'd prefer that one. However, judging the overhead
>>acceptability by just the specific use case you have is perhaps
>>insufficient for including your changes in the public tree, especially
>>with no clear perspective of how to reduce it if someone indeed cared.
> 
> I am a little lost by your statement about specific use case. People using 
> mem_access typically use it for security and guest inspection purposes. They 
> are aware of the performance hits that come along with that. Given that use 
> case, would you please reconsider including these changes? Or were you 
> talking about other use cases?

No, at least not about unspecified hypothetical ones. But again - a
vague statement like you gave, without any kind of quantification of
the imposed overhead, isn't going to be good enough a judgment.
After all pausing a domain can be quite problematic for its performance
if that happens reasonably frequently. Otoh I admit that the user of
your new mechanism has a certain level of control over the impact via
the number of pages (s)he wants to write-protect. So yes, perhaps it
isn't going to be too bad as long as the hackery you need to do isn't.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-05  6:33                           ` Jan Beulich
@ 2014-08-13 22:14                             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-13 22:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>Since the page table one has the overall smaller window of reduced
>>>protection, I think I'd prefer that one. However, judging the overhead
>>>acceptability by just the specific use case you have is perhaps
>>>insufficient for including your changes in the public tree, especially
>>>with no clear perspective of how to reduce it if someone indeed cared.
>>
>> I am a little lost by your statement about specific use case. People using
>> mem_access typically use it for security and guest inspection purposes. They
>> are aware of the performance hits that come along with that. Given that use
>> case, would you please reconsider including these changes? Or were you
>> talking about other use cases?
>
>No, at least not about unspecified hypothetical ones. But again - a
>vague statement like you gave, without any kind of quantification of
>the imposed overhead, isn't going to be good enough a judgment.
>After all pausing a domain can be quite problematic for its performance
>if that happens reasonably frequently. Otoh I admit that the user of
>your new mechanism has a certain level of control over the impact via
>the number of pages (s)he wants to write-protect. So yes, perhaps it
>isn't going to be too bad as long as the hackery you need to do isn't.

I just wanted to give an update as to where I stand so as to not leave this thread hanging. As I was working through pausing and unpausing the domain during Xen writes to guest memory, I found a spot where the write happens outside of __copy_to_user_ll() and __put_user_size(). It occurs in create_bounce_frame() where Xen writes to the guest stack to setup an exception frame. At that moment, if the guest stack is marked read-only, we end up in the same situation as with the copy to guest functions but with no obvious place to revert the page table entry back to read-only. I am at the moment looking for a spot where I can do this. 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-05  6:33                           ` Jan Beulich
  2014-08-13 22:14                             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22  9:34                               ` Andrew Cooper
  2014-08-22 15:33                               ` Jan Beulich
  1 sibling, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-22  2:29 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>No, at least not about unspecified hypothetical ones. But again - a
>>vague statement like you gave, without any kind of quantification of
>>the imposed overhead, isn't going to be good enough a judgment.
>>After all pausing a domain can be quite problematic for its performance
>>if that happens reasonably frequently. Otoh I admit that the user of
>>your new mechanism has a certain level of control over the impact via
>>the number of pages (s)he wants to write-protect. So yes, perhaps it
>>isn't going to be too bad as long as the hackery you need to do isn't.
>
>I just wanted to give an update as to where I stand so as to not leave this
>thread hanging. As I was working through pausing and unpausing the domain
>during Xen writes to guest memory, I found a spot where the write happens
>outside of __copy_to_user_ll() and __put_user_size(). It occurs in
>create_bounce_frame() where Xen writes to the guest stack to setup an
>exception frame. At that moment, if the guest stack is marked read-only, we
>end up in the same situation as with the copy to guest functions but with no
>obvious place to revert the page table entry back to read-only. I am at the
>moment looking for a spot where I can do this.

I have a solution for the create_bounc_frame() issue I described above. Please find below a POC patch that includes pausing and unpausing the domain during the Xen writes to guest memory. I have it on top of the patch that was using CR0.WP to highlight the difference. Please take a look and let me know if this solution is acceptable. 

PS: I do realize whatever I do to create_bounce_frame() will have to be reflected in the compat version. If this is correct approach I will do the same there too.

Thanks,
Aravindh

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 49d8545..758f7f8 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -734,6 +734,9 @@ int arch_set_info_guest(
         if ( ((c(ldt_base) & (PAGE_SIZE - 1)) != 0) ||
              (c(ldt_ents) > 8192) )
             return -EINVAL;
+
+        v->arch.pv_vcpu.mfn_access_reset_req = 0;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
     }
     else if ( is_pvh_vcpu(v) )
     {
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index d4473c1..6e6b7f8 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -1168,9 +1168,13 @@ int __init construct_dom0(
                COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
     }
 
-    /* Pages that are part of page tables must be read only. */
     if  ( is_pv_domain(d) )
+    {
+        v->arch.pv_vcpu.mfn_access_reset_req = 0;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+        /* Pages that are part of page tables must be read only. */
         mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
+    }
 
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
diff --git a/xen/arch/x86/mm/p2m-ma.c b/xen/arch/x86/mm/p2m-ma.c
index d8ad12c..ad37db0 100644
--- a/xen/arch/x86/mm/p2m-ma.c
+++ b/xen/arch/x86/mm/p2m-ma.c
@@ -27,6 +27,8 @@
 #include "mm-locks.h"
 
 /* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
 #undef mfn_to_page
 #define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
 
@@ -125,6 +127,34 @@ p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
     return mfn;
 }
 
+void p2m_mem_access_reset_mfn_entry(void)
+{
+    mfn_t mfn = _mfn(current->arch.pv_vcpu.mfn_access_reset);
+    struct page_info *page;
+    struct domain *d = current->domain;
+
+    if ( unlikely(!mfn_valid(mfn)) )
+        return;
+
+    page = mfn_to_page(mfn);
+    if ( page_get_owner(page) != d )
+        return;
+
+    ASSERT(!paging_locked_by_me(d));
+    paging_lock(d);
+
+    shadow_set_access(page, p2m_get_hostp2m(d)->default_access);
+
+    if ( sh_remove_all_mappings(current, mfn) )
+        flush_tlb_mask(d->domain_dirty_cpumask);
+
+    current->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+    current->arch.pv_vcpu.mfn_access_reset_req = 0;
+
+    paging_unlock(d);
+    domain_unpause(d);
+}
+
 /* Reset the set_entry and get_entry function pointers */
 void p2m_mem_access_reset(struct p2m_domain *p2m)
 {
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 9aacd8e..5f02948 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1132,8 +1132,36 @@ int shadow_write_guest_entry(struct vcpu *v, intpte_t *p,
  * appropriately.  Returns 0 if we page-faulted, 1 for success. */
 {
     int failed;
+    unsigned long mfn_access_reset_saved = INVALID_MFN;
     paging_lock(v->domain);
+
+    /*
+     * If a mem_access listener is present for a PV guest and is listening for
+     * write violations, we want to allow Xen writes to guest memory to go
+     * through. To allow this we are setting PTE.RW for the MFN in question and
+     * resetting this after the write has gone through. The resetting is kicked
+     * off at the end of the guest access functions __copy_to_user_ll() and
+     * __put_user_size() if mfn_access_reset_req is set. Since these functions
+     * are also called by the shadow code when setting the PTE.RW or
+     * sh_remove_all_mappings(), we temporarily set mfn_access_reset_req to 0
+     * prevent p2m_mem_access_reset_entry() from firing.
+     */
+    if ( unlikely(mem_event_check_ring(&v->domain->mem_event->access)) &&
+         is_pv_domain(v->domain) && v->arch.pv_vcpu.mfn_access_reset_req )
+    {
+        mfn_access_reset_saved = v->arch.pv_vcpu.mfn_access_reset;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+        v->arch.pv_vcpu.mfn_access_reset_req = 0;
+    }
+
     failed = __copy_to_user(p, &new, sizeof(new));
+
+    if ( unlikely(mfn_valid(_mfn(mfn_access_reset_saved))) )
+    {
+        v->arch.pv_vcpu.mfn_access_reset = mfn_access_reset_saved;
+        v->arch.pv_vcpu.mfn_access_reset_req = 1;
+    }
+
     if ( failed != sizeof(new) )
         sh_validate_guest_entry(v, gmfn, p, sizeof(new));
     paging_unlock(v->domain);
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index db30396..bcf8fdf 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -795,7 +795,7 @@ static inline void safe_write_entry(void *dst, void *src)
 
 
 static inline void 
-shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
+shadow_write_entries(struct vcpu *v, void *d, void *s, int entries, mfn_t mfn)
 /* This function does the actual writes to shadow pages.
  * It must not be called directly, since it doesn't do the bookkeeping
  * that shadow_set_l*e() functions do. */
@@ -804,6 +804,26 @@ shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
     shadow_l1e_t *src = s;
     void *map = NULL;
     int i;
+    unsigned long mfn_access_reset_saved = INVALID_MFN;
+
+    /*
+     * If a mem_access listener is present for a PV guest and is listening for
+     * write violations, we want to allow Xen writes to guest memory to go
+     * through. To allow this we are setting PTE.RW for the MFN in question and
+     * resetting this after the write has gone through. The resetting is kicked
+     * off at the end of the guest access functions __copy_to_user_ll() and
+     * __put_user_size() if mfn_access_reset_req is set. Since these functions
+     * are also called by the shadow code when setting the PTE.RW or
+     * sh_remove_all_mappings(), we temporarily set mfn_access_reset_req to 0
+     * prevent p2m_mem_access_reset_entry() from firing.
+     */
+    if ( unlikely(mem_event_check_ring(&v->domain->mem_event->access)) &&
+         v->arch.pv_vcpu.mfn_access_reset_req && is_pv_domain(v->domain) )
+    {
+        mfn_access_reset_saved = v->arch.pv_vcpu.mfn_access_reset;
+        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
+        v->arch.pv_vcpu.mfn_access_reset_req = 0;
+    }
 
     /* Because we mirror access rights at all levels in the shadow, an
      * l2 (or higher) entry with the RW bit cleared will leave us with
@@ -817,6 +837,12 @@ shadow_write_entries(void *d, void *s, int entries, mfn_t mfn)
         dst = map + ((unsigned long)dst & (PAGE_SIZE - 1));
     }
 
+    if ( unlikely(mfn_valid(_mfn(mfn_access_reset_saved))) &&
+         is_pv_domain(v->domain) )
+    {
+        v->arch.pv_vcpu.mfn_access_reset = mfn_access_reset_saved;
+        v->arch.pv_vcpu.mfn_access_reset_req = 1;
+    }
 
     for ( i = 0; i < entries; i++ )
         safe_write_entry(dst++, src++);
@@ -924,7 +950,7 @@ static int shadow_set_l4e(struct vcpu *v,
     }
 
     /* Write the new entry */
-    shadow_write_entries(sl4e, &new_sl4e, 1, sl4mfn);
+    shadow_write_entries(v, sl4e, &new_sl4e, 1, sl4mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( shadow_l4e_get_flags(old_sl4e) & _PAGE_PRESENT ) 
@@ -969,7 +995,7 @@ static int shadow_set_l3e(struct vcpu *v,
     }
 
     /* Write the new entry */
-    shadow_write_entries(sl3e, &new_sl3e, 1, sl3mfn);
+    shadow_write_entries(v, sl3e, &new_sl3e, 1, sl3mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( shadow_l3e_get_flags(old_sl3e) & _PAGE_PRESENT ) 
@@ -1051,9 +1077,9 @@ static int shadow_set_l2e(struct vcpu *v,
 
     /* Write the new entry */
 #if GUEST_PAGING_LEVELS == 2
-    shadow_write_entries(sl2e, &pair, 2, sl2mfn);
+    shadow_write_entries(v, sl2e, &pair, 2, sl2mfn);
 #else /* normal case */
-    shadow_write_entries(sl2e, &new_sl2e, 1, sl2mfn);
+    shadow_write_entries(v, sl2e, &new_sl2e, 1, sl2mfn);
 #endif
     flags |= SHADOW_SET_CHANGED;
 
@@ -1218,7 +1244,7 @@ static int shadow_set_l1e(struct vcpu *v,
     } 
 
     /* Write the new entry */
-    shadow_write_entries(sl1e, &new_sl1e, 1, sl1mfn);
+    shadow_write_entries(v, sl1e, &new_sl1e, 1, sl1mfn);
     flags |= SHADOW_SET_CHANGED;
 
     if ( (shadow_l1e_get_flags(old_sl1e) & _PAGE_PRESENT) 
@@ -3061,23 +3087,24 @@ static int sh_page_fault(struct vcpu *v,
 
             /*
              * Do not police writes to guest memory from the Xen hypervisor.
-             * This keeps PV mem_access on par with HVM. Turn off CR0.WP here to
-             * allow the write to go through if the guest has marked the page as
-             * writable. Turn it back on in the guest access functions
-             * __copy_to_user / __put_user_size() after the write is completed.
+             * This keeps PV mem_access on par with HVM. Pause the guest and
+             * mark the access entry as RW here to allow the write to go through
+             * if the guest has marked the page as writable. Unpause the guest
+             * and set the access value back to the default at the end of
+             * __copy_to_user, __put_user_size() and create_bounce_frame() after
+             * the write is completed. The guest is paused to prevent other
+             * VCPUs from writing to this page during this window.
              */
             if ( violation && access_w &&
                  regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
             {
-                unsigned long cr0 = read_cr0();
-
                 violation = 0;
-                if ( cr0 & X86_CR0_WP &&
-                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
+                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
                 {
-                    cr0 &= ~X86_CR0_WP;
-                    write_cr0(cr0);
-                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
+                    domain_pause_nosync(d);
+                    v->arch.pv_vcpu.mfn_access_reset = mfn_x(gmfn);
+                    v->arch.pv_vcpu.mfn_access_reset_req = 1;
+                    shadow_set_access(mfn_to_page(gmfn), p2m_access_rw);
                 }
             }
 
diff --git a/xen/arch/x86/usercopy.c b/xen/arch/x86/usercopy.c
index eecf429..a0d11a9 100644
--- a/xen/arch/x86/usercopy.c
+++ b/xen/arch/x86/usercopy.c
@@ -8,6 +8,7 @@
 
 #include <xen/lib.h>
 #include <xen/sched.h>
+#include <asm/p2m.h>
 #include <asm/uaccess.h>
 
 unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned n)
@@ -47,16 +48,12 @@ unsigned long __copy_to_user_ll(void __user *to, const void *from, unsigned n)
 
     /*
      * A mem_access listener was present and Xen tried to write to guest memory.
-     * To allow this write to go through without an event being sent to the
-     * listener or the pagetable entry being modified, we disabled CR0.WP in the
-     * shadow pagefault handler. We are enabling it back here again.
+     * To allow this write to go through we modified the PTE in the shadow page
+     * fault handler. We are resetting the access permission of the page that
+     * was written to its default value here.
      */
-    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) )
-    {
-        write_cr0(read_cr0() | X86_CR0_WP);
-        current->arch.pv_vcpu.need_cr0_wp_set = 0;
-    }
-
+    if ( unlikely(current->arch.pv_vcpu.mfn_access_reset_req) )
+        p2m_mem_access_reset_mfn_entry();
     return __n;
 }
 
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 3994f4d..97ea966 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -86,6 +86,7 @@ void __dummy__(void)
     OFFSET(VCPU_trap_ctxt, struct vcpu, arch.pv_vcpu.trap_ctxt);
     OFFSET(VCPU_kernel_sp, struct vcpu, arch.pv_vcpu.kernel_sp);
     OFFSET(VCPU_kernel_ss, struct vcpu, arch.pv_vcpu.kernel_ss);
+    OFFSET(VCPU_mfn_access_reset_req, struct vcpu, arch.pv_vcpu.mfn_access_reset_req);
     OFFSET(VCPU_guest_context_flags, struct vcpu, arch.vgc_flags);
     OFFSET(VCPU_nmi_pending, struct vcpu, nmi_pending);
     OFFSET(VCPU_mce_pending, struct vcpu, mce_pending);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index a3ed216..27fc97f 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -441,6 +441,12 @@ UNLIKELY_START(z, create_bounce_frame_bad_bounce_ip)
         jmp   asm_domain_crash_synchronous  /* Does not return */
 __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
         movq  %rax,UREGS_rip+8(%rsp)
+        cmpb  $1, VCPU_mfn_access_reset_req(%rbx)
+        je    2f
+        ret
+2:      SAVE_ALL
+        call  p2m_mem_access_reset_mfn_entry
+        RESTORE_ALL
         ret
         _ASM_EXTABLE(.Lft2,  dom_crash_sync_extable)
         _ASM_EXTABLE(.Lft3,  dom_crash_sync_extable)
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index cf2ae2a..fa58444 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -385,10 +385,13 @@ struct pv_vcpu
     struct vcpu_time_info pending_system_time;
 
     /*
-     * Flag that tracks if CR0.WP needs to be set after a Xen write to guest
-     * memory when a PV domain has a mem_access listener attached to it.
+     * Track if a mfn's access permission needs to be reset after a Xen write to
+     * guest memory when a PV domain has a mem_access listener attached to it.
+     * The boolean is used to allow for easy checking for this condition in
+     * create_bounce_frame().
      */
-    bool_t need_cr0_wp_set;
+    bool_t mfn_access_reset_req;
+    unsigned long mfn_access_reset;
 };
 
 struct arch_vcpu
diff --git a/xen/include/asm-x86/uaccess.h b/xen/include/asm-x86/uaccess.h
index 947470d..80c7a78 100644
--- a/xen/include/asm-x86/uaccess.h
+++ b/xen/include/asm-x86/uaccess.h
@@ -21,6 +21,8 @@ unsigned long __copy_from_user_ll(void *to, const void *from, unsigned n);
 extern long __get_user_bad(void);
 extern void __put_user_bad(void);
 
+extern void p2m_mem_access_reset_mfn_entry(void);
+
 /**
  * get_user: - Get a simple variable from user space.
  * @x:   Variable to store result.
diff --git a/xen/include/asm-x86/x86_64/uaccess.h b/xen/include/asm-x86/x86_64/uaccess.h
index 6d13ec6..111e4ae 100644
--- a/xen/include/asm-x86/x86_64/uaccess.h
+++ b/xen/include/asm-x86/x86_64/uaccess.h
@@ -67,11 +67,8 @@ do {									\
 	case 8: __put_user_asm(x,ptr,retval,"q","","ir",errret);break;	\
 	default: __put_user_bad();					\
 	}								\
-    if ( unlikely(current->arch.pv_vcpu.need_cr0_wp_set) ) \
-    { \
-        write_cr0(read_cr0() | X86_CR0_WP); \
-        current->arch.pv_vcpu.need_cr0_wp_set = 0; \
-    } \
+    if ( unlikely(current->arch.pv_vcpu.mfn_access_reset_req) ) \
+        p2m_mem_access_reset_mfn_entry(); \
 } while (0)
 
 #define __get_user_size(x,ptr,size,retval,errret)			\

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-22  9:34                               ` Andrew Cooper
  2014-08-22 10:02                                 ` Jan Beulich
                                                   ` (2 more replies)
  2014-08-22 15:33                               ` Jan Beulich
  1 sibling, 3 replies; 85+ messages in thread
From: Andrew Cooper @ 2014-08-22  9:34 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Jan Beulich
  Cc: xen-devel, Tim Deegan, Keir Fraser, Ian Jackson, Ian Campbell

On 22/08/14 03:29, Aravindh Puthiyaparambil (aravindp) wrote:
>>> No, at least not about unspecified hypothetical ones. But again - a
>>> vague statement like you gave, without any kind of quantification of
>>> the imposed overhead, isn't going to be good enough a judgment.
>>> After all pausing a domain can be quite problematic for its performance
>>> if that happens reasonably frequently. Otoh I admit that the user of
>>> your new mechanism has a certain level of control over the impact via
>>> the number of pages (s)he wants to write-protect. So yes, perhaps it
>>> isn't going to be too bad as long as the hackery you need to do isn't.
>> I just wanted to give an update as to where I stand so as to not leave this
>> thread hanging. As I was working through pausing and unpausing the domain
>> during Xen writes to guest memory, I found a spot where the write happens
>> outside of __copy_to_user_ll() and __put_user_size(). It occurs in
>> create_bounce_frame() where Xen writes to the guest stack to setup an
>> exception frame. At that moment, if the guest stack is marked read-only, we
>> end up in the same situation as with the copy to guest functions but with no
>> obvious place to revert the page table entry back to read-only. I am at the
>> moment looking for a spot where I can do this.
> I have a solution for the create_bounc_frame() issue I described above. Please find below a POC patch that includes pausing and unpausing the domain during the Xen writes to guest memory. I have it on top of the patch that was using CR0.WP to highlight the difference. Please take a look and let me know if this solution is acceptable. 
>
> PS: I do realize whatever I do to create_bounce_frame() will have to be reflected in the compat version. If this is correct approach I will do the same there too.
>
> Thanks,
> Aravindh

What is wrong with just making use of CR0.WP to solve this issue?

Alternatively, locate the page in question and use map_domain_page() to
get a supervisor rw mapping.


I am concerned with the addition of a the vcpu specifics to
shadow_write_entries().  Most of the shadow code is already vcpu centric
where it should be domain centric, and steps are being made to alleviate
these problems.  Any access in from a toolstack/device model hypercall
will probably be using vcpu[0], which will cause this logic to be
applied in an erroneous context.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22  9:34                               ` Andrew Cooper
@ 2014-08-22 10:02                                 ` Jan Beulich
  2014-08-22 10:14                                   ` Andrew Cooper
  2014-08-22 18:28                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-08-25 12:45                                 ` Gianluca Guida
  2 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-22 10:02 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Andrew Cooper
  Cc: xen-devel, KeirFraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 22.08.14 at 11:34, <andrew.cooper3@citrix.com> wrote:
> On 22/08/14 03:29, Aravindh Puthiyaparambil (aravindp) wrote:
>>>> No, at least not about unspecified hypothetical ones. But again - a
>>>> vague statement like you gave, without any kind of quantification of
>>>> the imposed overhead, isn't going to be good enough a judgment.
>>>> After all pausing a domain can be quite problematic for its performance
>>>> if that happens reasonably frequently. Otoh I admit that the user of
>>>> your new mechanism has a certain level of control over the impact via
>>>> the number of pages (s)he wants to write-protect. So yes, perhaps it
>>>> isn't going to be too bad as long as the hackery you need to do isn't.
>>> I just wanted to give an update as to where I stand so as to not leave this
>>> thread hanging. As I was working through pausing and unpausing the domain
>>> during Xen writes to guest memory, I found a spot where the write happens
>>> outside of __copy_to_user_ll() and __put_user_size(). It occurs in
>>> create_bounce_frame() where Xen writes to the guest stack to setup an
>>> exception frame. At that moment, if the guest stack is marked read-only, we
>>> end up in the same situation as with the copy to guest functions but with no
>>> obvious place to revert the page table entry back to read-only. I am at the
>>> moment looking for a spot where I can do this.
>> I have a solution for the create_bounc_frame() issue I described above. 
> Please find below a POC patch that includes pausing and unpausing the domain 
> during the Xen writes to guest memory. I have it on top of the patch that was 
> using CR0.WP to highlight the difference. Please take a look and let me know 
> if this solution is acceptable. 
>>
>> PS: I do realize whatever I do to create_bounce_frame() will have to be 
> reflected in the compat version. If this is correct approach I will do the 
> same there too.
> 
> What is wrong with just making use of CR0.WP to solve this issue?

The problem is that the period of time during with that flag would
remain clear isn't well bounded (due to the potential of interrupts
kicking intermediately).

> Alternatively, locate the page in question and use map_domain_page() to
> get a supervisor rw mapping.

Certainly not nice especially in the create_bounce_frame() case
(albeit I think a callout is being used anyway according to what
Aravindh said - I didn't look at the proposed changes in detail
yet), and the necessarily involved manual page table walk likely
wouldn't be nice either.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 10:02                                 ` Jan Beulich
@ 2014-08-22 10:14                                   ` Andrew Cooper
  0 siblings, 0 replies; 85+ messages in thread
From: Andrew Cooper @ 2014-08-22 10:14 UTC (permalink / raw)
  To: Jan Beulich, Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, KeirFraser, Ian Jackson, Ian Campbell, Tim Deegan

On 22/08/14 11:02, Jan Beulich wrote:
>>>> On 22.08.14 at 11:34, <andrew.cooper3@citrix.com> wrote:
>> On 22/08/14 03:29, Aravindh Puthiyaparambil (aravindp) wrote:
>>>>> No, at least not about unspecified hypothetical ones. But again - a
>>>>> vague statement like you gave, without any kind of quantification of
>>>>> the imposed overhead, isn't going to be good enough a judgment.
>>>>> After all pausing a domain can be quite problematic for its performance
>>>>> if that happens reasonably frequently. Otoh I admit that the user of
>>>>> your new mechanism has a certain level of control over the impact via
>>>>> the number of pages (s)he wants to write-protect. So yes, perhaps it
>>>>> isn't going to be too bad as long as the hackery you need to do isn't.
>>>> I just wanted to give an update as to where I stand so as to not leave this
>>>> thread hanging. As I was working through pausing and unpausing the domain
>>>> during Xen writes to guest memory, I found a spot where the write happens
>>>> outside of __copy_to_user_ll() and __put_user_size(). It occurs in
>>>> create_bounce_frame() where Xen writes to the guest stack to setup an
>>>> exception frame. At that moment, if the guest stack is marked read-only, we
>>>> end up in the same situation as with the copy to guest functions but with no
>>>> obvious place to revert the page table entry back to read-only. I am at the
>>>> moment looking for a spot where I can do this.
>>> I have a solution for the create_bounc_frame() issue I described above. 
>> Please find below a POC patch that includes pausing and unpausing the domain 
>> during the Xen writes to guest memory. I have it on top of the patch that was 
>> using CR0.WP to highlight the difference. Please take a look and let me know 
>> if this solution is acceptable. 
>>> PS: I do realize whatever I do to create_bounce_frame() will have to be 
>> reflected in the compat version. If this is correct approach I will do the 
>> same there too.
>>
>> What is wrong with just making use of CR0.WP to solve this issue?
> The problem is that the period of time during with that flag would
> remain clear isn't well bounded (due to the potential of interrupts
> kicking intermediately).

Very true - I retract the suggestion.

>
>> Alternatively, locate the page in question and use map_domain_page() to
>> get a supervisor rw mapping.
> Certainly not nice especially in the create_bounce_frame() case
> (albeit I think a callout is being used anyway according to what
> Aravindh said - I didn't look at the proposed changes in detail
> yet), and the necessarily involved manual page table walk likely
> wouldn't be nice either.

Hmm - that isn't nice.

On further considerations, neither this or the suggested patch deal with
create_bounce_frame() crossing a page boundary and encountering a
different mfn which is also read-only.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22  9:34                               ` Andrew Cooper
@ 2014-08-22 15:33                               ` Jan Beulich
  2014-08-22 19:07                                 ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-22 15:33 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 22.08.14 at 04:29, <aravindp@cisco.com> wrote:
> I have a solution for the create_bounc_frame() issue I described above. 
> Please find below a POC patch that includes pausing and unpausing the domain 
> during the Xen writes to guest memory. I have it on top of the patch that was 
> using CR0.WP to highlight the difference. Please take a look and let me know 
> if this solution is acceptable. 

As Andrew already pointed out, you absolutely need to deal with
page crossing accesses, and I think you also need to deal with
hypervisor accesses extending beyond a page worth of memory
(I'm not sure we have a firmly determined upper bound of how
much memory we may copy in one go).

> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -1168,9 +1168,13 @@ int __init construct_dom0(
>                 COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
>      }
>  
> -    /* Pages that are part of page tables must be read only. */
>      if  ( is_pv_domain(d) )
> +    {
> +        v->arch.pv_vcpu.mfn_access_reset_req = 0;
> +        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
> +        /* Pages that are part of page tables must be read only. */
>          mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);

The order of these should be reversed, with a blank line in between,
to have the important thing first.

>              if ( violation && access_w &&
>                   regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
>              {
> -                unsigned long cr0 = read_cr0();
> -
>                  violation = 0;
> -                if ( cr0 & X86_CR0_WP &&
> -                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
> +                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>                  {
> -                    cr0 &= ~X86_CR0_WP;
> -                    write_cr0(cr0);
> -                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
> +                    domain_pause_nosync(d);

I don't think a "nosync" pause is enough here, as that leaves a
window for the guest to write to the page. Since the sync version
may take some time to complete it may become difficult for you to
actually handle this in an acceptable way.

> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -441,6 +441,12 @@ UNLIKELY_START(z, create_bounce_frame_bad_bounce_ip)
>          jmp   asm_domain_crash_synchronous  /* Does not return */
>  __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
>          movq  %rax,UREGS_rip+8(%rsp)
> +        cmpb  $1, VCPU_mfn_access_reset_req(%rbx)
> +        je    2f

Please avoid comparing boolean values against other than zero.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22  9:34                               ` Andrew Cooper
  2014-08-22 10:02                                 ` Jan Beulich
@ 2014-08-22 18:28                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22 18:52                                   ` Andrew Cooper
  2014-08-25 12:45                                 ` Gianluca Guida
  2 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-22 18:28 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: xen-devel, Tim Deegan, Keir Fraser, Ian Jackson, Ian Campbell

>Please find below a POC patch that includes pausing and unpausing the
>domain during the Xen writes to guest memory. I have it on top of the patch
>that was using CR0.WP to highlight the difference. Please take a look and let
>me know if this solution is acceptable.
>>
>> PS: I do realize whatever I do to create_bounce_frame() will have to be
>reflected in the compat version. If this is correct approach I will do the same
>there too.
>>
>> Thanks,
>> Aravindh
>
>I am concerned with the addition of a the vcpu specifics to
>shadow_write_entries().  Most of the shadow code is already vcpu centric
>where it should be domain centric, and steps are being made to alleviate
>these problems.  

All the call sites of shadow_write_entries() are vcpu specific which I why I thought it was OK to extend this to shadow_write_entries(). What are the steps being taken to alleviate the problems? Maybe I can piggy back on them?

>Any access in from a toolstack/device model hypercall
>will probably be using vcpu[0], which will cause this logic to be
>applied in an erroneous context.

If the access from a toolstack/device model hypercall causes a Xen write to guest memory, it will be recorded for vcpu[0]. So won't it be ok?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 18:28                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-22 18:52                                   ` Andrew Cooper
  0 siblings, 0 replies; 85+ messages in thread
From: Andrew Cooper @ 2014-08-22 18:52 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Jan Beulich
  Cc: xen-devel, Tim Deegan, Keir Fraser, Ian Jackson, Ian Campbell

On 22/08/14 19:28, Aravindh Puthiyaparambil (aravindp) wrote:
>> Please find below a POC patch that includes pausing and unpausing the
>> domain during the Xen writes to guest memory. I have it on top of the patch
>> that was using CR0.WP to highlight the difference. Please take a look and let
>> me know if this solution is acceptable.
>>> PS: I do realize whatever I do to create_bounce_frame() will have to be
>> reflected in the compat version. If this is correct approach I will do the same
>> there too.
>>> Thanks,
>>> Aravindh
>> I am concerned with the addition of a the vcpu specifics to
>> shadow_write_entries().  Most of the shadow code is already vcpu centric
>> where it should be domain centric, and steps are being made to alleviate
>> these problems.  
> All the call sites of shadow_write_entries() are vcpu specific which I why I thought it was OK to extend this to shadow_write_entries(). What are the steps being taken to alleviate the problems? Maybe I can piggy back on them?

You are not the first person to make this assumption.  I have a patch
series being worked on a "when I am not more busy" basis, but I don't
think there is anything useful you could piggy back on.

This problem aside, your current proposal does not work when crossing
page boundaries where the adjacent page is also read-only.  This is an
issue which really does need fixing.

Unfortunately, I am at a loss as to what to suggest.  No practical
solution comes to mind without using CR0.WP, and that has associated
problem.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 15:33                               ` Jan Beulich
@ 2014-08-22 19:07                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22 19:24                                   ` Andrew Cooper
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-22 19:07 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell,
	Andrew Cooper (andrew.cooper3@citrix.com),
	Ian Jackson, Tim Deegan, xen-devel

>> I have a solution for the create_bounc_frame() issue I described above.
>> Please find below a POC patch that includes pausing and unpausing the
>> domain during the Xen writes to guest memory. I have it on top of the
>> patch that was using CR0.WP to highlight the difference. Please take a
>> look and let me know if this solution is acceptable.
>
>As Andrew already pointed out, you absolutely need to deal with page
>crossing accesses, 

Is this for say an unsigned long that lives across two pages? Off the top of my head, I think always allowing writes to the page in question and the next followed by reverting to default for both pages at the end of the write should take care of this. I would have to walk the page tables to figure out the next mfn. Or am I on the wrong track here?

> and I think you also need to deal with hypervisor accesses
>extending beyond a page worth of memory (I'm not sure we have a firmly
>determined upper bound of how much memory we may copy in one go).

Let me try to understand what happens in the non-mem_access case. Say the hypervisor is writing to three pages and all of them are not accessible in the guest. Which one of the following is true?
1. There is a pagefault for the first page which is resolved. The write is then retried which causes a fault for the second page which is resolved. Then the write is retried starting from the second page and so on for the third page too.
2. Or does the write get retried starting from the first page each time the page fault is resolved?

>> --- a/xen/arch/x86/domain_build.c
>> +++ b/xen/arch/x86/domain_build.c
>> @@ -1168,9 +1168,13 @@ int __init construct_dom0(
>>                 COMPAT_L2_PAGETABLE_XEN_SLOTS(d) * sizeof(*l2tab));
>>      }
>>
>> -    /* Pages that are part of page tables must be read only. */
>>      if  ( is_pv_domain(d) )
>> +    {
>> +        v->arch.pv_vcpu.mfn_access_reset_req = 0;
>> +        v->arch.pv_vcpu.mfn_access_reset = INVALID_MFN;
>> +        /* Pages that are part of page tables must be read only. */
>>          mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages);
>
>The order of these should be reversed, with a blank line in between, to have
>the important thing first.

Will do.

>>              if ( violation && access_w &&
>>                   regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
>>              {
>> -                unsigned long cr0 = read_cr0();
>> -
>>                  violation = 0;
>> -                if ( cr0 & X86_CR0_WP &&
>> -                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>> +                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>                  {
>> -                    cr0 &= ~X86_CR0_WP;
>> -                    write_cr0(cr0);
>> -                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>> +                    domain_pause_nosync(d);
>
>I don't think a "nosync" pause is enough here, as that leaves a window for the
>guest to write to the page. Since the sync version may take some time to
>complete it may become difficult for you to actually handle this in an
>acceptable way.

Are you worried about performance or is there some other issue? 

>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -441,6 +441,12 @@ UNLIKELY_START(z,
>create_bounce_frame_bad_bounce_ip)
>>          jmp   asm_domain_crash_synchronous  /* Does not return */
>>  __UNLIKELY_END(create_bounce_frame_bad_bounce_ip)
>>          movq  %rax,UREGS_rip+8(%rsp)
>> +        cmpb  $1, VCPU_mfn_access_reset_req(%rbx)
>> +        je    2f
>
>Please avoid comparing boolean values against other than zero.

OK, I will do:
+        cmpb  $0, VCPU_mfn_access_reset_req(%rbx)
+        jne    2f

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 19:07                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-22 19:24                                   ` Andrew Cooper
  2014-08-22 19:48                                     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Andrew Cooper @ 2014-08-22 19:24 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Jan Beulich
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

On 22/08/14 20:07, Aravindh Puthiyaparambil (aravindp) wrote:
>>> I have a solution for the create_bounc_frame() issue I described above.
>>> Please find below a POC patch that includes pausing and unpausing the
>>> domain during the Xen writes to guest memory. I have it on top of the
>>> patch that was using CR0.WP to highlight the difference. Please take a
>>> look and let me know if this solution is acceptable.
>> As Andrew already pointed out, you absolutely need to deal with page
>> crossing accesses, 
> Is this for say an unsigned long that lives across two pages? Off the top of my head, I think always allowing writes to the page in question and the next followed by reverting to default for both pages at the end of the write should take care of this. I would have to walk the page tables to figure out the next mfn. Or am I on the wrong track here?

create_bounce_frame puts several adjacent words on a guest stack, and
this is very capable of crossing a page boundary.

Even an unaligned uint16_t can cross a page boundary.

>
>> and I think you also need to deal with hypervisor accesses
>> extending beyond a page worth of memory (I'm not sure we have a firmly
>> determined upper bound of how much memory we may copy in one go).
> Let me try to understand what happens in the non-mem_access case. Say the hypervisor is writing to three pages and all of them are not accessible in the guest. Which one of the following is true?
> 1. There is a pagefault for the first page which is resolved. The write is then retried which causes a fault for the second page which is resolved. Then the write is retried starting from the second page and so on for the third page too.
> 2. Or does the write get retried starting from the first page each time the page fault is resolved?

For the non-mem_access case, all faults cause failures.

copy_to/from_user() will typically result in an -EFAULT being handed
back to the hypercaller.  For create_bounce_frame, the results are more
severe and might result in a domain crash or an injection of a failsafe
callback.

No attempt is made to play with the page permissions, as it is the
guests fault that the pages have the wrong permissions.

What mem_access introduces is a case where it is Xen's fault that a
write fault occured, and the fault should be worked around as the guest
is unaware that its pages are actually read-only.

>>>              if ( violation && access_w &&
>>>                   regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
>>>              {
>>> -                unsigned long cr0 = read_cr0();
>>> -
>>>                  violation = 0;
>>> -                if ( cr0 & X86_CR0_WP &&
>>> -                     guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>> +                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>>                  {
>>> -                    cr0 &= ~X86_CR0_WP;
>>> -                    write_cr0(cr0);
>>> -                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>>> +                    domain_pause_nosync(d);
>> I don't think a "nosync" pause is enough here, as that leaves a window for the
>> guest to write to the page. Since the sync version may take some time to
>> complete it may become difficult for you to actually handle this in an
>> acceptable way.
> Are you worried about performance or is there some other issue? 

Both performance and correctness.  With nosync(), guest vcpus can still
be running on other pcpus, and playing with this pagetable entry.

The synchronous variants can block for a moderate period of time.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 19:24                                   ` Andrew Cooper
@ 2014-08-22 19:48                                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22 20:02                                       ` Andrew Cooper
  2014-08-25  7:29                                       ` Jan Beulich
  0 siblings, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-22 19:48 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> As Andrew already pointed out, you absolutely need to deal with page
>>> crossing accesses,
>> Is this for say an unsigned long that lives across two pages? Off the top of
>my head, I think always allowing writes to the page in question and the next
>followed by reverting to default for both pages at the end of the write should
>take care of this. I would have to walk the page tables to figure out the next
>mfn. Or am I on the wrong track here?
>
>create_bounce_frame puts several adjacent words on a guest stack, and this
>is very capable of crossing a page boundary.
>
>Even an unaligned uint16_t can cross a page boundary.

OK, so marking two adjacent pages as writable and reverting after the write went through should solve this problem.

>>> and I think you also need to deal with hypervisor accesses extending
>>> beyond a page worth of memory (I'm not sure we have a firmly
>>> determined upper bound of how much memory we may copy in one go).
>> Let me try to understand what happens in the non-mem_access case. Say
>the hypervisor is writing to three pages and all of them are not accessible in
>the guest. Which one of the following is true?
>> 1. There is a pagefault for the first page which is resolved. The write is then
>retried which causes a fault for the second page which is resolved. Then the
>write is retried starting from the second page and so on for the third page too.
>> 2. Or does the write get retried starting from the first page each time the
>page fault is resolved?
>
>For the non-mem_access case, all faults cause failures.
>
>copy_to/from_user() will typically result in an -EFAULT being handed back to
>the hypercaller.  For create_bounce_frame, the results are more severe and
>might result in a domain crash or an injection of a failsafe callback.
>
>No attempt is made to play with the page permissions, as it is the guests fault
>that the pages have the wrong permissions.
>
>What mem_access introduces is a case where it is Xen's fault that a write fault
>occured, and the fault should be worked around as the guest is unaware that
>its pages are actually read-only.

Ouch, this does make things complicated. The only thing I can think of trying is your suggestion "Alternatively, locate the page in question and use map_domain_page() to get a supervisor rw mapping.". Do this only in __copy_to_user_ll() for copies that span multiple pages in the cases where a mem_access listener is present and listening for write violations. 

Sigh, if only I could bound the CR0.WP solution :-(

>>>> +                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>>>                  {
>>>> -                    cr0 &= ~X86_CR0_WP;
>>>> -                    write_cr0(cr0);
>>>> -                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>>>> +                    domain_pause_nosync(d);
>>> I don't think a "nosync" pause is enough here, as that leaves a
>>> window for the guest to write to the page. Since the sync version may
>>> take some time to complete it may become difficult for you to
>>> actually handle this in an acceptable way.
>> Are you worried about performance or is there some other issue?
>
>Both performance and correctness.  With nosync(), guest vcpus can still be
>running on other pcpus, and playing with this pagetable entry.
>
>The synchronous variants can block for a moderate period of time.

OK, I don't follow why pausing the other vcpus synchronously is an issue here. But if pausing other guest vcpus synchronously even is not an option then it looks like I am at a dead end even if I solve the writes spanning multiple pages issues.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 19:48                                     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-22 20:02                                       ` Andrew Cooper
  2014-08-22 20:13                                         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-25  7:33                                         ` Jan Beulich
  2014-08-25  7:29                                       ` Jan Beulich
  1 sibling, 2 replies; 85+ messages in thread
From: Andrew Cooper @ 2014-08-22 20:02 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Jan Beulich
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>>>> As Andrew already pointed out, you absolutely need to deal with page
>>>> crossing accesses,
>>> Is this for say an unsigned long that lives across two pages? Off the top of
>> my head, I think always allowing writes to the page in question and the next
>> followed by reverting to default for both pages at the end of the write should
>> take care of this. I would have to walk the page tables to figure out the next
>> mfn. Or am I on the wrong track here?
>>
>> create_bounce_frame puts several adjacent words on a guest stack, and this
>> is very capable of crossing a page boundary.
>>
>> Even an unaligned uint16_t can cross a page boundary.
> OK, so marking two adjacent pages as writable and reverting after the write went through should solve this problem.
>
>>>> and I think you also need to deal with hypervisor accesses extending
>>>> beyond a page worth of memory (I'm not sure we have a firmly
>>>> determined upper bound of how much memory we may copy in one go).
>>> Let me try to understand what happens in the non-mem_access case. Say
>> the hypervisor is writing to three pages and all of them are not accessible in
>> the guest. Which one of the following is true?
>>> 1. There is a pagefault for the first page which is resolved. The write is then
>> retried which causes a fault for the second page which is resolved. Then the
>> write is retried starting from the second page and so on for the third page too.
>>> 2. Or does the write get retried starting from the first page each time the
>> page fault is resolved?
>>
>> For the non-mem_access case, all faults cause failures.
>>
>> copy_to/from_user() will typically result in an -EFAULT being handed back to
>> the hypercaller.  For create_bounce_frame, the results are more severe and
>> might result in a domain crash or an injection of a failsafe callback.
>>
>> No attempt is made to play with the page permissions, as it is the guests fault
>> that the pages have the wrong permissions.
>>
>> What mem_access introduces is a case where it is Xen's fault that a write fault
>> occured, and the fault should be worked around as the guest is unaware that
>> its pages are actually read-only.
> Ouch, this does make things complicated. The only thing I can think of trying is your suggestion "Alternatively, locate the page in question and use map_domain_page() to get a supervisor rw mapping.". Do this only in __copy_to_user_ll() for copies that span multiple pages in the cases where a mem_access listener is present and listening for write violations. 
>
> Sigh, if only I could bound the CR0.WP solution :-(

I wonder whether, in the case that mem_access is enabled, it would be
reasonable to perform the CR0.WP sections with interrupts disabled?

The user/admin is already taking a performance hit from mem_access in
the first place, and delaying hardware interrupts a little almost
certainly a lesser evil than whatever mad scheme we devise to fix these
issues.

With interrupts disabled, the CR0.WP problem is very well bounded.  The
only faults which will occur will be as a direct result of the actions
performed, where the fault handlers will follow the extable redirection
and return quickly.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 20:02                                       ` Andrew Cooper
@ 2014-08-22 20:13                                         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-25  7:34                                           ` Jan Beulich
  2014-08-25  7:33                                         ` Jan Beulich
  1 sibling, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-22 20:13 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>>> As Andrew already pointed out, you absolutely need to deal with
>>>>> page crossing accesses,
>>>> Is this for say an unsigned long that lives across two pages? Off
>>>> the top of
>>> my head, I think always allowing writes to the page in question and
>>> the next followed by reverting to default for both pages at the end
>>> of the write should take care of this. I would have to walk the page
>>> tables to figure out the next mfn. Or am I on the wrong track here?
>>>
>>> create_bounce_frame puts several adjacent words on a guest stack, and
>>> this is very capable of crossing a page boundary.
>>>
>>> Even an unaligned uint16_t can cross a page boundary.
>> OK, so marking two adjacent pages as writable and reverting after the write
>went through should solve this problem.
>>
>>>>> and I think you also need to deal with hypervisor accesses
>>>>> extending beyond a page worth of memory (I'm not sure we have a
>>>>> firmly determined upper bound of how much memory we may copy in
>one go).
>>>> Let me try to understand what happens in the non-mem_access case.
>>>> Say
>>> the hypervisor is writing to three pages and all of them are not
>>> accessible in the guest. Which one of the following is true?
>>>> 1. There is a pagefault for the first page which is resolved. The
>>>> write is then
>>> retried which causes a fault for the second page which is resolved.
>>> Then the write is retried starting from the second page and so on for the
>third page too.
>>>> 2. Or does the write get retried starting from the first page each
>>>> time the
>>> page fault is resolved?
>>>
>>> For the non-mem_access case, all faults cause failures.
>>>
>>> copy_to/from_user() will typically result in an -EFAULT being handed
>>> back to the hypercaller.  For create_bounce_frame, the results are
>>> more severe and might result in a domain crash or an injection of a failsafe
>callback.
>>>
>>> No attempt is made to play with the page permissions, as it is the
>>> guests fault that the pages have the wrong permissions.
>>>
>>> What mem_access introduces is a case where it is Xen's fault that a
>>> write fault occured, and the fault should be worked around as the
>>> guest is unaware that its pages are actually read-only.
>> Ouch, this does make things complicated. The only thing I can think of trying
>is your suggestion "Alternatively, locate the page in question and use
>map_domain_page() to get a supervisor rw mapping.". Do this only in
>__copy_to_user_ll() for copies that span multiple pages in the cases where a
>mem_access listener is present and listening for write violations.
>>
>> Sigh, if only I could bound the CR0.WP solution :-(
>
>I wonder whether, in the case that mem_access is enabled, it would be
>reasonable to perform the CR0.WP sections with interrupts disabled?
>
>The user/admin is already taking a performance hit from mem_access in the
>first place, and delaying hardware interrupts a little almost certainly a lesser
>evil than whatever mad scheme we devise to fix these issues.
>
>With interrupts disabled, the CR0.WP problem is very well bounded.  The only
>faults which will occur will be as a direct result of the actions performed,
>where the fault handlers will follow the extable redirection and return quickly.

I fine with using your approach and taking the performance hit especially given that it is a corner case for mem_access listeners watching for Xen writes to guest memory.

Jan, are you OK with it?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 19:48                                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-22 20:02                                       ` Andrew Cooper
@ 2014-08-25  7:29                                       ` Jan Beulich
  2014-08-25 16:40                                         ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-25  7:29 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>>> On 22.08.14 at 21:48, <aravindp@cisco.com> wrote:
>>>> As Andrew already pointed out, you absolutely need to deal with page
>>>> crossing accesses,
>>> Is this for say an unsigned long that lives across two pages? Off the top of
>>my head, I think always allowing writes to the page in question and the next
>>followed by reverting to default for both pages at the end of the write 
> should
>>take care of this. I would have to walk the page tables to figure out the 
> next
>>mfn. Or am I on the wrong track here?
>>
>>create_bounce_frame puts several adjacent words on a guest stack, and this
>>is very capable of crossing a page boundary.
>>
>>Even an unaligned uint16_t can cross a page boundary.
> 
> OK, so marking two adjacent pages as writable and reverting after the write 
> went through should solve this problem.

For create_bounce_frame() yes, but not for the generic
copy_to_user(). But there you'd have the option of reverting the
first page's permissions when you hit a fault on the third one (i.e.
two slots for tracking pages should still suffice).

>>>>> +                if ( guest_l1e_get_flags(gw.l1e) & _PAGE_RW )
>>>>>                  {
>>>>> -                    cr0 &= ~X86_CR0_WP;
>>>>> -                    write_cr0(cr0);
>>>>> -                    v->arch.pv_vcpu.need_cr0_wp_set = 1;
>>>>> +                    domain_pause_nosync(d);
>>>> I don't think a "nosync" pause is enough here, as that leaves a
>>>> window for the guest to write to the page. Since the sync version may
>>>> take some time to complete it may become difficult for you to
>>>> actually handle this in an acceptable way.
>>> Are you worried about performance or is there some other issue?
>>
>>Both performance and correctness.  With nosync(), guest vcpus can still be
>>running on other pcpus, and playing with this pagetable entry.
>>
>>The synchronous variants can block for a moderate period of time.
> 
> OK, I don't follow why pausing the other vcpus synchronously is an issue 
> here.

So what is you don't understand? The remote vCPU-s won't
necessarily be paused by the time domain_pause_nosync()
returns (that's the nature of the "nosync"). Since the page table
entry then gets modified, those remote vCPU-s might access
memory using that entry, and the listener wouldn't see the write
access. And when you use the synchronous form, the waiting for
the remote vCPU-s to become de-scheduled will take some time,
which is particularly bad for many-vCPU guests with the way
domain_pause() currently works (of course this could be improved
by parallelizing the pausing of the individual vCPU-s, first issuing
vcpu_sleep_nosync() on all of them and then waiting for all of
them to become non-running).

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 20:02                                       ` Andrew Cooper
  2014-08-22 20:13                                         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-25  7:33                                         ` Jan Beulich
  2014-08-25 12:49                                           ` Andrew Cooper
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-25  7:33 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Andrew Cooper
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 22.08.14 at 22:02, <andrew.cooper3@citrix.com> wrote:
> On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>> Sigh, if only I could bound the CR0.WP solution :-(
> 
> I wonder whether, in the case that mem_access is enabled, it would be
> reasonable to perform the CR0.WP sections with interrupts disabled?

Which still wouldn't cover NMIs (albeit we might be able to live with
that). But what's worse - taking faults with interrupts disabled
requires extra care, and annotating code normally run with
interrupts enabled with the special .ex_table.pre annotations
doesn't seem like a very nice route, as that could easily hide other
problems in the future.

Jan

> The user/admin is already taking a performance hit from mem_access in
> the first place, and delaying hardware interrupts a little almost
> certainly a lesser evil than whatever mad scheme we devise to fix these
> issues.
> 
> With interrupts disabled, the CR0.WP problem is very well bounded.  The
> only faults which will occur will be as a direct result of the actions
> performed, where the fault handlers will follow the extable redirection
> and return quickly.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22 20:13                                         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-25  7:34                                           ` Jan Beulich
  0 siblings, 0 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-25  7:34 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Andrew Cooper
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 22.08.14 at 22:13, <aravindp@cisco.com> wrote:
>>> Sigh, if only I could bound the CR0.WP solution :-(
>>
>>I wonder whether, in the case that mem_access is enabled, it would be
>>reasonable to perform the CR0.WP sections with interrupts disabled?
>>
>>The user/admin is already taking a performance hit from mem_access in the
>>first place, and delaying hardware interrupts a little almost certainly a lesser
>>evil than whatever mad scheme we devise to fix these issues.
>>
>>With interrupts disabled, the CR0.WP problem is very well bounded.  The only
>>faults which will occur will be as a direct result of the actions performed,
>>where the fault handlers will follow the extable redirection and return quickly.
> 
> I fine with using your approach and taking the performance hit especially 
> given that it is a corner case for mem_access listeners watching for Xen 
> writes to guest memory.
> 
> Jan, are you OK with it?

See the other reply to Andrew's mail I just sent.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-22  9:34                               ` Andrew Cooper
  2014-08-22 10:02                                 ` Jan Beulich
  2014-08-22 18:28                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-25 12:45                                 ` Gianluca Guida
  2014-08-25 13:01                                   ` Jan Beulich
  2014-08-25 13:02                                   ` Andrew Cooper
  2 siblings, 2 replies; 85+ messages in thread
From: Gianluca Guida @ 2014-08-25 12:45 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, Jan Beulich,
	Aravindh Puthiyaparambil (aravindp),
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1407 bytes --]

On Fri, Aug 22, 2014 at 10:34 AM, Andrew Cooper <andrew.cooper3@citrix.com>
wrote:

> I am concerned with the addition of a the vcpu specifics to
> shadow_write_entries().  Most of the shadow code is already vcpu centric
> where it should be domain centric, and steps are being made to alleviate
> these problems.


The historical reason the code is set up this way, if you are referring to
the fact that every shadow operation is vcpu-specific while the interface
to it is domain specific, is due to the design choice of leaving the door
open to experiment with per-vcpu shadows.
That always looked like a nice feature, I am not sure anybody ever
implemented it. I would advocate -- for the sake of code consistency -- to
keep the current shadow internal interfaces per-vcpu in upcoming patches,
and change it when you propose your domain-centric patch, effectively
killing this probably never-exploited opportunity.

Honestly, haven't been following shadow code in a while, so probably
consistency has already been lost, in which case you should feel free to
ignore this comment.

Gianluca




>   Any access in from a toolstack/device model hypercall
> will probably be using vcpu[0], which will cause this logic to be
> applied in an erroneous context.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>

[-- Attachment #1.2: Type: text/html, Size: 2176 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25  7:33                                         ` Jan Beulich
@ 2014-08-25 12:49                                           ` Andrew Cooper
  2014-08-25 13:09                                             ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Andrew Cooper @ 2014-08-25 12:49 UTC (permalink / raw)
  To: Jan Beulich, Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

On 25/08/14 08:33, Jan Beulich wrote:
>>>> On 22.08.14 at 22:02, <andrew.cooper3@citrix.com> wrote:
>> On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>>> Sigh, if only I could bound the CR0.WP solution :-(
>> I wonder whether, in the case that mem_access is enabled, it would be
>> reasonable to perform the CR0.WP sections with interrupts disabled?
> Which still wouldn't cover NMIs (albeit we might be able to live with
> that).

NMIs and MCEs are short, possibly raise softirqs, or call panic().  We
have much larger problems in general if the lack of CR0.WP would
adversely affect the NMI or MCE paths.

> But what's worse - taking faults with interrupts disabled
> requires extra care, and annotating code normally run with
> interrupts enabled with the special .ex_table.pre annotations
> doesn't seem like a very nice route, as that could easily hide other
> problems in the future.

Does it?  In exception_with_ints_disabled, if the .ex_table.pre search
fails, we jump back into the regular handler after the sti label and
continue with the regular handler.

This requires no more careful handling than existing constructs such as
wrmsr_safe() inside a spinlock_irq{,save}() region.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 12:45                                 ` Gianluca Guida
@ 2014-08-25 13:01                                   ` Jan Beulich
  2014-08-25 13:02                                   ` Andrew Cooper
  1 sibling, 0 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-25 13:01 UTC (permalink / raw)
  To: Gianluca Guida
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Ian Jackson,
	Tim Deegan, Aravindh Puthiyaparambil (aravindp),
	xen-devel

>>> On 25.08.14 at 14:45, <glguida@gmail.com> wrote:
> On Fri, Aug 22, 2014 at 10:34 AM, Andrew Cooper <andrew.cooper3@citrix.com>
> wrote:
> 
>> I am concerned with the addition of a the vcpu specifics to
>> shadow_write_entries().  Most of the shadow code is already vcpu centric
>> where it should be domain centric, and steps are being made to alleviate
>> these problems.
> 
> 
> The historical reason the code is set up this way, if you are referring to
> the fact that every shadow operation is vcpu-specific while the interface
> to it is domain specific, is due to the design choice of leaving the door
> open to experiment with per-vcpu shadows.
> That always looked like a nice feature, I am not sure anybody ever
> implemented it. I would advocate -- for the sake of code consistency -- to
> keep the current shadow internal interfaces per-vcpu in upcoming patches,
> and change it when you propose your domain-centric patch, effectively
> killing this probably never-exploited opportunity.

Except that a number of places broke that possible route already,
by using vCPU 0 in several places where only a domain gets passed
in.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 12:45                                 ` Gianluca Guida
  2014-08-25 13:01                                   ` Jan Beulich
@ 2014-08-25 13:02                                   ` Andrew Cooper
  2014-08-25 13:59                                     ` Gianluca Guida
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Cooper @ 2014-08-25 13:02 UTC (permalink / raw)
  To: Gianluca Guida
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, Jan Beulich,
	Aravindh Puthiyaparambil (aravindp),
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1892 bytes --]

On 25/08/14 13:45, Gianluca Guida wrote:
> On Fri, Aug 22, 2014 at 10:34 AM, Andrew Cooper
> <andrew.cooper3@citrix.com <mailto:andrew.cooper3@citrix.com>> wrote:
>
>     I am concerned with the addition of a the vcpu specifics to
>     shadow_write_entries().  Most of the shadow code is already vcpu
>     centric
>     where it should be domain centric, and steps are being made to
>     alleviate
>     these problems.
>
>
> The historical reason the code is set up this way, if you are
> referring to the fact that every shadow operation is vcpu-specific
> while the interface to it is domain specific, is due to the design
> choice of leaving the door open to experiment with per-vcpu shadows.
> That always looked like a nice feature, I am not sure anybody ever
> implemented it. I would advocate -- for the sake of code consistency
> -- to keep the current shadow internal interfaces per-vcpu in upcoming
> patches, and change it when you propose your domain-centric patch,
> effectively killing this probably never-exploited opportunity.
>
> Honestly, haven't been following shadow code in a while, so probably
> consistency has already been lost, in which case you should feel free
> to ignore this comment.
>
> Gianluca

I appreciate that it was done with vcpus to experiment with per-vcpu
shadows, but it is fundamentally wrong (even in a per-vcpu shadows case)
for certain paths to arbitrarily use d->vcpu[0] when calling into the
shadow code.  At the very least, it adversely messes with the heuristics.

There are fundamentally two separate entry paths into the shadow code. 
First from the pagefault handler, which certainly is vcpu-centric, but
also from toolstack operations, which is very much domain centric.  A
per-vcpu shadow setup would need to use for_each_vcpu() under the hood,
as it certainly couldn't trust that the caller has handed an appropriate
vcpu.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 3203 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 12:49                                           ` Andrew Cooper
@ 2014-08-25 13:09                                             ` Jan Beulich
  2014-08-25 16:56                                               ` Aravindh Puthiyaparambil (aravindp)
  2014-08-25 17:44                                               ` Andrew Cooper
  0 siblings, 2 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-25 13:09 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Andrew Cooper
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 25.08.14 at 14:49, <andrew.cooper3@citrix.com> wrote:
> On 25/08/14 08:33, Jan Beulich wrote:
>>>>> On 22.08.14 at 22:02, <andrew.cooper3@citrix.com> wrote:
>>> On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>>>> Sigh, if only I could bound the CR0.WP solution :-(
>>> I wonder whether, in the case that mem_access is enabled, it would be
>>> reasonable to perform the CR0.WP sections with interrupts disabled?
>> Which still wouldn't cover NMIs (albeit we might be able to live with
>> that).
> 
> NMIs and MCEs are short, possibly raise softirqs, or call panic().  We
> have much larger problems in general if the lack of CR0.WP would
> adversely affect the NMI or MCE paths.

I agree for MCEs, but NMIs don't necessarily mean severe problems.

>> But what's worse - taking faults with interrupts disabled
>> requires extra care, and annotating code normally run with
>> interrupts enabled with the special .ex_table.pre annotations
>> doesn't seem like a very nice route, as that could easily hide other
>> problems in the future.
> 
> Does it?  In exception_with_ints_disabled, if the .ex_table.pre search
> fails, we jump back into the regular handler after the sti label and
> continue with the regular handler.
> 
> This requires no more careful handling than existing constructs such as
> wrmsr_safe() inside a spinlock_irq{,save}() region.

Oh, indeed. Which points out that one shouldn't
- use the same numeric label twice in a row, and even inside the
  same function
- jump to a numeric label across one or more non-numeric ones

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 13:02                                   ` Andrew Cooper
@ 2014-08-25 13:59                                     ` Gianluca Guida
  0 siblings, 0 replies; 85+ messages in thread
From: Gianluca Guida @ 2014-08-25 13:59 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Gianluca Guida, Ian Jackson,
	Tim Deegan, Jan Beulich, Aravindh Puthiyaparambil (aravindp),
	xen-devel

I just realised I replied to this email privately.

With apologies to Andrew, here's another reply.

On Mon, Aug 25, 2014 at 02:02:37PM +0100, Andrew Cooper wrote:
> On 25/08/14 13:45, Gianluca Guida wrote:
> > On Fri, Aug 22, 2014 at 10:34 AM, Andrew Cooper
> > <andrew.cooper3@citrix.com <mailto:andrew.cooper3@citrix.com>> wrote:
> >
> >     I am concerned with the addition of a the vcpu specifics to
> >     shadow_write_entries().  Most of the shadow code is already vcpu
> >     centric
> >     where it should be domain centric, and steps are being made to
> >     alleviate
> >     these problems.
> >
> >
> > The historical reason the code is set up this way, if you are
> > referring to the fact that every shadow operation is vcpu-specific
> > while the interface to it is domain specific, is due to the design
> > choice of leaving the door open to experiment with per-vcpu shadows.
> > That always looked like a nice feature, I am not sure anybody ever
> > implemented it. I would advocate -- for the sake of code consistency
> > -- to keep the current shadow internal interfaces per-vcpu in upcoming
> > patches, and change it when you propose your domain-centric patch,
> > effectively killing this probably never-exploited opportunity.
> >
> > Honestly, haven't been following shadow code in a while, so probably
> > consistency has already been lost, in which case you should feel free
> > to ignore this comment.
> >
> > Gianluca
> 
> I appreciate that it was done with vcpus to experiment with per-vcpu
> shadows, but it is fundamentally wrong (even in a per-vcpu shadows case)
> for certain paths to arbitrarily use d->vcpu[0] when calling into the
> shadow code.  At the very least, it adversely messes with the heuristics.

It is hackish, and ugly. Yes. [Not my code disclaimer here].
I don't follow it being _fundamentally_ wrong, or broken, but I am probably missing something. More specifically: are there known bugs in there due to the use of vcpu[0] from the callers of shadow code?

> There are fundamentally two separate entry paths into the shadow code. 
> First from the pagefault handler, which certainly is vcpu-centric, but
> also from toolstack operations, which is very much domain centric.  A
> per-vcpu shadow setup would need to use for_each_vcpu() under the hood,
> as it certainly couldn't trust that the caller has handed an appropriate
> vcpu.

This is a correct interpretation. But we should be able to trust the caller, if by that we mean mm.c.
But yes, adding an interface for the "toolstack operations" would makes things better, and contain in a proper way shadow internal architecture.

Gianluca

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25  7:29                                       ` Jan Beulich
@ 2014-08-25 16:40                                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-25 16:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>>>>>> +                    domain_pause_nosync(d);
>>>>> I don't think a "nosync" pause is enough here, as that leaves a
>>>>> window for the guest to write to the page. Since the sync version
>>>>> may take some time to complete it may become difficult for you to
>>>>> actually handle this in an acceptable way.
>>>> Are you worried about performance or is there some other issue?
>>>
>>>Both performance and correctness.  With nosync(), guest vcpus can
>>>still be running on other pcpus, and playing with this pagetable entry.
>>>
>>>The synchronous variants can block for a moderate period of time.
>>
>> OK, I don't follow why pausing the other vcpus synchronously is an
>> issue here.
>
>So what is you don't understand? The remote vCPU-s won't necessarily be
>paused by the time domain_pause_nosync() returns (that's the nature of the
>"nosync"). Since the page table entry then gets modified, those remote
>vCPU-s might access memory using that entry, and the listener wouldn't see
>the write access. And when you use the synchronous form, the waiting for
>the remote vCPU-s to become de-scheduled will take some time, which is
>particularly bad for many-vCPU guests with the way
>domain_pause() currently works (of course this could be improved by
>parallelizing the pausing of the individual vCPU-s, first issuing
>vcpu_sleep_nosync() on all of them and then waiting for all of them to
>become non-running).

Understood. Thank you for the explanation.

Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 13:09                                             ` Jan Beulich
@ 2014-08-25 16:56                                               ` Aravindh Puthiyaparambil (aravindp)
  2014-08-26  7:08                                                 ` Jan Beulich
  2014-08-25 17:44                                               ` Andrew Cooper
  1 sibling, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-25 16:56 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>>> Sigh, if only I could bound the CR0.WP solution :-(
>>>> I wonder whether, in the case that mem_access is enabled, it would
>>>> be reasonable to perform the CR0.WP sections with interrupts disabled?
>>> Which still wouldn't cover NMIs (albeit we might be able to live with
>>> that).
>>
>> NMIs and MCEs are short, possibly raise softirqs, or call panic().  We
>> have much larger problems in general if the lack of CR0.WP would
>> adversely affect the NMI or MCE paths.
>
>I agree for MCEs, but NMIs don't necessarily mean severe problems.
>
>>> But what's worse - taking faults with interrupts disabled requires
>>> extra care, and annotating code normally run with interrupts enabled
>>> with the special .ex_table.pre annotations doesn't seem like a very
>>> nice route, as that could easily hide other problems in the future.
>>
>> Does it?  In exception_with_ints_disabled, if the .ex_table.pre search
>> fails, we jump back into the regular handler after the sti label and
>> continue with the regular handler.
>>
>> This requires no more careful handling than existing constructs such
>> as
>> wrmsr_safe() inside a spinlock_irq{,save}() region.
>
>Oh, indeed. Which points out that one shouldn't
>- use the same numeric label twice in a row, and even inside the
>  same function
>- jump to a numeric label across one or more non-numeric ones

Just to be certain as to where we stand:

1. The "page table RW bit flipping" solution is not viable because pausing the domain synchronously takes too long for many vcpus domains. Plus there is the added issue of vcpu vs domain heuristics. This is the case even after solving the page boundary and multiple page copy issues.

2. The "CR0.WP with interrupts disabled" solution is not viable because of NMIs. Or did I misunderstand?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 13:09                                             ` Jan Beulich
  2014-08-25 16:56                                               ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-25 17:44                                               ` Andrew Cooper
  2014-08-26  7:12                                                 ` Jan Beulich
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Cooper @ 2014-08-25 17:44 UTC (permalink / raw)
  To: Jan Beulich, Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

On 25/08/14 14:09, Jan Beulich wrote:
>>>> On 25.08.14 at 14:49, <andrew.cooper3@citrix.com> wrote:
>> On 25/08/14 08:33, Jan Beulich wrote:
>>>>>> On 22.08.14 at 22:02, <andrew.cooper3@citrix.com> wrote:
>>>> On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>>>>> Sigh, if only I could bound the CR0.WP solution :-(
>>>> I wonder whether, in the case that mem_access is enabled, it would be
>>>> reasonable to perform the CR0.WP sections with interrupts disabled?
>>> Which still wouldn't cover NMIs (albeit we might be able to live with
>>> that).
>> NMIs and MCEs are short, possibly raise softirqs, or call panic().  We
>> have much larger problems in general if the lack of CR0.WP would
>> adversely affect the NMI or MCE paths.
> I agree for MCEs, but NMIs don't necessarily mean severe problems.

Indeed - NMIs are not necessarily severe problems, but the handlers are
very specifically short and touch very little.  There is no reason for
the NMI handler to ever touch data which might be mapped read-only.

Furthermore, as soon as you take a fault in the NMI handler, we are back
into re-entrant NMI territory (which I still havn't gotten around to
fixing) which will certainly cause irreparable problems for Xen.

>
>>> But what's worse - taking faults with interrupts disabled
>>> requires extra care, and annotating code normally run with
>>> interrupts enabled with the special .ex_table.pre annotations
>>> doesn't seem like a very nice route, as that could easily hide other
>>> problems in the future.
>> Does it?  In exception_with_ints_disabled, if the .ex_table.pre search
>> fails, we jump back into the regular handler after the sti label and
>> continue with the regular handler.
>>
>> This requires no more careful handling than existing constructs such as
>> wrmsr_safe() inside a spinlock_irq{,save}() region.
> Oh, indeed. Which points out that one shouldn't
> - use the same numeric label twice in a row, and even inside the
>   same function
> - jump to a numeric label across one or more non-numeric ones

I am sorry, but I don't follow this train of logic.  Has it got
something to do with the subsections used to generate the .ex_table
redirection information?

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 16:56                                               ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-26  7:08                                                 ` Jan Beulich
  2014-08-26 22:27                                                   ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-08-26  7:08 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>>> On 25.08.14 at 18:56, <aravindp@cisco.com> wrote:
> Just to be certain as to where we stand:
> 
> 1. The "page table RW bit flipping" solution is not viable because pausing 
> the domain synchronously takes too long for many vcpus domains. Plus there is 
> the added issue of vcpu vs domain heuristics. This is the case even after 
> solving the page boundary and multiple page copy issues.
> 
> 2. The "CR0.WP with interrupts disabled" solution is not viable because of 
> NMIs. Or did I misunderstand?

For this second option, NMIs are a concern. Whether that makes it
not viable I'm not certain. We really need to weigh benefits and risks
here, and from a project wide perspective I'm currently viewing the
PV mem-access feature as a niche thing, the more that I'm unaware
of really wide spread use if HVM mem-access capabilities. I.e. the
most I can currently see happening is for it to go in clearly marked
experimental, provided that no code path used outside of that
feature suffers in any way (functionality and performance). But of
course I'm open to be convinced otherwise, or overruled by a
majority of other maintainers.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-25 17:44                                               ` Andrew Cooper
@ 2014-08-26  7:12                                                 ` Jan Beulich
  0 siblings, 0 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-26  7:12 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan,
	Aravindh Puthiyaparambil (aravindp),
	xen-devel

>>> On 25.08.14 at 19:44, <andrew.cooper3@citrix.com> wrote:
> On 25/08/14 14:09, Jan Beulich wrote:
>>>>> On 25.08.14 at 14:49, <andrew.cooper3@citrix.com> wrote:
>>> On 25/08/14 08:33, Jan Beulich wrote:
>>>>>>> On 22.08.14 at 22:02, <andrew.cooper3@citrix.com> wrote:
>>>>> On 22/08/14 20:48, Aravindh Puthiyaparambil (aravindp) wrote:
>>>>>> Sigh, if only I could bound the CR0.WP solution :-(
>>>>> I wonder whether, in the case that mem_access is enabled, it would be
>>>>> reasonable to perform the CR0.WP sections with interrupts disabled?
>>>> Which still wouldn't cover NMIs (albeit we might be able to live with
>>>> that).
>>> NMIs and MCEs are short, possibly raise softirqs, or call panic().  We
>>> have much larger problems in general if the lack of CR0.WP would
>>> adversely affect the NMI or MCE paths.
>> I agree for MCEs, but NMIs don't necessarily mean severe problems.
> 
> Indeed - NMIs are not necessarily severe problems, but the handlers are
> very specifically short and touch very little.  There is no reason for
> the NMI handler to ever touch data which might be mapped read-only.
> 
> Furthermore, as soon as you take a fault in the NMI handler, we are back
> into re-entrant NMI territory (which I still havn't gotten around to
> fixing) which will certainly cause irreparable problems for Xen.

And possibly isn't even worth fixing: We should just make sure NMIs
are what they're supposed to be - non-reentrant.

>>>> But what's worse - taking faults with interrupts disabled
>>>> requires extra care, and annotating code normally run with
>>>> interrupts enabled with the special .ex_table.pre annotations
>>>> doesn't seem like a very nice route, as that could easily hide other
>>>> problems in the future.
>>> Does it?  In exception_with_ints_disabled, if the .ex_table.pre search
>>> fails, we jump back into the regular handler after the sti label and
>>> continue with the regular handler.
>>>
>>> This requires no more careful handling than existing constructs such as
>>> wrmsr_safe() inside a spinlock_irq{,save}() region.
>> Oh, indeed. Which points out that one shouldn't
>> - use the same numeric label twice in a row, and even inside the
>>   same function
>> - jump to a numeric label across one or more non-numeric ones
> 
> I am sorry, but I don't follow this train of logic.  Has it got
> something to do with the subsections used to generate the .ex_table
> redirection information?

No, simply with the confusion such use of numeric labels causes.
They should - afaic - be limited to strictly local code regions (not
crossing non-numeric labels), and be strictly distinct (not using
the same number twice within the same local code region).

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-26  7:08                                                 ` Jan Beulich
@ 2014-08-26 22:27                                                   ` Aravindh Puthiyaparambil (aravindp)
  2014-08-26 23:30                                                     ` Andrew Cooper
  2014-08-27  6:33                                                     ` Jan Beulich
  0 siblings, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-26 22:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>> Just to be certain as to where we stand:
>>
>> 1. The "page table RW bit flipping" solution is not viable because pausing
>> the domain synchronously takes too long for many vcpus domains. Plus
>there is
>> the added issue of vcpu vs domain heuristics. This is the case even after
>> solving the page boundary and multiple page copy issues.
>>
>> 2. The "CR0.WP with interrupts disabled" solution is not viable because of
>> NMIs. Or did I misunderstand?
>
>For this second option, NMIs are a concern. Whether that makes it
>not viable I'm not certain. We really need to weigh benefits and risks
>here, and from a project wide perspective I'm currently viewing the

>From what I can tell, Andrew does think that this route is a viable option and I will defer to you and him about this. If there is agreement that this approach is acceptable, I will send out another version of the patches implementing it.

>PV mem-access feature as a niche thing, the more that I'm unaware
>of really wide spread use if HVM mem-access capabilities. I.e. the
>most I can currently see happening is for it to go in clearly marked
>experimental, provided that no code path used outside of that
>feature suffers in any way (functionality and performance). But of
>course I'm open to be convinced otherwise, or overruled by a
>majority of other maintainers.

I agree that PV/HVM mem_access feature is indeed niche, however it is a value add feature for Xen when compared to other hypervisors. It is attracting users who are interested in developing security and guest inspection/introspection products. And yes, I agree that the code added for mem_access should not adversely affect other areas of the project. I would hope this feature area is given encouragement to grow by the community. Just my two bits...

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-26 22:27                                                   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-26 23:30                                                     ` Andrew Cooper
  2014-08-28  9:34                                                       ` Tim Deegan
  2014-08-27  6:33                                                     ` Jan Beulich
  1 sibling, 1 reply; 85+ messages in thread
From: Andrew Cooper @ 2014-08-26 23:30 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Jan Beulich
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

On 26/08/2014 23:27, Aravindh Puthiyaparambil (aravindp) wrote:
>>> Just to be certain as to where we stand:
>>>
>>> 1. The "page table RW bit flipping" solution is not viable because pausing
>>> the domain synchronously takes too long for many vcpus domains. Plus
>> there is
>>> the added issue of vcpu vs domain heuristics. This is the case even after
>>> solving the page boundary and multiple page copy issues.
>>>
>>> 2. The "CR0.WP with interrupts disabled" solution is not viable because of
>>> NMIs. Or did I misunderstand?
>> For this second option, NMIs are a concern. Whether that makes it
>> not viable I'm not certain. We really need to weigh benefits and risks
>> here, and from a project wide perspective I'm currently viewing the
> From what I can tell, Andrew does think that this route is a viable option and I will defer to you and him about this. If there is agreement that this approach is acceptable, I will send out another version of the patches implementing it.

I currently see the CR0.WP=0 with interrupts disabled as the most viable
solution going.  That is not to say that there isn't a better solution,
but I can't currently spot a plausible alternative.

I do not see the NMI path as problematic with respect to WP=0 or
interrupts disabled.  The logic is very deliberately self contained and
minimal, as it specifically can interrupt other critical regions with
interrupts otherwise disabled in Xen.

>
>> PV mem-access feature as a niche thing, the more that I'm unaware
>> of really wide spread use if HVM mem-access capabilities. I.e. the
>> most I can currently see happening is for it to go in clearly marked
>> experimental, provided that no code path used outside of that
>> feature suffers in any way (functionality and performance). But of
>> course I'm open to be convinced otherwise, or overruled by a
>> majority of other maintainers.
> I agree that PV/HVM mem_access feature is indeed niche, however it is a value add feature for Xen when compared to other hypervisors. It is attracting users who are interested in developing security and guest inspection/introspection products. And yes, I agree that the code added for mem_access should not adversely affect other areas of the project. I would hope this feature area is given encouragement to grow by the community. Just my two bits...

I would also agree that PV mem_access is niche, but that is not to say
that it isn't valuable as part of the Xen ecosystem.

If it can be implemented with zero (or more realistically, minimal)
overhead to the rest of Xen (especially in the case where it is not
used), and explicitly documented as "by using PV mem_access, you as the
admin/user are accepting that there is an overhead", then I think that
it is ok.

After all, a working feature (if it doesn't adversely interfere with Xen
when disabled) is better than half a feature which doesn't function,
even if it is somewhat slow.  I have to admit that this is rare where
there is only one apparently feasible solution, which isn't necessarily
fantastic overall. 

Having said all of this, I am merely a community member who is voicing
an opinion.  It is ultimately Jan and Tim you have to convince to get
any patches accepted.

~Andrew

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-26 22:27                                                   ` Aravindh Puthiyaparambil (aravindp)
  2014-08-26 23:30                                                     ` Andrew Cooper
@ 2014-08-27  6:33                                                     ` Jan Beulich
  2014-08-27  7:49                                                       ` Tim Deegan
  2014-08-27 17:29                                                       ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 2 replies; 85+ messages in thread
From: Jan Beulich @ 2014-08-27  6:33 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: KeirFraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>>> On 27.08.14 at 00:27, <aravindp@cisco.com> wrote:
>>> 2. The "CR0.WP with interrupts disabled" solution is not viable because of
>>> NMIs. Or did I misunderstand?
>>
>>For this second option, NMIs are a concern. Whether that makes it
>>not viable I'm not certain. We really need to weigh benefits and risks
>>here, and from a project wide perspective I'm currently viewing the
> 
> From what I can tell, Andrew does think that this route is a viable option 
> and I will defer to you and him about this. If there is agreement that this 
> approach is acceptable, I will send out another version of the patches 
> implementing it.

Yes, I think you should go ahead with this, understanding that there's
no promise (just reasonable likelihood) to take it in the end. It's kind of
unfortunate that Tim currently has very little time to spend on voicing
his opinion here.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-27  6:33                                                     ` Jan Beulich
@ 2014-08-27  7:49                                                       ` Tim Deegan
  2014-08-27 17:29                                                       ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Tim Deegan @ 2014-08-27  7:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: KeirFraser, Ian Campbell, Andrew Cooper, Ian Jackson,
	Aravindh Puthiyaparambil (aravindp),
	xen-devel

At 07:33 +0100 on 27 Aug (1409121231), Jan Beulich wrote:
> >>> On 27.08.14 at 00:27, <aravindp@cisco.com> wrote:
> >>> 2. The "CR0.WP with interrupts disabled" solution is not viable because of
> >>> NMIs. Or did I misunderstand?
> >>
> >>For this second option, NMIs are a concern. Whether that makes it
> >>not viable I'm not certain. We really need to weigh benefits and risks
> >>here, and from a project wide perspective I'm currently viewing the
> > 
> > From what I can tell, Andrew does think that this route is a viable option 
> > and I will defer to you and him about this. If there is agreement that this 
> > approach is acceptable, I will send out another version of the patches 
> > implementing it.
> 
> Yes, I think you should go ahead with this, understanding that there's
> no promise (just reasonable likelihood) to take it in the end. It's kind of
> unfortunate that Tim currently has very little time to spend on voicing
> his opinion here.

My apologies for that; I hope to have some time tomorrow to review this
thread properly.

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-27  6:33                                                     ` Jan Beulich
  2014-08-27  7:49                                                       ` Tim Deegan
@ 2014-08-27 17:29                                                       ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-27 17:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: KeirFraser, Ian Campbell, Andrew Cooper, Tim Deegan, xen-devel,
	Ian Jackson

>>>For this second option, NMIs are a concern. Whether that makes it
>>>not viable I'm not certain. We really need to weigh benefits and risks
>>>here, and from a project wide perspective I'm currently viewing the
>>
>> From what I can tell, Andrew does think that this route is a viable option
>> and I will defer to you and him about this. If there is agreement that this
>> approach is acceptable, I will send out another version of the patches
>> implementing it.
>
>Yes, I think you should go ahead with this, understanding that there's
>no promise (just reasonable likelihood) to take it in the end.

Understood. I will send out another RFC series.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-24 23:34     ` Aravindh Puthiyaparambil (aravindp)
  2014-07-25  7:19       ` Jan Beulich
@ 2014-08-28  9:09       ` Tim Deegan
  2014-08-28 18:23         ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 1 reply; 85+ messages in thread
From: Tim Deegan @ 2014-08-28  9:09 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

Hi,

Apologies for the very late review - just catching up with this thread.

At 23:34 +0000 on 24 Jul (1406241272), Aravindh Puthiyaparambil (aravindp) wrote:
> >> -         * share_xen_page_with_guest(). */
> >> +         * share_xen_page_with_guest().
> >> +         * PV domains that have a mem_access listener, runs in shadow mode
> >> +         * without refcounts.
> >> +         */
> >>          if ( !(shadow_mode_external(v->domain)
> >>                 && (page->count_info & PGC_count_mask) <= 3
> >>                 && ((page->u.inuse.type_info & PGT_count_mask)
> >> -                   == !!is_xen_heap_page(page))) )
> >> +                   == !!is_xen_heap_page(page))) &&
> >> +             !mem_event_check_ring(&v->domain->mem_event->access) )
> >
> >To me this doesn't look to be in sync with the comment, as the new
> >check is being carried out regardless of domain type. Furthermore
> >this continues to have the problem of also hiding issues unrelated
> >to mem-access handling.
> 
> I will add a check for PV domain. This check is wrt to
> refcouting. From what Tim told me PV domains cannot run in that
> mode, so I don't think any issues will be hidden if I add the check
> for PV domain.

I think the right check here is for !shadow_mode_refcounts() -- after
all, this check makes no sense if the refcounts aren't
affected by changes to shadow pagetables. 

> >> +            /*
> >> +             * Do not police writes to guest memory from the Xen hypervisor.
> >> +             * This keeps PV mem_access on par with HVM. Turn off CR0.WP
> >here to
> >> +             * allow the write to go through if the guest has marked the page as
> >> +             * writable. Turn it back on in the guest access functions
> >> +             * __copy_to_user / __put_user_size() after the write is completed.
> >> +             */
> >> +            if ( violation && access_w &&
> >> +                 regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END )
> >
> >Definitely < instead of <= on the right side. But - is this safe, the more
> >that this again doesn't appear to be sitting in a guest kind specific block?
> >I'd at least expect this to be qualified by a regs->cs and/or
> >guest_mode() check.
> 
> I will add the guest kind check. Should I do it for the whole block i.e add is_pv_domain() in addition to mem_event_check_ring()? That will also address next comment below. I will add the guest_mode() in addition to the above check for policing Xen writes to guest memory.
> 

Sounds good -- I would be inclined to replace the %rip test entirely
with a !guest_mode one, which should be a more reliable check for
whether the access is made by Xen.

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-07-25 21:39         ` Aravindh Puthiyaparambil (aravindp)
  2014-07-28  6:49           ` Jan Beulich
@ 2014-08-28  9:14           ` Tim Deegan
  2014-08-28 18:31             ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 1 reply; 85+ messages in thread
From: Tim Deegan @ 2014-08-28  9:14 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

At 21:39 +0000 on 25 Jul (1406320788), Aravindh Puthiyaparambil (aravindp) wrote:
> >>>I wonder how well the window where you're running with CR0.WP clear
> >>>is bounded: The flag serves as a kind of security measure, and hence
> >>>shouldn't be left off for extended periods of time.
> >>
> >> I agree. Is there a way I can bound this?
> >
> >That's what you need to figure out. The simplistic solution (single
> >stepping just the critical instruction(s)) is probably not going to be
> >acceptable due to its fragility. I have no good other suggestions,
> >but I'm not eager to allow code in that weakens protection.
> 
> From the debugging I have done to get this working, this is what the flow should be. Xen tries to write to guest page marked read only and page fault occurs. So __copy_to_user_ll() -> handle_exception_saved->do_page_fault() and CR0.WP is cleared. Once the fault is handled __copy_to_user_ll() is retried and it goes through. At the end of which CR0.WP is turned on. So this is the only window that pv_vcpu.need_cr0_wp_set should be true. Is there a spot outside of this window that I check to see if it is set and if it is, turn it back on again? Would that be a sufficient bound?
> 
> >>>>          pg[i].count_info = PGC_allocated | 1;
> >>>> +        if ( is_pv_domain(d) )
> >>>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)-
> >>default_access);
> >>>
> >>>I don't think you should call shadow code from here.
> >>
> >> Should I add a p2m wrapper for this, so that it is valid only for x86 and a
> >> no-op for ARM?
> >
> >That's not necessarily enough, but at least presumably the right
> >route: You also need to avoid fiddling with struct page_info fields
> >that may be used (now or in the future) for other purposes, i.e.
> >you need to gate the setting of the flags by more than just
> >is_pv_domain().
> 
> Coupled with your response to the other thread, I am thinking I should move away from using the shadow_flags for access permissions. Tim's other suggestion was to try and re-use the p2m-pt implementation.
> 

I think you should be OK to use shadow_flags here -- after all the
shadow code relies on being able to use them for any guest data page
once shadow mode is enabled. 

To avoid touching them before shadow mode is actually enabled you
could reshuffle the encodings so that 0 is 'default' (shadow code
absolutely relies on this field being 0 when shadow more is enabled so
any other user will have to maintain that).

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-26 23:30                                                     ` Andrew Cooper
@ 2014-08-28  9:34                                                       ` Tim Deegan
  2014-08-28 18:33                                                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Tim Deegan @ 2014-08-28  9:34 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Jan Beulich,
	Aravindh Puthiyaparambil (aravindp),
	xen-devel

At 00:30 +0100 on 27 Aug (1409095841), Andrew Cooper wrote:
> I currently see the CR0.WP=0 with interrupts disabled as the most viable
> solution going.  That is not to say that there isn't a better solution,
> but I can't currently spot a plausible alternative.

Yes, temporarily adjusting the pagetables sounds like madness --
in order to avoid other guest VCPUs using the temporary mappings to
subvert the write protection you'd have to pause them and that's
going to risk deadlocks.

Temporarily disabling CR0.WP is nasty too but even with interrupts
enabled I think it's better.  We don't do any guest-space writes from
interrupt handlers (since we can't guarantee an interrupt would arrive
in the right context). 

It would want a backstop assertion (in, say the cntext swicth or
return-to-guest) to catch any new paths that leave CR0.WP disabled.

For completeness, the other solutions on offer are: 
- have the three (?) paths in question detect write faults and 
  use HVM-like map-write-unmap code in that case.
- have the three paths in question _always_ map and unmap their
  targets when current vcpu is PV+shadow+memaccess.  
- trap and emulate Xen writes and detect/fix this case in that path.

I think I'm going to NAK the temporarily-frob-the-PTE and
trap-and-emulate ideas, but I'd be happy with using explicit mappings
or with disabling CR0.WP.

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28  9:09       ` Tim Deegan
@ 2014-08-28 18:23         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 18:23 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>At 23:34 +0000 on 24 Jul (1406241272), Aravindh Puthiyaparambil (aravindp)
>wrote:
>> >> -         * share_xen_page_with_guest(). */
>> >> +         * share_xen_page_with_guest().
>> >> +         * PV domains that have a mem_access listener, runs in shadow
>mode
>> >> +         * without refcounts.
>> >> +         */
>> >>          if ( !(shadow_mode_external(v->domain)
>> >>                 && (page->count_info & PGC_count_mask) <= 3
>> >>                 && ((page->u.inuse.type_info & PGT_count_mask)
>> >> -                   == !!is_xen_heap_page(page))) )
>> >> +                   == !!is_xen_heap_page(page))) &&
>> >> +             !mem_event_check_ring(&v->domain->mem_event->access) )
>> >
>> >To me this doesn't look to be in sync with the comment, as the new
>> >check is being carried out regardless of domain type. Furthermore
>> >this continues to have the problem of also hiding issues unrelated
>> >to mem-access handling.
>>
>> I will add a check for PV domain. This check is wrt to
>> refcouting. From what Tim told me PV domains cannot run in that
>> mode, so I don't think any issues will be hidden if I add the check
>> for PV domain.
>
>I think the right check here is for !shadow_mode_refcounts() -- after
>all, this check makes no sense if the refcounts aren't
>affected by changes to shadow pagetables.

OK, I will replace the check introduced with the !shadow_mode_refcounts() one.

>> >> +            /*
>> >> +             * Do not police writes to guest memory from the Xen hypervisor.
>> >> +             * This keeps PV mem_access on par with HVM. Turn off CR0.WP
>> >here to
>> >> +             * allow the write to go through if the guest has marked the page
>as
>> >> +             * writable. Turn it back on in the guest access functions
>> >> +             * __copy_to_user / __put_user_size() after the write is
>completed.
>> >> +             */
>> >> +            if ( violation && access_w &&
>> >> +                 regs->eip >= XEN_VIRT_START && regs->eip <=
>XEN_VIRT_END )
>> >
>> >Definitely < instead of <= on the right side. But - is this safe, the more
>> >that this again doesn't appear to be sitting in a guest kind specific block?
>> >I'd at least expect this to be qualified by a regs->cs and/or
>> >guest_mode() check.
>>
>> I will add the guest kind check. Should I do it for the whole block i.e add
>is_pv_domain() in addition to mem_event_check_ring()? That will also
>address next comment below. I will add the guest_mode() in addition to the
>above check for policing Xen writes to guest memory.
>>
>
>Sounds good -- I would be inclined to replace the %rip test entirely
>with a !guest_mode one, which should be a more reliable check for
>whether the access is made by Xen.

OK, I will replace the %rip check with !guest_mode().

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28  9:14           ` Tim Deegan
@ 2014-08-28 18:31             ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28 19:00               ` Tim Deegan
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 18:31 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>At 21:39 +0000 on 25 Jul (1406320788), Aravindh Puthiyaparambil (aravindp)
>wrote:
>> >>>I wonder how well the window where you're running with CR0.WP clear
>> >>>is bounded: The flag serves as a kind of security measure, and hence
>> >>>shouldn't be left off for extended periods of time.
>> >>
>> >> I agree. Is there a way I can bound this?
>> >
>> >That's what you need to figure out. The simplistic solution (single
>> >stepping just the critical instruction(s)) is probably not going to be
>> >acceptable due to its fragility. I have no good other suggestions,
>> >but I'm not eager to allow code in that weakens protection.
>>
>> From the debugging I have done to get this working, this is what the flow
>should be. Xen tries to write to guest page marked read only and page fault
>occurs. So __copy_to_user_ll() -> handle_exception_saved->do_page_fault()
>and CR0.WP is cleared. Once the fault is handled __copy_to_user_ll() is
>retried and it goes through. At the end of which CR0.WP is turned on. So this is
>the only window that pv_vcpu.need_cr0_wp_set should be true. Is there a
>spot outside of this window that I check to see if it is set and if it is, turn it back
>on again? Would that be a sufficient bound?
>>
>> >>>>          pg[i].count_info = PGC_allocated | 1;
>> >>>> +        if ( is_pv_domain(d) )
>> >>>> +            shadow_set_access(&pg[i], p2m_get_hostp2m(d)-
>> >>default_access);
>> >>>
>> >>>I don't think you should call shadow code from here.
>> >>
>> >> Should I add a p2m wrapper for this, so that it is valid only for x86 and a
>> >> no-op for ARM?
>> >
>> >That's not necessarily enough, but at least presumably the right
>> >route: You also need to avoid fiddling with struct page_info fields
>> >that may be used (now or in the future) for other purposes, i.e.
>> >you need to gate the setting of the flags by more than just
>> >is_pv_domain().
>>
>> Coupled with your response to the other thread, I am thinking I should
>move away from using the shadow_flags for access permissions. Tim's other
>suggestion was to try and re-use the p2m-pt implementation.
>>
>
>I think you should be OK to use shadow_flags here -- after all the
>shadow code relies on being able to use them for any guest data page
>once shadow mode is enabled.

Yes, I agree. In fact using the p2m-pt implementation is not going to help getting around the hypercall continuation issue.

>To avoid touching them before shadow mode is actually enabled you
>could reshuffle the encodings so that 0 is 'default' (shadow code
>absolutely relies on this field being 0 when shadow more is enabled so
>any other user will have to maintain that).

Do you mean the p2m_access_t enum when you say encoding? 0 now mean p2m_access_n. Are you saying that 0 should now mean p2m_access_rwx? 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28  9:34                                                       ` Tim Deegan
@ 2014-08-28 18:33                                                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 18:33 UTC (permalink / raw)
  To: Tim Deegan, Andrew Cooper
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>> I currently see the CR0.WP=0 with interrupts disabled as the most
>> viable solution going.  That is not to say that there isn't a better
>> solution, but I can't currently spot a plausible alternative.
>
>Yes, temporarily adjusting the pagetables sounds like madness -- in order to
>avoid other guest VCPUs using the temporary mappings to subvert the write
>protection you'd have to pause them and that's going to risk deadlocks.

I agree it is madness. It was my desperation coming through :-)

>Temporarily disabling CR0.WP is nasty too but even with interrupts enabled I
>think it's better.  We don't do any guest-space writes from interrupt handlers
>(since we can't guarantee an interrupt would arrive in the right context).
>
>It would want a backstop assertion (in, say the cntext swicth or
>return-to-guest) to catch any new paths that leave CR0.WP disabled.
>
>For completeness, the other solutions on offer are:
>- have the three (?) paths in question detect write faults and
>  use HVM-like map-write-unmap code in that case.
>- have the three paths in question _always_ map and unmap their
>  targets when current vcpu is PV+shadow+memaccess.
>- trap and emulate Xen writes and detect/fix this case in that path.
>
>I think I'm going to NAK the temporarily-frob-the-PTE and trap-and-emulate
>ideas, but I'd be happy with using explicit mappings or with disabling CR0.WP.

I am going to try the CR0.WP route first. If that does not work well I will try the explicit mapping route.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 18:31             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-28 19:00               ` Tim Deegan
  2014-08-28 19:23                 ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Tim Deegan @ 2014-08-28 19:00 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

At 18:31 +0000 on 28 Aug (1409247071), Aravindh Puthiyaparambil (aravindp) wrote:
> >> >That's not necessarily enough, but at least presumably the right
> >> >route: You also need to avoid fiddling with struct page_info fields
> >> >that may be used (now or in the future) for other purposes, i.e.
> >> >you need to gate the setting of the flags by more than just
> >> >is_pv_domain().
> >>
> >> Coupled with your response to the other thread, I am thinking I should
> >move away from using the shadow_flags for access permissions. Tim's other
> >suggestion was to try and re-use the p2m-pt implementation.
> >>
> >
> >I think you should be OK to use shadow_flags here -- after all the
> >shadow code relies on being able to use them for any guest data page
> >once shadow mode is enabled.
> 
> Yes, I agree. In fact using the p2m-pt implementation is not going to help getting around the hypercall continuation issue.
> 
> >To avoid touching them before shadow mode is actually enabled you
> >could reshuffle the encodings so that 0 is 'default' (shadow code
> >absolutely relies on this field being 0 when shadow more is enabled so
> >any other user will have to maintain that).
> 
> Do you mean the p2m_access_t enum when you say encoding?

Not necessarily - but you _could_ reorder the enum (and add a comment
so make sure that other people don't reorder it again) if that does
what you want.  Alternatively, you could use one more bit of
the shadow flags as a 'valid' bit for the access bits, where readers
would replace invalid mappings with whatever the correct default value
is.

One other question occurs to me: what about the case of enabling,
disabling and re-enabling the mem-access feature?  Is it OK that
access permissions will be retained from the first use into the second
or do they need to be reset somehow?

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 19:00               ` Tim Deegan
@ 2014-08-28 19:23                 ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28 20:37                   ` Tim Deegan
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 19:23 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>> >To avoid touching them before shadow mode is actually enabled you
>> >could reshuffle the encodings so that 0 is 'default' (shadow code
>> >absolutely relies on this field being 0 when shadow more is enabled
>> >so any other user will have to maintain that).
>>
>> Do you mean the p2m_access_t enum when you say encoding?
>
>Not necessarily - but you _could_ reorder the enum (and add a comment so
>make sure that other people don't reorder it again) if that does what you
>want.  Alternatively, you could use one more bit of the shadow flags as a
>'valid' bit for the access bits, where readers would replace invalid mappings
>with whatever the correct default value is.
>
>One other question occurs to me: what about the case of enabling, disabling
>and re-enabling the mem-access feature?  Is it OK that access permissions will
>be retained from the first use into the second or do they need to be reset
>somehow?

With HVM guests, the mem-access listener does the following every time it enables the feature:

1. Set the default access value: 
	xc_set_mem_access(xch, domain_id, default_access, ~0ull, 0). 
	All this does is set p2m->default_access. None of the individual page permissions are changed.
2. Convert  individual pages to the default access value:
	xc_set_mem_access(xch, domain_id, default_access, 0, domain_max_pages);
 
In the PV case step 2 is problematic as the range of pages that belong to the PV guest is unknown to the mem-access listener. I tried adding another PV specific API for setting default access that will walk the page_list and set the shadow_flag to default. But Jan rightly pointed out issues surrounding hypercall preemption / continuation during which, the page_list could be modified. So my current plan is to blow all shadow pages every time the API for setting default access is called. The on the subsequent page-faults where the PTE is marked not present, set the shadow_flag to the default access as part of creating the PTE. The mem-access listener for PV guests hence need not call step 2. Given that I only check for mem-access violation for present pages, this should work. Then on disabli
 ng mem-access, I will set p2m->default_access back to RWX and turn off shadow paging. So anytime mem-access is enabled again, the default value that will be set by the listener will be honored. Does this sound viable?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 19:23                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-28 20:37                   ` Tim Deegan
  2014-08-28 21:35                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28 22:20                     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 85+ messages in thread
From: Tim Deegan @ 2014-08-28 20:37 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

At 19:23 +0000 on 28 Aug (1409250226), Aravindh Puthiyaparambil (aravindp) wrote:
> In the PV case step 2 is problematic as the range of pages that
> belong to the PV guest is unknown to the mem-access listener. I
> tried adding another PV specific API for setting default access that
> will walk the page_list and set the shadow_flag to default. But Jan
> rightly pointed out issues surrounding hypercall preemption /
> continuation during which, the page_list could be modified. So my
> current plan is to blow all shadow pages every time the API for
> setting default access is called. The on the subsequent page-faults
> where the PTE is marked not present, set the shadow_flag to the
> default access as part of creating the PTE.

How do you know when this step finishes?  I.e. how can you tell the
difference between an access field in the shadow_flags that was set
before you enabled mem_access and one that was set since?

At some point you need some more state - either something resetting
the shadow_flags (though indeed I don't know how to do that reliably
and safely) or an epoch counter (which there isn't room for) or some
other trick that I can't think of right now.

I wonder if it's possible just to mandate in the API that you might
get spurious mem_access faults in this case.  A bit ugly though. :(

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 20:37                   ` Tim Deegan
@ 2014-08-28 21:35                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-28 22:20                     ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 21:35 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>> In the PV case step 2 is problematic as the range of pages that belong
>> to the PV guest is unknown to the mem-access listener. I tried adding
>> another PV specific API for setting default access that will walk the
>> page_list and set the shadow_flag to default. But Jan rightly pointed
>> out issues surrounding hypercall preemption / continuation during
>> which, the page_list could be modified. So my current plan is to blow
>> all shadow pages every time the API for setting default access is
>> called. The on the subsequent page-faults where the PTE is marked not
>> present, set the shadow_flag to the default access as part of creating
>> the PTE.
>
>How do you know when this step finishes?  I.e. how can you tell the
>difference between an access field in the shadow_flags that was set before
>you enabled mem_access and one that was set since?

What I was thinking of doing is, in p2m_mem_access_get_entry(), check if PTE for the page is marked present and return the access value in the shadow_flags if it is. If it is not present, then return the default access value.

>At some point you need some more state - either something resetting the
>shadow_flags (though indeed I don't know how to do that reliably and safely)
>or an epoch counter (which there isn't room for) or some other trick that I
>can't think of right now.

At the moment, I have some code in assign_pages() to reset the shadow_flags to the default value. So that will take care of resetting pages assigned to new domains.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 20:37                   ` Tim Deegan
  2014-08-28 21:35                     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-28 22:20                     ` Aravindh Puthiyaparambil (aravindp)
  2014-08-29  9:52                       ` Tim Deegan
  1 sibling, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-28 22:20 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>>> In the PV case step 2 is problematic as the range of pages that belong
>>> to the PV guest is unknown to the mem-access listener. I tried adding
>>> another PV specific API for setting default access that will walk the
>>> page_list and set the shadow_flag to default. But Jan rightly pointed
>>> out issues surrounding hypercall preemption / continuation during
>>> which, the page_list could be modified. So my current plan is to blow
>>> all shadow pages every time the API for setting default access is
>>> called. The on the subsequent page-faults where the PTE is marked not
>>> present, set the shadow_flag to the default access as part of creating
>>> the PTE.
>>
>>How do you know when this step finishes?  I.e. how can you tell the
>>difference between an access field in the shadow_flags that was set before
>>you enabled mem_access and one that was set since?
>
>What I was thinking of doing is, in p2m_mem_access_get_entry(), check if PTE
>for the page is marked present and return the access value in the
>shadow_flags if it is. If it is not present, then return the default access value.

I realized this does not cover for the case when p2m_mem_access_set_entry() gets called for a non-present page. Subsequent p2m_mem_access_get_entry() will return the default access value. Even worse subsequent page fault will overwrite the access value in shadow flags with the default. Should I disallow p2m_mem_access_set_entry() for non-present pages to work around this?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-28 22:20                     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-29  9:52                       ` Tim Deegan
  2014-08-29 17:52                         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-29 19:03                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 85+ messages in thread
From: Tim Deegan @ 2014-08-29  9:52 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

At 22:20 +0000 on 28 Aug (1409260838), Aravindh Puthiyaparambil (aravindp) wrote:
> >>> In the PV case step 2 is problematic as the range of pages that belong
> >>> to the PV guest is unknown to the mem-access listener. I tried adding
> >>> another PV specific API for setting default access that will walk the
> >>> page_list and set the shadow_flag to default. But Jan rightly pointed
> >>> out issues surrounding hypercall preemption / continuation during
> >>> which, the page_list could be modified. So my current plan is to blow
> >>> all shadow pages every time the API for setting default access is
> >>> called. The on the subsequent page-faults where the PTE is marked not
> >>> present, set the shadow_flag to the default access as part of creating
> >>> the PTE.
> >>
> >>How do you know when this step finishes?  I.e. how can you tell the
> >>difference between an access field in the shadow_flags that was set before
> >>you enabled mem_access and one that was set since?
> >
> >What I was thinking of doing is, in p2m_mem_access_get_entry(), check if PTE
> >for the page is marked present and return the access value in the
> >shadow_flags if it is. If it is not present, then return the default access value.
> 
> I realized this does not cover for the case when p2m_mem_access_set_entry() gets called for a non-present page. Subsequent p2m_mem_access_get_entry() will return the default access value. Even worse subsequent page fault will overwrite the access value in shadow flags with the default. Should I disallow p2m_mem_access_set_entry() for non-present pages to work around this?
> 

Nope, I think you're in even deeper trouble than that -- shadow PTEs
can be dropped at any time (e.g. because of memory pressure) so you
can't rely on their being present or absent to mean anything.

(Also, I suspect this plan might get tangled by pages that have
multiple PTEs mapping them, but I haven't thought that all the way
through yet.)

Tim.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-29  9:52                       ` Tim Deegan
@ 2014-08-29 17:52                         ` Aravindh Puthiyaparambil (aravindp)
  2014-08-29 19:03                         ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-29 17:52 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>> >>> In the PV case step 2 is problematic as the range of pages that belong
>> >>> to the PV guest is unknown to the mem-access listener. I tried adding
>> >>> another PV specific API for setting default access that will walk the
>> >>> page_list and set the shadow_flag to default. But Jan rightly pointed
>> >>> out issues surrounding hypercall preemption / continuation during
>> >>> which, the page_list could be modified. So my current plan is to blow
>> >>> all shadow pages every time the API for setting default access is
>> >>> called. The on the subsequent page-faults where the PTE is marked not
>> >>> present, set the shadow_flag to the default access as part of creating
>> >>> the PTE.
>> >>
>> >>How do you know when this step finishes?  I.e. how can you tell the
>> >>difference between an access field in the shadow_flags that was set
>before
>> >>you enabled mem_access and one that was set since?
>> >
>> >What I was thinking of doing is, in p2m_mem_access_get_entry(), check if
>PTE
>> >for the page is marked present and return the access value in the
>> >shadow_flags if it is. If it is not present, then return the default access
>value.
>>
>> I realized this does not cover for the case when
>p2m_mem_access_set_entry() gets called for a non-present page.
>Subsequent p2m_mem_access_get_entry() will return the default access
>value. Even worse subsequent page fault will overwrite the access value in
>shadow flags with the default. Should I disallow
>p2m_mem_access_set_entry() for non-present pages to work around this?
>>
>
>Nope, I think you're in even deeper trouble than that -- shadow PTEs
>can be dropped at any time (e.g. because of memory pressure) so you
>can't rely on their being present or absent to mean anything.

You are right. Shadow PTEs being present cannot be used as an indicator. So the problem of setting default access and converting all pages to it will exist irrespective of what data structure I use. The only way I can think of doing this is to walk the page_list but if the hypercall gets preempted, there is the possibility that the list could get modified which would break the continuation. Is the page_list the only place that will tell me the range of pages that belong to a PV guest? Is the max PFN stored somewhere? 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-29  9:52                       ` Tim Deegan
  2014-08-29 17:52                         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-08-29 19:03                         ` Aravindh Puthiyaparambil (aravindp)
  2014-09-01 10:38                           ` Jan Beulich
  1 sibling, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-08-29 19:03 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Jan Beulich, Ian Campbell

>>> >>> In the PV case step 2 is problematic as the range of pages that belong
>>> >>> to the PV guest is unknown to the mem-access listener. I tried adding
>>> >>> another PV specific API for setting default access that will walk the
>>> >>> page_list and set the shadow_flag to default. But Jan rightly pointed
>>> >>> out issues surrounding hypercall preemption / continuation during
>>> >>> which, the page_list could be modified. So my current plan is to blow
>>> >>> all shadow pages every time the API for setting default access is
>>> >>> called. The on the subsequent page-faults where the PTE is marked
>not
>>> >>> present, set the shadow_flag to the default access as part of creating
>>> >>> the PTE.
>>> >>
>>> >>How do you know when this step finishes?  I.e. how can you tell the
>>> >>difference between an access field in the shadow_flags that was set
>>before
>>> >>you enabled mem_access and one that was set since?
>>> >
>>> >What I was thinking of doing is, in p2m_mem_access_get_entry(), check
>if
>>PTE
>>> >for the page is marked present and return the access value in the
>>> >shadow_flags if it is. If it is not present, then return the default access
>>value.
>>>
>>> I realized this does not cover for the case when
>>p2m_mem_access_set_entry() gets called for a non-present page.
>>Subsequent p2m_mem_access_get_entry() will return the default access
>>value. Even worse subsequent page fault will overwrite the access value in
>>shadow flags with the default. Should I disallow
>>p2m_mem_access_set_entry() for non-present pages to work around this?
>>>
>>
>>Nope, I think you're in even deeper trouble than that -- shadow PTEs
>>can be dropped at any time (e.g. because of memory pressure) so you
>>can't rely on their being present or absent to mean anything.
>
>You are right. Shadow PTEs being present cannot be used as an indicator. So
>the problem of setting default access and converting all pages to it will exist
>irrespective of what data structure I use. The only way I can think of doing this
>is to walk the page_list but if the hypercall gets preempted, there is the
>possibility that the list could get modified which would break the continuation.
>Is the page_list the only place that will tell me the range of pages that belong
>to a PV guest? Is the max PFN stored somewhere?

The one solution I was kicking around is to add a flag that denotes if the page_list was modified. The flag will be set to 0 at the beginning of mem_access_set_default() if we are starting at the head of page_list and when it is preempted. The flag will be set to 1 in page_list_add/del() and friends if the page belongs to a PV domain that has a mem_access listener. On hypercall preemption / continuation i.e. start_page has a value, mem_access_set_default() will check this flag. If it is set then it will restart setting the permissions from the head of page_list. We can give it N number of retries and if it does not go through we can return an error. If it goes through successfully, we will set this flag to 0. Do you think this is a viable solution?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-08-29 19:03                         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-09-01 10:38                           ` Jan Beulich
  2014-09-02 21:57                             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-09-01 10:38 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 29.08.14 at 21:03, <aravindp@cisco.com> wrote:
>>>Nope, I think you're in even deeper trouble than that -- shadow PTEs
>>>can be dropped at any time (e.g. because of memory pressure) so you
>>>can't rely on their being present or absent to mean anything.
>>
>>You are right. Shadow PTEs being present cannot be used as an indicator. So
>>the problem of setting default access and converting all pages to it will 
> exist
>>irrespective of what data structure I use. The only way I can think of doing 
> this
>>is to walk the page_list but if the hypercall gets preempted, there is the
>>possibility that the list could get modified which would break the 
> continuation.
>>Is the page_list the only place that will tell me the range of pages that 
> belong
>>to a PV guest? Is the max PFN stored somewhere?
> 
> The one solution I was kicking around is to add a flag that denotes if the 
> page_list was modified. The flag will be set to 0 at the beginning of 
> mem_access_set_default() if we are starting at the head of page_list and when 
> it is preempted. The flag will be set to 1 in page_list_add/del() and friends 
> if the page belongs to a PV domain that has a mem_access listener. On 
> hypercall preemption / continuation i.e. start_page has a value, 
> mem_access_set_default() will check this flag. If it is set then it will 
> restart setting the permissions from the head of page_list. We can give it N 
> number of retries and if it does not go through we can return an error. If it 
> goes through successfully, we will set this flag to 0. Do you think this is a 
> viable solution?

What would you think of a feature that has relatively little change
of working at all on certain kinds of guests (e.g. with
autoballooning turned on, i.e. where the list of owned pages may
be constantly in flux)?

What we do in at least one other place is shuffle the list from one
list head to another, and put the hole thing back when done. I'm
not sure this is a good option here (in particular not going to break
that other case, as well as then needing to consider that if there
are two sites doing this, there may well appear a third one, and
hence the question would become whether this scales).

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-01 10:38                           ` Jan Beulich
@ 2014-09-02 21:57                             ` Aravindh Puthiyaparambil (aravindp)
  2014-09-03  8:31                               ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-02 21:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>>>Nope, I think you're in even deeper trouble than that -- shadow PTEs
>>>>can be dropped at any time (e.g. because of memory pressure) so you
>>>>can't rely on their being present or absent to mean anything.
>>>
>>>You are right. Shadow PTEs being present cannot be used as an
>>>indicator. So the problem of setting default access and converting all
>>>pages to it will
>> exist
>>>irrespective of what data structure I use. The only way I can think of
>>>doing
>> this
>>>is to walk the page_list but if the hypercall gets preempted, there is
>>>the possibility that the list could get modified which would break the
>> continuation.
>>>Is the page_list the only place that will tell me the range of pages
>>>that
>> belong
>>>to a PV guest? Is the max PFN stored somewhere?
>>
>> The one solution I was kicking around is to add a flag that denotes if
>> the page_list was modified. The flag will be set to 0 at the beginning
>> of
>> mem_access_set_default() if we are starting at the head of page_list
>> and when it is preempted. The flag will be set to 1 in
>> page_list_add/del() and friends if the page belongs to a PV domain
>> that has a mem_access listener. On hypercall preemption / continuation
>> i.e. start_page has a value,
>> mem_access_set_default() will check this flag. If it is set then it
>> will restart setting the permissions from the head of page_list. We
>> can give it N number of retries and if it does not go through we can
>> return an error. If it goes through successfully, we will set this
>> flag to 0. Do you think this is a viable solution?
>
>What would you think of a feature that has relatively little change of working
>at all on certain kinds of guests (e.g. with autoballooning turned on, i.e. where
>the list of owned pages may be constantly in flux)?

Setting the default access is typically an operation done once during the life of mem_access listener. It is not something that the listener constantly calls. So this issue will occur only when mem_access_set_default() is called during ballooning. For pages added after the call has been made will get the default access permission using a hook in assign_pages().

>What we do in at least one other place is shuffle the list from one
>list head to another, and put the hole thing back when done. I'm
>not sure this is a good option here (in particular not going to break
>that other case, as well as then needing to consider that if there
>are two sites doing this, there may well appear a third one, and
>hence the question would become whether this scales).

Yes, you are right that if the number of places the page_list is modified outside of the page_list_*() functions, then this check will have to be done in every location. 

Is the page_list the only place that will tell me the range of pages that belong to a PV guest? 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-02 21:57                             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-09-03  8:31                               ` Jan Beulich
  2014-09-03 18:50                                 ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-09-03  8:31 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 02.09.14 at 23:57, <aravindp@cisco.com> wrote:
> Is the page_list the only place that will tell me the range of pages that 
> belong to a PV guest? 

Yes. No upper bound on the PFN space is being tracked.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-03  8:31                               ` Jan Beulich
@ 2014-09-03 18:50                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-09-04  6:39                                   ` Jan Beulich
       [not found]                                   ` <20140904083906.GA86555@deinos.phlegethon.org>
  0 siblings, 2 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-03 18:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>> Is the page_list the only place that will tell me the range of pages that
>> belong to a PV guest?
>
>Yes. No upper bound on the PFN space is being tracked.

OK, in that case the only thing I can think of is to set a max memory limit for PV domains that an access listener can attach to so that the hypercall preemption / continuation is not required. In addition we could add a kernel command line parameter that overrides this and document that this could cause denial of service issues when used. 

Tim, can you think of any other options here?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-03 18:50                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-09-04  6:39                                   ` Jan Beulich
  2014-09-04 18:24                                     ` Aravindh Puthiyaparambil (aravindp)
       [not found]                                   ` <20140904083906.GA86555@deinos.phlegethon.org>
  1 sibling, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-09-04  6:39 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 03.09.14 at 20:50, <aravindp@cisco.com> wrote:
>> > Is the page_list the only place that will tell me the range of pages that
>>> belong to a PV guest?
>>
>>Yes. No upper bound on the PFN space is being tracked.
> 
> OK, in that case the only thing I can think of is to set a max memory limit 
> for PV domains that an access listener can attach to so that the hypercall 
> preemption / continuation is not required. In addition we could add a kernel 
> command line parameter that overrides this and document that this could cause 
> denial of service issues when used. 

Did you check whether the suggested list shuffling would require
(massive) changes elsewhere (other than in domain cleanup)?

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-04  6:39                                   ` Jan Beulich
@ 2014-09-04 18:24                                     ` Aravindh Puthiyaparambil (aravindp)
  2014-09-05  8:11                                       ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-04 18:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>Did you check whether the suggested list shuffling would require
>(massive) changes elsewhere (other than in domain cleanup)?

"What we do in at least one other place is shuffle the list from one list head to another, and put the hole thing back when done. I'm not sure this is a good option here (in particular not going to break that other case, as well as then needing to consider that if there are two sites doing this, there may well appear a third one, and hence the question would become whether this scales)."

Sorry I did not. From your previous reply I thought this was not a good option as it wouldn't scale. I will look more at this code now.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-04 18:24                                     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-09-05  8:11                                       ` Jan Beulich
  2014-09-05 22:49                                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-09-05  8:11 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 04.09.14 at 20:24, <aravindp@cisco.com> wrote:
>> Did you check whether the suggested list shuffling would require
>>(massive) changes elsewhere (other than in domain cleanup)?
> 
> "What we do in at least one other place is shuffle the list from one list 
> head to another, and put the hole thing back when done. I'm not sure this is 
> a good option here (in particular not going to break that other case, as well 
> as then needing to consider that if there are two sites doing this, there may 
> well appear a third one, and hence the question would become whether this 
> scales)."
> 
> Sorry I did not. From your previous reply I thought this was not a good 
> option as it wouldn't scale. I will look more at this code now.

But Tim's suggestion was an even better one anyway, if it (as both
him and me think it will) results in not too ugly code.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-05  8:11                                       ` Jan Beulich
@ 2014-09-05 22:49                                         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-05 22:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> Did you check whether the suggested list shuffling would require
>>>(massive) changes elsewhere (other than in domain cleanup)?
>>
>> "What we do in at least one other place is shuffle the list from one list
>> head to another, and put the hole thing back when done. I'm not sure this is
>> a good option here (in particular not going to break that other case, as well
>> as then needing to consider that if there are two sites doing this, there may
>> well appear a third one, and hence the question would become whether
>this
>> scales)."
>>
>> Sorry I did not. From your previous reply I thought this was not a good
>> option as it wouldn't scale. I will look more at this code now.
>
>But Tim's suggestion was an even better one anyway, if it (as both
>him and me think it will) results in not too ugly code.

OK, I will take a stab at implementing that.

Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
       [not found]                                     ` <540849430200007800030C47@mail.emea.novell.com>
@ 2014-09-11 19:40                                       ` Aravindh Puthiyaparambil (aravindp)
  2014-09-12  7:21                                         ` Jan Beulich
  0 siblings, 1 reply; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-11 19:40 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell

>> It might be possible to make the walk restartable if we arrange that:
>> - all additions to the page list are at the tail;
>> - walks in progress are recorded (though I'm not sure where: it might
>>   be possible just to use a single bit in the page_info to say "this
>>   page is the cursor of an interrupted walk" and have a separate
>>   per-domain list of interrupted walks); and
>> - anything that removes a page from a domain's page_list must check
>>   that bit; if it finds it set it must update the per-walk state to
>>   point to the next page.
>>
>> Jan, do you think that's viable?  I think it'll need a bit of thought
>> around the edge cases, and the semantics will need to be written down
>> carefully.
>
>Yes, nice idea - I think that could work (with particularly converting
>arch_iommu_populate_page_table() to use that mechanism, while inspecting
>[and perhaps adjusting] the related code in
>p2m_pod_offline_or_broken_hit()) and would not end up being too ugly (i.e.
>not exactly nice, but maintainable).

As I was looking into implementing this, I notice that the functions won't be called for PV domains. So I assume this implementation is a generic mechanism to make page_list walks pre-emptible and not specific to the PV mem_access use case. If that is the case I would prefer committing this as a separate patch in the 4.6 time frame and follow it up with the PV mem_access patch. Is that agreeable?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-11 19:40                                       ` Aravindh Puthiyaparambil (aravindp)
@ 2014-09-12  7:21                                         ` Jan Beulich
  2014-09-12 18:01                                           ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 85+ messages in thread
From: Jan Beulich @ 2014-09-12  7:21 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), Tim Deegan
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell

>>> On 11.09.14 at 21:40, <aravindp@cisco.com> wrote:
>> > It might be possible to make the walk restartable if we arrange that:
>>> - all additions to the page list are at the tail;
>>> - walks in progress are recorded (though I'm not sure where: it might
>>>   be possible just to use a single bit in the page_info to say "this
>>>   page is the cursor of an interrupted walk" and have a separate
>>>   per-domain list of interrupted walks); and
>>> - anything that removes a page from a domain's page_list must check
>>>   that bit; if it finds it set it must update the per-walk state to
>>>   point to the next page.
>>>
>>> Jan, do you think that's viable?  I think it'll need a bit of thought
>>> around the edge cases, and the semantics will need to be written down
>>> carefully.
>>
>>Yes, nice idea - I think that could work (with particularly converting
>>arch_iommu_populate_page_table() to use that mechanism, while inspecting
>>[and perhaps adjusting] the related code in
>>p2m_pod_offline_or_broken_hit()) and would not end up being too ugly (i.e.
>>not exactly nice, but maintainable).
> 
> As I was looking into implementing this, I notice that the functions won't 
> be called for PV domains. So I assume this implementation is a generic 
> mechanism to make page_list walks pre-emptible and not specific to the PV 
> mem_access use case. If that is the case I would prefer committing this as a 
> separate patch in the 4.6 time frame and follow it up with the PV mem_access 
> patch. Is that agreeable?

The generic part being a separate patch certainly makes sense. And
yours then going on top does (of course) too. The implication being
that the PV mem-access functionality then all will only go in for 4.6.

Jan

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-09-12  7:21                                         ` Jan Beulich
@ 2014-09-12 18:01                                           ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 85+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-09-12 18:01 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell

>>> > It might be possible to make the walk restartable if we arrange that:
>>>> - all additions to the page list are at the tail;
>>>> - walks in progress are recorded (though I'm not sure where: it might
>>>>   be possible just to use a single bit in the page_info to say "this
>>>>   page is the cursor of an interrupted walk" and have a separate
>>>>   per-domain list of interrupted walks); and
>>>> - anything that removes a page from a domain's page_list must check
>>>>   that bit; if it finds it set it must update the per-walk state to
>>>>   point to the next page.
>>>>
>>>> Jan, do you think that's viable?  I think it'll need a bit of thought
>>>> around the edge cases, and the semantics will need to be written down
>>>> carefully.
>>>
>>>Yes, nice idea - I think that could work (with particularly converting
>>>arch_iommu_populate_page_table() to use that mechanism, while
>inspecting
>>>[and perhaps adjusting] the related code in
>>>p2m_pod_offline_or_broken_hit()) and would not end up being too ugly
>(i.e.
>>>not exactly nice, but maintainable).
>>
>> As I was looking into implementing this, I notice that the functions won't
>> be called for PV domains. So I assume this implementation is a generic
>> mechanism to make page_list walks pre-emptible and not specific to the PV
>> mem_access use case. If that is the case I would prefer committing this as a
>> separate patch in the 4.6 time frame and follow it up with the PV
>mem_access
>> patch. Is that agreeable?
>
>The generic part being a separate patch certainly makes sense. And
>yours then going on top does (of course) too. The implication being
>that the PV mem-access functionality then all will only go in for 4.6.

I am going to make a best effort to get in to 4.5 but it is looking like a long shot at this point.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2014-09-12 18:01 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-08  2:50 [PATCH RFC v2 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
2014-07-08  2:50 ` [PATCH RFC v2 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
2014-07-24 14:29   ` Jan Beulich
2014-07-24 23:34     ` Aravindh Puthiyaparambil (aravindp)
2014-07-25  7:19       ` Jan Beulich
2014-07-25 21:39         ` Aravindh Puthiyaparambil (aravindp)
2014-07-28  6:49           ` Jan Beulich
2014-07-28 21:14             ` Aravindh Puthiyaparambil (aravindp)
2014-07-30  4:05             ` Aravindh Puthiyaparambil (aravindp)
2014-07-30  7:11               ` Jan Beulich
2014-07-30 18:35                 ` Aravindh Puthiyaparambil (aravindp)
2014-08-01  6:39                   ` Jan Beulich
2014-08-01 18:08                     ` Aravindh Puthiyaparambil (aravindp)
2014-08-04  7:03                       ` Jan Beulich
2014-08-05  0:14                         ` Aravindh Puthiyaparambil (aravindp)
2014-08-05  6:33                           ` Jan Beulich
2014-08-13 22:14                             ` Aravindh Puthiyaparambil (aravindp)
2014-08-22  2:29                             ` Aravindh Puthiyaparambil (aravindp)
2014-08-22  9:34                               ` Andrew Cooper
2014-08-22 10:02                                 ` Jan Beulich
2014-08-22 10:14                                   ` Andrew Cooper
2014-08-22 18:28                                 ` Aravindh Puthiyaparambil (aravindp)
2014-08-22 18:52                                   ` Andrew Cooper
2014-08-25 12:45                                 ` Gianluca Guida
2014-08-25 13:01                                   ` Jan Beulich
2014-08-25 13:02                                   ` Andrew Cooper
2014-08-25 13:59                                     ` Gianluca Guida
2014-08-22 15:33                               ` Jan Beulich
2014-08-22 19:07                                 ` Aravindh Puthiyaparambil (aravindp)
2014-08-22 19:24                                   ` Andrew Cooper
2014-08-22 19:48                                     ` Aravindh Puthiyaparambil (aravindp)
2014-08-22 20:02                                       ` Andrew Cooper
2014-08-22 20:13                                         ` Aravindh Puthiyaparambil (aravindp)
2014-08-25  7:34                                           ` Jan Beulich
2014-08-25  7:33                                         ` Jan Beulich
2014-08-25 12:49                                           ` Andrew Cooper
2014-08-25 13:09                                             ` Jan Beulich
2014-08-25 16:56                                               ` Aravindh Puthiyaparambil (aravindp)
2014-08-26  7:08                                                 ` Jan Beulich
2014-08-26 22:27                                                   ` Aravindh Puthiyaparambil (aravindp)
2014-08-26 23:30                                                     ` Andrew Cooper
2014-08-28  9:34                                                       ` Tim Deegan
2014-08-28 18:33                                                         ` Aravindh Puthiyaparambil (aravindp)
2014-08-27  6:33                                                     ` Jan Beulich
2014-08-27  7:49                                                       ` Tim Deegan
2014-08-27 17:29                                                       ` Aravindh Puthiyaparambil (aravindp)
2014-08-25 17:44                                               ` Andrew Cooper
2014-08-26  7:12                                                 ` Jan Beulich
2014-08-25  7:29                                       ` Jan Beulich
2014-08-25 16:40                                         ` Aravindh Puthiyaparambil (aravindp)
2014-08-28  9:14           ` Tim Deegan
2014-08-28 18:31             ` Aravindh Puthiyaparambil (aravindp)
2014-08-28 19:00               ` Tim Deegan
2014-08-28 19:23                 ` Aravindh Puthiyaparambil (aravindp)
2014-08-28 20:37                   ` Tim Deegan
2014-08-28 21:35                     ` Aravindh Puthiyaparambil (aravindp)
2014-08-28 22:20                     ` Aravindh Puthiyaparambil (aravindp)
2014-08-29  9:52                       ` Tim Deegan
2014-08-29 17:52                         ` Aravindh Puthiyaparambil (aravindp)
2014-08-29 19:03                         ` Aravindh Puthiyaparambil (aravindp)
2014-09-01 10:38                           ` Jan Beulich
2014-09-02 21:57                             ` Aravindh Puthiyaparambil (aravindp)
2014-09-03  8:31                               ` Jan Beulich
2014-09-03 18:50                                 ` Aravindh Puthiyaparambil (aravindp)
2014-09-04  6:39                                   ` Jan Beulich
2014-09-04 18:24                                     ` Aravindh Puthiyaparambil (aravindp)
2014-09-05  8:11                                       ` Jan Beulich
2014-09-05 22:49                                         ` Aravindh Puthiyaparambil (aravindp)
     [not found]                                   ` <20140904083906.GA86555@deinos.phlegethon.org>
     [not found]                                     ` <540849430200007800030C47@mail.emea.novell.com>
2014-09-11 19:40                                       ` Aravindh Puthiyaparambil (aravindp)
2014-09-12  7:21                                         ` Jan Beulich
2014-09-12 18:01                                           ` Aravindh Puthiyaparambil (aravindp)
2014-08-28  9:09       ` Tim Deegan
2014-08-28 18:23         ` Aravindh Puthiyaparambil (aravindp)
2014-07-08  2:50 ` [PATCH RFC v2 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
2014-07-24 14:38   ` Jan Beulich
2014-07-24 23:52     ` Aravindh Puthiyaparambil (aravindp)
2014-07-25  7:23       ` Jan Beulich
2014-07-25 21:47         ` Aravindh Puthiyaparambil (aravindp)
2014-07-28  6:56           ` Jan Beulich
2014-07-28 21:16             ` Aravindh Puthiyaparambil (aravindp)
2014-07-08  2:50 ` [PATCH RFC v2 3/4] tools/libxc: Add APIs for PV mem_access Aravindh Puthiyaparambil
2014-07-08  2:50 ` [PATCH RFC v2 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
2014-07-08 16:27 ` [PATCH RFC v2 0/4] Add mem_access " Konrad Rzeszutek Wilk
2014-07-08 17:57   ` Aravindh Puthiyaparambil (aravindp)
2014-07-09  0:31   ` Aravindh Puthiyaparambil (aravindp)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.