All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] Add mem_access support for PV domains
@ 2014-04-29  4:45 Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-04-29  4:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, Tim Deegan,
	Ian Jackson, Jan Beulich

This patch series adds mem_access support for PV domains. To do this the PV
domain domain has to be run with shadow paging. A new shadow mode, PG_mem_access
has been added for this. A p2m implementation for mem_access has also been
added to track the access permissions. Since special ring pages are not created
for PV domains, a new libxc API has been added to do this. The mem_access
listener has to call this API before trying to enable mem_access. This page
is freed when mem_access is disabled or when the domain is destroyed. 

When mem_access is enabled for a PV domain, PG_mem_access shadow is turned on and
all the shadows are dropped. In the resulting pagefaults, the entries are created
with the default access permissions. On future pagefaults, if there is a violation,
a mem_event is sent to the mem_access listener who will then resolve it. 

To get the access permissions for individual pages, the p2m access look up table
is referenced. To set the access permission of individual pages, the new
permission is set in the lookup table and the shadow for the gmfn is dropped and
on the resulting fault, the new PTE entry will be created with the new permission.
It should be noted that a mem_access listener for a PV domain does not need to
convert all pages in the PV domain to the default permission during setup and
teardown as this is done implicitly when enabling and disabling mem_access.

Patches are based on top of commit 31ee95.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

  x86/mm: Shadow and p2m changes for PV mem_access
  x86/mem_access: mem_access and mem_event changes to support PV domains
  tools/libxc: Add APIs to create and get the PV ring page
  tool/xen-access: Add support for PV domains

 tools/libxc/xc_mem_access.c         |  29 ++++
 tools/libxc/xenctrl.h               |  17 ++-
 tools/tests/xen-access/xen-access.c | 122 +++++++++------
 xen/arch/x86/domain.c               |  12 ++
 xen/arch/x86/mm/Makefile            |   2 +-
 xen/arch/x86/mm/mem_access.c        | 146 ++++++++++++++++--
 xen/arch/x86/mm/mem_event.c         |  77 ++++++++--
 xen/arch/x86/mm/p2m-ma.c            | 286 ++++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c               |  19 ++-
 xen/arch/x86/mm/paging.c            |  16 ++
 xen/arch/x86/mm/shadow/common.c     |  63 +++++++-
 xen/arch/x86/mm/shadow/multi.c      | 104 ++++++++++++-
 xen/include/asm-x86/domain.h        |   3 +
 xen/include/asm-x86/mem_access.h    |   4 +
 xen/include/asm-x86/p2m.h           |  51 +++++++
 xen/include/asm-x86/paging.h        |   3 +
 xen/include/asm-x86/shadow.h        |  11 ++
 xen/include/public/domctl.h         |   4 +
 xen/include/public/memory.h         |   2 +
 19 files changed, 897 insertions(+), 74 deletions(-)
 create mode 100644 xen/arch/x86/mm/p2m-ma.c

-- 
1.9.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29  4:45 [PATCH RFC 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
@ 2014-04-29  4:45 ` Aravindh Puthiyaparambil
  2014-04-29  8:50   ` Jan Beulich
                     ` (2 more replies)
  2014-04-29  4:45 ` [PATCH RFC 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-04-29  4:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Ian Jackson, Ian Campbell, Jan Beulich, Tim Deegan

Shadow mem_access mode
----------------------
Add a new shadow mode for mem_access. This should only be enabled by a
mem_access listener when it calls XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
for a PV domain.

Changes to shadow code
----------------------
If the shadow pagefault handler detects that a mem_access listener is
present, then it checks if an violation occurred. If it did, then the
vCPU is paused and an event is sent to the listener.
Similarly if the propagation code detects that a mem_access listener is
present, then it creates the PTE after applying access permissions to it.

P2M changes
-----------
Add a new p2m implementation for mem_access. Central to this is the
access lookup table. This is an array of mfns. Each mfn is a page
allocated from the domain's shadow memory. The pages hold the
p2m_access_t values for each guest gmfn. p2m_mem_access_set_entry()
sets the access value of the mfn given as input and blows the shadow
entries for the mfn. p2m_mem_access_get_entry() returns the access
value of the mfn given as input.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/mm/Makefile        |   2 +-
 xen/arch/x86/mm/p2m-ma.c        | 286 ++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c           |  19 ++-
 xen/arch/x86/mm/paging.c        |  16 +++
 xen/arch/x86/mm/shadow/common.c |  63 ++++++++-
 xen/arch/x86/mm/shadow/multi.c  | 104 ++++++++++++++-
 xen/include/asm-x86/p2m.h       |  51 +++++++
 xen/include/asm-x86/paging.h    |   3 +
 xen/include/asm-x86/shadow.h    |  11 ++
 xen/include/public/domctl.h     |   4 +
 10 files changed, 549 insertions(+), 10 deletions(-)
 create mode 100644 xen/arch/x86/mm/p2m-ma.c

diff --git a/xen/arch/x86/mm/Makefile b/xen/arch/x86/mm/Makefile
index 73dcdf4..41128a4 100644
--- a/xen/arch/x86/mm/Makefile
+++ b/xen/arch/x86/mm/Makefile
@@ -2,7 +2,7 @@ subdir-y += shadow
 subdir-y += hap
 
 obj-y += paging.o
-obj-y += p2m.o p2m-pt.o p2m-ept.o p2m-pod.o
+obj-y += p2m.o p2m-pt.o p2m-ept.o p2m-pod.o p2m-ma.o
 obj-y += guest_walk_2.o
 obj-y += guest_walk_3.o
 obj-$(x86_64) += guest_walk_4.o
diff --git a/xen/arch/x86/mm/p2m-ma.c b/xen/arch/x86/mm/p2m-ma.c
new file mode 100644
index 0000000..634b0eb
--- /dev/null
+++ b/xen/arch/x86/mm/p2m-ma.c
@@ -0,0 +1,286 @@
+/******************************************************************************
+ * arch/x86/mm/p2m-ma.c
+ *
+ * Implementation of p2m data structures, for use by PV mem_access code.
+ *
+ * Copyright (c) 2014 Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <asm/p2m.h>
+#include <asm/shadow.h>
+#include "mm-locks.h"
+
+/* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
+#undef mfn_to_page
+#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
+#undef page_to_mfn
+#define page_to_mfn(_pg) _mfn(__page_to_mfn(_pg))
+
+/* Convert access restrictions to page table flags */
+void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access)
+{
+
+    /* Restrict with access permissions */
+    switch (access)
+    {
+        case p2m_access_r:
+            *flags &= ~(_PAGE_RW);
+            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);
+            break;
+        case p2m_access_rx:
+        case p2m_access_rx2rw:
+            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
+            *flags |= _PAGE_PRESENT;
+            break;
+        case p2m_access_rw:
+            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
+            break;
+        case p2m_access_rwx:
+        default:
+            *flags &= ~(_PAGE_NX_BIT);
+            *flags |= (_PAGE_RW|_PAGE_PRESENT);
+            break;
+    }
+
+    // Allow more restrictive guest flags to be propagated instead of access
+    // permissions
+    if ( !(gflags & _PAGE_RW) )
+        *flags &= ~(_PAGE_RW);
+
+    if ( gflags & _PAGE_NX_BIT )
+        *flags |= _PAGE_NX_BIT;
+
+}
+
+/*
+ * Set the page permission of the mfn. This in effect removes all shadow
+ * mappings of that mfn. The access type of that mfn is stored in the access
+ * lookup table.
+ */
+static int
+p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
+                         unsigned int page_order, p2m_type_t p2mt,
+                         p2m_access_t p2ma)
+{
+    struct domain *d = p2m->domain;
+    mfn_t *access_lookup_table = p2m->access_lookup_table;
+    uint table_idx;
+    uint page_idx;
+    uint8_t *access_table_page;
+
+    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);
+
+    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions */
+    if ( unlikely(p2ma != p2m_access_r &&
+                  p2ma != p2m_access_rw &&
+                  p2ma != p2m_access_rx &&
+                  p2ma != p2m_access_rwx &&
+                  p2ma != p2m_access_rx2rw) )
+        return -EINVAL;
+
+    if ( page_get_owner(mfn_to_page(mfn)) != d )
+        return -ENOENT;
+
+    gfn = get_gpfn_from_mfn(mfn_x(mfn));
+
+    /*
+     * Values with the MSB set denote MFNs that aren't really part of the
+     * domain's pseudo-physical memory map (e.g., the shared info frame).
+     * Nothing to do here.
+     */
+    if ( unlikely(!VALID_M2P(gfn)) )
+        return 0;
+
+    if ( gfn > (d->tot_pages - 1) )
+        return -EINVAL;
+
+    paging_lock(d);
+
+    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
+    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
+    access_table_page = map_domain_page(mfn_x(access_lookup_table[table_idx]));
+    access_table_page[page_idx] = p2ma;
+    unmap_domain_page(access_table_page);
+
+    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
+        flush_tlb_mask(d->domain_dirty_cpumask);
+
+    paging_unlock(d);
+
+    return 0;
+}
+
+/* Get the page permission of the mfn from the access lookup table */
+static mfn_t
+p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
+                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
+                         unsigned int *page_order)
+{
+    struct domain *d = p2m->domain;
+    mfn_t *access_lookup_table = p2m->access_lookup_table;
+    uint table_idx;
+    uint page_idx;
+    uint8_t *access_table_page;
+    mfn_t mfn = _mfn(gfn); // For PV guests mfn == gfn
+
+    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);
+
+    /* Not necessarily true, but for non-translated guests, we claim
+     * it's the most generic kind of memory */
+    *t = p2m_ram_rw;
+
+    if ( page_get_owner(mfn_to_page(mfn)) != d )
+        return _mfn(INVALID_MFN);
+
+    gfn = get_gpfn_from_mfn(mfn_x(mfn));
+
+    /*
+     * Values with the MSB set denote MFNs that aren't really part of the
+     * domain's pseudo-physical memory map (e.g., the shared info frame).
+     * Return mfn with RW access.
+     */
+    if ( unlikely(!VALID_M2P(gfn)) )
+    {
+        *a = p2m_access_rw;
+        return mfn;
+    }
+
+    if ( gfn > (d->tot_pages - 1) )
+    {
+        *a = p2m->default_access;
+        return _mfn(INVALID_MFN);
+    }
+
+    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
+    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
+
+    access_table_page = map_domain_page(mfn_x(access_lookup_table[table_idx]));
+
+    /* This is a hint to take the default permissions */
+    if ( access_table_page[page_idx] == p2m_access_n )
+        access_table_page[page_idx] = p2m->default_access;
+    *a = access_table_page[page_idx];
+
+    unmap_domain_page(access_table_page);
+
+    return mfn;
+}
+
+/* Check the default_access value and blow away all the shadows */
+int p2m_mem_access_set_default(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
+
+    ASSERT(shadow_mode_mem_access(d));
+
+    /* For PV domains we only support r, rw, rx, rwx access permissions */
+    if ( p2m->default_access != p2m_access_r &&
+         p2m->default_access != p2m_access_rw &&
+         p2m->default_access != p2m_access_rx &&
+         p2m->default_access != p2m_access_rwx &&
+         p2m->default_access != p2m_access_rx2rw )
+        return -EINVAL;
+
+    paging_lock_recursive(d);
+    shadow_blow_tables(d);
+    paging_unlock(d);
+
+    return 0;
+}
+
+/*
+ * Free all the shadow pages used in the access lookup table and free the table
+ * itself.
+ */
+void p2m_mem_access_teardown(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
+    mfn_t *access_lookup_table = p2m->access_lookup_table;
+    uint32_t nr_access_table_pages;
+    uint32_t ctr;
+
+    /* Reset the set_entry and get_entry function pointers */
+    p2m_pt_init(p2m);
+
+    if ( !access_lookup_table  )
+        return;
+
+    nr_access_table_pages = get_domain_nr_access_table_pages(d);
+
+    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )
+    {
+        if ( !mfn_valid(access_lookup_table[ctr]) )
+                break;
+        d->arch.paging.free_page(d,
+                                 mfn_to_page(access_lookup_table[ctr]));
+    }
+
+    xfree(p2m->access_lookup_table);
+    p2m->access_lookup_table = NULL;
+}
+
+/*
+ * Allocate the access lookup table array. This is an array of mfns. Each mfn
+ * is a page allocated from the domain's shadow memory. The size of the table
+ * will depend on the domain's memory size.
+ */
+int p2m_mem_access_init(struct p2m_domain *p2m)
+{
+    struct domain *d = p2m->domain;
+    mfn_t *access_lookup_table;
+    uint32_t nr_access_table_pages;
+    uint32_t ctr;
+
+    nr_access_table_pages = get_domain_nr_access_table_pages(d);
+    access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages);
+    if ( !access_lookup_table )
+        return -ENOMEM;
+
+    p2m->access_lookup_table = access_lookup_table;
+
+    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )
+    {
+        struct page_info *page;
+
+        page = d->arch.paging.alloc_page(d);
+        if ( page == NULL )
+        {
+            /* This a hint to p2m_mem_access_teardown() to stop freeing */
+            access_lookup_table[ctr] = _mfn(INVALID_MFN);
+            p2m_mem_access_teardown(p2m);
+            return -ENOMEM;
+        }
+
+        access_lookup_table[ctr] = page_to_mfn(page);
+    }
+
+    p2m->set_entry = p2m_mem_access_set_entry;
+    p2m->get_entry = p2m_mem_access_get_entry;
+
+    return 0;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 1d0528b..c4b9dc4 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -33,6 +33,7 @@
 #include <asm/mem_event.h>
 #include <public/mem_event.h>
 #include <asm/mem_sharing.h>
+#include <asm/shadow.h>
 #include <xen/event.h>
 #include <asm/hvm/nestedhvm.h>
 #include <asm/hvm/svm/amd-iommu-proto.h>
@@ -222,7 +223,9 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
     if ( q & P2M_UNSHARE )
         q |= P2M_ALLOC;
 
-    if ( !p2m || !paging_mode_translate(p2m->domain) )
+    if ( !p2m ||
+          (!paging_mode_translate(p2m->domain) &&
+           !paging_mode_mem_access(p2m->domain)) )
     {
         /* Not necessarily true, but for non-translated guests, we claim
          * it's the most generic kind of memory */
@@ -259,7 +262,9 @@ mfn_t __get_gfn_type_access(struct p2m_domain *p2m, unsigned long gfn,
 
 void __put_gfn(struct p2m_domain *p2m, unsigned long gfn)
 {
-    if ( !p2m || !paging_mode_translate(p2m->domain) )
+    if ( !p2m ||
+         (!paging_mode_translate(p2m->domain) &&
+          !paging_mode_mem_access(p2m->domain)) )
         /* Nothing to do in this case */
         return;
 
@@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
     if ( pfn == ~0ul )
     {
         p2m->default_access = a;
+        if ( is_pv_domain(d) )
+            return p2m_mem_access_set_default(p2m);
         return 0;
     }
 
@@ -1421,6 +1428,14 @@ long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
     for ( pfn += start; nr > start; ++pfn )
     {
         mfn = p2m->get_entry(p2m, pfn, &t, &_a, 0, NULL);
+
+        /* Return error on invalid MFN for PV guests */
+        if ( unlikely(is_pv_domain(d) && !mfn_valid(mfn)) )
+        {
+            rc = -EFAULT;
+            break;
+        }
+
         rc = p2m->set_entry(p2m, pfn, mfn, PAGE_ORDER_4K, t, a);
         if ( rc )
             break;
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index f956aa5..bedf1cd 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -585,9 +585,17 @@ int paging_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
     {
 
     case XEN_DOMCTL_SHADOW_OP_ENABLE:
+        /*
+         * Shadow mem_access mode should only be enabled when mem_access is
+         * enabled in XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE.
+         */
+        if ( sc->mode & XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS )
+            return -EINVAL;
+
         if ( !(sc->mode & XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY) )
             break;
         /* Else fall through... */
+
     case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:
         if ( hap_enabled(d) )
             hap_logdirty_init(d);
@@ -622,6 +630,14 @@ void paging_teardown(struct domain *d)
     /* clean up log dirty resources. */
     paging_log_dirty_teardown(d);
 
+    /*
+     * Free the PV mem_access p2m resources in the case where a mem_access
+     * listener is present while the domain is being destroyed or it crashed
+     * without cleaning up.
+     */
+    if ( is_pv_domain(d) )
+        p2m_mem_access_teardown(p2m_get_hostp2m(d));
+
     /* Move populate-on-demand cache back to domain_list for destruction */
     p2m_pod_empty_cache(d);
 }
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 3c803b6..572bd8d 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1356,7 +1356,7 @@ void shadow_prealloc(struct domain *d, u32 type, unsigned int count)
 
 /* Deliberately free all the memory we can: this will tear down all of
  * this domain's shadows */
-static void shadow_blow_tables(struct domain *d) 
+void shadow_blow_tables(struct domain *d)
 {
     struct page_info *sp, *t;
     struct vcpu *v = d->vcpu[0];
@@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v, mfn_t gmfn)
         if ( !(shadow_mode_external(v->domain)
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
-                   == !!is_xen_heap_page(page))) )
+                       == !!is_xen_heap_page(page)))
+                    && !(shadow_mode_mem_access(v->domain)) )
         {
             SHADOW_ERROR("can't find all mappings of mfn %lx: "
                           "c=%08lx t=%08lx\n", mfn_x(gmfn), 
@@ -2953,7 +2954,7 @@ int shadow_enable(struct domain *d, u32 mode)
         paging_unlock(d);
     }
 
-    /* Allow p2m and log-dirty code to borrow shadow memory */
+    /* Allow p2m, log-dirty and mem_access code to borrow shadow memory */
     d->arch.paging.alloc_page = shadow_alloc_p2m_page;
     d->arch.paging.free_page = shadow_free_p2m_page;
 
@@ -3130,6 +3131,9 @@ void shadow_teardown(struct domain *d)
      * calls now that we've torn down the bitmap */
     d->arch.paging.mode &= ~PG_log_dirty;
 
+    /* Clear the mem_access mode bit */
+    d->arch.paging.mode &= ~PG_mem_access;
+
     if (d->arch.hvm_domain.dirty_vram) {
         xfree(d->arch.hvm_domain.dirty_vram->sl1ma);
         xfree(d->arch.hvm_domain.dirty_vram->dirty_bitmap);
@@ -3179,8 +3183,14 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
 {
     ASSERT(paging_locked_by_me(d));
 
-    /* Sanity check the call */
-    if ( d == current->domain || (d->arch.paging.mode & mode) == mode )
+    /*
+     * Sanity check the call.
+     * Do not allow shadow mem_access mode to be enabled when
+     * log-dirty or translate mode is enabled.
+     */
+    if ( d == current->domain || (d->arch.paging.mode & mode) == mode ||
+         ((mode & PG_mem_access) &&
+         (shadow_mode_log_dirty(d) | shadow_mode_translate(d))) )
     {
         return -EINVAL;
     }
@@ -3197,7 +3207,7 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
         }
     }
 
-    /* Allow p2m and log-dirty code to borrow shadow memory */
+    /* Allow p2m, log-dirty and mem_access code to borrow shadow memory */
     d->arch.paging.alloc_page = shadow_alloc_p2m_page;
     d->arch.paging.free_page = shadow_free_p2m_page;
 
@@ -3661,6 +3671,47 @@ out:
 }
 
 /**************************************************************************/
+/* mem_access mode support */
+
+/* Shadow specific code which is called in
+ * XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE for PV guests.
+ * Return 0 on success.
+ */
+int shadow_enable_mem_access(struct domain *d)
+{
+    int ret;
+
+    paging_lock(d);
+
+#if (SHADOW_OPTIMIZATIONS & SHOPT_LINUX_L3_TOPLEVEL)
+    /* 32bit PV guests on 64bit xen behave like older 64bit linux: they
+     * change an l4e instead of cr3 to switch tables.  Give them the
+     * same optimization */
+    if ( is_pv_32on64_domain(d) )
+        d->arch.paging.shadow.opt_flags = SHOPT_LINUX_L3_TOPLEVEL;
+#endif
+
+    ret = shadow_one_bit_enable(d, PG_mem_access);
+    paging_unlock(d);
+
+    return ret;
+}
+
+/* shadow specfic code which is called in
+ * XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE for PV guests
+ */
+int shadow_disable_mem_access(struct domain *d)
+{
+    int ret;
+
+    paging_lock(d);
+    ret = shadow_one_bit_disable(d, PG_mem_access);
+    paging_unlock(d);
+
+    return ret;
+}
+
+/**************************************************************************/
 /* Shadow-control XEN_DOMCTL dispatcher */
 
 int shadow_domctl(struct domain *d, 
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index c6c9d10..aee061c 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -38,6 +38,8 @@
 #include <asm/hvm/cacheattr.h>
 #include <asm/mtrr.h>
 #include <asm/guest_pt.h>
+#include <asm/mem_event.h>
+#include <asm/mem_access.h>
 #include <public/sched.h>
 #include "private.h"
 #include "types.h"
@@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v,
             }
     }
 
+    /* Propagate access permissions */
+    if ( unlikely((level == 1) &&
+                  mem_event_check_ring(&d->mem_event->access) &&
+                  !sh_mfn_is_a_page_table(target_mfn)) )
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(d);
+        p2m_access_t a;
+        p2m_type_t t;
+        mfn_t mfn;
+        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL);
+        if ( mfn_valid(mfn) )
+            p2m_access_to_flags(&sflags, gflags, a);
+    }
+
     // Set the A&D bits for higher level shadows.
     // Higher level entries do not, strictly speaking, have dirty bits, but
     // since we use shadow linear tables, each of these entries may, at some
@@ -2822,6 +2838,8 @@ static int sh_page_fault(struct vcpu *v,
     int r;
     fetch_type_t ft = 0;
     p2m_type_t p2mt;
+    p2m_access_t p2ma;
+    mem_event_request_t *req_ptr = NULL;
     uint32_t rc;
     int version;
 #if SHADOW_OPTIMIZATIONS & SHOPT_FAST_EMULATION
@@ -3009,7 +3027,80 @@ static int sh_page_fault(struct vcpu *v,
 
     /* What mfn is the guest trying to access? */
     gfn = guest_l1e_get_gfn(gw.l1e);
-    gmfn = get_gfn(d, gfn, &p2mt);
+    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
+        gmfn = get_gfn(d, gfn, &p2mt);
+    /*
+     * A mem_access listener is present, so we will first check if a violation
+     * has occurred.
+     */
+    else
+    {
+        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
+
+        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0, NULL);
+        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
+             && (regs->error_code & PFEC_page_present)
+             && !(regs->error_code & PFEC_reserved_bit) )
+        {
+            int violation = 0;
+            bool_t access_w = !!(regs->error_code & PFEC_write_access);
+            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
+            bool_t access_r = access_x ? 0 : !(access_w);
+
+            /* If the access is against the permissions, then send to mem_event */
+            switch (p2ma)
+            {
+            case p2m_access_r:
+                violation = access_w || access_x;
+                break;
+            case p2m_access_rx:
+            case p2m_access_rx2rw:
+                violation = access_w;
+                break;
+            case p2m_access_rw:
+                violation = access_x;
+                break;
+            case p2m_access_rwx:
+            default:
+                break;
+            }
+
+            /*
+             * Do not police writes to guest memory emanating from the Xen
+             * kernel. Trying to do so will cause the same pagefault to occur
+             * over and over again with an event being sent to the access
+             * listener for each fault. If the access listener's vcpu is not
+             * scheduled during this time, the violation is never resolved and
+             * will eventually end with the host crashing.
+             */
+            if ( (violation && access_w) &&
+                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
+            {
+                violation = 0;
+                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
+                                    p2m_ram_rw, p2m_access_rw);
+            }
+
+            if ( violation )
+            {
+                paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
+                              (va & ((1 << PAGE_SHIFT) - 1));
+                if ( !p2m_mem_access_check(gpa, 1, va,
+                                           access_r, access_w, access_x,
+                                           &req_ptr) )
+                {
+                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" "
+                                  "p2ma: %d\n",
+                                  (access_r ? 'r' : '-'),
+                                  (access_w ? 'w' : '-'),
+                                  (access_x ? 'x' : '-'),
+                                  mfn_x(gmfn), p2ma);
+                    /* Rights not promoted, vcpu paused, work here is done */
+                    goto out_put_gfn;
+                }
+            }
+        }
+    }
 
     if ( shadow_mode_refcounts(d) && 
          ((!p2m_is_valid(p2mt) && !p2m_is_grant(p2mt)) ||
@@ -3214,7 +3305,18 @@ static int sh_page_fault(struct vcpu *v,
     SHADOW_PRINTK("fixed\n");
     shadow_audit_tables(v);
     paging_unlock(d);
+ out_put_gfn:
     put_gfn(d, gfn_x(gfn));
+
+    /* Send access violation to mem_access listener */
+    if ( unlikely(req_ptr != NULL) )
+    {
+        SHADOW_PRINTK("mem_access SEND violation mfn: 0x%"PRI_mfn"\n",
+                      mfn_x(gmfn));
+        mem_access_send_req(d, req_ptr);
+        xfree(req_ptr);
+    }
+
     return EXCRET_fault_fixed;
 
  emulate:
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 86847e9..c20277e 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -283,6 +283,8 @@ struct p2m_domain {
         struct ept_data ept;
         /* NPT-equivalent structure could be added here. */
     };
+    /* Page access lookup table for PV domains */
+    mfn_t *access_lookup_table;
 };
 
 /* get host p2m table */
@@ -665,6 +667,55 @@ void p2m_flush_nestedp2m(struct domain *d);
 void nestedp2m_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
     l1_pgentry_t *p, l1_pgentry_t new, unsigned int level);
 
+/*
+ * Functions specific to the p2m-ma implementation
+ */
+
+/* Set up p2m function pointers and structures for mem_access implementation */
+int p2m_mem_access_init(struct p2m_domain *p2m);
+
+/* Reset p2m function pointers and free mem_access structures */
+void p2m_mem_access_teardown(struct p2m_domain *p2m);
+
+/* Set the default permission for all pages
+ * Unlike an HVM guest, the mem_access listener cannot set access permissions
+ * for all pages since the gfn to mfn translation is done by the guest. All it
+ * can do is set permission for individual pages. This functions blows away the
+ * shadows in lieu of that so that new faults will set the page permissions to
+ * the default value.
+ */
+int p2m_mem_access_set_default(struct p2m_domain *p2m);
+
+/* Convert access restrictions to page table flags */
+void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access);
+
+/* Round it up to 8 bits i.e. sizeof(uint8_t) */
+#define P2M_ACCESS_SIZE sizeof(uint8_t)
+
+/* Number of p2m_access_t in a page */
+#define NR_ACCESS (PAGE_SIZE / P2M_ACCESS_SIZE)
+
+/* Number of access table pages for a PV domain */
+#define get_domain_nr_access_table_pages(d) \
+        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE)
+
+/*
+ * The mem_access lookup table is an array of mfn_ts. Each mfn points to a page
+ * that holds NR_ACCESS p2m_access_t values, each corresponding to a gfn.
+ * access_lookup_table[0] will point to a mfn that has p2m_access_t values for
+ * gfns 0 to NR_ACCESS - 1.
+ * access_lookup_table[1] will point to a mfn that has p2m_access_t values for
+ * gfns NR_ACCESS to (2 * NR_ACCESS) - 1. And so on...
+ */
+
+/* Index in the mem_access lookup table that has page which holds the
+ *  p2m_access_t for the gfn
+ */
+#define MEM_ACCESS_TABLE_IDX(gfn) (gfn / NR_ACCESS);
+
+/* Location in the page that has the p2m_access_t for the gfn */
+#define MEM_ACCESS_PAGE_IDX(gfn) (gfn % NR_ACCESS);
+
 #endif /* _XEN_P2M_H */
 
 /*
diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
index 9b8f8de..958e3ce 100644
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -53,6 +53,8 @@
 /* Xen does not steal address space from the domain for its own booking;
  * requires VT or similar mechanisms */
 #define PG_external    (XEN_DOMCTL_SHADOW_ENABLE_EXTERNAL << PG_mode_shift)
+/* Enable shadow mem_access mode for PV domains */
+#define PG_mem_access  (XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS << PG_mode_shift)
 
 #define paging_mode_enabled(_d)   ((_d)->arch.paging.mode)
 #define paging_mode_shadow(_d)    ((_d)->arch.paging.mode & PG_SH_enable)
@@ -62,6 +64,7 @@
 #define paging_mode_log_dirty(_d) ((_d)->arch.paging.mode & PG_log_dirty)
 #define paging_mode_translate(_d) ((_d)->arch.paging.mode & PG_translate)
 #define paging_mode_external(_d)  ((_d)->arch.paging.mode & PG_external)
+#define paging_mode_mem_access(_d) ((_d)->arch.paging.mode & PG_mem_access)
 
 /* flags used for paging debug */
 #define PAGING_DEBUG_LOGDIRTY 0
diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
index f40cab4..ea2a47c 100644
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -43,6 +43,8 @@
                                    paging_mode_translate(_d))
 #define shadow_mode_external(_d)  (paging_mode_shadow(_d) && \
                                    paging_mode_external(_d))
+#define shadow_mode_mem_access(_d) (paging_mode_shadow(_d) && \
+                                    paging_mode_mem_access(_d))
 
 /*****************************************************************************
  * Entry points into the shadow code */
@@ -86,6 +88,12 @@ int shadow_disable_log_dirty(struct domain *d);
 /* shadow code to call when bitmap is being cleaned */
 void shadow_clean_dirty_bitmap(struct domain *d);
 
+/* shadow code to call when mem_access is enabled */
+int shadow_enable_mem_access(struct domain *d);
+
+/* shadow code to call when mem access is disabled */
+int shadow_disable_mem_access(struct domain *d);
+
 /* Update all the things that are derived from the guest's CR0/CR3/CR4.
  * Called to initialize paging structures if the paging mode
  * has changed, and when bringing up a VCPU for the first time. */
@@ -114,6 +122,9 @@ static inline void shadow_remove_all_shadows(struct vcpu *v, mfn_t gmfn)
 /* Discard _all_ mappings from the domain's shadows. */
 void shadow_blow_tables_per_domain(struct domain *d);
 
+/* Tear down all of this domain's shadows */
+void shadow_blow_tables(struct domain *d);
+
 #endif /* _XEN_SHADOW_H */
 
 /*
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 1b75ab2..209ffdd 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -225,6 +225,10 @@ struct xen_domctl_getpageframeinfo3 {
   * Requires HVM support.
   */
 #define XEN_DOMCTL_SHADOW_ENABLE_EXTERNAL  (1 << 4)
+ /*
+  * Enable shadow mem_access mode for PV domains
+  */
+#define XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS (1 << 5)
 
 struct xen_domctl_shadow_op_stats {
     uint32_t fault_count;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains
  2014-04-29  4:45 [PATCH RFC 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
@ 2014-04-29  4:45 ` Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
  3 siblings, 0 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-04-29  4:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan, Keir Fraser, Jan Beulich

mem_access changes
------------------
Two new memory hypercalls, XENMEM_access_op_create_ring_page and
XENMEM_access_op_get_ring_mfn have been added. The mem_access listener
makes these calls during setup. The ring page is created from the
xenheap and shared with the guest. It is freed when mem_access is
disabled or when the domain is shutdown or destroyed.
XENMEM_access_op_[sg]et_access hypercalls are modified to accomodate
calls for a PV domain. When setting the access permissions for a mfn,
all shadows for that mfn are dropped. They get recreated with the new
permissions on the next page fault for that mfn. To get the permissions
for a mfn, value is returned from the p2m_access lookup table.

mem_event changes
-----------------
The XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE/DISABLE hypercalls are
modified to allow mem_access to work with PV domains. When the access
listener goes to enable mem_access for a PV domain, shadow PG_mem_access
mode is turned on and the p2m structures are initialized. When
disabling, this mode is turned off but domain continues to run in shadow
mode. The p2m structures are also freed on disabling.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Keir Fraser <keir@xen.org>
Cc: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/domain.c            |  12 ++++
 xen/arch/x86/mm/mem_access.c     | 146 ++++++++++++++++++++++++++++++++++++---
 xen/arch/x86/mm/mem_event.c      |  77 +++++++++++++++++----
 xen/include/asm-x86/domain.h     |   3 +
 xen/include/asm-x86/mem_access.h |   4 ++
 xen/include/public/memory.h      |   2 +
 6 files changed, 222 insertions(+), 22 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 2a9c6fc..0ea03b4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -57,6 +57,7 @@
 #include <asm/nmi.h>
 #include <asm/mce.h>
 #include <asm/amd.h>
+#include <asm/mem_access.h>
 #include <xen/numa.h>
 #include <xen/iommu.h>
 #include <compat/vcpu.h>
@@ -593,8 +594,11 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags)
         }
     }
     else
+    {
         /* 64-bit PV guest by default. */
         d->arch.is_32bit_pv = d->arch.has_32bit_shinfo = 0;
+        d->arch.pv_domain.access_ring_mfn = _mfn(INVALID_MFN);
+    }
 
     /* initialize default tsc behavior in case tools don't */
     tsc_set_info(d, TSC_MODE_DEFAULT, 0UL, 0, 0);
@@ -632,8 +636,16 @@ void arch_domain_destroy(struct domain *d)
 
     free_perdomain_mappings(d);
     if ( is_pv_domain(d) )
+    {
         free_xenheap_page(d->arch.pv_domain.gdt_ldt_l1tab);
 
+        /*
+         * Free the PV mem_access ring xenheap page in the case where a
+         * mem_access listener is present while the domain is being destroyed.
+         */
+        mem_access_free_pv_ring(d);
+    }
+
     free_xenheap_page(d->shared_info);
     cleanup_domain_irq_mapping(d);
 }
diff --git a/xen/arch/x86/mm/mem_access.c b/xen/arch/x86/mm/mem_access.c
index 462c318..2340a46 100644
--- a/xen/arch/x86/mm/mem_access.c
+++ b/xen/arch/x86/mm/mem_access.c
@@ -26,8 +26,36 @@
 #include <xen/hypercall.h>
 #include <asm/p2m.h>
 #include <asm/mem_event.h>
+#include <asm/shadow.h>
 #include <xsm/xsm.h>
 
+/* Override macros from asm/page.h to make them work with mfn_t */
+#undef mfn_valid
+#define mfn_valid(_mfn) __mfn_valid(mfn_x(_mfn))
+#undef mfn_to_page
+#define mfn_to_page(_m) __mfn_to_page(mfn_x(_m))
+
+static inline bool_t domain_valid_for_mem_access(struct domain *d)
+{
+    if ( is_hvm_domain(d) )
+    {
+        /* Only HAP is supported */
+        if ( !hap_enabled(d) )
+            return 0;
+
+        /* Currently only EPT is supported */
+        if ( !cpu_has_vmx )
+            return 0;
+    }
+    /*
+     * Only PV guests using shadow mem_access mode and running on CPUs with
+     * the NX bit are supported
+     */
+    else if ( !shadow_mode_mem_access(d) || !cpu_has_nx )
+        return 0;
+
+    return 1;
+}
 
 int mem_access_memop(unsigned long cmd,
                      XEN_GUEST_HANDLE_PARAM(xen_mem_access_op_t) arg)
@@ -43,16 +71,14 @@ int mem_access_memop(unsigned long cmd,
     if ( rc )
         return rc;
 
-    rc = -EINVAL;
-    if ( !is_hvm_domain(d) )
-        goto out;
-
     rc = xsm_mem_event_op(XSM_TARGET, d, XENMEM_access_op);
     if ( rc )
         goto out;
 
     rc = -ENODEV;
-    if ( unlikely(!d->mem_event->access.ring_page) )
+    if ( unlikely(!d->mem_event->access.ring_page &&
+                  mao.op != XENMEM_access_op_create_ring_page &&
+                  mao.op != XENMEM_access_op_get_ring_mfn) )
         goto out;
 
     switch ( mao.op )
@@ -65,12 +91,27 @@ int mem_access_memop(unsigned long cmd,
     case XENMEM_access_op_set_access:
     {
         unsigned long start_iter = cmd & ~MEMOP_CMD_MASK;
+        unsigned long pfn = mao.pfn;
 
         rc = -EINVAL;
-        if ( (mao.pfn != ~0ull) &&
+        if ( !domain_valid_for_mem_access(d) )
+            break;
+
+        if ( unlikely(is_pv_domain(d)  && pfn != ~0ull) )
+            pfn = get_gpfn_from_mfn(mao.pfn);
+
+        /*
+         * max_pfn for PV domains is obtained from the shared_info structures
+         * that the guest maintains. It is up to the guest to maintain this and
+         * is not filled in during early boot. So we do not check if we are
+         * crossing max_pfn here and will depend on the checks in
+         * p2m_mem_access_set_entry().
+         */
+        if ( (pfn != ~0ull) &&
              (mao.nr < start_iter ||
-              ((mao.pfn + mao.nr - 1) < mao.pfn) ||
-              ((mao.pfn + mao.nr - 1) > domain_get_maximum_gpfn(d))) )
+              ((pfn + mao.nr - 1) < pfn) ||
+              ((pfn + mao.nr - 1) > domain_get_maximum_gpfn(d) &&
+                !is_pv_domain(d))) )
             break;
 
         rc = p2m_set_mem_access(d, mao.pfn, mao.nr, start_iter,
@@ -89,7 +130,18 @@ int mem_access_memop(unsigned long cmd,
         xenmem_access_t access;
 
         rc = -EINVAL;
-        if ( (mao.pfn > domain_get_maximum_gpfn(d)) && mao.pfn != ~0ull )
+        if ( !domain_valid_for_mem_access(d) )
+            break;
+
+        /*
+         * max_pfn for PV domains is obtained from the shared_info structures
+         * that the guest maintains. It is up to the guest to maintain this and
+         * is not filled in during early boot. So we do not check if we are
+         * crossing max_pfn here and will depend on the checks in
+         * p2m_mem_access_get_entry().
+         */
+        if ( (mao.pfn > domain_get_maximum_gpfn(d) && !is_pv_domain(d)) &&
+             mao.pfn != ~0ull )
             break;
 
         rc = p2m_get_mem_access(d, mao.pfn, &access);
@@ -102,6 +154,67 @@ int mem_access_memop(unsigned long cmd,
         break;
     }
 
+    case XENMEM_access_op_create_ring_page:
+    {
+        void *access_ring_va;
+
+        /*
+         * mem_access listeners for HVM domains need not call
+         * xc_mem_access_set_ring_pfn() as the special ring page would have been
+         * setup during domain creation.
+         */
+        rc = -ENOSYS;
+        if ( is_hvm_domain(d) )
+            break;
+
+        /*
+         * The ring page was created by a mem_access listener but was not
+         * freed. Do not allow another xenheap page to be allocated.
+         */
+        if ( mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        {
+            rc = -EPERM;
+            break;
+        }
+
+        access_ring_va = alloc_xenheap_page();
+        if ( access_ring_va == NULL )
+        {
+            rc = -ENOMEM;
+            break;
+        }
+
+        clear_page(access_ring_va);
+        share_xen_page_with_guest(virt_to_page(access_ring_va), d,
+                                  XENSHARE_writable);
+
+        d->arch.pv_domain.access_ring_mfn = _mfn(virt_to_mfn(access_ring_va));
+
+        rc = 0;
+        break;
+    }
+
+    case XENMEM_access_op_get_ring_mfn:
+        /*
+         * mem_access listeners for HVM domains should call xc_get_hvm_param()
+         * instead of xc_mem_access_get_ring_pfn().
+         */
+        rc = -ENOSYS;
+        if ( is_hvm_domain(d) )
+            break;
+
+        if ( !mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        {
+            rc = -ENODEV;
+            break;
+        }
+
+        mao.pfn = mfn_x(d->arch.pv_domain.access_ring_mfn);
+        rc = __copy_field_to_guest(arg, &mao, pfn) ? -EFAULT : 0;
+
+        rc = 0;
+        break;
+
     default:
         rc = -ENOSYS;
         break;
@@ -123,6 +236,21 @@ int mem_access_send_req(struct domain *d, mem_event_request_t *req)
     return 0;
 } 
 
+/* Free the xenheap page used for the PV access ring */
+void mem_access_free_pv_ring(struct domain *d)
+{
+    struct page_info *pg = mfn_to_page(d->arch.pv_domain.access_ring_mfn);
+
+    if ( !mfn_valid(d->arch.pv_domain.access_ring_mfn) )
+        return;
+
+    BUG_ON(page_get_owner(pg) != d);
+    if ( test_and_clear_bit(_PGC_allocated, &pg->count_info) )
+        put_page(pg);
+    free_xenheap_page(mfn_to_virt(mfn_x(d->arch.pv_domain.access_ring_mfn)));
+    d->arch.pv_domain.access_ring_mfn = _mfn(INVALID_MFN);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index 36b9dba..9b45504 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -25,6 +25,7 @@
 #include <xen/event.h>
 #include <xen/wait.h>
 #include <asm/p2m.h>
+#include <asm/shadow.h>
 #include <asm/mem_event.h>
 #include <asm/mem_paging.h>
 #include <asm/mem_access.h>
@@ -49,7 +50,12 @@ static int mem_event_enable(
     xen_event_channel_notification_t notification_fn)
 {
     int rc;
-    unsigned long ring_gfn = d->arch.hvm_domain.params[param];
+    unsigned long ring_gfn;
+
+    if ( is_pv_domain(d) && param == HVM_PARAM_ACCESS_RING_PFN )
+        ring_gfn = mfn_x(d->arch.pv_domain.access_ring_mfn);
+    else
+        ring_gfn = d->arch.hvm_domain.params[param];
 
     /* Only one helper at a time. If the helper crashed,
      * the ring is in an undefined state and so is the guest.
@@ -581,28 +587,73 @@ int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec,
         switch( mec->op )
         {
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE:
-        {
             rc = -ENODEV;
-            /* Only HAP is supported */
-            if ( !hap_enabled(d) )
-                break;
+            if ( !is_pv_domain(d) )
+            {
+                /* Only HAP is supported */
+                if ( !hap_enabled(d) )
+                    break;
 
-            /* Currently only EPT is supported */
-            if ( !cpu_has_vmx )
-                break;
+                /* Currently only EPT is supported */
+                if ( !cpu_has_vmx )
+                    break;
+            }
+            /* PV guests use shadow mem_access mode */
+            else
+            {
+                domain_pause(d);
+                if ( !shadow_mode_mem_access(d) )
+                {
+                    rc = shadow_enable_mem_access(d);
+                    if ( rc != 0 )
+                        goto pv_out;
+                }
+
+                rc = p2m_mem_access_init(p2m_get_hostp2m(d));
+                if ( rc != 0 )
+                {
+                    shadow_disable_mem_access(d);
+                    goto pv_out;
+                }
+            }
 
             rc = mem_event_enable(d, mec, med, _VPF_mem_access, 
                                     HVM_PARAM_ACCESS_RING_PFN,
                                     mem_access_notification);
-        }
-        break;
+
+            if ( !is_pv_domain(d) )
+                break;
+
+ pv_out:
+            if ( rc != 0 )
+            {
+                shadow_disable_mem_access(d);
+                p2m_mem_access_teardown(p2m_get_hostp2m(d));
+                mem_access_free_pv_ring(d);
+            }
+            domain_unpause(d);
+
+            break;
 
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE:
-        {
             if ( med->ring_page )
                 rc = mem_event_disable(d, med);
-        }
-        break;
+
+            /*
+             * Even if mem_access was not enabled, some resources could have
+             * been created in the PV case.
+             */
+            if ( is_pv_domain(d) )
+            {
+                domain_pause(d);
+                if ( shadow_mode_mem_access(d) )
+                    shadow_disable_mem_access(d);
+
+                p2m_mem_access_teardown(p2m_get_hostp2m(d));
+                mem_access_free_pv_ring(d);
+                domain_unpause(d);
+            }
+            break;
 
         default:
             rc = -ENOSYS;
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index c5c266f..884b799 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -227,6 +227,9 @@ struct pv_domain
 
     /* map_domain_page() mapping cache. */
     struct mapcache_domain mapcache;
+
+    /* mfn of the mem_access ring page for PV domains */
+    mfn_t access_ring_mfn;
 };
 
 struct arch_domain
diff --git a/xen/include/asm-x86/mem_access.h b/xen/include/asm-x86/mem_access.h
index 5c7c5fd..c7be52c 100644
--- a/xen/include/asm-x86/mem_access.h
+++ b/xen/include/asm-x86/mem_access.h
@@ -27,6 +27,10 @@ int mem_access_memop(unsigned long cmd,
                      XEN_GUEST_HANDLE_PARAM(xen_mem_access_op_t) arg);
 int mem_access_send_req(struct domain *d, mem_event_request_t *req);
 
+
+/* Free the xenheap page used for the access ring */
+void mem_access_free_pv_ring(struct domain *d);
+
 #endif /* _XEN_ASM_MEM_ACCESS_H */
 
 /*
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 5bcd475..1070636 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -380,6 +380,8 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_event_op_t);
 #define XENMEM_access_op_resume             0
 #define XENMEM_access_op_set_access         1
 #define XENMEM_access_op_get_access         2
+#define XENMEM_access_op_create_ring_page   3
+#define XENMEM_access_op_get_ring_mfn       4
 
 typedef enum {
     XENMEM_access_n,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page
  2014-04-29  4:45 [PATCH RFC 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
  2014-04-29  4:45 ` [PATCH RFC 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
@ 2014-04-29  4:45 ` Aravindh Puthiyaparambil
  2014-05-02 12:41   ` Ian Campbell
  2014-04-29  4:45 ` [PATCH RFC 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
  3 siblings, 1 reply; 22+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-04-29  4:45 UTC (permalink / raw)
  To: xen-devel; +Cc: Ian Jackson, Ian Campbell, Stefano Stabellini

Add two APIs, xc_mem_access_create_ring_page() and
xc_mem_access_get_ring_mfn(). The mem_access listener needs to call
xc_mem_access_create_ring_page() before enabling mem_access for PV
domains. This is not needed for HVM domains as the page is created
during domain creation time. It can then call
xc_mem_access_get_ring_mfn() to get the mfn of the created page to map
in. This is requivalent to xc_get_hvm_param(HVM_PARAM_ACCESS_RING_PFN)
for HVM domains.

Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxc/xc_mem_access.c | 29 +++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h       | 17 ++++++++++++++++-
 2 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/tools/libxc/xc_mem_access.c b/tools/libxc/xc_mem_access.c
index f436e69..b85af0d 100644
--- a/tools/libxc/xc_mem_access.c
+++ b/tools/libxc/xc_mem_access.c
@@ -97,6 +97,35 @@ int xc_get_mem_access(xc_interface *xch,
     return rc;
 }
 
+int xc_mem_access_create_ring_page(xc_interface *xch, domid_t domain_id)
+{
+    xen_mem_access_op_t mao =
+    {
+        .op    = XENMEM_access_op_create_ring_page,
+        .domid = domain_id
+    };
+
+    return do_memory_op(xch, XENMEM_access_op, &mao, sizeof(mao));
+}
+
+int xc_mem_access_get_ring_mfn(xc_interface *xch, domid_t domain_id,
+                               unsigned long *mfn)
+{
+    int rc;
+    xen_mem_access_op_t mao =
+    {
+        .op    = XENMEM_access_op_get_ring_mfn,
+        .domid = domain_id
+    };
+
+    rc = do_memory_op(xch, XENMEM_access_op, &mao, sizeof(mao));
+
+    if ( !rc )
+        *mfn = mao.pfn;
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 02129f7..a1c743d 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -2050,7 +2050,10 @@ int xc_mem_access_resume(xc_interface *xch, domid_t domain_id);
 /*
  * Set a range of memory to a specific access.
  * Allowed types are XENMEM_access_default, XENMEM_access_n, any combination of
- * XENMEM_access_ + (rwx), and XENMEM_access_rx2rw
+ * XENMEM_access_ + (rwx), XENMEM_access_rx2rw and XENMEM_access_n2rwx for HVM
+ * domains.
+ * Allowed types are XENMEM_access_default, XENMEM_access_r, XENMEM_access_rw,
+ * XENMEM_access_rwx and XENMEM_access_rx2rw for PV domains.
  */
 int xc_set_mem_access(xc_interface *xch, domid_t domain_id,
                       xenmem_access_t access, uint64_t first_pfn,
@@ -2062,6 +2065,18 @@ int xc_set_mem_access(xc_interface *xch, domid_t domain_id,
 int xc_get_mem_access(xc_interface *xch, domid_t domain_id,
                       uint64_t pfn, xenmem_access_t *access);
 
+/*
+ * Create the ring page for PV domains. This has to be called before calling
+ * xc_mem_access_enable(). This should not be called for HVM domains
+ */
+int xc_mem_access_create_ring_page(xc_interface *xch, domid_t domain_id);
+
+/*
+ * Get the mfn of the ring page for PV domains
+ */
+int xc_mem_access_get_ring_mfn(xc_interface *xch, domid_t domain_id,
+                               unsigned long *mfn);
+
 /***
  * Memory sharing operations.
  *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 4/4] tool/xen-access: Add support for PV domains
  2014-04-29  4:45 [PATCH RFC 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
                   ` (2 preceding siblings ...)
  2014-04-29  4:45 ` [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page Aravindh Puthiyaparambil
@ 2014-04-29  4:45 ` Aravindh Puthiyaparambil
  2014-05-02 12:43   ` Ian Campbell
  3 siblings, 1 reply; 22+ messages in thread
From: Aravindh Puthiyaparambil @ 2014-04-29  4:45 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Ian Jackson, Ian Campbell, Aravindh Puthiyaprambil

Add support to the xen-access test program for it work with PV domains.
The main difference is that for PV domains, unlike HVM domains,
xc_mem_access_create_ring_page() has to be called as the page is not
created during domain creation time. PV domains do not need to set
all individual page access permissions during setup and teardown. Enabling
and disabling mem_access does that indirectly.

Signed-off-by: Aravindh Puthiyaprambil <aravindp@cisco.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/tests/xen-access/xen-access.c | 122 ++++++++++++++++++++++++------------
 1 file changed, 81 insertions(+), 41 deletions(-)

diff --git a/tools/tests/xen-access/xen-access.c b/tools/tests/xen-access/xen-access.c
index 0a84bd5..f9883f4 100644
--- a/tools/tests/xen-access/xen-access.c
+++ b/tools/tests/xen-access/xen-access.c
@@ -114,7 +114,8 @@ typedef struct xenaccess {
 } xenaccess_t;
 
 static int interrupted;
-bool evtchn_bind = 0, evtchn_open = 0, mem_access_enable = 0;
+bool evtchn_bind = 0, evtchn_open = 0, mem_access_enable = 0, pv_ring_page = 0;
+bool hvm = 0;
 
 static void close_handler(int sig)
 {
@@ -173,7 +174,7 @@ int xenaccess_teardown(xc_interface *xch, xenaccess_t *xenaccess)
     if ( xenaccess->mem_event.ring_page )
         munmap(xenaccess->mem_event.ring_page, XC_PAGE_SIZE);
 
-    if ( mem_access_enable )
+    if ( mem_access_enable  || (!hvm && pv_ring_page) )
     {
         rc = xc_mem_access_disable(xenaccess->xc_handle,
                                    xenaccess->mem_event.domain_id);
@@ -245,9 +246,57 @@ xenaccess_t *xenaccess_init(xc_interface **xch_r, domid_t domain_id)
     /* Initialise lock */
     mem_event_ring_lock_init(&xenaccess->mem_event);
 
+    /* Get domaininfo */
+    xenaccess->domain_info = malloc(sizeof(xc_domaininfo_t));
+    if ( xenaccess->domain_info == NULL )
+    {
+        ERROR("Error allocating memory for domain info");
+        goto err;
+    }
+
+    rc = xc_domain_getinfolist(xenaccess->xc_handle, domain_id, 1,
+                               xenaccess->domain_info);
+    if ( rc != 1 )
+    {
+        ERROR("Error getting domain info");
+        goto err;
+    }
+
+    if ( xenaccess->domain_info->flags & XEN_DOMINF_hvm_guest )
+        hvm = 1;
+
+    DPRINTF("max_pages = %"PRIx64"\n", xenaccess->domain_info->max_pages);
+
+    if ( hvm )
+    {
+        rc = xc_get_hvm_param(xch, xenaccess->mem_event.domain_id,
+                              HVM_PARAM_ACCESS_RING_PFN, &ring_pfn);
+
+    }
+    else
+    {
+        rc = xc_mem_access_create_ring_page(xch, xenaccess->mem_event.domain_id);
+        if ( rc != 0 )
+        {
+            PERROR("Failed to set ring gfn\n");
+            goto err;
+        }
+
+        pv_ring_page = 1;
+
+        rc = xc_mem_access_get_ring_mfn(xch, xenaccess->mem_event.domain_id,
+                                        &ring_pfn);
+    }
+
+    if ( rc != 0 )
+    {
+        PERROR("Failed to get ring gfn\n");
+        goto err;
+    }
+
+    DPRINTF("ring_mfn: 0x%lx\n", ring_pfn);
+
     /* Map the ring page */
-    xc_get_hvm_param(xch, xenaccess->mem_event.domain_id, 
-                        HVM_PARAM_ACCESS_RING_PFN, &ring_pfn);
     mmap_pfn = ring_pfn;
     xenaccess->mem_event.ring_page = 
         xc_map_foreign_batch(xch, xenaccess->mem_event.domain_id, 
@@ -327,24 +376,6 @@ xenaccess_t *xenaccess_init(xc_interface **xch_r, domid_t domain_id)
                     xenaccess->mem_event.domain_id, 1, 0, &ring_pfn) )
         PERROR("Failed to remove ring from guest physmap");
 
-    /* Get domaininfo */
-    xenaccess->domain_info = malloc(sizeof(xc_domaininfo_t));
-    if ( xenaccess->domain_info == NULL )
-    {
-        ERROR("Error allocating memory for domain info");
-        goto err;
-    }
-
-    rc = xc_domain_getinfolist(xenaccess->xc_handle, domain_id, 1,
-                               xenaccess->domain_info);
-    if ( rc != 1 )
-    {
-        ERROR("Error getting domain info");
-        goto err;
-    }
-
-    DPRINTF("max_pages = %"PRIx64"\n", xenaccess->domain_info->max_pages);
-
     return xenaccess;
 
  err:
@@ -526,23 +557,28 @@ int main(int argc, char *argv[])
         goto exit;
     }
 
-    rc = xc_set_mem_access(xch, domain_id, default_access, 0,
-                           xenaccess->domain_info->max_pages);
-    if ( rc < 0 )
+    if ( hvm )
     {
-        ERROR("Error %d setting all memory to access type %d\n", rc,
-              default_access);
-        goto exit;
-    }
+        rc = xc_set_mem_access(xch, domain_id, default_access, 0,
+                               xenaccess->domain_info->max_pages);
+        if ( rc < 0 )
+        {
+            ERROR("Error %d setting all memory to access type %d\n", rc,
+                  default_access);
+            goto exit;
+        }
 
-    if ( int3 )
-        rc = xc_set_hvm_param(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_sync);
-    else
-        rc = xc_set_hvm_param(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_disabled);
-    if ( rc < 0 )
-    {
-        ERROR("Error %d setting int3 mem_event\n", rc);
-        goto exit;
+        if ( int3 )
+            rc = xc_set_hvm_param(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3,
+                                  HVMPME_mode_sync);
+        else
+            rc = xc_set_hvm_param(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3,
+                                  HVMPME_mode_disabled);
+        if ( rc < 0 )
+        {
+            ERROR("Error %d setting int3 mem_event\n", rc);
+            goto exit;
+        }
     }
 
     /* Wait for access */
@@ -554,10 +590,14 @@ int main(int argc, char *argv[])
 
             /* Unregister for every event */
             rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, ~0ull, 0);
-            rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, 0,
-                                   xenaccess->domain_info->max_pages);
-            rc = xc_set_hvm_param(xch, domain_id, HVM_PARAM_MEMORY_EVENT_INT3, HVMPME_mode_disabled);
-
+            if ( hvm )
+            {
+                rc = xc_set_mem_access(xch, domain_id, XENMEM_access_rwx, 0,
+                                       xenaccess->domain_info->max_pages);
+                rc = xc_set_hvm_param(xch, domain_id,
+                                      HVM_PARAM_MEMORY_EVENT_INT3,
+                                      HVMPME_mode_disabled);
+            }
             shutting_down = 1;
         }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
@ 2014-04-29  8:50   ` Jan Beulich
  2014-04-29 23:10     ` Aravindh Puthiyaparambil (aravindp)
  2014-04-29 12:05   ` Tamas Lengyel
  2014-05-01 14:39   ` Tim Deegan
  2 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-04-29  8:50 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 29.04.14 at 06:45, <aravindp@cisco.com> wrote:
> +void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access)
> +{
> +
> +    /* Restrict with access permissions */
> +    switch (access)
> +    {
> +        case p2m_access_r:
> +            *flags &= ~(_PAGE_RW);
> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);
> +            break;
> +        case p2m_access_rx:
> +        case p2m_access_rx2rw:
> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
> +            *flags |= _PAGE_PRESENT;
> +            break;
> +        case p2m_access_rw:
> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +        case p2m_access_rwx:
> +        default:
> +            *flags &= ~(_PAGE_NX_BIT);
> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +    }

Tim will have the final say here, but I'd suggest dropping all these
needless parentheses (I see only one case where they're really
needed).

> +
> +    // Allow more restrictive guest flags to be propagated instead of access
> +    // permissions

Coding style (there are more of these, which I'm not going to
individually comment on).

> +    if ( !(gflags & _PAGE_RW) )
> +        *flags &= ~(_PAGE_RW);
> +
> +    if ( gflags & _PAGE_NX_BIT )
> +        *flags |= _PAGE_NX_BIT;

And now you get things even into inconsistent state wrt parentheses.

> +static int
> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
> +                         unsigned int page_order, p2m_type_t p2mt,
> +                         p2m_access_t p2ma)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
> +    uint table_idx;
> +    uint page_idx;
> +    uint8_t *access_table_page;
> +
> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);

Better split into two ASSERT()s, so that if it triggers one would know
which of the two conditions wasn't met?

> +
> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions */
> +    if ( unlikely(p2ma != p2m_access_r &&
> +                  p2ma != p2m_access_rw &&
> +                  p2ma != p2m_access_rx &&
> +                  p2ma != p2m_access_rwx &&
> +                  p2ma != p2m_access_rx2rw) )
> +        return -EINVAL;

Perhaps better done as a switch() statement.

> +
> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
> +        return -ENOENT;
> +
> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));

You get "gfn" passed in and blindly overwrite it here? _If_ you need to
do this lookup, shouldn't you instead check it matches the passed in one?

> +
> +    /*
> +     * Values with the MSB set denote MFNs that aren't really part of the
> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
> +     * Nothing to do here.
> +     */
> +    if ( unlikely(!VALID_M2P(gfn)) )
> +        return 0;
> +
> +    if ( gfn > (d->tot_pages - 1) )
> +        return -EINVAL;

Hardly - a PV domain can easily make its address space sparse, and
iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
on MMIO space. (And as a side note, this would better be "gfn >=
d->tot_pages".)

> +    paging_lock(d);
> +
> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
> +    access_table_page = map_domain_page(mfn_x(access_lookup_table[table_idx]));
> +    access_table_page[page_idx] = p2ma;
> +    unmap_domain_page(access_table_page);
> +
> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )

Is there anything guaranteeing d->vcpu and d->vcpu[0] being non-
NULL?

> +static mfn_t
> +p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long gfn,
> +                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
> +                         unsigned int *page_order)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
> +    uint table_idx;
> +    uint page_idx;
> +    uint8_t *access_table_page;
> +    mfn_t mfn = _mfn(gfn); // For PV guests mfn == gfn
> +
> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);
> +
> +    /* Not necessarily true, but for non-translated guests, we claim
> +     * it's the most generic kind of memory */

I think you copied this comment verbatim from elsewhere, but I don't
think it's correct as is.

> +    *t = p2m_ram_rw;
> +
> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
> +        return _mfn(INVALID_MFN);
> +
> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));

Same comment as earlier.

> +    if ( gfn > (d->tot_pages - 1) )

Dito.

> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
> +
> +    access_table_page = map_domain_page(mfn_x(access_lookup_table[table_idx]));
> +
> +    /* This is a hint to take the default permissions */
> +    if ( access_table_page[page_idx] == p2m_access_n )
> +        access_table_page[page_idx] = p2m->default_access;

We're in "get" here - why does that modify any global state?

> +void p2m_mem_access_teardown(struct p2m_domain *p2m)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
> +    uint32_t nr_access_table_pages;
> +    uint32_t ctr;
> +
> +    /* Reset the set_entry and get_entry function pointers */
> +    p2m_pt_init(p2m);
> +
> +    if ( !access_lookup_table  )
> +        return;
> +
> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
> +
> +    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )

No new effectively unbounded loops please.

> +int p2m_mem_access_init(struct p2m_domain *p2m)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table;
> +    uint32_t nr_access_table_pages;
> +    uint32_t ctr;
> +
> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
> +    access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages);

This surely is going to be an order > 0 allocation, and you even risk
it being an order > MAX_ORDER one. The former is disallowed at
runtime by convention/agreement, the latter is an outright bug.
(And again just as a side note - you don't appear to need the zero
initialization here.)

> +    if ( !access_lookup_table )
> +        return -ENOMEM;
> +
> +    p2m->access_lookup_table = access_lookup_table;
> +
> +    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )

Same comment regarding the effective unboundedness of the loop.

> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d, unsigned long pfn, uint32_t nr,
>      if ( pfn == ~0ul )
>      {
>          p2m->default_access = a;
> +        if ( is_pv_domain(d) )
> +            return p2m_mem_access_set_default(p2m);

Is it really correct to set p2m->default_access _before_ calling that
function? Perhaps it wouldn't be correct doing it the other way
around either - I suppose you need to hold the necessary lock across
both operations.

> @@ -585,9 +585,17 @@ int paging_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
>      {
>  
>      case XEN_DOMCTL_SHADOW_OP_ENABLE:
> +        /*
> +         * Shadow mem_access mode should only be enabled when mem_access is
> +         * enabled in XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE.
> +         */
> +        if ( sc->mode & XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS )
> +            return -EINVAL;
> +
>          if ( !(sc->mode & XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY) )
>              break;
>          /* Else fall through... */
> +
>      case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:

Stray blank line - the fall through here makes it preferable to keep the
two case blocks un-separated.

> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v, mfn_t gmfn)
>          if ( !(shadow_mode_external(v->domain)
>                 && (page->count_info & PGC_count_mask) <= 3
>                 && ((page->u.inuse.type_info & PGT_count_mask)
> -                   == !!is_xen_heap_page(page))) )
> +                       == !!is_xen_heap_page(page)))
> +                    && !(shadow_mode_mem_access(v->domain)) )

You're breaking indentation, there are pointless parentheses again,
but most importantly - why?

> @@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v,
>              }
>      }
>  
> +    /* Propagate access permissions */
> +    if ( unlikely((level == 1) &&
> +                  mem_event_check_ring(&d->mem_event->access) &&
> +                  !sh_mfn_is_a_page_table(target_mfn)) )

Just a general remark here - unlikely() used like this is pointless, it
ought to be used on the individual constituents of && or ||.

> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +        p2m_access_t a;
> +        p2m_type_t t;
> +        mfn_t mfn;
> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL);

Missing blank line between declarations and statements. Also this
statement could instead be an initializer.

> @@ -2822,6 +2838,8 @@ static int sh_page_fault(struct vcpu *v,
>      int r;
>      fetch_type_t ft = 0;
>      p2m_type_t p2mt;
> +    p2m_access_t p2ma;

This variable ...

> @@ -3009,7 +3027,80 @@ static int sh_page_fault(struct vcpu *v,
>  
>      /* What mfn is the guest trying to access? */
>      gfn = guest_l1e_get_gfn(gw.l1e);
> -    gmfn = get_gfn(d, gfn, &p2mt);
> +    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
> +        gmfn = get_gfn(d, gfn, &p2mt);
> +    /*
> +     * A mem_access listener is present, so we will first check if a violation
> +     * has occurred.
> +     */
> +    else
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);

... seems to be used only in this scope, and with you already adding
scope restricted variables it should go here.

> +
> +        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0, NULL);
> +        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
> +             && (regs->error_code & PFEC_page_present)
> +             && !(regs->error_code & PFEC_reserved_bit) )
> +        {
> +            int violation = 0;
> +            bool_t access_w = !!(regs->error_code & PFEC_write_access);
> +            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
> +            bool_t access_r = access_x ? 0 : !(access_w);

Stray parentheses again. I'm not going to repeat this again - please
just check your code before submitting.

> +            /*
> +             * Do not police writes to guest memory emanating from the Xen
> +             * kernel. Trying to do so will cause the same pagefault to occur
> +             * over and over again with an event being sent to the access
> +             * listener for each fault. If the access listener's vcpu is not
> +             * scheduled during this time, the violation is never resolved and
> +             * will eventually end with the host crashing.
> +             */
> +            if ( (violation && access_w) &&
> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
> +            {
> +                violation = 0;
> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
> +                                    p2m_ram_rw, p2m_access_rw);
> +            }

This looks more like a hack than a proper solution - shouldn't the
listener be allowed to know of hypervisor side accesses?

> +/* Number of access table pages for a PV domain */
> +#define get_domain_nr_access_table_pages(d) \
> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE)

And once again. I wonder whether this code was tested on a suitably
big pv-ops PV guest.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
  2014-04-29  8:50   ` Jan Beulich
@ 2014-04-29 12:05   ` Tamas Lengyel
  2014-04-29 23:36     ` Aravindh Puthiyaparambil (aravindp)
  2014-05-01 14:39   ` Tim Deegan
  2 siblings, 1 reply; 22+ messages in thread
From: Tamas Lengyel @ 2014-04-29 12:05 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, Jan Beulich,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 908 bytes --]

>
> +    /* For PV domains we only support rw, rx, rx2rw, rwx access
> permissions */
>

There is a bit of discrepancy between the comment and the code here. Also,
could you briefly explain why only these permissions are supported?


> +    if ( unlikely(p2ma != p2m_access_r &&
> +                  p2ma != p2m_access_rw &&
> +                  p2ma != p2m_access_rx &&
> +                  p2ma != p2m_access_rwx &&
> +                  p2ma != p2m_access_rx2rw) )
> +        return -EINVAL;
>


> +    /* For PV domains we only support r, rw, rx, rwx access permissions */
>

Code/comment discrepancy again.


> +    if ( p2m->default_access != p2m_access_r &&
> +         p2m->default_access != p2m_access_rw &&
> +         p2m->default_access != p2m_access_rx &&
> +         p2m->default_access != p2m_access_rwx &&
> +         p2m->default_access != p2m_access_rx2rw )
> +        return -EINVAL;
> +
>

[-- Attachment #1.2: Type: text/html, Size: 1726 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29  8:50   ` Jan Beulich
@ 2014-04-29 23:10     ` Aravindh Puthiyaparambil (aravindp)
  2014-04-30  6:32       ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-04-29 23:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>> +void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access)
>> +{
>> +
>> +    /* Restrict with access permissions */
>> +    switch (access)
>> +    {
>> +        case p2m_access_r:
>> +            *flags &= ~(_PAGE_RW);
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);
>> +            break;
>> +        case p2m_access_rx:
>> +        case p2m_access_rx2rw:
>> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
>> +            *flags |= _PAGE_PRESENT;
>> +            break;
>> +        case p2m_access_rw:
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +        case p2m_access_rwx:
>> +        default:
>> +            *flags &= ~(_PAGE_NX_BIT);
>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +    }
>
>Tim will have the final say here, but I'd suggest dropping all these
>needless parentheses (I see only one case where they're really
>needed).

OK, I will drop all the needless ones.  Could you please point out the case they are needed? I couldn't find anything in the CODING_STYLE that dictates what should be done.

>> +
>> +    // Allow more restrictive guest flags to be propagated instead of access
>> +    // permissions
>
>Coding style (there are more of these, which I'm not going to
>individually comment on).

Sorry, I let the C++ style comments slip in. I will fix all of them to C style comments. 

>> +static int
>> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long
>gfn, mfn_t mfn,
>> +                         unsigned int page_order, p2m_type_t p2mt,
>> +                         p2m_access_t p2ma)
>> +{
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
>> +    uint table_idx;
>> +    uint page_idx;
>> +    uint8_t *access_table_page;
>> +
>> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table !=
>NULL);
>
>Better split into two ASSERT()s, so that if it triggers one would know
>which of the two conditions wasn't met?

You are right. I should have done that in the first place. 

>> +
>> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions
>*/
>> +    if ( unlikely(p2ma != p2m_access_r &&
>> +                  p2ma != p2m_access_rw &&
>> +                  p2ma != p2m_access_rx &&
>> +                  p2ma != p2m_access_rwx &&
>> +                  p2ma != p2m_access_rx2rw) )
>> +        return -EINVAL;
>
>Perhaps better done as a switch() statement.

Sure. That will definitely be cleaner.

>> +
>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>> +        return -ENOENT;
>> +
>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>
>You get "gfn" passed in and blindly overwrite it here? _If_ you need to
>do this lookup, shouldn't you instead check it matches the passed in one?

The "gfn" and "mfn" passed in to the function should be the same as it is a PV guest. That is why I am overwriting "gfn" to get the "gpfn" from the machine_to_phys_mapping.

>> +
>> +    /*
>> +     * Values with the MSB set denote MFNs that aren't really part of the
>> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
>> +     * Nothing to do here.
>> +     */
>> +    if ( unlikely(!VALID_M2P(gfn)) )
>> +        return 0;
>> +
>> +    if ( gfn > (d->tot_pages - 1) )
>> +        return -EINVAL;
>
>Hardly - a PV domain can easily make its address space sparse, and
>iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>on MMIO space. (And as a side note, this would better be "gfn >=
>d->tot_pages".)

I was using this as a guard against going out of bounds in the access lookup table which is created based on the domains tot_pages. Please let me know what I should be using instead of tot_pages when doing this.
 
>> +    paging_lock(d);
>> +
>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>> +    access_table_page =
>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>> +    access_table_page[page_idx] = p2ma;
>> +    unmap_domain_page(access_table_page);
>> +
>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>
>Is there anything guaranteeing d->vcpu and d->vcpu[0] being non-
>NULL?

I am not sure. I see shadow_track_dirty_vram() calling this without a check. Maybe I should check and call it only if d->vcpu is not null.

>> +static mfn_t
>> +p2m_mem_access_get_entry(struct p2m_domain *p2m, unsigned long
>gfn,
>> +                         p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
>> +                         unsigned int *page_order)
>> +{
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
>> +    uint table_idx;
>> +    uint page_idx;
>> +    uint8_t *access_table_page;
>> +    mfn_t mfn = _mfn(gfn); // For PV guests mfn == gfn
>> +
>> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table !=
>NULL);
>> +
>> +    /* Not necessarily true, but for non-translated guests, we claim
>> +     * it's the most generic kind of memory */
>
>I think you copied this comment verbatim from elsewhere, but I don't
>think it's correct as is.

Yes, indeed. I copied this from __get_gfn_type_access(). I will get rid of the comment.

>> +    *t = p2m_ram_rw;
>> +
>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>> +        return _mfn(INVALID_MFN);
>> +
>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>
>Same comment as earlier.
>
>> +    if ( gfn > (d->tot_pages - 1) )
>
>Dito.

I have the same reasoning here as I did above.

>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>> +
>> +    access_table_page =
>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>> +
>> +    /* This is a hint to take the default permissions */
>> +    if ( access_table_page[page_idx] == p2m_access_n )
>> +        access_table_page[page_idx] = p2m->default_access;
>
>We're in "get" here - why does that modify any global state?

I do not initialize the access lookup table to the default value. But the table is zeroed as part of initialization when each page is allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n" / 0 value as hint that the default value should be returned as p2m_access_n is not valid for PV domains. If you prefer that I initialize the table to the default value, I will gladly do so.  

>> +void p2m_mem_access_teardown(struct p2m_domain *p2m)
>> +{
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
>> +    uint32_t nr_access_table_pages;
>> +    uint32_t ctr;
>> +
>> +    /* Reset the set_entry and get_entry function pointers */
>> +    p2m_pt_init(p2m);
>> +
>> +    if ( !access_lookup_table  )
>> +        return;
>> +
>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>> +
>> +    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )
>
>No new effectively unbounded loops please.

OK, I will add a check for preemption and add hypercall continuation logic here. 

>> +int p2m_mem_access_init(struct p2m_domain *p2m)
>> +{
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table;
>> +    uint32_t nr_access_table_pages;
>> +    uint32_t ctr;
>> +
>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>> +    access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages);
>
>This surely is going to be an order > 0 allocation, and you even risk
>it being an order > MAX_ORDER one. The former is disallowed at
>runtime by convention/agreement, the latter is an outright bug.

Sorry, I was not aware of the convention. What should I be doing here instead?

>(And again just as a side note - you don't appear to need the zero
>initialization here.)
>
>> +    if ( !access_lookup_table )
>> +        return -ENOMEM;
>> +
>> +    p2m->access_lookup_table = access_lookup_table;
>> +
>> +    for ( ctr = 0; ctr < nr_access_table_pages; ctr++ )
>
>Same comment regarding the effective unboundedness of the loop.

OK, I will add a check for preemption and add hypercall continuation logic here too.

>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>unsigned long pfn, uint32_t nr,
>>      if ( pfn == ~0ul )
>>      {
>>          p2m->default_access = a;
>> +        if ( is_pv_domain(d) )
>> +            return p2m_mem_access_set_default(p2m);
>
>Is it really correct to set p2m->default_access _before_ calling that
>function? Perhaps it wouldn't be correct doing it the other way
>around either - I suppose you need to hold the necessary lock across
>both operations.

I do not see the problem here. p2m_mem_access_set_default() just blows the shadows away after taking the paging_lock. There is no actual setting of default permission for all pages in the PV case.

>> @@ -585,9 +585,17 @@ int paging_domctl(struct domain *d,
>xen_domctl_shadow_op_t *sc,
>>      {
>>
>>      case XEN_DOMCTL_SHADOW_OP_ENABLE:
>> +        /*
>> +         * Shadow mem_access mode should only be enabled when
>mem_access is
>> +         * enabled in XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE.
>> +         */
>> +        if ( sc->mode & XEN_DOMCTL_SHADOW_ENABLE_MEM_ACCESS )
>> +            return -EINVAL;
>> +
>>          if ( !(sc->mode & XEN_DOMCTL_SHADOW_ENABLE_LOG_DIRTY) )
>>              break;
>>          /* Else fall through... */
>> +
>>      case XEN_DOMCTL_SHADOW_OP_ENABLE_LOGDIRTY:
>
>Stray blank line - the fall through here makes it preferable to keep the
>two case blocks un-separated.

OK, I will keep them un-seperated.

>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>mfn_t gmfn)
>>          if ( !(shadow_mode_external(v->domain)
>>                 && (page->count_info & PGC_count_mask) <= 3
>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>> -                   == !!is_xen_heap_page(page))) )
>> +                       == !!is_xen_heap_page(page)))
>> +                    && !(shadow_mode_mem_access(v->domain)) )
>
>You're breaking indentation, there are pointless parentheses again,
>but most importantly - why?

Sorry, I meant to ask a question about this in my patch message and mark this as a workaround. I am seeing the message on all sh_remove_all_mappings() and I was not able to figure out why this was happening. I just added this as a work around. I was hoping you or Tim would shed more light on this.

>> @@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v,
>>              }
>>      }
>>
>> +    /* Propagate access permissions */
>> +    if ( unlikely((level == 1) &&
>> +                  mem_event_check_ring(&d->mem_event->access) &&
>> +                  !sh_mfn_is_a_page_table(target_mfn)) )
>
>Just a general remark here - unlikely() used like this is pointless, it
>ought to be used on the individual constituents of && or ||.

OK, I will leave this for mem_event_check_ring() case.

>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +        p2m_access_t a;
>> +        p2m_type_t t;
>> +        mfn_t mfn;
>> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL);
>
>Missing blank line between declarations and statements. Also this
>statement could instead be an initializer.

I will make it an initializer.

>> @@ -2822,6 +2838,8 @@ static int sh_page_fault(struct vcpu *v,
>>      int r;
>>      fetch_type_t ft = 0;
>>      p2m_type_t p2mt;
>> +    p2m_access_t p2ma;
>
>This variable ...
>
>> @@ -3009,7 +3027,80 @@ static int sh_page_fault(struct vcpu *v,
>>
>>      /* What mfn is the guest trying to access? */
>>      gfn = guest_l1e_get_gfn(gw.l1e);
>> -    gmfn = get_gfn(d, gfn, &p2mt);
>> +    if ( likely(!mem_event_check_ring(&d->mem_event->access)) )
>> +        gmfn = get_gfn(d, gfn, &p2mt);
>> +    /*
>> +     * A mem_access listener is present, so we will first check if a violation
>> +     * has occurred.
>> +     */
>> +    else
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(v->domain);
>
>... seems to be used only in this scope, and with you already adding
>scope restricted variables it should go here.

OK, I will do that.

>> +
>> +        gmfn = get_gfn_type_access(p2m, gfn_x(gfn), &p2mt, &p2ma, 0,
>NULL);
>> +        if ( mfn_valid(gmfn) && !sh_mfn_is_a_page_table(gmfn)
>> +             && (regs->error_code & PFEC_page_present)
>> +             && !(regs->error_code & PFEC_reserved_bit) )
>> +        {
>> +            int violation = 0;
>> +            bool_t access_w = !!(regs->error_code & PFEC_write_access);
>> +            bool_t access_x = !!(regs->error_code & PFEC_insn_fetch);
>> +            bool_t access_r = access_x ? 0 : !(access_w);
>
>Stray parentheses again. I'm not going to repeat this again - please
>just check your code before submitting.

I am really sorry. I will fix this is in the next patch series.

>> +            /*
>> +             * Do not police writes to guest memory emanating from the Xen
>> +             * kernel. Trying to do so will cause the same pagefault to occur
>> +             * over and over again with an event being sent to the access
>> +             * listener for each fault. If the access listener's vcpu is not
>> +             * scheduled during this time, the violation is never resolved and
>> +             * will eventually end with the host crashing.
>> +             */
>> +            if ( (violation && access_w) &&
>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
>> +            {
>> +                violation = 0;
>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>> +                                    p2m_ram_rw, p2m_access_rw);
>> +            }
>
>This looks more like a hack than a proper solution - shouldn't the
>listener be allowed to know of hypervisor side accesses?

Ideally the listener should know of the hypervisor side accesses but I ran in to cases where the listener would not get to run and the pagefault would be tried over and over, eventually causing the host to crash. I guess more detail about the problem is needed. I will send out a separate email regarding this, CCing you and Tim.

>> +/* Number of access table pages for a PV domain */
>> +#define get_domain_nr_access_table_pages(d) \
>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE)
>
>And once again. I wonder whether this code was tested on a suitably
>big pv-ops PV guest.

The largest PV domain I tried this with was 28GB.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29 12:05   ` Tamas Lengyel
@ 2014-04-29 23:36     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-04-29 23:36 UTC (permalink / raw)
  To: Tamas Lengyel
  Cc: Keir Fraser, Ian Campbell, Ian Jackson, Tim Deegan, Jan Beulich,
	xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1282 bytes --]

+    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions */

There is a bit of discrepancy between the comment and the code here. Also, could you briefly explain why only these permissions are supported?

I will add “r” to the comment.

Only these permissions are supported as the regular page tables unlike extended page tables (EPT) does not have an explicit R bit. It only has a present bit and hence there is no easy way to do permissions like execute only. And of course permissions like write or write-execute cannot be supported in both cases.

+    if ( unlikely(p2ma != p2m_access_r &&
+                  p2ma != p2m_access_rw &&
+                  p2ma != p2m_access_rx &&
+                  p2ma != p2m_access_rwx &&
+                  p2ma != p2m_access_rx2rw) )
+        return -EINVAL;

+    /* For PV domains we only support r, rw, rx, rwx access permissions */

Code/comment discrepancy again.

I will add “rx2rw” to the comment.

+    if ( p2m->default_access != p2m_access_r &&
+         p2m->default_access != p2m_access_rw &&
+         p2m->default_access != p2m_access_rx &&
+         p2m->default_access != p2m_access_rwx &&
+         p2m->default_access != p2m_access_rx2rw )
+        return -EINVAL;
+

[-- Attachment #1.2: Type: text/html, Size: 5411 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29 23:10     ` Aravindh Puthiyaparambil (aravindp)
@ 2014-04-30  6:32       ` Jan Beulich
  2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-04-30  6:32 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Tim Deegan

>>> On 30.04.14 at 01:10, <aravindp@cisco.com> wrote:
>>> +        default:
>>> +            *flags &= ~(_PAGE_NX_BIT);
>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>> +            break;
>>> +    }
>>
>>Tim will have the final say here, but I'd suggest dropping all these
>>needless parentheses (I see only one case where they're really
>>needed).
> 
> OK, I will drop all the needless ones.  Could you please point out the case 
> they are needed? I couldn't find anything in the CODING_STYLE that dictates 
> what should be done.

Convention is to use them when operator precedence isn't considered
obvious. What "obvious" here is differs between people, but at least
all precedence rules defined by school mathematics should generally
not require parenthesizing. Which in the case above means no
parentheses around the right side expression of assignment operators.

>>> +
>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>> +        return -ENOENT;
>>> +
>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>
>>You get "gfn" passed in and blindly overwrite it here? _If_ you need to
>>do this lookup, shouldn't you instead check it matches the passed in one?
> 
> The "gfn" and "mfn" passed in to the function should be the same as it is a 
> PV guest. That is why I am overwriting "gfn" to get the "gpfn" from the 
> machine_to_phys_mapping.

Since this overwrites/discards a function parameter, this surely is
worth a comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).

>>> +
>>> +    /*
>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
>>> +     * Nothing to do here.
>>> +     */
>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>> +        return 0;
>>> +
>>> +    if ( gfn > (d->tot_pages - 1) )
>>> +        return -EINVAL;
>>
>>Hardly - a PV domain can easily make its address space sparse, and
>>iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>on MMIO space. (And as a side note, this would better be "gfn >=
>>d->tot_pages".)
> 
> I was using this as a guard against going out of bounds in the access lookup 
> table which is created based on the domains tot_pages. Please let me know 
> what I should be using instead of tot_pages when doing this.

Afaict For PV guests there's absolutely nothing you can use to bound
the range.

>>> +    paging_lock(d);
>>> +
>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>> +    access_table_page =
>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>> +    access_table_page[page_idx] = p2ma;
>>> +    unmap_domain_page(access_table_page);
>>> +
>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>
>>Is there anything guaranteeing d->vcpu and d->vcpu[0] being non-
>>NULL?
> 
> I am not sure. I see shadow_track_dirty_vram() calling this without a check. 
> Maybe I should check and call it only if d->vcpu is not null.

You ought to check both, at least via ASSERT().

>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>> +
>>> +    access_table_page =
>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>> +
>>> +    /* This is a hint to take the default permissions */
>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>> +        access_table_page[page_idx] = p2m->default_access;
>>
>>We're in "get" here - why does that modify any global state?
> 
> I do not initialize the access lookup table to the default value. But the 
> table is zeroed as part of initialization when each page is allocated using 
> shadow_alloc_p2m_page(). I then use the "p2m_access_n" / 0 value as hint that 
> the default value should be returned as p2m_access_n is not valid for PV 
> domains. If you prefer that I initialize the table to the default value, I 
> will gladly do so.  

I think implying _n to mean "default" is at least a latent bug anyway
(since that precludes setting that type). And _if_ "get" really does
any modification, that minimally would require a big comment.

Furthermore, with it being possible for the default type to change,
the net effect might easily be different between setting the default
type at initialization time or when the first lookup happens.

>>> +int p2m_mem_access_init(struct p2m_domain *p2m)
>>> +{
>>> +    struct domain *d = p2m->domain;
>>> +    mfn_t *access_lookup_table;
>>> +    uint32_t nr_access_table_pages;
>>> +    uint32_t ctr;
>>> +
>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>> +    access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages);
>>
>>This surely is going to be an order > 0 allocation, and you even risk
>>it being an order > MAX_ORDER one. The former is disallowed at
>>runtime by convention/agreement, the latter is an outright bug.
> 
> Sorry, I was not aware of the convention. What should I be doing here 
> instead?

Especially with the above note about GFN space being effectively
unbounded for PV guests, you need to find a different data structure
to represent the information you need. Perhaps something similar to
what log-dirty mode uses, except that you would need more levels in
the extreme case, and hence it might be worthwhile making the actually
used number of levels dynamic?

>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>unsigned long pfn, uint32_t nr,
>>>      if ( pfn == ~0ul )
>>>      {
>>>          p2m->default_access = a;
>>> +        if ( is_pv_domain(d) )
>>> +            return p2m_mem_access_set_default(p2m);
>>
>>Is it really correct to set p2m->default_access _before_ calling that
>>function? Perhaps it wouldn't be correct doing it the other way
>>around either - I suppose you need to hold the necessary lock across
>>both operations.
> 
> I do not see the problem here. p2m_mem_access_set_default() just blows the 
> shadows away after taking the paging_lock. There is no actual setting of 
> default permission for all pages in the PV case.

But the function having a return value means it might fail, in which
case you shouldn't have set the new default.

>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>mfn_t gmfn)
>>>          if ( !(shadow_mode_external(v->domain)
>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>> -                   == !!is_xen_heap_page(page))) )
>>> +                       == !!is_xen_heap_page(page)))
>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>
>>You're breaking indentation, there are pointless parentheses again,
>>but most importantly - why?
> 
> Sorry, I meant to ask a question about this in my patch message and mark 
> this as a workaround. I am seeing the message on all sh_remove_all_mappings() 
> and I was not able to figure out why this was happening. I just added this as 
> a work around. I was hoping you or Tim would shed more light on this.

I'm afraid that without detail on which pages the triggers on, and you
at least having spent some time finding out where the stray/extra
references may be coming from it's going to be hard to help.

>>> +            /*
>>> +             * Do not police writes to guest memory emanating from the Xen
>>> +             * kernel. Trying to do so will cause the same pagefault to occur
>>> +             * over and over again with an event being sent to the access
>>> +             * listener for each fault. If the access listener's vcpu is not
>>> +             * scheduled during this time, the violation is never resolved and
>>> +             * will eventually end with the host crashing.
>>> +             */
>>> +            if ( (violation && access_w) &&
>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
>>> +            {
>>> +                violation = 0;
>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>> +                                    p2m_ram_rw, p2m_access_rw);
>>> +            }
>>
>>This looks more like a hack than a proper solution - shouldn't the
>>listener be allowed to know of hypervisor side accesses?
> 
> Ideally the listener should know of the hypervisor side accesses but I ran 
> in to cases where the listener would not get to run and the pagefault would 
> be tried over and over, eventually causing the host to crash. I guess more 
> detail about the problem is needed. I will send out a separate email 
> regarding this, CCing you and Tim.

And if it's intended as a workaround until proper resolution only, you
probably should say so explicitly in the comment, avoiding the need
for reviewers to comment on this being a problem.

>>> +/* Number of access table pages for a PV domain */
>>> +#define get_domain_nr_access_table_pages(d) \
>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE)
>>
>>And once again. I wonder whether this code was tested on a suitably
>>big pv-ops PV guest.
> 
> The largest PV domain I tried this with was 28GB.

Which ought to have had GFNs larger than ->tot_pages, so I wonder
how this worked.

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-30  6:32       ` Jan Beulich
@ 2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
  2014-05-01  1:11           ` Andres Lagar-Cavilla
                             ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-04-30 22:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Ian Jackson, Tim Deegan, xen-devel

>>>> +        default:
>>>> +            *flags &= ~(_PAGE_NX_BIT);
>>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>>> +            break;
>>>> +    }
>>>
>>>Tim will have the final say here, but I'd suggest dropping all these
>>>needless parentheses (I see only one case where they're really
>>>needed).
>>
>> OK, I will drop all the needless ones.  Could you please point out the
>> case they are needed? I couldn't find anything in the CODING_STYLE
>> that dictates what should be done.
>
>Convention is to use them when operator precedence isn't considered
>obvious. What "obvious" here is differs between people, but at least all
>precedence rules defined by school mathematics should generally not require
>parenthesizing. Which in the case above means no parentheses around the
>right side expression of assignment operators.

OK, I understand and will fix them appropriately. 

>>>> +
>>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>>> +        return -ENOENT;
>>>> +
>>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>>
>>>You get "gfn" passed in and blindly overwrite it here? _If_ you need
>>>to do this lookup, shouldn't you instead check it matches the passed in
>one?
>>
>> The "gfn" and "mfn" passed in to the function should be the same as it
>> is a PV guest. That is why I am overwriting "gfn" to get the "gpfn"
>> from the machine_to_phys_mapping.
>
>Since this overwrites/discards a function parameter, this surely is worth a
>comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).

I will add the ASSERT and a comment for it.

>>>> +
>>>> +    /*
>>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>>> +     * domain's pseudo-physical memory map (e.g., the shared info
>frame).
>>>> +     * Nothing to do here.
>>>> +     */
>>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>>> +        return 0;
>>>> +
>>>> +    if ( gfn > (d->tot_pages - 1) )
>>>> +        return -EINVAL;
>>>
>>>Hardly - a PV domain can easily make its address space sparse, and
>>>iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>>on MMIO space. (And as a side note, this would better be "gfn >=
>>>d->tot_pages".)
>>
>> I was using this as a guard against going out of bounds in the access
>> lookup table which is created based on the domains tot_pages. Please
>> let me know what I should be using instead of tot_pages when doing this.
>
>Afaict For PV guests there's absolutely nothing you can use to bound the
>range.
>>>> +    paging_lock(d);
>>>> +
>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>> +    access_table_page =
>>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>> +    access_table_page[page_idx] = p2ma;
>>>> +    unmap_domain_page(access_table_page);
>>>> +
>>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>>
>>>Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- NULL?
>>
>> I am not sure. I see shadow_track_dirty_vram() calling this without a check.
>> Maybe I should check and call it only if d->vcpu is not null.
>
>You ought to check both, at least via ASSERT().

OK, I will add the ASSERTs.

>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>> +
>>>> +    access_table_page =
>>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>> +
>>>> +    /* This is a hint to take the default permissions */
>>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>>> +        access_table_page[page_idx] = p2m->default_access;
>>>
>>>We're in "get" here - why does that modify any global state?
>>
>> I do not initialize the access lookup table to the default value. But
>> the table is zeroed as part of initialization when each page is
>> allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n"
>> / 0 value as hint that the default value should be returned as
>> p2m_access_n is not valid for PV domains. If you prefer that I
>> initialize the table to the default value, I will gladly do so.
>
>I think implying _n to mean "default" is at least a latent bug anyway (since that
>precludes setting that type). And _if_ "get" really does any modification, that
>minimally would require a big comment.

For PV domains access_n is precluded.

>Furthermore, with it being possible for the default type to change, the net
>effect might easily be different between setting the default type at
>initialization time or when the first lookup happens.

I am not sure that I follow your concern here. Setting default access does not mean that the access table gets cleared. So if a particular MFN entry's access value changed and the default value was changed, then the MFN should retain the access value it was changed to and not the default value. The new default value will only be returned for MFNs for which set_entry was not called on and I think that is correct.
 
>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>> +    struct domain *d = p2m->domain;
>>>> +    mfn_t *access_lookup_table;
>>>> +    uint32_t nr_access_table_pages;
>>>> +    uint32_t ctr;
>>>> +
>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>> + nr_access_table_pages);
>>>
>>>This surely is going to be an order > 0 allocation, and you even risk
>>>it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>by convention/agreement, the latter is an outright bug.
>>
>> Sorry, I was not aware of the convention. What should I be doing here
>> instead?
>
>Especially with the above note about GFN space being effectively unbounded
>for PV guests, you need to find a different data structure to represent the
>information you need. Perhaps something similar to what log-dirty mode
>uses, except that you would need more levels in the extreme case, and hence
>it might be worthwhile making the actually used number of levels dynamic?

I am a little confused here. In my initial discussions about this I was told that PV guests have bounded and contiguous memory. This is why I took the approach of an indexable array. 

" OTOH, all you need is a byte per pfn, and the great thing is that in PV domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, etc, which demand the sparse tree structure. So you can allocate an easily indexable array, notwithstanding super page concerns (I think/hope)."

http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html

Did I misunderstand something here? 
PS: I have CCed Andres too so that he can clarify.

>>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>>unsigned long pfn, uint32_t nr,
>>>>      if ( pfn == ~0ul )
>>>>      {
>>>>          p2m->default_access = a;
>>>> +        if ( is_pv_domain(d) )
>>>> +            return p2m_mem_access_set_default(p2m);
>>>
>>>Is it really correct to set p2m->default_access _before_ calling that
>>>function? Perhaps it wouldn't be correct doing it the other way around
>>>either - I suppose you need to hold the necessary lock across both
>>>operations.
>>
>> I do not see the problem here. p2m_mem_access_set_default() just blows
>> the shadows away after taking the paging_lock. There is no actual
>> setting of default permission for all pages in the PV case.
>
>But the function having a return value means it might fail, in which case you
>shouldn't have set the new default.

Good point, I will fix that.

>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>mfn_t gmfn)
>>>>          if ( !(shadow_mode_external(v->domain)
>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>> -                   == !!is_xen_heap_page(page))) )
>>>> +                       == !!is_xen_heap_page(page)))
>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>
>>>You're breaking indentation, there are pointless parentheses again,
>>>but most importantly - why?
>>
>> Sorry, I meant to ask a question about this in my patch message and
>> mark this as a workaround. I am seeing the message on all
>> sh_remove_all_mappings() and I was not able to figure out why this was
>> happening. I just added this as a work around. I was hoping you or Tim
>would shed more light on this.
>
>I'm afraid that without detail on which pages the triggers on, and you at least
>having spent some time finding out where the stray/extra references may be
>coming from it's going to be hard to help.

Here are some of the prints I saw. I typically saw it for every set_entry() call.

(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33960: c=8000000000000002 t=7400000000000000 <most common>
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134b8: c=8000000000000038 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13344: c=8000000000000003 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13449: c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1394b: c=8000000000000013 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 137b5: c=8000000000000003 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33fa6: c=8000000000000002 t=7400000000000000
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1345d: c=8000000000000028 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33410: c=8000000000000034 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33412: c=8000000000000039 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33953: c=8000000000000011 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13a63: c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 393f2: c=800000000000003a t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1335b: c=8000000000000028 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134bc: c=8000000000000036 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 393f0: c=8000000000000014 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134bd: c=8000000000000008 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1354e: c=8000000000000010 t=7400000000000001
 (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33443: c=8000000000000013 t=7400000000000001
(XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 39f74: c=8000000000000008 t=7400000000000001

>>>> +            /*
>>>> +             * Do not police writes to guest memory emanating from the Xen
>>>> +             * kernel. Trying to do so will cause the same pagefault to occur
>>>> +             * over and over again with an event being sent to the access
>>>> +             * listener for each fault. If the access listener's vcpu is not
>>>> +             * scheduled during this time, the violation is never resolved and
>>>> +             * will eventually end with the host crashing.
>>>> +             */
>>>> +            if ( (violation && access_w) &&
>>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <=
>XEN_VIRT_END) )
>>>> +            {
>>>> +                violation = 0;
>>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>>> +                                    p2m_ram_rw, p2m_access_rw);
>>>> +            }
>>>
>>>This looks more like a hack than a proper solution - shouldn't the
>>>listener be allowed to know of hypervisor side accesses?
>>
>> Ideally the listener should know of the hypervisor side accesses but I
>> ran in to cases where the listener would not get to run and the
>> pagefault would be tried over and over, eventually causing the host to
>> crash. I guess more detail about the problem is needed. I will send
>> out a separate email regarding this, CCing you and Tim.
>
>And if it's intended as a workaround until proper resolution only, you probably
>should say so explicitly in the comment, avoiding the need for reviewers to
>comment on this being a problem.

Sorry, I will be more explicit about such things in the future.

>>>> +/* Number of access table pages for a PV domain */ #define
>>>> +get_domain_nr_access_table_pages(d) \
>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>> +PAGE_SIZE)
>>>
>>>And once again. I wonder whether this code was tested on a suitably
>>>big pv-ops PV guest.
>>
>> The largest PV domain I tried this with was 28GB.
>
>Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>worked.

It would have had MFNs larger than tot_pages. I don't understand how it would have had GFNs larger than tot_pages.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-01  1:11           ` Andres Lagar-Cavilla
  2014-05-01  3:18             ` Aravindh Puthiyaparambil (aravindp)
  2014-05-01 14:18           ` Tim Deegan
  2014-05-02  8:26           ` Jan Beulich
  2 siblings, 1 reply; 22+ messages in thread
From: Andres Lagar-Cavilla @ 2014-05-01  1:11 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Jan Beulich,
	xen-devel


On Apr 30, 2014, at 6:20 PM, "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com> wrote:

>>>>> +        default:
>>>>> +            *flags &= ~(_PAGE_NX_BIT);
>>>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>>>> +            break;
>>>>> +    }
>>>> 
>>>> Tim will have the final say here, but I'd suggest dropping all these
>>>> needless parentheses (I see only one case where they're really
>>>> needed).
>>> 
>>> OK, I will drop all the needless ones.  Could you please point out the
>>> case they are needed? I couldn't find anything in the CODING_STYLE
>>> that dictates what should be done.
>> 
>> Convention is to use them when operator precedence isn't considered
>> obvious. What "obvious" here is differs between people, but at least all
>> precedence rules defined by school mathematics should generally not require
>> parenthesizing. Which in the case above means no parentheses around the
>> right side expression of assignment operators.
> 
> OK, I understand and will fix them appropriately. 
> 
>>>>> +
>>>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>>>> +        return -ENOENT;
>>>>> +
>>>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>>> 
>>>> You get "gfn" passed in and blindly overwrite it here? _If_ you need
>>>> to do this lookup, shouldn't you instead check it matches the passed in
>> one?
>>> 
>>> The "gfn" and "mfn" passed in to the function should be the same as it
>>> is a PV guest. That is why I am overwriting "gfn" to get the "gpfn"
>>> from the machine_to_phys_mapping.
>> 
>> Since this overwrites/discards a function parameter, this surely is worth a
>> comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).
> 
> I will add the ASSERT and a comment for it.
> 
>>>>> +
>>>>> +    /*
>>>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>>>> +     * domain's pseudo-physical memory map (e.g., the shared info
>> frame).
>>>>> +     * Nothing to do here.
>>>>> +     */
>>>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>>>> +        return 0;
>>>>> +
>>>>> +    if ( gfn > (d->tot_pages - 1) )
>>>>> +        return -EINVAL;
>>>> 
>>>> Hardly - a PV domain can easily make its address space sparse, and
>>>> iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>>> on MMIO space. (And as a side note, this would better be "gfn >=
>>>> d->tot_pages".)
>>> 
>>> I was using this as a guard against going out of bounds in the access
>>> lookup table which is created based on the domains tot_pages. Please
>>> let me know what I should be using instead of tot_pages when doing this.
>> 
>> Afaict For PV guests there's absolutely nothing you can use to bound the
>> range.
>>>>> +    paging_lock(d);
>>>>> +
>>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>>> +    access_table_page =
>>>> map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>>> +    access_table_page[page_idx] = p2ma;
>>>>> +    unmap_domain_page(access_table_page);
>>>>> +
>>>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>>> 
>>>> Is there anything guaranteeing d->vcpu and d->vcpu[0] being non- NULL?
>>> 
>>> I am not sure. I see shadow_track_dirty_vram() calling this without a check.
>>> Maybe I should check and call it only if d->vcpu is not null.
>> 
>> You ought to check both, at least via ASSERT().
> 
> OK, I will add the ASSERTs.
> 
>>>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>>>> +
>>>>> +    access_table_page =
>>>> map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>>>> +
>>>>> +    /* This is a hint to take the default permissions */
>>>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>>>> +        access_table_page[page_idx] = p2m->default_access;
>>>> 
>>>> We're in "get" here - why does that modify any global state?
>>> 
>>> I do not initialize the access lookup table to the default value. But
>>> the table is zeroed as part of initialization when each page is
>>> allocated using shadow_alloc_p2m_page(). I then use the "p2m_access_n"
>>> / 0 value as hint that the default value should be returned as
>>> p2m_access_n is not valid for PV domains. If you prefer that I
>>> initialize the table to the default value, I will gladly do so.
>> 
>> I think implying _n to mean "default" is at least a latent bug anyway (since that
>> precludes setting that type). And _if_ "get" really does any modification, that
>> minimally would require a big comment.
> 
> For PV domains access_n is precluded.
> 
>> Furthermore, with it being possible for the default type to change, the net
>> effect might easily be different between setting the default type at
>> initialization time or when the first lookup happens.
> 
> I am not sure that I follow your concern here. Setting default access does not mean that the access table gets cleared. So if a particular MFN entry's access value changed and the default value was changed, then the MFN should retain the access value it was changed to and not the default value. The new default value will only be returned for MFNs for which set_entry was not called on and I think that is correct.
> 
>>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>>> +    struct domain *d = p2m->domain;
>>>>> +    mfn_t *access_lookup_table;
>>>>> +    uint32_t nr_access_table_pages;
>>>>> +    uint32_t ctr;
>>>>> +
>>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>>> + nr_access_table_pages);
>>>> 
>>>> This surely is going to be an order > 0 allocation, and you even risk
>>>> it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>> by convention/agreement, the latter is an outright bug.
>>> 
>>> Sorry, I was not aware of the convention. What should I be doing here
>>> instead?
>> 
>> Especially with the above note about GFN space being effectively unbounded
>> for PV guests, you need to find a different data structure to represent the
>> information you need. Perhaps something similar to what log-dirty mode
>> uses, except that you would need more levels in the extreme case, and hence
>> it might be worthwhile making the actually used number of levels dynamic?
> 
> I am a little confused here. In my initial discussions about this I was told that PV guests have bounded and contiguous memory. This is why I took the approach of an indexable array. 
> 
> " OTOH, all you need is a byte per pfn, and the great thing is that in PV domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, etc, which demand the sparse tree structure. So you can allocate an easily indexable array, notwithstanding super page concerns (I think/hope)."
> 
> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html
> 
> Did I misunderstand something here? 
> PS: I have CCed Andres too so that he can clarify.
I can muddy things up further.

In the good olden days that was the case. It seems that since linux 3.x, pv dom0 is defined with a physmap as large as the host, and plenty of holes in it which are plugged with the m2p override. I am unfortunately unhelpfully hazy on details. A situation like this will lead to max_pfn being very different from tot_pages. This is common in hvm.

Apologies for lack of precision
Andres
> 
>>>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>>> unsigned long pfn, uint32_t nr,
>>>>>     if ( pfn == ~0ul )
>>>>>     {
>>>>>         p2m->default_access = a;
>>>>> +        if ( is_pv_domain(d) )
>>>>> +            return p2m_mem_access_set_default(p2m);
>>>> 
>>>> Is it really correct to set p2m->default_access _before_ calling that
>>>> function? Perhaps it wouldn't be correct doing it the other way around
>>>> either - I suppose you need to hold the necessary lock across both
>>>> operations.
>>> 
>>> I do not see the problem here. p2m_mem_access_set_default() just blows
>>> the shadows away after taking the paging_lock. There is no actual
>>> setting of default permission for all pages in the PV case.
>> 
>> But the function having a return value means it might fail, in which case you
>> shouldn't have set the new default.
> 
> Good point, I will fix that.
> 
>>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>> mfn_t gmfn)
>>>>>         if ( !(shadow_mode_external(v->domain)
>>>>>                && (page->count_info & PGC_count_mask) <= 3
>>>>>                && ((page->u.inuse.type_info & PGT_count_mask)
>>>>> -                   == !!is_xen_heap_page(page))) )
>>>>> +                       == !!is_xen_heap_page(page)))
>>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>> 
>>>> You're breaking indentation, there are pointless parentheses again,
>>>> but most importantly - why?
>>> 
>>> Sorry, I meant to ask a question about this in my patch message and
>>> mark this as a workaround. I am seeing the message on all
>>> sh_remove_all_mappings() and I was not able to figure out why this was
>>> happening. I just added this as a work around. I was hoping you or Tim
>> would shed more light on this.
>> 
>> I'm afraid that without detail on which pages the triggers on, and you at least
>> having spent some time finding out where the stray/extra references may be
>> coming from it's going to be hard to help.
> 
> Here are some of the prints I saw. I typically saw it for every set_entry() call.
> 
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33960: c=8000000000000002 t=7400000000000000 <most common>
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134b8: c=8000000000000038 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13344: c=8000000000000003 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13449: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1394b: c=8000000000000013 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 137b5: c=8000000000000003 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33fa6: c=8000000000000002 t=7400000000000000
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1345d: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33410: c=8000000000000034 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33412: c=8000000000000039 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33953: c=8000000000000011 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 13a63: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 393f2: c=800000000000003a t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1335b: c=8000000000000028 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134bc: c=8000000000000036 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 393f0: c=8000000000000014 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134bd: c=8000000000000008 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 1354e: c=8000000000000010 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33443: c=8000000000000013 t=7400000000000001
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 39f74: c=8000000000000008 t=7400000000000001
> 
>>>>> +            /*
>>>>> +             * Do not police writes to guest memory emanating from the Xen
>>>>> +             * kernel. Trying to do so will cause the same pagefault to occur
>>>>> +             * over and over again with an event being sent to the access
>>>>> +             * listener for each fault. If the access listener's vcpu is not
>>>>> +             * scheduled during this time, the violation is never resolved and
>>>>> +             * will eventually end with the host crashing.
>>>>> +             */
>>>>> +            if ( (violation && access_w) &&
>>>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <=
>> XEN_VIRT_END) )
>>>>> +            {
>>>>> +                violation = 0;
>>>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>>>> +                                    p2m_ram_rw, p2m_access_rw);
>>>>> +            }
>>>> 
>>>> This looks more like a hack than a proper solution - shouldn't the
>>>> listener be allowed to know of hypervisor side accesses?
>>> 
>>> Ideally the listener should know of the hypervisor side accesses but I
>>> ran in to cases where the listener would not get to run and the
>>> pagefault would be tried over and over, eventually causing the host to
>>> crash. I guess more detail about the problem is needed. I will send
>>> out a separate email regarding this, CCing you and Tim.
>> 
>> And if it's intended as a workaround until proper resolution only, you probably
>> should say so explicitly in the comment, avoiding the need for reviewers to
>> comment on this being a problem.
> 
> Sorry, I will be more explicit about such things in the future.
> 
>>>>> +/* Number of access table pages for a PV domain */ #define
>>>>> +get_domain_nr_access_table_pages(d) \
>>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>>> +PAGE_SIZE)
>>>> 
>>>> And once again. I wonder whether this code was tested on a suitably
>>>> big pv-ops PV guest.
>>> 
>>> The largest PV domain I tried this with was 28GB.
>> 
>> Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>> worked.
> 
> It would have had MFNs larger than tot_pages. I don't understand how it would have had GFNs larger than tot_pages.
> 
> Thanks,
> Aravindh
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-05-01  1:11           ` Andres Lagar-Cavilla
@ 2014-05-01  3:18             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-01  3:18 UTC (permalink / raw)
  To: Andres Lagar-Cavilla
  Cc: Keir Fraser, Ian Campbell, Tim Deegan, Ian Jackson, Jan Beulich,
	xen-devel

>>> Especially with the above note about GFN space being effectively
>>> unbounded for PV guests, you need to find a different data structure
>>> to represent the information you need. Perhaps something similar to
>>> what log-dirty mode uses, except that you would need more levels in
>>> the extreme case, and hence it might be worthwhile making the actually
>used number of levels dynamic?
>>
>> I am a little confused here. In my initial discussions about this I was told that
>PV guests have bounded and contiguous memory. This is why I took the
>approach of an indexable array.
>>
>> " OTOH, all you need is a byte per pfn, and the great thing is that in PV
>domains, the physmap is bounded and continuous. Unlike HVM and its PCI
>holes, etc, which demand the sparse tree structure. So you can allocate an
>easily indexable array, notwithstanding super page concerns (I think/hope)."
>>
>> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.h
>> tml
>>
>> Did I misunderstand something here?
>> PS: I have CCed Andres too so that he can clarify.
>I can muddy things up further.
>
>In the good olden days that was the case. It seems that since linux 3.x, pv
>dom0 is defined with a physmap as large as the host, and plenty of holes in it
>which are plugged with the m2p override. I am unfortunately unhelpfully hazy
>on details. A situation like this will lead to max_pfn being very different from
>tot_pages. This is common in hvm.

So let me see if I understand correctly. Say I have a 4GB PV domain on 32 GB host. Its MFNs could be spread out from 10-12GB / 14-16GB. Now if I do get_gpfn_from_mfn() on the range 10-12GB / 14-16GB, the return values need not be 0 - 4GB but could be 0-3GB / 6-7GB? 

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
  2014-05-01  1:11           ` Andres Lagar-Cavilla
@ 2014-05-01 14:18           ` Tim Deegan
  2014-05-01 19:14             ` Aravindh Puthiyaparambil (aravindp)
  2014-05-02  8:26           ` Jan Beulich
  2 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-05-01 14:18 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Ian Jackson, Jan Beulich, xen-devel

At 22:20 +0000 on 30 Apr (1398892838), Aravindh Puthiyaparambil (aravindp) wrot> >>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
> >>>mfn_t gmfn)
> >>>>          if ( !(shadow_mode_external(v->domain)
> >>>>                 && (page->count_info & PGC_count_mask) <= 3
> >>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
> >>>> -                   == !!is_xen_heap_page(page))) )
> >>>> +                       == !!is_xen_heap_page(page)))
> >>>> +                    && !(shadow_mode_mem_access(v->domain)) )
> >>>
> >>>You're breaking indentation, there are pointless parentheses again,
> >>>but most importantly - why?
> >>
> >> Sorry, I meant to ask a question about this in my patch message and
> >> mark this as a workaround. I am seeing the message on all
> >> sh_remove_all_mappings() and I was not able to figure out why this was
> >> happening. I just added this as a work around. I was hoping you or Tim
> >would shed more light on this.
> >
> >I'm afraid that without detail on which pages the triggers on, and you at least
> >having spent some time finding out where the stray/extra references may be
> >coming from it's going to be hard to help.
> 
> Here are some of the prints I saw. I typically saw it for every set_entry() call.
> 
> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 33960: c=8000000000000002 t=7400000000000000 <most common>
>  (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of mfn 134b8: c=8000000000000038 t=7400000000000001

This is because you are not in shadow_mode_refcounts() (i.e. the
page's refcount and typecount are based on the _guest_ pagetables
rather than the _shadow_ ones).  That means that sh_remove_all_mappings()
can't use the refcounts to figure out when it's removed the last
shadow mapping.

That also means that sh_remove_all_mappings() will have had to search
every sl1e in the system, which is expensive.  If there will be a lot
of these updates, you might consider batching them and using
shadow_blow_tables() after each batch instead.

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
  2014-04-29  8:50   ` Jan Beulich
  2014-04-29 12:05   ` Tamas Lengyel
@ 2014-05-01 14:39   ` Tim Deegan
  2014-05-01 19:26     ` Aravindh Puthiyaparambil (aravindp)
  2 siblings, 1 reply; 22+ messages in thread
From: Tim Deegan @ 2014-05-01 14:39 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil
  Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Jan Beulich

Hi,

Thanks for your patch.  I have to say, this looks a lot less
terrifying than I thought it would. :)

At 21:45 -0700 on 28 Apr (1398717902), Aravindh Puthiyaparambil wrote:
> Shadow mem_access mode
> ----------------------
> Add a new shadow mode for mem_access. This should only be enabled by a
> mem_access listener when it calls XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
> for a PV domain.

Do we need a specific mode for this?  If the default behaviour is to
allow all accesses until told otherwise, then it can be enabled at all
times.

> P2M changes
> -----------
> Add a new p2m implementation for mem_access. Central to this is the
> access lookup table. This is an array of mfns. Each mfn is a page
> allocated from the domain's shadow memory. The pages hold the
> p2m_access_t values for each guest gmfn. p2m_mem_access_set_entry()
> sets the access value of the mfn given as input and blows the shadow
> entries for the mfn. p2m_mem_access_get_entry() returns the access
> value of the mfn given as input.

I think this array needs to be a trie, at least; it may make sense to reuse
the p2m_pt implementation and just restrict/ignore the types and MFNs.
That way you won't have to reinvent the wheel, esp. around avoiding
long-running operations.

> +/* Convert access restrictions to page table flags */
> +void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access)
> +{
> +
> +    /* Restrict with access permissions */
> +    switch (access)
> +    {
> +        case p2m_access_r:
> +            *flags &= ~(_PAGE_RW);
> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);

IIUC this function is called to add further restrictions to an
existing set of flags.  In that case, it should never set the
_PAGE_PRESENT bit.  Rather, it should only ever _reduce_ permissions 
(i.e. set _NX or clear other bits).  Then...

> +            break;
> +        case p2m_access_rx:
> +        case p2m_access_rx2rw:
> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
> +            *flags |= _PAGE_PRESENT;
> +            break;
> +        case p2m_access_rw:
> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +        case p2m_access_rwx:
> +        default:
> +            *flags &= ~(_PAGE_NX_BIT);
> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +    }
> +
> +    // Allow more restrictive guest flags to be propagated instead of access
> +    // permissions
> +    if ( !(gflags & _PAGE_RW) )
> +        *flags &= ~(_PAGE_RW);
> +    if ( gflags & _PAGE_NX_BIT )
> +        *flags |= _PAGE_NX_BIT;

..you won't need these either.

> +}
> +
> +/*
> + * Set the page permission of the mfn. This in effect removes all shadow
> + * mappings of that mfn. The access type of that mfn is stored in the access
> + * lookup table.
> + */
> +static int
> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
> +                         unsigned int page_order, p2m_type_t p2mt,
> +                         p2m_access_t p2ma)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
> +    uint table_idx;
> +    uint page_idx;
> +    uint8_t *access_table_page;
> +
> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);
> +
> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions */
> +    if ( unlikely(p2ma != p2m_access_r &&
> +                  p2ma != p2m_access_rw &&
> +                  p2ma != p2m_access_rx &&
> +                  p2ma != p2m_access_rwx &&
> +                  p2ma != p2m_access_rx2rw) )
> +        return -EINVAL;
> +
> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
> +        return -ENOENT;

If we're only ever setting access permissions for a guest's own pages
(i.e. not setting them for grant mappings or foreign mappings) then
the access permissions lookup table could potentially be shared among
all PV VMs.  It might even be possible to fold the access-type into
the shadow_flags field in struct page_info and avoid having a lookup
table at all.  If that's possible, I think it will make some thing a
_lot_ easier. 

(Based on this, and the worries about gfn-space being sparse, I'm not
going to give detailed feedback about the implementation of the lookup
table in this version; Jan's already pointed out some flaws there.)

> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));

Beware: the guest can set this m2p entry to any value it likes...

> +    /*
> +     * Values with the MSB set denote MFNs that aren't really part of the
> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
> +     * Nothing to do here.
> +     */
> +    if ( unlikely(!VALID_M2P(gfn)) )
> +        return 0;

...effectively preventing the tools from setting the access type.
That's also relevant below where you use _rw access for any frame with
a bogus gfn.

> +    if ( gfn > (d->tot_pages - 1) )
> +        return -EINVAL;

Yeah, as others have pointed out, that's not a good check even if the
guest is well-behaved.

> @@ -3179,8 +3183,14 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
>  {
>      ASSERT(paging_locked_by_me(d));
>  
> -    /* Sanity check the call */
> -    if ( d == current->domain || (d->arch.paging.mode & mode) == mode )
> +    /*
> +     * Sanity check the call.
> +     * Do not allow shadow mem_access mode to be enabled when
> +     * log-dirty or translate mode is enabled.

Why not log-dirty?  That would mean you can't use mem-access and live
migration at the same time. 

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -38,6 +38,8 @@
>  #include <asm/hvm/cacheattr.h>
>  #include <asm/mtrr.h>
>  #include <asm/guest_pt.h>
> +#include <asm/mem_event.h>
> +#include <asm/mem_access.h>
>  #include <public/sched.h>
>  #include "private.h"
>  #include "types.h"
> @@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v,
>              }
>      }
>  
> +    /* Propagate access permissions */
> +    if ( unlikely((level == 1) &&
> +                  mem_event_check_ring(&d->mem_event->access) &&
> +                  !sh_mfn_is_a_page_table(target_mfn)) )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +        p2m_access_t a;
> +        p2m_type_t t;
> +        mfn_t mfn;
> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL);

Ah, no.  This access lookup ought to happen as part of the original
p2m lookup that produced target_mfn.  Having a second lookup here is
unpleasant and confusing -- this code keeps the distinction between
gfn and mfn pretty strictly, so passing an mfn into a p2m lookup is
wrong, even though you happen to know they're the same namespace in
this PV guest.

Of course, if you can hide the access type in the struct page_info,
then it's correct to recover it here.

> +            /*
> +             * Do not police writes to guest memory emanating from the Xen
> +             * kernel. Trying to do so will cause the same pagefault to occur
> +             * over and over again with an event being sent to the access
> +             * listener for each fault. If the access listener's vcpu is not
> +             * scheduled during this time, the violation is never resolved and
> +             * will eventually end with the host crashing.
> +             */
> +            if ( (violation && access_w) &&
> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
> +            {
> +                violation = 0;
> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
> +                                    p2m_ram_rw, p2m_access_rw);
> +            }

That's exploitable, surely: the guest just needs to make a hypercall
with a chosen buffer address to give itself _rw access to a page.
(Which I guess is OK for your intended use of rw<->rx but not in
general.)

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-05-01 14:18           ` Tim Deegan
@ 2014-05-01 19:14             ` Aravindh Puthiyaparambil (aravindp)
  2014-05-08 12:44               ` Tim Deegan
  0 siblings, 1 reply; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-01 19:14 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Ian Jackson, Jan Beulich, xen-devel

>> >>>mfn_t gmfn)
>> >>>>          if ( !(shadow_mode_external(v->domain)
>> >>>>                 && (page->count_info & PGC_count_mask) <= 3
>> >>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>> >>>> -                   == !!is_xen_heap_page(page))) )
>> >>>> +                       == !!is_xen_heap_page(page)))
>> >>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>> >>>
>> >>>You're breaking indentation, there are pointless parentheses again,
>> >>>but most importantly - why?
>> >>
>> >> Sorry, I meant to ask a question about this in my patch message and
>> >> mark this as a workaround. I am seeing the message on all
>> >> sh_remove_all_mappings() and I was not able to figure out why this
>> >> was happening. I just added this as a work around. I was hoping you
>> >> or Tim
>> >would shed more light on this.
>> >
>> >I'm afraid that without detail on which pages the triggers on, and
>> >you at least having spent some time finding out where the stray/extra
>> >references may be coming from it's going to be hard to help.
>>
>> Here are some of the prints I saw. I typically saw it for every set_entry() call.
>>
>> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of
>> mfn 33960: c=8000000000000002 t=7400000000000000 <most common>
>>  (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of
>> mfn 134b8: c=8000000000000038 t=7400000000000001
>
>This is because you are not in shadow_mode_refcounts() (i.e. the page's
>refcount and typecount are based on the _guest_ pagetables rather than the
>_shadow_ ones).  That means that sh_remove_all_mappings() can't use the
>refcounts to figure out when it's removed the last shadow mapping.

I take it that I won't be able to run with shadow_mode_refcounts() for PV domains.

>That also means that sh_remove_all_mappings() will have had to search every
>sl1e in the system, which is expensive.  If there will be a lot of these updates,
>you might consider batching them and using
>shadow_blow_tables() after each batch instead.

Does this mean batching them in the hypervisor or in the mem_access listener? Typically one would want the effect of the new permissions to kick in immediately. So I am afraid batching them will delay this. Anyhow, at a minimum I will add a comment here as to why I am adding the check here. Once the feature has gone in I will revisit to see if I can add some performance improvements in this area.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-05-01 14:39   ` Tim Deegan
@ 2014-05-01 19:26     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-01 19:26 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Keir Fraser, Ian Jackson, Ian Campbell, Jan Beulich

>Thanks for your patch.  I have to say, this looks a lot less terrifying than I
>thought it would. :)

I am glad to hear that. I too feared the worst :-) On that note many thanks to you and Jan for taking the time for giving valuable feedback.

>> Shadow mem_access mode
>> ----------------------
>> Add a new shadow mode for mem_access. This should only be enabled by a
>> mem_access listener when it calls
>> XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
>> for a PV domain.
>
>Do we need a specific mode for this?  If the default behaviour is to allow all
>accesses until told otherwise, then it can be enabled at all times.

You might be right. I will look in to getting rid of it.

>> P2M changes
>> -----------
>> Add a new p2m implementation for mem_access. Central to this is the
>> access lookup table. This is an array of mfns. Each mfn is a page
>> allocated from the domain's shadow memory. The pages hold the
>> p2m_access_t values for each guest gmfn. p2m_mem_access_set_entry()
>> sets the access value of the mfn given as input and blows the shadow
>> entries for the mfn. p2m_mem_access_get_entry() returns the access
>> value of the mfn given as input.
>
>I think this array needs to be a trie, at least; it may make sense to reuse the
>p2m_pt implementation and just restrict/ignore the types and MFNs.
>That way you won't have to reinvent the wheel, esp. around avoiding long-
>running operations.

I like the idea you mentioned below of stashing the access type in the shadow_flags field of struct page_info. If that does not work out I will look in to reusing the p2m_pt implementation.

>> +/* Convert access restrictions to page table flags */ void
>> +p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access) {
>> +
>> +    /* Restrict with access permissions */
>> +    switch (access)
>> +    {
>> +        case p2m_access_r:
>> +            *flags &= ~(_PAGE_RW);
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);
>
>IIUC this function is called to add further restrictions to an existing set of flags.
>In that case, it should never set the _PAGE_PRESENT bit.  Rather, it should
>only ever _reduce_ permissions (i.e. set _NX or clear other bits).  Then...
>
>> +            break;
>> +        case p2m_access_rx:
>> +        case p2m_access_rx2rw:
>> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
>> +            *flags |= _PAGE_PRESENT;
>> +            break;
>> +        case p2m_access_rw:
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +        case p2m_access_rwx:
>> +        default:
>> +            *flags &= ~(_PAGE_NX_BIT);
>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +    }
>> +
>> +    // Allow more restrictive guest flags to be propagated instead of access
>> +    // permissions
>> +    if ( !(gflags & _PAGE_RW) )
>> +        *flags &= ~(_PAGE_RW);
>> +    if ( gflags & _PAGE_NX_BIT )
>> +        *flags |= _PAGE_NX_BIT;
>
>..you won't need these either.

That makes sense. I will follow your suggestion.

>> +}
>> +
>> +/*
>> + * Set the page permission of the mfn. This in effect removes all
>> +shadow
>> + * mappings of that mfn. The access type of that mfn is stored in the
>> +access
>> + * lookup table.
>> + */
>> +static int
>> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long
>gfn, mfn_t mfn,
>> +                         unsigned int page_order, p2m_type_t p2mt,
>> +                         p2m_access_t p2ma) {
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
>> +    uint table_idx;
>> +    uint page_idx;
>> +    uint8_t *access_table_page;
>> +
>> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table !=
>NULL);
>> +
>> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions
>*/
>> +    if ( unlikely(p2ma != p2m_access_r &&
>> +                  p2ma != p2m_access_rw &&
>> +                  p2ma != p2m_access_rx &&
>> +                  p2ma != p2m_access_rwx &&
>> +                  p2ma != p2m_access_rx2rw) )
>> +        return -EINVAL;
>> +
>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>> +        return -ENOENT;
>
>If we're only ever setting access permissions for a guest's own pages (i.e. not
>setting them for grant mappings or foreign mappings) then the access
>permissions lookup table could potentially be shared among all PV VMs.  It
>might even be possible to fold the access-type into the shadow_flags field in
>struct page_info and avoid having a lookup table at all.  If that's possible, I
>think it will make some thing a _lot_ easier.

This idea looks very promising. I remember mentioning in this is some of my earlier emails but got side tracked thinking PV memory is contiguous and I could use a simple indexed lookup table. So I will resubmit this patch series after I implement a new version of p2m-ma using the shadow_flags or reusing the p2m-pt implementation.

>> -    /* Sanity check the call */
>> -    if ( d == current->domain || (d->arch.paging.mode & mode) == mode )
>> +    /*
>> +     * Sanity check the call.
>> +     * Do not allow shadow mem_access mode to be enabled when
>> +     * log-dirty or translate mode is enabled.
>
>Why not log-dirty?  That would mean you can't use mem-access and live
>migration at the same time.

I was afraid they would conflict and prevent log-dirty from working correctly. But I forgot about it being used during live-migration. Let me see if I can get both of them to live together.

>> +    /* Propagate access permissions */
>> +    if ( unlikely((level == 1) &&
>> +                  mem_event_check_ring(&d->mem_event->access) &&
>> +                  !sh_mfn_is_a_page_table(target_mfn)) )
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +        p2m_access_t a;
>> +        p2m_type_t t;
>> +        mfn_t mfn;
>> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0,
>> + NULL);
>
>Ah, no.  This access lookup ought to happen as part of the original p2m lookup
>that produced target_mfn.  Having a second lookup here is unpleasant and
>confusing -- this code keeps the distinction between gfn and mfn pretty
>strictly, so passing an mfn into a p2m lookup is wrong, even though you
>happen to know they're the same namespace in this PV guest.
>
>Of course, if you can hide the access type in the struct page_info, then it's
>correct to recover it here.

OK, I am hoping to do that here. 

>> +            /*
>> +             * Do not police writes to guest memory emanating from the Xen
>> +             * kernel. Trying to do so will cause the same pagefault to occur
>> +             * over and over again with an event being sent to the access
>> +             * listener for each fault. If the access listener's vcpu is not
>> +             * scheduled during this time, the violation is never resolved and
>> +             * will eventually end with the host crashing.
>> +             */
>> +            if ( (violation && access_w) &&
>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
>> +            {
>> +                violation = 0;
>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>> +                                    p2m_ram_rw, p2m_access_rw);
>> +            }
>
>That's exploitable, surely: the guest just needs to make a hypercall with a
>chosen buffer address to give itself _rw access to a page.
>(Which I guess is OK for your intended use of rw<->rx but not in
>general.)

I agree but could not come up with a way to get it to work. I have sent a separate email regarding this issue. 

Cheers,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
  2014-05-01  1:11           ` Andres Lagar-Cavilla
  2014-05-01 14:18           ` Tim Deegan
@ 2014-05-02  8:26           ` Jan Beulich
  2014-05-02 16:28             ` Aravindh Puthiyaparambil (aravindp)
  2 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2014-05-02  8:26 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Tim Deegan, xen-devel, Ian Jackson

>>> On 01.05.14 at 00:20, <aravindp@cisco.com> wrote:
>>I think implying _n to mean "default" is at least a latent bug anyway (since that
>>precludes setting that type). And _if_ "get" really does any modification, that
>>minimally would require a big comment.
> 
> For PV domains access_n is precluded.
> 
>>Furthermore, with it being possible for the default type to change, the net
>>effect might easily be different between setting the default type at
>>initialization time or when the first lookup happens.
> 
> I am not sure that I follow your concern here. Setting default access does 
> not mean that the access table gets cleared. So if a particular MFN entry's 
> access value changed and the default value was changed, then the MFN should 
> retain the access value it was changed to and not the default value. The new 
> default value will only be returned for MFNs for which set_entry was not 
> called on and I think that is correct.

It might still be a difference when the designated access gets effected:
I would assume that a request for the default type should obtain the
default set at the time of the request, not the one in effects at the time
the entry gets first used.

>>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>>> +    struct domain *d = p2m->domain;
>>>>> +    mfn_t *access_lookup_table;
>>>>> +    uint32_t nr_access_table_pages;
>>>>> +    uint32_t ctr;
>>>>> +
>>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>>> + nr_access_table_pages);
>>>>
>>>>This surely is going to be an order > 0 allocation, and you even risk
>>>>it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>>by convention/agreement, the latter is an outright bug.
>>>
>>> Sorry, I was not aware of the convention. What should I be doing here
>>> instead?
>>
>>Especially with the above note about GFN space being effectively unbounded
>>for PV guests, you need to find a different data structure to represent the
>>information you need. Perhaps something similar to what log-dirty mode
>>uses, except that you would need more levels in the extreme case, and hence
>>it might be worthwhile making the actually used number of levels dynamic?
> 
> I am a little confused here. In my initial discussions about this I was told 
> that PV guests have bounded and contiguous memory. This is why I took the 
> approach of an indexable array. 
> 
> " OTOH, all you need is a byte per pfn, and the great thing is that in PV 
> domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, 
> etc, which demand the sparse tree structure. So you can allocate an easily 
> indexable array, notwithstanding super page concerns (I think/hope)."
> 
> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html 
> 
> Did I misunderstand something here? 

What he stated is the situation at domain start. Due to ballooning and
other kinds of manipulations (as said, pv-ops re-arranges things to
avoid having memory [PFNs] overlap MMIO [MFNs]) this may change
immediately after the domain started.

And even if it was a plain contiguous linear address space, you still
wouldn't be permitted to allocate a flat one-byte-per-page array, as
for large domains this might still overflow the MAX_ORDER allocation
constraint.

>>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>>mfn_t gmfn)
>>>>>          if ( !(shadow_mode_external(v->domain)
>>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>>> -                   == !!is_xen_heap_page(page))) )
>>>>> +                       == !!is_xen_heap_page(page)))
>>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>>
>>>>You're breaking indentation, there are pointless parentheses again,
>>>>but most importantly - why?
>>>
>>> Sorry, I meant to ask a question about this in my patch message and
>>> mark this as a workaround. I am seeing the message on all
>>> sh_remove_all_mappings() and I was not able to figure out why this was
>>> happening. I just added this as a work around. I was hoping you or Tim
>>would shed more light on this.
>>
>>I'm afraid that without detail on which pages the triggers on, and you at 
> least
>>having spent some time finding out where the stray/extra references may be
>>coming from it's going to be hard to help.
> 
> Here are some of the prints I saw. I typically saw it for every set_entry() 
> call.

There are some rather large usage counts among the examples you
gave - these surely need understanding rather than blindly ignoring.

>>>>> +/* Number of access table pages for a PV domain */ #define
>>>>> +get_domain_nr_access_table_pages(d) \
>>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>>> +PAGE_SIZE)
>>>>
>>>>And once again. I wonder whether this code was tested on a suitably
>>>>big pv-ops PV guest.
>>>
>>> The largest PV domain I tried this with was 28GB.
>>
>>Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>>worked.
> 
> It would have had MFNs larger than tot_pages. I don't understand how it 
> would have had GFNs larger than tot_pages.

Once again, because it re-arranges its PFN space to avoid MMIO holes
(plus - applicable to non-pv-ops too - it may be started ballooned or
balloon out memory subsequently).

Jan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page
  2014-04-29  4:45 ` [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page Aravindh Puthiyaparambil
@ 2014-05-02 12:41   ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2014-05-02 12:41 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Mon, 2014-04-28 at 21:45 -0700, Aravindh Puthiyaparambil wrote:
> Add two APIs, xc_mem_access_create_ring_page() and
> xc_mem_access_get_ring_mfn(). The mem_access listener needs to call
> xc_mem_access_create_ring_page() before enabling mem_access for PV
> domains. This is not needed for HVM domains as the page is created
> during domain creation time. It can then call
> xc_mem_access_get_ring_mfn() to get the mfn of the created page to map
> in. This is requivalent to xc_get_hvm_param(HVM_PARAM_ACCESS_RING_PFN)
> for HVM domains.
> 
> Signed-off-by: Aravindh Puthiyaparambil <aravindp@cisco.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Seems like a reasonable veneer over the hypercall, so assuming that is
acked by the hypervisor guys this tools bit:
        Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 4/4] tool/xen-access: Add support for PV domains
  2014-04-29  4:45 ` [PATCH RFC 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
@ 2014-05-02 12:43   ` Ian Campbell
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2014-05-02 12:43 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil; +Cc: xen-devel, Ian Jackson, Stefano Stabellini

On Mon, 2014-04-28 at 21:45 -0700, Aravindh Puthiyaparambil wrote:
> Add support to the xen-access test program for it work with PV domains.
> The main difference is that for PV domains, unlike HVM domains,
> xc_mem_access_create_ring_page() has to be called as the page is not
> created during domain creation time. PV domains do not need to set
> all individual page access permissions during setup and teardown. Enabling
> and disabling mem_access does that indirectly.
> 
> Signed-off-by: Aravindh Puthiyaprambil <aravindp@cisco.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-05-02  8:26           ` Jan Beulich
@ 2014-05-02 16:28             ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 22+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-02 16:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Tim Deegan, xen-devel, Ian Jackson

>>>> On 01.05.14 at 00:20, <aravindp@cisco.com> wrote:
>>>I think implying _n to mean "default" is at least a latent bug anyway
>>>(since that precludes setting that type). And _if_ "get" really does
>>>any modification, that minimally would require a big comment.
>>
>> For PV domains access_n is precluded.
>>
>>>Furthermore, with it being possible for the default type to change,
>>>the net effect might easily be different between setting the default
>>>type at initialization time or when the first lookup happens.
>>
>> I am not sure that I follow your concern here. Setting default access
>> does not mean that the access table gets cleared. So if a particular
>> MFN entry's access value changed and the default value was changed,
>> then the MFN should retain the access value it was changed to and not
>> the default value. The new default value will only be returned for
>> MFNs for which set_entry was not called on and I think that is correct.
>
>It might still be a difference when the designated access gets effected:
>I would assume that a request for the default type should obtain the default
>set at the time of the request, not the one in effects at the time the entry
>gets first used.

Ah I understand now. Given that I am moving away from using the lookup table, this might not be a concern anymore.

>>>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>>>> +    struct domain *d = p2m->domain;
>>>>>> +    mfn_t *access_lookup_table;
>>>>>> +    uint32_t nr_access_table_pages;
>>>>>> +    uint32_t ctr;
>>>>>> +
>>>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>>>> + nr_access_table_pages);
>>>>>
>>>>>This surely is going to be an order > 0 allocation, and you even
>>>>>risk it being an order > MAX_ORDER one. The former is disallowed at
>>>>>runtime by convention/agreement, the latter is an outright bug.
>>>>
>>>> Sorry, I was not aware of the convention. What should I be doing
>>>> here instead?
>>>
>>>Especially with the above note about GFN space being effectively
>>>unbounded for PV guests, you need to find a different data structure
>>>to represent the information you need. Perhaps something similar to
>>>what log-dirty mode uses, except that you would need more levels in
>>>the extreme case, and hence it might be worthwhile making the actually
>used number of levels dynamic?
>>
>> I am a little confused here. In my initial discussions about this I
>> was told that PV guests have bounded and contiguous memory. This is
>> why I took the approach of an indexable array.
>>
>> " OTOH, all you need is a byte per pfn, and the great thing is that in
>> PV domains, the physmap is bounded and continuous. Unlike HVM and its
>> PCI holes, etc, which demand the sparse tree structure. So you can
>> allocate an easily indexable array, notwithstanding super page concerns (I
>think/hope)."
>>
>> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.h
>> tml
>>
>> Did I misunderstand something here?
>
>What he stated is the situation at domain start. Due to ballooning and other
>kinds of manipulations (as said, pv-ops re-arranges things to avoid having
>memory [PFNs] overlap MMIO [MFNs]) this may change immediately after
>the domain started.
>
>And even if it was a plain contiguous linear address space, you still wouldn't be
>permitted to allocate a flat one-byte-per-page array, as for large domains this
>might still overflow the MAX_ORDER allocation constraint.

OK, the flat array is a bad idea. I am looking in to some of Tim's suggestions to do this now. Either stash the values in the shadow_flags or reuse the p2m-pt implementation. I am going to try reusing the shadows_flags first and if that does not work I will go with the p2m-pt implementation.

>>>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>>>mfn_t gmfn)
>>>>>>          if ( !(shadow_mode_external(v->domain)
>>>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>>>> -                   == !!is_xen_heap_page(page))) )
>>>>>> +                       == !!is_xen_heap_page(page)))
>>>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>>>
>>>>>You're breaking indentation, there are pointless parentheses again,
>>>>>but most importantly - why?
>>>>
>>>> Sorry, I meant to ask a question about this in my patch message and
>>>> mark this as a workaround. I am seeing the message on all
>>>> sh_remove_all_mappings() and I was not able to figure out why this
>>>> was happening. I just added this as a work around. I was hoping you
>>>> or Tim
>>>would shed more light on this.
>>>
>>>I'm afraid that without detail on which pages the triggers on, and you
>>>at
>> least
>>>having spent some time finding out where the stray/extra references
>>>may be coming from it's going to be hard to help.
>>
>> Here are some of the prints I saw. I typically saw it for every
>> set_entry() call.
>
>There are some rather large usage counts among the examples you gave -
>these surely need understanding rather than blindly ignoring.

Tim gave me the reason for this:
"This is because you are not in shadow_mode_refcounts() (i.e. the page's refcount and typecount are based on the _guest_ pagetables rather than the _shadow_ ones).  That means that sh_remove_all_mappings() can't use the refcounts to figure out when it's removed the last shadow mapping."

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
  2014-05-01 19:14             ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-08 12:44               ` Tim Deegan
  0 siblings, 0 replies; 22+ messages in thread
From: Tim Deegan @ 2014-05-08 12:44 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Keir Fraser, Ian Campbell,
	Andres Lagar-Cavilla (andreslc@gridcentric.ca),
	Ian Jackson, Jan Beulich, xen-devel

At 19:14 +0000 on 01 May (1398968076), Aravindh Puthiyaparambil (aravindp) wrote:
> >> (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of
> >> mfn 33960: c=8000000000000002 t=7400000000000000 <most common>
> >>  (XEN) sh error: sh_remove_all_mappings(): can't find all mappings of
> >> mfn 134b8: c=8000000000000038 t=7400000000000001
> >
> >This is because you are not in shadow_mode_refcounts() (i.e. the page's
> >refcount and typecount are based on the _guest_ pagetables rather than the
> >_shadow_ ones).  That means that sh_remove_all_mappings() can't use the
> >refcounts to figure out when it's removed the last shadow mapping.
> 
> I take it that I won't be able to run with shadow_mode_refcounts() for PV domains.

Nope, that's not an option. :(  The PV MM code uses typecounts in
various places that aren't visible from just the pagetables, so we
can't let the shadow pagetable code control them.

> >That also means that sh_remove_all_mappings() will have had to search every
> >sl1e in the system, which is expensive.  If there will be a lot of these updates,
> >you might consider batching them and using
> >shadow_blow_tables() after each batch instead.
> 
> Does this mean batching them in the hypervisor or in the mem_access
> listener?

Either, I suppose.

> Typically one would want the effect of the new permissions
> to kick in immediately. So I am afraid batching them will delay
> this. Anyhow, at a minimum I will add a comment here as to why I am
> adding the check here. Once the feature has gone in I will revisit
> to see if I can add some performance improvements in this area.

Righto.  

Tim.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2014-05-08 12:45 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-29  4:45 [PATCH RFC 0/4] Add mem_access support for PV domains Aravindh Puthiyaparambil
2014-04-29  4:45 ` [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access Aravindh Puthiyaparambil
2014-04-29  8:50   ` Jan Beulich
2014-04-29 23:10     ` Aravindh Puthiyaparambil (aravindp)
2014-04-30  6:32       ` Jan Beulich
2014-04-30 22:20         ` Aravindh Puthiyaparambil (aravindp)
2014-05-01  1:11           ` Andres Lagar-Cavilla
2014-05-01  3:18             ` Aravindh Puthiyaparambil (aravindp)
2014-05-01 14:18           ` Tim Deegan
2014-05-01 19:14             ` Aravindh Puthiyaparambil (aravindp)
2014-05-08 12:44               ` Tim Deegan
2014-05-02  8:26           ` Jan Beulich
2014-05-02 16:28             ` Aravindh Puthiyaparambil (aravindp)
2014-04-29 12:05   ` Tamas Lengyel
2014-04-29 23:36     ` Aravindh Puthiyaparambil (aravindp)
2014-05-01 14:39   ` Tim Deegan
2014-05-01 19:26     ` Aravindh Puthiyaparambil (aravindp)
2014-04-29  4:45 ` [PATCH RFC 2/4] x86/mem_access: mem_access and mem_event changes to support PV domains Aravindh Puthiyaparambil
2014-04-29  4:45 ` [PATCH RFC 3/4] tools/libxc: Add APIs to create and get the PV ring page Aravindh Puthiyaparambil
2014-05-02 12:41   ` Ian Campbell
2014-04-29  4:45 ` [PATCH RFC 4/4] tool/xen-access: Add support for PV domains Aravindh Puthiyaparambil
2014-05-02 12:43   ` Ian Campbell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.