All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/16] acquire_resource size and external IPT monitoring
@ 2021-01-30  2:58 Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 01/16] xen/memory: Reject out-of-range resource 'frame' values Andrew Cooper
                   ` (16 more replies)
  0 siblings, 17 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel

Combined series (as they are dependent).  First, the resource size fixes, and
then the external IPT monitoring built on top.

Posting in full for reference, but several patches are ready to go in.  Those
in need of review are patch 6, 8 and 12.

See individual patches for changes.  The major work was rebasing over the
ARM/IOREQ series which moved a load of code which this series was bugfixing.

Andrew Cooper (7):
  xen/memory: Reject out-of-range resource 'frame' values
  xen/gnttab: Rework resource acquisition
  xen/memory: Fix acquire_resource size semantics
  xen/memory: Improve compat XENMEM_acquire_resource handling
  xen/memory: Indent part of acquire_resource()
  xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace

Michał Leszczyński (7):
  xen/domain: Add vmtrace_size domain creation parameter
  tools/[lib]xl: Add vmtrace_buf_size parameter
  xen/memory: Add a vmtrace_buf resource type
  x86/vmx: Add Intel Processor Trace support
  xen/domctl: Add XEN_DOMCTL_vmtrace_op
  tools/libxc: Add xc_vmtrace_* functions
  tools/misc: Add xen-vmtrace tool

Tamas K Lengyel (2):
  xen/vmtrace: support for VM forks
  x86/vm_event: Carry the vmtrace buffer position in vm_event

 docs/man/xl.cfg.5.pod.in                    |   9 +
 tools/golang/xenlight/helpers.gen.go        |   4 +
 tools/golang/xenlight/types.gen.go          |   2 +
 tools/include/libxl.h                       |  14 ++
 tools/include/xenctrl.h                     |  73 ++++++++
 tools/libs/ctrl/Makefile                    |   1 +
 tools/libs/ctrl/xc_vmtrace.c                | 128 +++++++++++++
 tools/libs/light/libxl.c                    |   2 +
 tools/libs/light/libxl_cpuid.c              |   1 +
 tools/libs/light/libxl_create.c             |   1 +
 tools/libs/light/libxl_types.idl            |   5 +
 tools/misc/.gitignore                       |   1 +
 tools/misc/Makefile                         |   7 +
 tools/misc/xen-cpuid.c                      |   2 +-
 tools/misc/xen-vmtrace.c                    | 154 ++++++++++++++++
 tools/ocaml/libs/xc/xenctrl.ml              |   1 +
 tools/ocaml/libs/xc/xenctrl.mli             |   1 +
 tools/xl/xl_info.c                          |   5 +-
 tools/xl/xl_parse.c                         |   4 +
 xen/arch/x86/domain.c                       |  23 +++
 xen/arch/x86/domctl.c                       |  55 ++++++
 xen/arch/x86/hvm/vmx/vmcs.c                 |  19 +-
 xen/arch/x86/hvm/vmx/vmx.c                  | 200 +++++++++++++++++++-
 xen/arch/x86/mm/mem_sharing.c               |   3 +
 xen/arch/x86/vm_event.c                     |   3 +
 xen/common/compat/memory.c                  | 147 +++++++++++----
 xen/common/domain.c                         |  81 ++++++++
 xen/common/grant_table.c                    | 112 ++++++++----
 xen/common/ioreq.c                          |   2 +-
 xen/common/memory.c                         | 274 +++++++++++++++++++---------
 xen/common/sysctl.c                         |   2 +
 xen/include/asm-x86/cpufeature.h            |   1 +
 xen/include/asm-x86/hvm/hvm.h               |  72 ++++++++
 xen/include/asm-x86/hvm/vmx/vmcs.h          |   4 +
 xen/include/asm-x86/msr.h                   |  32 ++++
 xen/include/public/arch-x86/cpufeatureset.h |   1 +
 xen/include/public/domctl.h                 |  38 ++++
 xen/include/public/memory.h                 |  18 +-
 xen/include/public/sysctl.h                 |   3 +-
 xen/include/public/vm_event.h               |   7 +
 xen/include/xen/domain.h                    |   2 +
 xen/include/xen/grant_table.h               |  21 ++-
 xen/include/xen/ioreq.h                     |   2 +-
 xen/include/xen/sched.h                     |   6 +
 xen/xsm/flask/hooks.c                       |   1 +
 45 files changed, 1366 insertions(+), 178 deletions(-)
 create mode 100644 tools/libs/ctrl/xc_vmtrace.c
 create mode 100644 tools/misc/xen-vmtrace.c

-- 
2.11.0



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v8 01/16] xen/memory: Reject out-of-range resource 'frame' values
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 02/16] xen/gnttab: Rework resource acquisition Andrew Cooper
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Stefano Stabellini, Julien Grall, Volodymyr Babchuk,
	Paul Durrant, Oleksandr Tyshchenko

The ABI is unfortunate, and frame being 64 bits leads to all kinds of problems
performing correct overflow checks.

Reject out-of-range values, and combinations which overflow, and use unsigned
int consistently elsewhere.  This fixes several truncation bugs in the grant
call tree, as the underlying limits are expressed with unsigned int to begin
with.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien@xen.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Paul Durrant <paul@xen.org>
CC: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>

v8:
 * Push long=>int conversion all the way down into gnttab/ioreq
 * Rebase over ARM/IOREQ series

v7.5:
 * New
---
 xen/common/grant_table.c      | 14 +++++++-------
 xen/common/ioreq.c            |  2 +-
 xen/common/memory.c           | 17 +++++++++++++++--
 xen/include/xen/grant_table.h |  8 ++++----
 xen/include/xen/ioreq.h       |  2 +-
 5 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index f6f5acd300..4ecec35f28 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3918,7 +3918,7 @@ int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
 
 /* caller must hold write lock */
 static int gnttab_get_status_frame_mfn(struct domain *d,
-                                       unsigned long idx, mfn_t *mfn)
+                                       unsigned int idx, mfn_t *mfn)
 {
     const struct grant_table *gt = d->grant_table;
 
@@ -3929,8 +3929,8 @@ static int gnttab_get_status_frame_mfn(struct domain *d,
 
     if ( idx >= nr_status_frames(gt) )
     {
-        unsigned long nr_status;
-        unsigned long nr_grant;
+        unsigned int nr_status;
+        unsigned int nr_grant;
 
         nr_status = idx + 1; /* sufficient frames to make idx valid */
 
@@ -3958,7 +3958,7 @@ static int gnttab_get_status_frame_mfn(struct domain *d,
 
 /* caller must hold write lock */
 static int gnttab_get_shared_frame_mfn(struct domain *d,
-                                       unsigned long idx, mfn_t *mfn)
+                                       unsigned int idx, mfn_t *mfn)
 {
     const struct grant_table *gt = d->grant_table;
 
@@ -3966,7 +3966,7 @@ static int gnttab_get_shared_frame_mfn(struct domain *d,
 
     if ( idx >= nr_grant_frames(gt) )
     {
-        unsigned long nr_grant;
+        unsigned int nr_grant;
 
         nr_grant = idx + 1; /* sufficient frames to make idx valid */
 
@@ -4021,7 +4021,7 @@ int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn, mfn_t *mfn)
     return rc;
 }
 
-int gnttab_get_shared_frame(struct domain *d, unsigned long idx,
+int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
                             mfn_t *mfn)
 {
     struct grant_table *gt = d->grant_table;
@@ -4034,7 +4034,7 @@ int gnttab_get_shared_frame(struct domain *d, unsigned long idx,
     return rc;
 }
 
-int gnttab_get_status_frame(struct domain *d, unsigned long idx,
+int gnttab_get_status_frame(struct domain *d, unsigned int idx,
                             mfn_t *mfn)
 {
     struct grant_table *gt = d->grant_table;
diff --git a/xen/common/ioreq.c b/xen/common/ioreq.c
index a36137d41d..90ed2e0302 100644
--- a/xen/common/ioreq.c
+++ b/xen/common/ioreq.c
@@ -771,7 +771,7 @@ static int ioreq_server_get_info(struct domain *d, ioservid_t id,
 }
 
 int ioreq_server_get_frame(struct domain *d, ioservid_t id,
-                           unsigned long idx, mfn_t *mfn)
+                           unsigned int idx, mfn_t *mfn)
 {
     struct ioreq_server *s;
     int rc;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 33296e65cb..4d681597a5 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1055,7 +1055,7 @@ static long xatp_permission_check(struct domain *d, unsigned int space)
 }
 
 static int acquire_grant_table(struct domain *d, unsigned int id,
-                               unsigned long frame,
+                               unsigned int frame,
                                unsigned int nr_frames,
                                xen_pfn_t mfn_list[])
 {
@@ -1094,7 +1094,7 @@ static int acquire_grant_table(struct domain *d, unsigned int id,
 
 static int acquire_ioreq_server(struct domain *d,
                                 unsigned int id,
-                                unsigned long frame,
+                                unsigned int frame,
                                 unsigned int nr_frames,
                                 xen_pfn_t mfn_list[])
 {
@@ -1164,6 +1164,19 @@ static int acquire_resource(
     if ( xmar.nr_frames > ARRAY_SIZE(mfn_list) )
         return -E2BIG;
 
+    /*
+     * The ABI is rather unfortunate.  nr_frames (and therefore the total size
+     * of the resource) is 32bit, while frame (the offset within the resource
+     * we'd like to start at) is 64bit.
+     *
+     * Reject values oustide the of the range of nr_frames, as well as
+     * combinations of frame and nr_frame which overflow, to simplify the rest
+     * of the logic.
+     */
+    if ( (xmar.frame >> 32) ||
+         ((xmar.frame + xmar.nr_frames) >> 32) )
+        return -EINVAL;
+
     rc = rcu_lock_remote_domain_by_id(xmar.domid, &d);
     if ( rc )
         return rc;
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 8876f1f28e..6d14fe2526 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -55,9 +55,9 @@ int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
 
 int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn,
                      mfn_t *mfn);
-int gnttab_get_shared_frame(struct domain *d, unsigned long idx,
+int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
                             mfn_t *mfn);
-int gnttab_get_status_frame(struct domain *d, unsigned long idx,
+int gnttab_get_status_frame(struct domain *d, unsigned int idx,
                             mfn_t *mfn);
 
 #else
@@ -92,13 +92,13 @@ static inline int gnttab_map_frame(struct domain *d, unsigned long idx,
     return -EINVAL;
 }
 
-static inline int gnttab_get_shared_frame(struct domain *d, unsigned long idx,
+static inline int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
                                           mfn_t *mfn)
 {
     return -EINVAL;
 }
 
-static inline int gnttab_get_status_frame(struct domain *d, unsigned long idx,
+static inline int gnttab_get_status_frame(struct domain *d, unsigned int idx,
                                           mfn_t *mfn)
 {
     return -EINVAL;
diff --git a/xen/include/xen/ioreq.h b/xen/include/xen/ioreq.h
index 2d635e9432..a54a637bef 100644
--- a/xen/include/xen/ioreq.h
+++ b/xen/include/xen/ioreq.h
@@ -90,7 +90,7 @@ bool vcpu_ioreq_handle_completion(struct vcpu *v);
 bool is_ioreq_server_page(struct domain *d, const struct page_info *page);
 
 int ioreq_server_get_frame(struct domain *d, ioservid_t id,
-                           unsigned long idx, mfn_t *mfn);
+                           unsigned int idx, mfn_t *mfn);
 int ioreq_server_map_mem_type(struct domain *d, ioservid_t id,
                               uint32_t type, uint32_t flags);
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 02/16] xen/gnttab: Rework resource acquisition
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 01/16] xen/memory: Reject out-of-range resource 'frame' values Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 03/16] xen/memory: Fix acquire_resource size semantics Andrew Cooper
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

The existing logic doesn't function in the general case for mapping a guests
grant table, due to arbitrary 32 frame limit, and the default grant table
limit being 64.

In order to start addressing this, rework the existing grant table logic by
implementing a single gnttab_acquire_resource().  This is far more efficient
than the previous acquire_grant_table() in memory.c because it doesn't take
the grant table write lock, and attempt to grow the table, for every single
frame.

The new gnttab_acquire_resource() function subsumes the previous two
gnttab_get_{shared,status}_frame() helpers.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Ian Jackson <iwj@xenproject.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Paul Durrant <paul@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Rebase over 'frame' truncation/overflow changes.
 * Rebase over ARM/IOREQ series.

v3:
 * Fold switch statements in gnttab_acquire_resource()
---
 xen/common/grant_table.c      | 76 ++++++++++++++++++++++++++++---------------
 xen/common/memory.c           | 42 ++----------------------
 xen/include/xen/grant_table.h | 19 ++++-------
 3 files changed, 58 insertions(+), 79 deletions(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 4ecec35f28..235bf88daf 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3987,6 +3987,55 @@ static int gnttab_get_shared_frame_mfn(struct domain *d,
     return 0;
 }
 
+int gnttab_acquire_resource(
+    struct domain *d, unsigned int id, unsigned int frame,
+    unsigned int nr_frames, xen_pfn_t mfn_list[])
+{
+    struct grant_table *gt = d->grant_table;
+    unsigned int i, final_frame;
+    mfn_t tmp;
+    void **vaddrs;
+    int rc = -EINVAL;
+
+    if ( !nr_frames )
+        return rc;
+
+    final_frame = frame + nr_frames - 1;
+
+    /* Grow table if necessary. */
+    grant_write_lock(gt);
+    switch ( id )
+    {
+    case XENMEM_resource_grant_table_id_shared:
+        vaddrs = gt->shared_raw;
+        rc = gnttab_get_shared_frame_mfn(d, final_frame, &tmp);
+        break;
+
+    case XENMEM_resource_grant_table_id_status:
+        if ( gt->gt_version != 2 )
+            break;
+
+        /* Check that void ** is a suitable representation for gt->status. */
+        BUILD_BUG_ON(!__builtin_types_compatible_p(
+                         typeof(gt->status), grant_status_t **));
+        vaddrs = (void **)gt->status;
+        rc = gnttab_get_status_frame_mfn(d, final_frame, &tmp);
+        break;
+    }
+
+    /* Any errors?  Bad id, or from growing the table? */
+    if ( rc )
+        goto out;
+
+    for ( i = 0; i < nr_frames; ++i )
+        mfn_list[i] = virt_to_mfn(vaddrs[frame + i]);
+
+ out:
+    grant_write_unlock(gt);
+
+    return rc;
+}
+
 int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn, mfn_t *mfn)
 {
     int rc = 0;
@@ -4021,33 +4070,6 @@ int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn, mfn_t *mfn)
     return rc;
 }
 
-int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
-                            mfn_t *mfn)
-{
-    struct grant_table *gt = d->grant_table;
-    int rc;
-
-    grant_write_lock(gt);
-    rc = gnttab_get_shared_frame_mfn(d, idx, mfn);
-    grant_write_unlock(gt);
-
-    return rc;
-}
-
-int gnttab_get_status_frame(struct domain *d, unsigned int idx,
-                            mfn_t *mfn)
-{
-    struct grant_table *gt = d->grant_table;
-    int rc;
-
-    grant_write_lock(gt);
-    rc = (gt->gt_version == 2) ?
-        gnttab_get_status_frame_mfn(d, idx, mfn) : -EINVAL;
-    grant_write_unlock(gt);
-
-    return rc;
-}
-
 static void gnttab_usage_print(struct domain *rd)
 {
     int first = 1;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 4d681597a5..b36c28af63 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1054,44 +1054,6 @@ static long xatp_permission_check(struct domain *d, unsigned int space)
     return xsm_add_to_physmap(XSM_TARGET, current->domain, d);
 }
 
-static int acquire_grant_table(struct domain *d, unsigned int id,
-                               unsigned int frame,
-                               unsigned int nr_frames,
-                               xen_pfn_t mfn_list[])
-{
-    unsigned int i = nr_frames;
-
-    /* Iterate backwards in case table needs to grow */
-    while ( i-- != 0 )
-    {
-        mfn_t mfn = INVALID_MFN;
-        int rc;
-
-        switch ( id )
-        {
-        case XENMEM_resource_grant_table_id_shared:
-            rc = gnttab_get_shared_frame(d, frame + i, &mfn);
-            break;
-
-        case XENMEM_resource_grant_table_id_status:
-            rc = gnttab_get_status_frame(d, frame + i, &mfn);
-            break;
-
-        default:
-            rc = -EINVAL;
-            break;
-        }
-
-        if ( rc )
-            return rc;
-
-        ASSERT(!mfn_eq(mfn, INVALID_MFN));
-        mfn_list[i] = mfn_x(mfn);
-    }
-
-    return 0;
-}
-
 static int acquire_ioreq_server(struct domain *d,
                                 unsigned int id,
                                 unsigned int frame,
@@ -1188,8 +1150,8 @@ static int acquire_resource(
     switch ( xmar.type )
     {
     case XENMEM_resource_grant_table:
-        rc = acquire_grant_table(d, xmar.id, xmar.frame, xmar.nr_frames,
-                                 mfn_list);
+        rc = gnttab_acquire_resource(d, xmar.id, xmar.frame, xmar.nr_frames,
+                                     mfn_list);
         break;
 
     case XENMEM_resource_ioreq_server:
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 6d14fe2526..14973de734 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -55,10 +55,10 @@ int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
 
 int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn,
                      mfn_t *mfn);
-int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
-                            mfn_t *mfn);
-int gnttab_get_status_frame(struct domain *d, unsigned int idx,
-                            mfn_t *mfn);
+
+int gnttab_acquire_resource(
+    struct domain *d, unsigned int id, unsigned int frame,
+    unsigned int nr_frames, xen_pfn_t mfn_list[]);
 
 #else
 
@@ -92,14 +92,9 @@ static inline int gnttab_map_frame(struct domain *d, unsigned long idx,
     return -EINVAL;
 }
 
-static inline int gnttab_get_shared_frame(struct domain *d, unsigned int idx,
-                                          mfn_t *mfn)
-{
-    return -EINVAL;
-}
-
-static inline int gnttab_get_status_frame(struct domain *d, unsigned int idx,
-                                          mfn_t *mfn)
+static inline int gnttab_acquire_resource(
+    struct domain *d, unsigned int id, unsigned int frame,
+    unsigned int nr_frames, xen_pfn_t mfn_list[])
 {
     return -EINVAL;
 }
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 03/16] xen/memory: Fix acquire_resource size semantics
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 01/16] xen/memory: Reject out-of-range resource 'frame' values Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 02/16] xen/gnttab: Rework resource acquisition Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 04/16] xen/memory: Improve compat XENMEM_acquire_resource handling Andrew Cooper
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

Calling XENMEM_acquire_resource with a NULL frame_list is a request for the
size of the resource, but the returned 32 is bogus.

If someone tries to follow it for XENMEM_resource_ioreq_server, the acquire
call will fail as IOREQ servers currently top out at 2 frames, and it is only
half the size of the default grant table limit for guests.

Also, no users actually request a resource size, because it was never wired up
in the sole implementation of resource acquisition in Linux.

Introduce a new resource_max_frames() to calculate the size of a resource, and
implement it the IOREQ and grant subsystems.

It is impossible to guarantee that a mapping call following a successful size
call will succeed (e.g. The target IOREQ server gets destroyed, or the domain
switches from grant v2 to v1).  Document the restriction, and use the
flexibility to simplify the paths to be lockless.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Ian Jackson <iwj@xenproject.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Paul Durrant <paul@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Rebase over ARM/IOREQ series

v3:
 * Use const struct domain *
 * Fold goto out paths
v2:
 * Spelling fixes
 * Add more local variables.
 * Don't return any status frames on ARM where v2 support is compiled out.
---
 xen/common/grant_table.c      | 23 +++++++++++++++
 xen/common/memory.c           | 66 ++++++++++++++++++++++++++++++++++---------
 xen/include/public/memory.h   | 17 ++++++++---
 xen/include/xen/grant_table.h |  8 ++++++
 4 files changed, 97 insertions(+), 17 deletions(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 235bf88daf..280b7969b6 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -3987,6 +3987,29 @@ static int gnttab_get_shared_frame_mfn(struct domain *d,
     return 0;
 }
 
+unsigned int gnttab_resource_max_frames(const struct domain *d, unsigned int id)
+{
+    const struct grant_table *gt = d->grant_table;
+    unsigned int nr = 0;
+
+    /* Don't need the grant lock.  This limit is fixed at domain create time. */
+    switch ( id )
+    {
+    case XENMEM_resource_grant_table_id_shared:
+        nr = gt->max_grant_frames;
+        break;
+
+    case XENMEM_resource_grant_table_id_status:
+        if ( GNTTAB_MAX_VERSION < 2 )
+            break;
+
+        nr = grant_to_status_frames(gt->max_grant_frames);
+        break;
+    }
+
+    return nr;
+}
+
 int gnttab_acquire_resource(
     struct domain *d, unsigned int id, unsigned int frame,
     unsigned int nr_frames, xen_pfn_t mfn_list[])
diff --git a/xen/common/memory.c b/xen/common/memory.c
index b36c28af63..9fb34eb4d0 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1054,6 +1054,42 @@ static long xatp_permission_check(struct domain *d, unsigned int space)
     return xsm_add_to_physmap(XSM_TARGET, current->domain, d);
 }
 
+unsigned int ioreq_server_max_frames(const struct domain *d)
+{
+    unsigned int nr = 0;
+
+#ifdef CONFIG_IOREQ_SERVER
+    if ( is_hvm_domain(d) )
+        /* One frame for the buf-ioreq ring, and one frame per 128 vcpus. */
+        nr = 1 + DIV_ROUND_UP(d->max_vcpus * sizeof(struct ioreq), PAGE_SIZE);
+#endif
+
+    return nr;
+}
+
+/*
+ * Return 0 on any kind of error.  Caller converts to -EINVAL.
+ *
+ * All nonzero values should be repeatable (i.e. derived from some fixed
+ * property of the domain), and describe the full resource (i.e. mapping the
+ * result of this call will be the entire resource).
+ */
+static unsigned int resource_max_frames(const struct domain *d,
+                                        unsigned int type, unsigned int id)
+{
+    switch ( type )
+    {
+    case XENMEM_resource_grant_table:
+        return gnttab_resource_max_frames(d, id);
+
+    case XENMEM_resource_ioreq_server:
+        return ioreq_server_max_frames(d);
+
+    default:
+        return -EOPNOTSUPP;
+    }
+}
+
 static int acquire_ioreq_server(struct domain *d,
                                 unsigned int id,
                                 unsigned int frame,
@@ -1099,6 +1135,7 @@ static int acquire_resource(
      * use-cases then per-CPU arrays or heap allocations may be required.
      */
     xen_pfn_t mfn_list[32];
+    unsigned int max_frames;
     int rc;
 
     if ( !arch_acquire_resource_check(currd) )
@@ -1110,19 +1147,6 @@ static int acquire_resource(
     if ( xmar.pad != 0 )
         return -EINVAL;
 
-    if ( guest_handle_is_null(xmar.frame_list) )
-    {
-        if ( xmar.nr_frames )
-            return -EINVAL;
-
-        xmar.nr_frames = ARRAY_SIZE(mfn_list);
-
-        if ( __copy_field_to_guest(arg, &xmar, nr_frames) )
-            return -EFAULT;
-
-        return 0;
-    }
-
     if ( xmar.nr_frames > ARRAY_SIZE(mfn_list) )
         return -E2BIG;
 
@@ -1147,6 +1171,22 @@ static int acquire_resource(
     if ( rc )
         goto out;
 
+    max_frames = resource_max_frames(d, xmar.type, xmar.id);
+
+    rc = -EINVAL;
+    if ( !max_frames )
+        goto out;
+
+    if ( guest_handle_is_null(xmar.frame_list) )
+    {
+        if ( xmar.nr_frames )
+            goto out;
+
+        xmar.nr_frames = max_frames;
+        rc = __copy_field_to_guest(arg, &xmar, nr_frames) ? -EFAULT : 0;
+        goto out;
+    }
+
     switch ( xmar.type )
     {
     case XENMEM_resource_grant_table:
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 7ff56d5f28..020c79d757 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -639,10 +639,19 @@ struct xen_mem_acquire_resource {
 #define XENMEM_resource_grant_table_id_status 1
 
     /*
-     * IN/OUT - As an IN parameter number of frames of the resource
-     *          to be mapped. However, if the specified value is 0 and
-     *          frame_list is NULL then this field will be set to the
-     *          maximum value supported by the implementation on return.
+     * IN/OUT
+     *
+     * As an IN parameter number of frames of the resource to be mapped.
+     * This value may be updated over the course of the operation.
+     *
+     * When frame_list is NULL and nr_frames is 0, this is interpreted as a
+     * request for the size of the resource, which shall be returned in the
+     * nr_frames field.
+     *
+     * The size of a resource will never be zero, but a nonzero result doesn't
+     * guarantee that a subsequent mapping request will be successful.  There
+     * are further type/id specific constraints which may change between the
+     * two calls.
      */
     uint32_t nr_frames;
     uint32_t pad;
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 14973de734..63b6dc78f4 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -56,6 +56,8 @@ int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref,
 int gnttab_map_frame(struct domain *d, unsigned long idx, gfn_t gfn,
                      mfn_t *mfn);
 
+unsigned int gnttab_resource_max_frames(const struct domain *d, unsigned int id);
+
 int gnttab_acquire_resource(
     struct domain *d, unsigned int id, unsigned int frame,
     unsigned int nr_frames, xen_pfn_t mfn_list[]);
@@ -92,6 +94,12 @@ static inline int gnttab_map_frame(struct domain *d, unsigned long idx,
     return -EINVAL;
 }
 
+static inline unsigned int gnttab_resource_max_frames(
+    const struct domain *d, unsigned int id)
+{
+    return 0;
+}
+
 static inline int gnttab_acquire_resource(
     struct domain *d, unsigned int id, unsigned int frame,
     unsigned int nr_frames, xen_pfn_t mfn_list[])
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 04/16] xen/memory: Improve compat XENMEM_acquire_resource handling
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (2 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 03/16] xen/memory: Fix acquire_resource size semantics Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 05/16] xen/memory: Indent part of acquire_resource() Andrew Cooper
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

The frame_list is an input, or an output, depending on whether the calling
domain is translated or not.  The array does not need marshalling in both
directions.

Furthermore, the copy-in loop was very inefficient, copying 4 bytes at at
time.  Rewrite it to copy in all nr_frames at once, and then expand
compat_pfn_t to xen_pfn_t in place.

Re-position the copy-in loop to simplify continuation support in a future
patch, and reduce the scope of certain variables.

No change in guest observed behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Ian Jackson <iwj@xenproject.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Paul Durrant <paul@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>
---
 xen/common/compat/memory.c | 65 ++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index ed92e05b08..834c5e19d1 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -55,6 +55,8 @@ static int get_reserved_device_memory(xen_pfn_t start, xen_ulong_t nr,
 
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
+    struct vcpu *curr = current;
+    struct domain *currd = curr->domain;
     int split, op = cmd & MEMOP_CMD_MASK;
     long rc;
     unsigned int start_extent = cmd >> MEMOP_EXTENT_SHIFT;
@@ -399,7 +401,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 
         case XENMEM_acquire_resource:
         {
-            xen_pfn_t *xen_frame_list;
+            xen_pfn_t *xen_frame_list = NULL;
             unsigned int max_nr_frames;
 
             if ( copy_from_guest(&cmp.mar, compat, 1) )
@@ -417,28 +419,10 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             if ( cmp.mar.nr_frames > max_nr_frames )
                 return -E2BIG;
 
-            if ( compat_handle_is_null(cmp.mar.frame_list) )
-                xen_frame_list = NULL;
-            else
-            {
+            /* Marshal the frame list in the remainder of the xlat space. */
+            if ( !compat_handle_is_null(cmp.mar.frame_list) )
                 xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
 
-                if ( !compat_handle_okay(cmp.mar.frame_list,
-                                         cmp.mar.nr_frames) )
-                    return -EFAULT;
-
-                for ( i = 0; i < cmp.mar.nr_frames; i++ )
-                {
-                    compat_pfn_t frame;
-
-                    if ( __copy_from_compat_offset(
-                             &frame, cmp.mar.frame_list, i, 1) )
-                        return -EFAULT;
-
-                    xen_frame_list[i] = frame;
-                }
-            }
-
 #define XLAT_mem_acquire_resource_HNDL_frame_list(_d_, _s_) \
             set_xen_guest_handle((_d_)->frame_list, xen_frame_list)
 
@@ -446,6 +430,31 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 
 #undef XLAT_mem_acquire_resource_HNDL_frame_list
 
+            if ( xen_frame_list && cmp.mar.nr_frames )
+            {
+                /*
+                 * frame_list is an input for translated guests, and an output
+                 * for untranslated guests.  Only copy in for translated guests.
+                 */
+                if ( paging_mode_translate(currd) )
+                {
+                    compat_pfn_t *compat_frame_list = (void *)xen_frame_list;
+
+                    if ( !compat_handle_okay(cmp.mar.frame_list,
+                                             cmp.mar.nr_frames) ||
+                         __copy_from_compat_offset(
+                             compat_frame_list, cmp.mar.frame_list,
+                             0, cmp.mar.nr_frames) )
+                        return -EFAULT;
+
+                    /*
+                     * Iterate backwards over compat_frame_list[] expanding
+                     * compat_pfn_t to xen_pfn_t in place.
+                     */
+                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
+                        xen_frame_list[x] = compat_frame_list[x];
+                }
+            }
             break;
         }
         default:
@@ -590,8 +599,6 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 
         case XENMEM_acquire_resource:
         {
-            const xen_pfn_t *xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
-            compat_pfn_t *compat_frame_list = (compat_pfn_t *)(nat.mar + 1);
             DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
 
             if ( compat_handle_is_null(cmp.mar.frame_list) )
@@ -601,9 +608,18 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                                            compat_mem_acquire_resource_t),
                          nat.mar, nr_frames) )
                     return -EFAULT;
+                break;
             }
-            else
+
+            /*
+             * frame_list is an input for translated guests, and an output for
+             * untranslated guests.  Only copy out for untranslated guests.
+             */
+            if ( !paging_mode_translate(currd) )
             {
+                const xen_pfn_t *xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
+                compat_pfn_t *compat_frame_list = (compat_pfn_t *)(nat.mar + 1);
+
                 /*
                  * NOTE: the smaller compat array overwrites the native
                  *       array.
@@ -625,7 +641,6 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                                              cmp.mar.nr_frames) )
                     return -EFAULT;
             }
-
             break;
         }
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 05/16] xen/memory: Indent part of acquire_resource()
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (3 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 04/16] xen/memory: Improve compat XENMEM_acquire_resource handling Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource Andrew Cooper
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

Indent the middle of acquire_resource() inside a do {} while ( 0 ) loop.  This
is broken out specifically to make the following change readable.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Ian Jackson <iwj@xenproject.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Paul Durrant <paul@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Rebase over ARM/IOREQ series.
---
 xen/common/memory.c | 72 +++++++++++++++++++++++++++--------------------------
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 9fb34eb4d0..01cab7e493 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1187,48 +1187,50 @@ static int acquire_resource(
         goto out;
     }
 
-    switch ( xmar.type )
-    {
-    case XENMEM_resource_grant_table:
-        rc = gnttab_acquire_resource(d, xmar.id, xmar.frame, xmar.nr_frames,
-                                     mfn_list);
-        break;
+    do {
+        switch ( xmar.type )
+        {
+        case XENMEM_resource_grant_table:
+            rc = gnttab_acquire_resource(d, xmar.id, xmar.frame, xmar.nr_frames,
+                                         mfn_list);
+            break;
 
-    case XENMEM_resource_ioreq_server:
-        rc = acquire_ioreq_server(d, xmar.id, xmar.frame, xmar.nr_frames,
-                                  mfn_list);
-        break;
+        case XENMEM_resource_ioreq_server:
+            rc = acquire_ioreq_server(d, xmar.id, xmar.frame, xmar.nr_frames,
+                                      mfn_list);
+            break;
 
-    default:
-        rc = -EOPNOTSUPP;
-        break;
-    }
+        default:
+            rc = -EOPNOTSUPP;
+            break;
+        }
 
-    if ( rc )
-        goto out;
+        if ( rc )
+            goto out;
 
-    if ( !paging_mode_translate(currd) )
-    {
-        if ( copy_to_guest(xmar.frame_list, mfn_list, xmar.nr_frames) )
-            rc = -EFAULT;
-    }
-    else
-    {
-        xen_pfn_t gfn_list[ARRAY_SIZE(mfn_list)];
-        unsigned int i;
+        if ( !paging_mode_translate(currd) )
+        {
+            if ( copy_to_guest(xmar.frame_list, mfn_list, xmar.nr_frames) )
+                rc = -EFAULT;
+        }
+        else
+        {
+            xen_pfn_t gfn_list[ARRAY_SIZE(mfn_list)];
+            unsigned int i;
 
-        if ( copy_from_guest(gfn_list, xmar.frame_list, xmar.nr_frames) )
-            rc = -EFAULT;
+            if ( copy_from_guest(gfn_list, xmar.frame_list, xmar.nr_frames) )
+                rc = -EFAULT;
 
-        for ( i = 0; !rc && i < xmar.nr_frames; i++ )
-        {
-            rc = set_foreign_p2m_entry(currd, d, gfn_list[i],
-                                       _mfn(mfn_list[i]));
-            /* rc should be -EIO for any iteration other than the first */
-            if ( rc && i )
-                rc = -EIO;
+            for ( i = 0; !rc && i < xmar.nr_frames; i++ )
+            {
+                rc = set_foreign_p2m_entry(currd, d, gfn_list[i],
+                                           _mfn(mfn_list[i]));
+                /* rc should be -EIO for any iteration other than the first */
+                if ( rc && i )
+                    rc = -EIO;
+            }
         }
-    }
+    } while ( 0 );
 
  out:
     rcu_unlock_domain(d);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (4 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 05/16] xen/memory: Indent part of acquire_resource() Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-02-01 10:10   ` Roger Pau Monné
  2021-01-30  2:58 ` [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace Andrew Cooper
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

A guest's default number of grant frames is 64, and XENMEM_acquire_resource
will reject an attempt to map more than 32 frames.  This limit is caused by
the size of mfn_list[] on the stack.

Fix mapping of arbitrary size requests by looping over batches of 32 in
acquire_resource(), and using hypercall continuations when necessary.

To start with, break _acquire_resource() out of acquire_resource() to cope
with type-specific dispatching, and update the return semantics to indicate
the number of mfns returned.  Update gnttab_acquire_resource() and x86's
arch_acquire_resource() to match these new semantics.

Have do_memory_op() pass start_extent into acquire_resource() so it can pick
up where it left off after a continuation, and loop over batches of 32 until
all the work is done, or a continuation needs to occur.

compat_memory_op() is a bit more complicated, because it also has to marshal
frame_list in the XLAT buffer.  Have it account for continuation information
itself and hide details from the upper layer, so it can marshal the buffer in
chunks if necessary.

With these fixes in place, it is now possible to map the whole grant table for
a guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: George Dunlap <George.Dunlap@eu.citrix.com>
CC: Ian Jackson <iwj@xenproject.org>
CC: Jan Beulich <JBeulich@suse.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Wei Liu <wl@xen.org>
CC: Julien Grall <julien@xen.org>
CC: Paul Durrant <paul@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * nat => cmp change in the start_extent check.
 * Rebase over 'frame' and ARM/IOREQ series.

v3:
 * Spelling fixes
---
 xen/common/compat/memory.c |  94 +++++++++++++++++++++++++++-------
 xen/common/grant_table.c   |   3 ++
 xen/common/memory.c        | 124 +++++++++++++++++++++++++++++++++------------
 3 files changed, 169 insertions(+), 52 deletions(-)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index 834c5e19d1..4c9cd9c05a 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -402,23 +402,10 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
         case XENMEM_acquire_resource:
         {
             xen_pfn_t *xen_frame_list = NULL;
-            unsigned int max_nr_frames;
 
             if ( copy_from_guest(&cmp.mar, compat, 1) )
                 return -EFAULT;
 
-            /*
-             * The number of frames handled is currently limited to a
-             * small number by the underlying implementation, so the
-             * scratch space should be sufficient for bouncing the
-             * frame addresses.
-             */
-            max_nr_frames = (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
-                sizeof(*xen_frame_list);
-
-            if ( cmp.mar.nr_frames > max_nr_frames )
-                return -E2BIG;
-
             /* Marshal the frame list in the remainder of the xlat space. */
             if ( !compat_handle_is_null(cmp.mar.frame_list) )
                 xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
@@ -432,6 +419,28 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 
             if ( xen_frame_list && cmp.mar.nr_frames )
             {
+                unsigned int xlat_max_frames =
+                    (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
+                    sizeof(*xen_frame_list);
+
+                if ( start_extent >= cmp.mar.nr_frames )
+                    return -EINVAL;
+
+                /*
+                 * Adjust nat to account for work done on previous
+                 * continuations, leaving cmp pristine.  Hide the continaution
+                 * from the native code to prevent double accounting.
+                 */
+                nat.mar->nr_frames -= start_extent;
+                nat.mar->frame += start_extent;
+                cmd &= MEMOP_CMD_MASK;
+
+                /*
+                 * If there are two many frames to fit within the xlat buffer,
+                 * we'll need to loop to marshal them all.
+                 */
+                nat.mar->nr_frames = min(nat.mar->nr_frames, xlat_max_frames);
+
                 /*
                  * frame_list is an input for translated guests, and an output
                  * for untranslated guests.  Only copy in for translated guests.
@@ -444,14 +453,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                                              cmp.mar.nr_frames) ||
                          __copy_from_compat_offset(
                              compat_frame_list, cmp.mar.frame_list,
-                             0, cmp.mar.nr_frames) )
+                             start_extent, nat.mar->nr_frames) )
                         return -EFAULT;
 
                     /*
                      * Iterate backwards over compat_frame_list[] expanding
                      * compat_pfn_t to xen_pfn_t in place.
                      */
-                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
+                    for ( int x = nat.mar->nr_frames - 1; x >= 0; --x )
                         xen_frame_list[x] = compat_frame_list[x];
                 }
             }
@@ -600,9 +609,11 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
         case XENMEM_acquire_resource:
         {
             DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
+            unsigned int done;
 
             if ( compat_handle_is_null(cmp.mar.frame_list) )
             {
+                ASSERT(split == 0 && rc == 0);
                 if ( __copy_field_to_guest(
                          guest_handle_cast(compat,
                                            compat_mem_acquire_resource_t),
@@ -611,6 +622,21 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                 break;
             }
 
+            if ( split < 0 )
+            {
+                /* Continuation occurred. */
+                ASSERT(rc != XENMEM_acquire_resource);
+                done = cmd >> MEMOP_EXTENT_SHIFT;
+            }
+            else
+            {
+                /* No continuation. */
+                ASSERT(rc == 0);
+                done = nat.mar->nr_frames;
+            }
+
+            ASSERT(done <= nat.mar->nr_frames);
+
             /*
              * frame_list is an input for translated guests, and an output for
              * untranslated guests.  Only copy out for untranslated guests.
@@ -626,7 +652,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                  */
                 BUILD_BUG_ON(sizeof(compat_pfn_t) > sizeof(xen_pfn_t));
 
-                for ( i = 0; i < cmp.mar.nr_frames; i++ )
+                for ( i = 0; i < done; i++ )
                 {
                     compat_pfn_t frame = xen_frame_list[i];
 
@@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
                     compat_frame_list[i] = frame;
                 }
 
-                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
+                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
                                              compat_frame_list,
-                                             cmp.mar.nr_frames) )
+                                             done) )
                     return -EFAULT;
             }
-            break;
+
+            start_extent += done;
+
+            /* Completely done. */
+            if ( start_extent == cmp.mar.nr_frames )
+                break;
+
+            /*
+             * Done a "full" batch, but we were limited by space in the xlat
+             * area.  Go around the loop again without necesserily returning
+             * to guest context.
+             */
+            if ( done == nat.mar->nr_frames )
+            {
+                split = 1;
+                break;
+            }
+
+            /* Explicit continuation request from a higher level. */
+            if ( done < nat.mar->nr_frames )
+                return hypercall_create_continuation(
+                    __HYPERVISOR_memory_op, "ih",
+                    op | (start_extent << MEMOP_EXTENT_SHIFT), compat);
+
+            /*
+             * Well... Somethings gone wrong with the two levels of chunking.
+             * My condolences to whomever next has to debug this mess.
+             */
+            ASSERT_UNREACHABLE();
+            goto crash;
         }
 
         default:
+        crash:
             domain_crash(current->domain);
             split = 0;
             break;
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 280b7969b6..b95403695f 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -4053,6 +4053,9 @@ int gnttab_acquire_resource(
     for ( i = 0; i < nr_frames; ++i )
         mfn_list[i] = virt_to_mfn(vaddrs[frame + i]);
 
+    /* Success.  Passed nr_frames back to the caller. */
+    rc = nr_frames;
+
  out:
     grant_write_unlock(gt);
 
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 01cab7e493..128718b31c 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1118,23 +1118,41 @@ static int acquire_ioreq_server(struct domain *d,
         mfn_list[i] = mfn_x(mfn);
     }
 
-    return 0;
+    /* Success.  Passed nr_frames back to the caller. */
+    return nr_frames;
 #else
     return -EOPNOTSUPP;
 #endif
 }
 
+/*
+ * Returns -errno on error, or positive in the range [1, nr_frames] on
+ * success.  Returning less than nr_frames contitutes a request for a
+ * continuation.  Callers can depend on frame + nr_frames not overflowing.
+ */
+static int _acquire_resource(
+    struct domain *d, unsigned int type, unsigned int id, unsigned int frame,
+    unsigned int nr_frames, xen_pfn_t mfn_list[])
+{
+    switch ( type )
+    {
+    case XENMEM_resource_grant_table:
+        return gnttab_acquire_resource(d, id, frame, nr_frames, mfn_list);
+
+    case XENMEM_resource_ioreq_server:
+        return acquire_ioreq_server(d, id, frame, nr_frames, mfn_list);
+
+    default:
+        return -EOPNOTSUPP;
+    }
+}
+
 static int acquire_resource(
-    XEN_GUEST_HANDLE_PARAM(xen_mem_acquire_resource_t) arg)
+    XEN_GUEST_HANDLE_PARAM(xen_mem_acquire_resource_t) arg,
+    unsigned long start_extent)
 {
     struct domain *d, *currd = current->domain;
     xen_mem_acquire_resource_t xmar;
-    /*
-     * The mfn_list and gfn_list (below) arrays are ok on stack for the
-     * moment since they are small, but if they need to grow in future
-     * use-cases then per-CPU arrays or heap allocations may be required.
-     */
-    xen_pfn_t mfn_list[32];
     unsigned int max_frames;
     int rc;
 
@@ -1147,9 +1165,6 @@ static int acquire_resource(
     if ( xmar.pad != 0 )
         return -EINVAL;
 
-    if ( xmar.nr_frames > ARRAY_SIZE(mfn_list) )
-        return -E2BIG;
-
     /*
      * The ABI is rather unfortunate.  nr_frames (and therefore the total size
      * of the resource) is 32bit, while frame (the offset within the resource
@@ -1179,7 +1194,7 @@ static int acquire_resource(
 
     if ( guest_handle_is_null(xmar.frame_list) )
     {
-        if ( xmar.nr_frames )
+        if ( xmar.nr_frames || start_extent )
             goto out;
 
         xmar.nr_frames = max_frames;
@@ -1187,30 +1202,47 @@ static int acquire_resource(
         goto out;
     }
 
-    do {
-        switch ( xmar.type )
-        {
-        case XENMEM_resource_grant_table:
-            rc = gnttab_acquire_resource(d, xmar.id, xmar.frame, xmar.nr_frames,
-                                         mfn_list);
-            break;
+    /*
+     * Limiting nr_frames at (UINT_MAX >> MEMOP_EXTENT_SHIFT) isn't ideal.  If
+     * it ever becomes a practical problem, we can switch to mutating
+     * xmar.{frame,nr_frames,frame_list} in guest memory.
+     */
+    rc = -EINVAL;
+    if ( start_extent >= xmar.nr_frames ||
+         xmar.nr_frames > (UINT_MAX >> MEMOP_EXTENT_SHIFT) )
+        goto out;
 
-        case XENMEM_resource_ioreq_server:
-            rc = acquire_ioreq_server(d, xmar.id, xmar.frame, xmar.nr_frames,
-                                      mfn_list);
-            break;
+    /* Adjust for work done on previous continuations. */
+    xmar.nr_frames -= start_extent;
+    xmar.frame += start_extent;
+    guest_handle_add_offset(xmar.frame_list, start_extent);
 
-        default:
-            rc = -EOPNOTSUPP;
-            break;
-        }
+    do {
+        /*
+         * Arbitrary size.  Not too much stack space, and a reasonable stride
+         * for continuation checks.
+         */
+        xen_pfn_t mfn_list[32];
+        unsigned int todo = MIN(ARRAY_SIZE(mfn_list), xmar.nr_frames), done;
 
-        if ( rc )
+        rc = _acquire_resource(d, xmar.type, xmar.id, xmar.frame,
+                               todo, mfn_list);
+        if ( rc < 0 )
+            goto out;
+
+        done = rc;
+        rc = 0;
+        if ( done == 0 || done > todo )
+        {
+            ASSERT_UNREACHABLE();
+            rc = -EINVAL;
             goto out;
+        }
 
+        /* Adjust guest frame_list appropriately. */
         if ( !paging_mode_translate(currd) )
         {
-            if ( copy_to_guest(xmar.frame_list, mfn_list, xmar.nr_frames) )
+            if ( copy_to_guest(xmar.frame_list, mfn_list, done) )
                 rc = -EFAULT;
         }
         else
@@ -1218,10 +1250,10 @@ static int acquire_resource(
             xen_pfn_t gfn_list[ARRAY_SIZE(mfn_list)];
             unsigned int i;
 
-            if ( copy_from_guest(gfn_list, xmar.frame_list, xmar.nr_frames) )
+            if ( copy_from_guest(gfn_list, xmar.frame_list, done) )
                 rc = -EFAULT;
 
-            for ( i = 0; !rc && i < xmar.nr_frames; i++ )
+            for ( i = 0; !rc && i < done; i++ )
             {
                 rc = set_foreign_p2m_entry(currd, d, gfn_list[i],
                                            _mfn(mfn_list[i]));
@@ -1230,7 +1262,32 @@ static int acquire_resource(
                     rc = -EIO;
             }
         }
-    } while ( 0 );
+
+        if ( rc )
+            goto out;
+
+        xmar.nr_frames -= done;
+        xmar.frame += done;
+        guest_handle_add_offset(xmar.frame_list, done);
+        start_extent += done;
+
+        /*
+         * Explicit continuation request from _acquire_resource(), or we've
+         * still got more work to do.
+         */
+        if ( done < todo ||
+             (xmar.nr_frames && hypercall_preempt_check()) )
+        {
+            rc = hypercall_create_continuation(
+                __HYPERVISOR_memory_op, "lh",
+                XENMEM_acquire_resource | (start_extent << MEMOP_EXTENT_SHIFT),
+                arg);
+            goto out;
+        }
+
+    } while ( xmar.nr_frames );
+
+    rc = 0;
 
  out:
     rcu_unlock_domain(d);
@@ -1697,7 +1754,8 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
     case XENMEM_acquire_resource:
         rc = acquire_resource(
-            guest_handle_cast(arg, xen_mem_acquire_resource_t));
+            guest_handle_cast(arg, xen_mem_acquire_resource_t),
+            start_extent);
         break;
 
     default:
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (5 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-02-01 10:32   ` Roger Pau Monné
  2021-01-30  2:58 ` [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter Andrew Cooper
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Michał Leszczyński,
	Tamas K Lengyel

We're about to introduce support for Intel Processor Trace, but similar
functionality exists in other platforms.

Aspects of vmtrace can reasonably can be common, so start with
XEN_SYSCTL_PHYSCAP_vmtrace and plumb the signal from Xen all the way down into
`xl info`.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Ian Jackson <iwj@xenproject.org>
CC: Wei Liu <wl@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Bump XEN_SYSCTL_PHYSCAP_MAX, fix Ocaml stubs
v7:
 * New
---
 tools/golang/xenlight/helpers.gen.go | 2 ++
 tools/golang/xenlight/types.gen.go   | 1 +
 tools/include/libxl.h                | 7 +++++++
 tools/libs/light/libxl.c             | 2 ++
 tools/libs/light/libxl_types.idl     | 1 +
 tools/ocaml/libs/xc/xenctrl.ml       | 1 +
 tools/ocaml/libs/xc/xenctrl.mli      | 1 +
 tools/xl/xl_info.c                   | 5 +++--
 xen/common/domain.c                  | 2 ++
 xen/common/sysctl.c                  | 2 ++
 xen/include/public/sysctl.h          | 3 ++-
 xen/include/xen/domain.h             | 2 ++
 12 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/tools/golang/xenlight/helpers.gen.go b/tools/golang/xenlight/helpers.gen.go
index e0681ee14e..63e2876463 100644
--- a/tools/golang/xenlight/helpers.gen.go
+++ b/tools/golang/xenlight/helpers.gen.go
@@ -3348,6 +3348,7 @@ x.CapHvmDirectio = bool(xc.cap_hvm_directio)
 x.CapHap = bool(xc.cap_hap)
 x.CapShadow = bool(xc.cap_shadow)
 x.CapIommuHapPtShare = bool(xc.cap_iommu_hap_pt_share)
+x.CapVmtrace = bool(xc.cap_vmtrace)
 
  return nil}
 
@@ -3378,6 +3379,7 @@ xc.cap_hvm_directio = C.bool(x.CapHvmDirectio)
 xc.cap_hap = C.bool(x.CapHap)
 xc.cap_shadow = C.bool(x.CapShadow)
 xc.cap_iommu_hap_pt_share = C.bool(x.CapIommuHapPtShare)
+xc.cap_vmtrace = C.bool(x.CapVmtrace)
 
  return nil
  }
diff --git a/tools/golang/xenlight/types.gen.go b/tools/golang/xenlight/types.gen.go
index ac78dea1af..5851c38057 100644
--- a/tools/golang/xenlight/types.gen.go
+++ b/tools/golang/xenlight/types.gen.go
@@ -1000,6 +1000,7 @@ CapHvmDirectio bool
 CapHap bool
 CapShadow bool
 CapIommuHapPtShare bool
+CapVmtrace bool
 }
 
 type Connectorinfo struct {
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index 547ddd3085..f48d0c5e8a 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -482,6 +482,13 @@
 #define LIBXL_HAVE_DEVICE_PCI_NAME 1
 
 /*
+ * LIBXL_HAVE_PHYSINFO_CAP_VMTRACE indicates that libxl_physinfo has a
+ * cap_vmtrace field, which indicates the availability of platform tracing
+ * functionality.
+ */
+#define LIBXL_HAVE_PHYSINFO_CAP_VMTRACE 1
+
+/*
  * libxl ABI compatibility
  *
  * The only guarantee which libxl makes regarding ABI compatibility
diff --git a/tools/libs/light/libxl.c b/tools/libs/light/libxl.c
index d2a87157a2..204eb0be2d 100644
--- a/tools/libs/light/libxl.c
+++ b/tools/libs/light/libxl.c
@@ -402,6 +402,8 @@ int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo)
         !!(xcphysinfo.capabilities & XEN_SYSCTL_PHYSCAP_shadow);
     physinfo->cap_iommu_hap_pt_share =
         !!(xcphysinfo.capabilities & XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share);
+    physinfo->cap_vmtrace =
+        !!(xcphysinfo.capabilities & XEN_SYSCTL_PHYSCAP_vmtrace);
 
     GC_FREE;
     return 0;
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index 37fe61f3ec..dacb7df6b7 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -1053,6 +1053,7 @@ libxl_physinfo = Struct("physinfo", [
     ("cap_hap", bool),
     ("cap_shadow", bool),
     ("cap_iommu_hap_pt_share", bool),
+    ("cap_vmtrace", bool),
     ], dir=DIR_OUT)
 
 libxl_connectorinfo = Struct("connectorinfo", [
diff --git a/tools/ocaml/libs/xc/xenctrl.ml b/tools/ocaml/libs/xc/xenctrl.ml
index e0a47c4769..a02e26b27f 100644
--- a/tools/ocaml/libs/xc/xenctrl.ml
+++ b/tools/ocaml/libs/xc/xenctrl.ml
@@ -116,6 +116,7 @@ type physinfo_cap_flag =
 	| CAP_HAP
 	| CAP_Shadow
 	| CAP_IOMMU_HAP_PT_SHARE
+	| CAP_Vmtrace
 
 type physinfo =
 {
diff --git a/tools/ocaml/libs/xc/xenctrl.mli b/tools/ocaml/libs/xc/xenctrl.mli
index 84311fa33d..d2a312e273 100644
--- a/tools/ocaml/libs/xc/xenctrl.mli
+++ b/tools/ocaml/libs/xc/xenctrl.mli
@@ -101,6 +101,7 @@ type physinfo_cap_flag =
   | CAP_HAP
   | CAP_Shadow
   | CAP_IOMMU_HAP_PT_SHARE
+  | CAP_Vmtrace
 
 type physinfo = {
   threads_per_core : int;
diff --git a/tools/xl/xl_info.c b/tools/xl/xl_info.c
index ca417df8e8..8383e4a6df 100644
--- a/tools/xl/xl_info.c
+++ b/tools/xl/xl_info.c
@@ -210,14 +210,15 @@ static void output_physinfo(void)
          info.hw_cap[4], info.hw_cap[5], info.hw_cap[6], info.hw_cap[7]
         );
 
-    maybe_printf("virt_caps              :%s%s%s%s%s%s%s\n",
+    maybe_printf("virt_caps              :%s%s%s%s%s%s%s%s\n",
          info.cap_pv ? " pv" : "",
          info.cap_hvm ? " hvm" : "",
          info.cap_hvm && info.cap_hvm_directio ? " hvm_directio" : "",
          info.cap_pv && info.cap_hvm_directio ? " pv_directio" : "",
          info.cap_hap ? " hap" : "",
          info.cap_shadow ? " shadow" : "",
-         info.cap_iommu_hap_pt_share ? " iommu_hap_pt_share" : ""
+         info.cap_iommu_hap_pt_share ? " iommu_hap_pt_share" : "",
+         info.cap_vmtrace ? " vmtrace" : ""
         );
 
     vinfo = libxl_get_version_info(ctx);
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 2b461655c3..d1e94d88cf 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -82,6 +82,8 @@ struct vcpu *idle_vcpu[NR_CPUS] __read_mostly;
 
 vcpu_info_t dummy_vcpu_info;
 
+bool __read_mostly vmtrace_available;
+
 static void __domain_finalise_shutdown(struct domain *d)
 {
     struct vcpu *v;
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index ec916424e5..3558641cd9 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -277,6 +277,8 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
             if ( iommu_hap_pt_share )
                 pi->capabilities |= XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share;
         }
+        if ( vmtrace_available )
+            pi->capabilities |= XEN_SYSCTL_PHYSCAP_vmtrace;
 
         if ( copy_to_guest(u_sysctl, op, 1) )
             ret = -EFAULT;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
index a073647117..039ccf885c 100644
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -100,9 +100,10 @@ struct xen_sysctl_tbuf_op {
 #define _XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share 5
 #define XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share  \
     (1u << _XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share)
+#define XEN_SYSCTL_PHYSCAP_vmtrace       (1 << 6)
 
 /* Max XEN_SYSCTL_PHYSCAP_* constant.  Used for ABI checking. */
-#define XEN_SYSCTL_PHYSCAP_MAX XEN_SYSCTL_PHYSCAP_iommu_hap_pt_share
+#define XEN_SYSCTL_PHYSCAP_MAX XEN_SYSCTL_PHYSCAP_vmtrace
 
 struct xen_sysctl_physinfo {
     uint32_t threads_per_core;
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index cde0d9c7fe..1708c36964 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -131,4 +131,6 @@ void vnuma_destroy(struct vnuma_info *vnuma);
 static inline void vnuma_destroy(struct vnuma_info *vnuma) { ASSERT(!vnuma); }
 #endif
 
+extern bool vmtrace_available;
+
 #endif /* __XEN_DOMAIN_H__ */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (6 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-02-01 10:51   ` Roger Pau Monné
  2021-02-01 13:18   ` Jan Beulich
  2021-01-30  2:58 ` [PATCH v8 09/16] tools/[lib]xl: Add vmtrace_buf_size parameter Andrew Cooper
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Jan Beulich,
	Roger Pau Monné,
	Wei Liu, Anthony PERARD, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

To use vmtrace, buffers of a suitable size need allocating, and different
tasks will want different sizes.

Add a domain creation parameter, and audit it appropriately in the
{arch_,}sanitise_domain_config() functions.

For now, the x86 specific auditing is tuned to Processor Trace running in
Single Output mode, which requires a single contiguous range of memory.

The size is given an arbitrary limit of 64M which is expected to be enough for
anticipated usecases, but not large enough to get into long-running-hypercall
problems.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Anthony PERARD <anthony.perard@citrix.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

When support for later generations of IPT get added, we can in principle start
to use ToTP which is a scatter list of smaller trace regions to use, if we
need to massively up the buffer size available.

v8:
 * Rename vmtrace_frames to vmtrace_size.  Reposition to fill a hole.
 * Rename vmtrace.buf to vmtrace.pg.
 * Rework the refcounting logic and comment it *very* clearly.

v7:
 * Major chop&change within the series.
 * Use the name 'vmtrace' consistently.
 * Use the (new) common vcpu_teardown() functionality, rather than leaving a
   latent memory leak on ARM.
---
 xen/arch/x86/domain.c       | 23 +++++++++++++
 xen/common/domain.c         | 79 +++++++++++++++++++++++++++++++++++++++++++++
 xen/include/public/domctl.h |  3 ++
 xen/include/xen/sched.h     |  6 ++++
 4 files changed, 111 insertions(+)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b9ba04633e..6c7ee25f3b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -660,6 +660,29 @@ int arch_sanitise_domain_config(struct xen_domctl_createdomain *config)
         return -EINVAL;
     }
 
+    if ( config->vmtrace_size )
+    {
+        unsigned int size = config->vmtrace_size;
+
+        ASSERT(vmtrace_available); /* Checked by common code. */
+
+        /*
+         * For now, vmtrace is restricted to HVM guests, and using a
+         * power-of-2 buffer between 4k and 64M in size.
+         */
+        if ( !hvm )
+        {
+            dprintk(XENLOG_INFO, "vmtrace not supported for PV\n");
+            return -EINVAL;
+        }
+
+        if ( size < PAGE_SIZE || size > MB(64) || (size & (size - 1)) )
+        {
+            dprintk(XENLOG_INFO, "Unsupported vmtrace size: %#x\n", size);
+            return -EINVAL;
+        }
+    }
+
     return 0;
 }
 
diff --git a/xen/common/domain.c b/xen/common/domain.c
index d1e94d88cf..491b32812e 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v)
     v->vcpu_info_mfn = INVALID_MFN;
 }
 
+static void vmtrace_free_buffer(struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    struct page_info *pg = v->vmtrace.pg;
+    unsigned int i;
+
+    if ( !pg )
+        return;
+
+    v->vmtrace.pg = NULL;
+
+    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
+    {
+        put_page_alloc_ref(&pg[i]);
+        put_page_and_type(&pg[i]);
+    }
+}
+
+static int vmtrace_alloc_buffer(struct vcpu *v)
+{
+    struct domain *d = v->domain;
+    struct page_info *pg;
+    unsigned int i;
+
+    if ( !d->vmtrace_size )
+        return 0;
+
+    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
+                             MEMF_no_refcount);
+    if ( !pg )
+        return -ENOMEM;
+
+    /*
+     * Getting the reference counting correct here is hard.
+     *
+     * All pages are now on the domlist.  They, or subranges within, will be
+     * freed when their reference count drops to zero, which may any time
+     * between now and the domain teardown path.
+     */
+
+    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
+        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
+            goto refcnt_err;
+
+    /*
+     * We must only let vmtrace_free_buffer() take any action in the success
+     * case when we've taken all the refs it intends to drop.
+     */
+    v->vmtrace.pg = pg;
+
+    return 0;
+
+ refcnt_err:
+    /*
+     * In the failure case, we must drop all the acquired typerefs thus far,
+     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
+     * drop the alloc refs on any remaining pages - some pages could already
+     * have been freed behind our backs.
+     */
+    while ( i-- )
+        put_page_and_type(&pg[i]);
+
+    return -ENODATA;
+}
+
 /*
  * Release resources held by a vcpu.  There may or may not be live references
  * to the vcpu, and it may or may not be fully constructed.
@@ -140,6 +205,8 @@ static void vcpu_info_reset(struct vcpu *v)
  */
 static int vcpu_teardown(struct vcpu *v)
 {
+    vmtrace_free_buffer(v);
+
     return 0;
 }
 
@@ -201,6 +268,9 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int vcpu_id)
     if ( sched_init_vcpu(v) != 0 )
         goto fail_wq;
 
+    if ( vmtrace_alloc_buffer(v) != 0 )
+        goto fail_wq;
+
     if ( arch_vcpu_create(v) != 0 )
         goto fail_sched;
 
@@ -449,6 +519,12 @@ static int sanitise_domain_config(struct xen_domctl_createdomain *config)
         }
     }
 
+    if ( config->vmtrace_size && !vmtrace_available )
+    {
+        dprintk(XENLOG_INFO, "vmtrace requested but not available\n");
+        return -EINVAL;
+    }
+
     return arch_sanitise_domain_config(config);
 }
 
@@ -474,7 +550,10 @@ struct domain *domain_create(domid_t domid,
     ASSERT(is_system_domain(d) ? config == NULL : config != NULL);
 
     if ( config )
+    {
         d->options = config->flags;
+        d->vmtrace_size = config->vmtrace_size;
+    }
 
     /* Sort out our idea of is_control_domain(). */
     d->is_privileged = is_priv;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 666aeb71bf..88a5b1ef5d 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -95,6 +95,9 @@ struct xen_domctl_createdomain {
     int32_t max_grant_frames;
     int32_t max_maptrack_frames;
 
+    /* Per-vCPU buffer size in bytes.  0 to disable. */
+    uint32_t vmtrace_size;
+
     struct xen_arch_domainconfig arch;
 };
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 06dba1a397..bc78a09a53 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -272,6 +272,10 @@ struct vcpu
     /* vPCI per-vCPU area, used to store data for long running operations. */
     struct vpci_vcpu vpci;
 
+    struct {
+        struct page_info *pg; /* One contiguous allocation of d->vmtrace_size */
+    } vmtrace;
+
     struct arch_vcpu arch;
 
 #ifdef CONFIG_IOREQ_SERVER
@@ -547,6 +551,8 @@ struct domain
         unsigned int guest_request_sync          : 1;
     } monitor;
 
+    unsigned int vmtrace_size; /* Buffer size in bytes, or 0 to disable. */
+
 #ifdef CONFIG_ARGO
     /* Argo interdomain communication support */
     struct argo_domain *argo;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 09/16] tools/[lib]xl: Add vmtrace_buf_size parameter
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (7 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 10/16] xen/memory: Add a vmtrace_buf resource type Andrew Cooper
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Ian Jackson,
	Wei Liu, Anthony PERARD, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Allow to specify the size of per-vCPU trace buffer upon
domain creation. This is zero by default (meaning: not enabled).

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
CC: Ian Jackson <iwj@xenproject.org>
CC: Wei Liu <wl@xen.org>
CC: Anthony PERARD <anthony.perard@citrix.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Rebase over vmtrace_size change.
v7:
 * Use the name 'vmtrace' consistently.
---
 docs/man/xl.cfg.5.pod.in             | 9 +++++++++
 tools/golang/xenlight/helpers.gen.go | 2 ++
 tools/golang/xenlight/types.gen.go   | 1 +
 tools/include/libxl.h                | 7 +++++++
 tools/libs/light/libxl_create.c      | 1 +
 tools/libs/light/libxl_types.idl     | 4 ++++
 tools/xl/xl_parse.c                  | 4 ++++
 7 files changed, 28 insertions(+)

diff --git a/docs/man/xl.cfg.5.pod.in b/docs/man/xl.cfg.5.pod.in
index 7cdb8595d3..040374dcd6 100644
--- a/docs/man/xl.cfg.5.pod.in
+++ b/docs/man/xl.cfg.5.pod.in
@@ -681,6 +681,15 @@ Windows).
 
 If this option is not specified then it will default to B<false>.
 
+=item B<vmtrace_buf_kb=KBYTES>
+
+Specifies the size of vmtrace buffer that would be allocated for each
+vCPU belonging to this domain.  Disabled (i.e.  B<vmtrace_buf_kb=0>) by
+default.
+
+B<NOTE>: Acceptable values are platform specific.  For Intel Processor
+Trace, this value must be a power of 2 between 4k and 16M.
+
 =back
 
 =head2 Devices
diff --git a/tools/golang/xenlight/helpers.gen.go b/tools/golang/xenlight/helpers.gen.go
index 63e2876463..4c60d27a9c 100644
--- a/tools/golang/xenlight/helpers.gen.go
+++ b/tools/golang/xenlight/helpers.gen.go
@@ -1114,6 +1114,7 @@ return fmt.Errorf("invalid union key '%v'", x.Type)}
 x.ArchArm.GicVersion = GicVersion(xc.arch_arm.gic_version)
 x.ArchArm.Vuart = VuartType(xc.arch_arm.vuart)
 x.Altp2M = Altp2MMode(xc.altp2m)
+x.VmtraceBufKb = int(xc.vmtrace_buf_kb)
 
  return nil}
 
@@ -1589,6 +1590,7 @@ return fmt.Errorf("invalid union key '%v'", x.Type)}
 xc.arch_arm.gic_version = C.libxl_gic_version(x.ArchArm.GicVersion)
 xc.arch_arm.vuart = C.libxl_vuart_type(x.ArchArm.Vuart)
 xc.altp2m = C.libxl_altp2m_mode(x.Altp2M)
+xc.vmtrace_buf_kb = C.int(x.VmtraceBufKb)
 
  return nil
  }
diff --git a/tools/golang/xenlight/types.gen.go b/tools/golang/xenlight/types.gen.go
index 5851c38057..cb13002fdb 100644
--- a/tools/golang/xenlight/types.gen.go
+++ b/tools/golang/xenlight/types.gen.go
@@ -514,6 +514,7 @@ GicVersion GicVersion
 Vuart VuartType
 }
 Altp2M Altp2MMode
+VmtraceBufKb int
 }
 
 type domainBuildInfoTypeUnion interface {
diff --git a/tools/include/libxl.h b/tools/include/libxl.h
index f48d0c5e8a..a7b673e89d 100644
--- a/tools/include/libxl.h
+++ b/tools/include/libxl.h
@@ -489,6 +489,13 @@
 #define LIBXL_HAVE_PHYSINFO_CAP_VMTRACE 1
 
 /*
+ * LIBXL_HAVE_VMTRACE_BUF_KB indicates that libxl_domain_create_info has a
+ * vmtrace_buf_kb parameter, which allows to enable pre-allocation of
+ * processor tracing buffers of given size.
+ */
+#define LIBXL_HAVE_VMTRACE_BUF_KB 1
+
+/*
  * libxl ABI compatibility
  *
  * The only guarantee which libxl makes regarding ABI compatibility
diff --git a/tools/libs/light/libxl_create.c b/tools/libs/light/libxl_create.c
index 9848d65f36..46f68da697 100644
--- a/tools/libs/light/libxl_create.c
+++ b/tools/libs/light/libxl_create.c
@@ -607,6 +607,7 @@ int libxl__domain_make(libxl__gc *gc, libxl_domain_config *d_config,
             .max_evtchn_port = b_info->event_channels,
             .max_grant_frames = b_info->max_grant_frames,
             .max_maptrack_frames = b_info->max_maptrack_frames,
+            .vmtrace_size = ROUNDUP(b_info->vmtrace_buf_kb << 10, XC_PAGE_SHIFT),
         };
 
         if (info->type != LIBXL_DOMAIN_TYPE_PV) {
diff --git a/tools/libs/light/libxl_types.idl b/tools/libs/light/libxl_types.idl
index dacb7df6b7..5b85a7419f 100644
--- a/tools/libs/light/libxl_types.idl
+++ b/tools/libs/light/libxl_types.idl
@@ -648,6 +648,10 @@ libxl_domain_build_info = Struct("domain_build_info",[
     # supported by x86 HVM and ARM support is planned.
     ("altp2m", libxl_altp2m_mode),
 
+    # Size of preallocated vmtrace trace buffers (in KBYTES).
+    # Use zero value to disable this feature.
+    ("vmtrace_buf_kb", integer),
+
     ], dir=DIR_IN,
        copy_deprecated_fn="libxl__domain_build_info_copy_deprecated",
 )
diff --git a/tools/xl/xl_parse.c b/tools/xl/xl_parse.c
index 867e4d068a..1893cfc086 100644
--- a/tools/xl/xl_parse.c
+++ b/tools/xl/xl_parse.c
@@ -1863,6 +1863,10 @@ void parse_config_data(const char *config_source,
         }
     }
 
+    if (!xlu_cfg_get_long(config, "vmtrace_buf_kb", &l, 1) && l) {
+        b_info->vmtrace_buf_kb = l;
+    }
+
     if (!xlu_cfg_get_list(config, "ioports", &ioports, &num_ioports, 0)) {
         b_info->num_ioports = num_ioports;
         b_info->ioports = calloc(num_ioports, sizeof(*b_info->ioports));
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 10/16] xen/memory: Add a vmtrace_buf resource type
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (8 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 09/16] tools/[lib]xl: Add vmtrace_buf_size parameter Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 11/16] x86/vmx: Add Intel Processor Trace support Andrew Cooper
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Jan Beulich,
	Roger Pau Monné,
	Wei Liu, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Allow to map processor trace buffer using acquire_resource().

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Rebase over 'fault' and ARM/IOREQ series.

v7:
 * Rebase over changes elsewhere in the series
---
 xen/common/memory.c         | 29 +++++++++++++++++++++++++++++
 xen/include/public/memory.h |  1 +
 2 files changed, 30 insertions(+)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 128718b31c..fada97a79f 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1085,6 +1085,9 @@ static unsigned int resource_max_frames(const struct domain *d,
     case XENMEM_resource_ioreq_server:
         return ioreq_server_max_frames(d);
 
+    case XENMEM_resource_vmtrace_buf:
+        return d->vmtrace_size >> PAGE_SHIFT;
+
     default:
         return -EOPNOTSUPP;
     }
@@ -1125,6 +1128,29 @@ static int acquire_ioreq_server(struct domain *d,
 #endif
 }
 
+static int acquire_vmtrace_buf(
+    struct domain *d, unsigned int id, unsigned int frame,
+    unsigned int nr_frames, xen_pfn_t mfn_list[])
+{
+    const struct vcpu *v = domain_vcpu(d, id);
+    unsigned int i;
+    mfn_t mfn;
+
+    if ( !v )
+        return -ENOENT;
+
+    if ( !v->vmtrace.pg ||
+         (frame + nr_frames) > (d->vmtrace_size >> PAGE_SHIFT) )
+        return -EINVAL;
+
+    mfn = page_to_mfn(v->vmtrace.pg);
+
+    for ( i = 0; i < nr_frames; i++ )
+        mfn_list[i] = mfn_x(mfn) + frame + i;
+
+    return nr_frames;
+}
+
 /*
  * Returns -errno on error, or positive in the range [1, nr_frames] on
  * success.  Returning less than nr_frames contitutes a request for a
@@ -1142,6 +1168,9 @@ static int _acquire_resource(
     case XENMEM_resource_ioreq_server:
         return acquire_ioreq_server(d, id, frame, nr_frames, mfn_list);
 
+    case XENMEM_resource_vmtrace_buf:
+        return acquire_vmtrace_buf(d, id, frame, nr_frames, mfn_list);
+
     default:
         return -EOPNOTSUPP;
     }
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 020c79d757..50e73eef98 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -625,6 +625,7 @@ struct xen_mem_acquire_resource {
 
 #define XENMEM_resource_ioreq_server 0
 #define XENMEM_resource_grant_table 1
+#define XENMEM_resource_vmtrace_buf 2
 
     /*
      * IN - a type-specific resource identifier, which must be zero
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 11/16] x86/vmx: Add Intel Processor Trace support
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (9 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 10/16] xen/memory: Add a vmtrace_buf resource type Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op Andrew Cooper
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Jan Beulich,
	Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Add CPUID/MSR enumeration details for Processor Trace.  For now, we will only
support its use inside VMX operation.  Fill in the vmtrace_available boolean
to activate the newly introduced common infrastructure for allocating trace
buffers.

For now, Processor Trace is going to be operated in Single Output mode behind
the guests back.  Add the MSRs to struct vcpu_msrs, and set up the buffer
limit in vmx_init_ipt() as it is fixed for the lifetime of the domain.

Context switch the most of the MSRs in and out of vCPU context, but the main
control register needs to reside in the MSR load/save lists.  Explicitly pull
the msrs pointer out into a local variable, because the optimiser cannot keep
it live across the memory clobbers in the MSR accesses.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Plumb bsp boolean down into vmx_init_vmcs_config()
 * Rename vmx_init_pt() to vmx_init_ipt()
 * Rebase over vmtrace_size/pg renames.

v7:
 * Major chop&change within the series.
 * Move MSRs to vcpu_msrs, which is where they'll definitely want to live when
   we offer PT to VMs for their own use.
---
 tools/libs/light/libxl_cpuid.c              |  1 +
 tools/misc/xen-cpuid.c                      |  2 +-
 xen/arch/x86/hvm/vmx/vmcs.c                 | 19 +++++++++++++---
 xen/arch/x86/hvm/vmx/vmx.c                  | 34 ++++++++++++++++++++++++++++-
 xen/include/asm-x86/cpufeature.h            |  1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h          |  4 ++++
 xen/include/asm-x86/msr.h                   | 32 +++++++++++++++++++++++++++
 xen/include/public/arch-x86/cpufeatureset.h |  1 +
 8 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index 259612834e..289c59c742 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -188,6 +188,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *cpuid, const char* str)
         {"avx512-ifma",  0x00000007,  0, CPUID_REG_EBX, 21,  1},
         {"clflushopt",   0x00000007,  0, CPUID_REG_EBX, 23,  1},
         {"clwb",         0x00000007,  0, CPUID_REG_EBX, 24,  1},
+        {"proc-trace",   0x00000007,  0, CPUID_REG_EBX, 25,  1},
         {"avx512pf",     0x00000007,  0, CPUID_REG_EBX, 26,  1},
         {"avx512er",     0x00000007,  0, CPUID_REG_EBX, 27,  1},
         {"avx512cd",     0x00000007,  0, CPUID_REG_EBX, 28,  1},
diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
index c81aa93055..2d04162d8d 100644
--- a/tools/misc/xen-cpuid.c
+++ b/tools/misc/xen-cpuid.c
@@ -106,7 +106,7 @@ static const char *const str_7b0[32] =
     [18] = "rdseed",   [19] = "adx",
     [20] = "smap",     [21] = "avx512-ifma",
     [22] = "pcommit",  [23] = "clflushopt",
-    [24] = "clwb",     [25] = "pt",
+    [24] = "clwb",     [25] = "proc-trace",
     [26] = "avx512pf", [27] = "avx512er",
     [28] = "avx512cd", [29] = "sha",
     [30] = "avx512bw", [31] = "avx512vl",
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 164535f8f0..f9f9bc18cd 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -243,7 +243,7 @@ static bool_t cap_check(const char *name, u32 expected, u32 saw)
     return saw != expected;
 }
 
-static int vmx_init_vmcs_config(void)
+static int vmx_init_vmcs_config(bool bsp)
 {
     u32 vmx_basic_msr_low, vmx_basic_msr_high, min, opt;
     u32 _vmx_pin_based_exec_control;
@@ -291,6 +291,20 @@ static int vmx_init_vmcs_config(void)
         _vmx_cpu_based_exec_control &=
             ~(CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING);
 
+    rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap);
+
+    /* Check whether IPT is supported in VMX operation. */
+    if ( bsp )
+        vmtrace_available = cpu_has_proc_trace &&
+                            (_vmx_misc_cap & VMX_MISC_PROC_TRACE);
+    else if ( vmtrace_available &&
+              !(_vmx_misc_cap & VMX_MISC_PROC_TRACE) )
+    {
+        printk("VMX: IPT capabilities differ between CPU%u and BSP\n",
+               smp_processor_id());
+        return -EINVAL;
+    }
+
     if ( _vmx_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS )
     {
         min = 0;
@@ -305,7 +319,6 @@ static int vmx_init_vmcs_config(void)
                SECONDARY_EXEC_ENABLE_VIRT_EXCEPTIONS |
                SECONDARY_EXEC_XSAVES |
                SECONDARY_EXEC_TSC_SCALING);
-        rdmsrl(MSR_IA32_VMX_MISC, _vmx_misc_cap);
         if ( _vmx_misc_cap & VMX_MISC_VMWRITE_ALL )
             opt |= SECONDARY_EXEC_ENABLE_VMCS_SHADOWING;
         if ( opt_vpid_enabled )
@@ -715,7 +728,7 @@ static int _vmx_cpu_up(bool bsp)
         wrmsr(MSR_IA32_FEATURE_CONTROL, eax, 0);
     }
 
-    if ( (rc = vmx_init_vmcs_config()) != 0 )
+    if ( (rc = vmx_init_vmcs_config(bsp)) != 0 )
         return rc;
 
     INIT_LIST_HEAD(&this_cpu(active_vmcs_list));
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 2d4475ee3d..12b961113e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -428,6 +428,20 @@ static void vmx_domain_relinquish_resources(struct domain *d)
     vmx_free_vlapic_mapping(d);
 }
 
+static void vmx_init_ipt(struct vcpu *v)
+{
+    unsigned int size = v->domain->vmtrace_size;
+
+    if ( !size )
+        return;
+
+    /* Checked by domain creation logic. */
+    ASSERT(v->vmtrace.pg);
+    ASSERT(size >= PAGE_SIZE && (size & (size - 1)) == 0);
+
+    v->arch.msrs->rtit.output_limit = size - 1;
+}
+
 static int vmx_vcpu_initialise(struct vcpu *v)
 {
     int rc;
@@ -470,6 +484,7 @@ static int vmx_vcpu_initialise(struct vcpu *v)
     }
 
     vmx_install_vlapic_mapping(v);
+    vmx_init_ipt(v);
 
     return 0;
 }
@@ -508,22 +523,39 @@ static void vmx_restore_host_msrs(void)
 
 static void vmx_save_guest_msrs(struct vcpu *v)
 {
+    struct vcpu_msrs *msrs = v->arch.msrs;
+
     /*
      * We cannot cache SHADOW_GS_BASE while the VCPU runs, as it can
      * be updated at any time via SWAPGS, which we cannot trap.
      */
     v->arch.hvm.vmx.shadow_gs = read_gs_shadow();
+
+    if ( v->arch.hvm.vmx.ipt_active )
+    {
+        rdmsrl(MSR_RTIT_OUTPUT_MASK, msrs->rtit.output_mask);
+        rdmsrl(MSR_RTIT_STATUS, msrs->rtit.status);
+    }
 }
 
 static void vmx_restore_guest_msrs(struct vcpu *v)
 {
+    const struct vcpu_msrs *msrs = v->arch.msrs;
+
     write_gs_shadow(v->arch.hvm.vmx.shadow_gs);
     wrmsrl(MSR_STAR,           v->arch.hvm.vmx.star);
     wrmsrl(MSR_LSTAR,          v->arch.hvm.vmx.lstar);
     wrmsrl(MSR_SYSCALL_MASK,   v->arch.hvm.vmx.sfmask);
 
     if ( cpu_has_msr_tsc_aux )
-        wrmsr_tsc_aux(v->arch.msrs->tsc_aux);
+        wrmsr_tsc_aux(msrs->tsc_aux);
+
+    if ( v->arch.hvm.vmx.ipt_active )
+    {
+        wrmsrl(MSR_RTIT_OUTPUT_BASE, page_to_maddr(v->vmtrace.pg));
+        wrmsrl(MSR_RTIT_OUTPUT_MASK, msrs->rtit.output_mask);
+        wrmsrl(MSR_RTIT_STATUS, msrs->rtit.status);
+    }
 }
 
 void vmx_update_cpu_exec_control(struct vcpu *v)
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index f62e526a96..33b2257888 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -105,6 +105,7 @@
 #define cpu_has_clwb            boot_cpu_has(X86_FEATURE_CLWB)
 #define cpu_has_avx512er        boot_cpu_has(X86_FEATURE_AVX512ER)
 #define cpu_has_avx512cd        boot_cpu_has(X86_FEATURE_AVX512CD)
+#define cpu_has_proc_trace      boot_cpu_has(X86_FEATURE_PROC_TRACE)
 #define cpu_has_sha             boot_cpu_has(X86_FEATURE_SHA)
 #define cpu_has_avx512bw        boot_cpu_has(X86_FEATURE_AVX512BW)
 #define cpu_has_avx512vl        boot_cpu_has(X86_FEATURE_AVX512VL)
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 906810592f..8073af323b 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -156,6 +156,9 @@ struct vmx_vcpu {
     /* Do we need to tolerate a spurious EPT_MISCONFIG VM exit? */
     bool_t               ept_spurious_misconfig;
 
+    /* Processor Trace configured and enabled for the vcpu. */
+    bool                 ipt_active;
+
     /* Is the guest in real mode? */
     uint8_t              vmx_realmode;
     /* Are we emulating rather than VMENTERing? */
@@ -283,6 +286,7 @@ extern u32 vmx_secondary_exec_control;
 #define VMX_VPID_INVVPID_SINGLE_CONTEXT_RETAINING_GLOBAL 0x80000000000ULL
 extern u64 vmx_ept_vpid_cap;
 
+#define VMX_MISC_PROC_TRACE                     0x00004000
 #define VMX_MISC_CR3_TARGET                     0x01ff0000
 #define VMX_MISC_VMWRITE_ALL                    0x20000000
 
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 16f95e7344..1d3eca9063 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -306,6 +306,38 @@ struct vcpu_msrs
         };
     } misc_features_enables;
 
+    /*
+     * 0x00000560 ... 57x - MSR_RTIT_*
+     *
+     * "Real Time Instruction Trace", now called Processor Trace.
+     *
+     * These MSRs are not exposed to guests.  They are controlled by Xen
+     * behind the scenes, when vmtrace is enabled for the domain.
+     *
+     * MSR_RTIT_OUTPUT_BASE not stored here.  It is fixed per vcpu, and
+     * derived from v->vmtrace.buf.
+     */
+    struct {
+        /*
+         * Placed in the MSR load/save lists.  Only modified by hypercall in
+         * the common case.
+         */
+        uint64_t ctl;
+
+        /*
+         * Updated by hardware in non-root mode.  Synchronised here on vcpu
+         * context switch.
+         */
+        uint64_t status;
+        union {
+            uint64_t output_mask;
+            struct {
+                uint32_t output_limit;
+                uint32_t output_offset;
+            };
+        };
+    } rtit;
+
     /* 0x00000da0 - MSR_IA32_XSS */
     struct {
         uint64_t raw;
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 6f7efaad6d..a501479820 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -217,6 +217,7 @@ XEN_CPUFEATURE(SMAP,          5*32+20) /*S  Supervisor Mode Access Prevention */
 XEN_CPUFEATURE(AVX512_IFMA,   5*32+21) /*A  AVX-512 Integer Fused Multiply Add */
 XEN_CPUFEATURE(CLFLUSHOPT,    5*32+23) /*A  CLFLUSHOPT instruction */
 XEN_CPUFEATURE(CLWB,          5*32+24) /*A  CLWB instruction */
+XEN_CPUFEATURE(PROC_TRACE,    5*32+25) /*   Processor Trace */
 XEN_CPUFEATURE(AVX512PF,      5*32+26) /*A  AVX-512 Prefetch Instructions */
 XEN_CPUFEATURE(AVX512ER,      5*32+27) /*A  AVX-512 Exponent & Reciprocal Instrs */
 XEN_CPUFEATURE(AVX512CD,      5*32+28) /*A  AVX-512 Conflict Detection Instrs */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (10 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 11/16] x86/vmx: Add Intel Processor Trace support Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-02-01 12:01   ` Roger Pau Monné
  2021-01-30  2:58 ` [PATCH v8 13/16] tools/libxc: Add xc_vmtrace_* functions Andrew Cooper
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Jan Beulich,
	Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Implement an interface to configure and control tracing operations.  Reuse the
existing SETDEBUGGING flask vector rather than inventing a new one.

Userspace using this interface is going to need platform specific knowledge
anyway to interpret the contents of the trace buffer.  While some operations
(e.g. enable/disable) can reasonably be generic, others cannot.  Provide an
explicitly-platform specific pair of get/set operations to reduce API churn as
new options get added/enabled.

For the VMX specific Processor Trace implementation, tolerate reading and
modifying a safe subset of bits in CTL, STATUS and OUTPUT_MASK.  This permits
userspace to control the content which gets logged, but prevents modification
of details such as the position/size of the output buffer.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Reposition mask constants.

v7:
 * Major chop&change within the series.
---
 xen/arch/x86/domctl.c         |  55 +++++++++++++++
 xen/arch/x86/hvm/vmx/vmx.c    | 155 ++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/hvm.h |  63 +++++++++++++++++
 xen/include/public/domctl.h   |  35 ++++++++++
 xen/xsm/flask/hooks.c         |   1 +
 5 files changed, 309 insertions(+)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index b28cfe9817..b464465230 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -155,6 +155,55 @@ void arch_get_domain_info(const struct domain *d,
     info->arch_config.emulation_flags = d->arch.emulation_flags;
 }
 
+static int do_vmtrace_op(struct domain *d, struct xen_domctl_vmtrace_op *op,
+                         XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
+{
+    struct vcpu *v;
+    int rc;
+
+    if ( !d->vmtrace_size || d == current->domain /* No vcpu_pause() */ )
+        return -EINVAL;
+
+    ASSERT(is_hvm_domain(d)); /* Restricted by domain creation logic. */
+
+    v = domain_vcpu(d, op->vcpu);
+    if ( !v )
+        return -ENOENT;
+
+    vcpu_pause(v);
+    switch ( op->cmd )
+    {
+    case XEN_DOMCTL_vmtrace_enable:
+    case XEN_DOMCTL_vmtrace_disable:
+    case XEN_DOMCTL_vmtrace_reset_and_enable:
+        rc = hvm_vmtrace_control(
+            v, op->cmd != XEN_DOMCTL_vmtrace_disable,
+            op->cmd == XEN_DOMCTL_vmtrace_reset_and_enable);
+        break;
+
+    case XEN_DOMCTL_vmtrace_output_position:
+        rc = hvm_vmtrace_output_position(v, &op->value);
+        if ( rc >= 0 )
+            rc = 0;
+        break;
+
+    case XEN_DOMCTL_vmtrace_get_option:
+        rc = hvm_vmtrace_get_option(v, op->key, &op->value);
+        break;
+
+    case XEN_DOMCTL_vmtrace_set_option:
+        rc = hvm_vmtrace_set_option(v, op->key, op->value);
+        break;
+
+    default:
+        rc = -EOPNOTSUPP;
+        break;
+    }
+    vcpu_unpause(v);
+
+    return rc;
+}
+
 #define MAX_IOPORTS 0x10000
 
 long arch_do_domctl(
@@ -1320,6 +1369,12 @@ long arch_do_domctl(
         domain_unpause(d);
         break;
 
+    case XEN_DOMCTL_vmtrace_op:
+        ret = do_vmtrace_op(d, &domctl->u.vmtrace_op, u_domctl);
+        if ( !ret )
+            copyback = true;
+        break;
+
     default:
         ret = iommu_do_domctl(domctl, d, u_domctl);
         break;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 12b961113e..a64c4e4177 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2261,6 +2261,157 @@ static bool vmx_get_pending_event(struct vcpu *v, struct x86_event *info)
     return true;
 }
 
+/*
+ * We only let vmtrace agents see and modify a subset of bits in MSR_RTIT_CTL.
+ * These all pertain to data-emitted into the trace buffer(s).  Must not
+ * include controls pertaining to the structure/position of the trace
+ * buffer(s).
+ */
+#define RTIT_CTL_MASK                                                   \
+    (RTIT_CTL_TRACE_EN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_TSC_EN | \
+     RTIT_CTL_DIS_RETC | RTIT_CTL_BRANCH_EN)
+
+/*
+ * Status bits restricted to the first-gen subset (i.e. no further CPUID
+ * requirements.)
+ */
+#define RTIT_STATUS_MASK                                                \
+    (RTIT_STATUS_FILTER_EN | RTIT_STATUS_CONTEXT_EN | RTIT_STATUS_TRIGGER_EN | \
+     RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED)
+
+static int vmtrace_get_option(struct vcpu *v, uint64_t key, uint64_t *output)
+{
+    const struct vcpu_msrs *msrs = v->arch.msrs;
+
+    switch ( key )
+    {
+    case MSR_RTIT_OUTPUT_MASK:
+        *output = msrs->rtit.output_mask;
+        break;
+
+    case MSR_RTIT_CTL:
+        *output = msrs->rtit.ctl & RTIT_CTL_MASK;
+        break;
+
+    case MSR_RTIT_STATUS:
+        *output = msrs->rtit.status & RTIT_STATUS_MASK;
+        break;
+
+    default:
+        *output = 0;
+        return -EINVAL;
+    }
+    return 0;
+}
+
+static int vmtrace_set_option(struct vcpu *v, uint64_t key, uint64_t value)
+{
+    struct vcpu_msrs *msrs = v->arch.msrs;
+    bool new_en, old_en = msrs->rtit.ctl & RTIT_CTL_TRACE_EN;
+
+    switch ( key )
+    {
+    case MSR_RTIT_OUTPUT_MASK:
+        /*
+         * MSR_RTIT_OUTPUT_MASK, when using Single Output mode, has a limit
+         * field in the lower 32 bits, and an offset in the upper 32 bits.
+         *
+         * Limit is fixed by the vmtrace buffer size and must not be
+         * controlled by userspace, while offset must be within the limit.
+         *
+         * Drop writes to the limit field to simply userspace wanting to reset
+         * the offset by writing 0.
+         */
+        if ( (value >> 32) > msrs->rtit.output_limit )
+            return -EINVAL;
+        msrs->rtit.output_offset = value >> 32;
+        break;
+
+    case MSR_RTIT_CTL:
+        if ( value & ~RTIT_CTL_MASK )
+            return -EINVAL;
+
+        msrs->rtit.ctl &= ~RTIT_CTL_MASK;
+        msrs->rtit.ctl |= (value & RTIT_CTL_MASK);
+        break;
+
+    case MSR_RTIT_STATUS:
+        if ( value & ~RTIT_STATUS_MASK )
+            return -EINVAL;
+
+        msrs->rtit.status &= ~RTIT_STATUS_MASK;
+        msrs->rtit.status |= (value & RTIT_STATUS_MASK);
+        break;
+
+    default:
+        return -EINVAL;
+    }
+
+    new_en = msrs->rtit.ctl & RTIT_CTL_TRACE_EN;
+
+    /* ctl.trace_en changed => update MSR load/save lists appropriately. */
+    if ( !old_en && new_en )
+    {
+        if ( vmx_add_guest_msr(v, MSR_RTIT_CTL, msrs->rtit.ctl) ||
+             vmx_add_host_load_msr(v, MSR_RTIT_CTL, 0) )
+        {
+            /*
+             * The only failure cases here are failing the
+             * singleton-per-domain memory allocation, or exceeding the space
+             * in the allocation.  We could unwind in principle, but there is
+             * nothing userspace can usefully do to continue using this VM.
+             */
+            domain_crash(v->domain);
+            return -ENXIO;
+        }
+    }
+    else if ( old_en && !new_en )
+    {
+        vmx_del_msr(v, MSR_RTIT_CTL, VMX_MSR_GUEST);
+        vmx_del_msr(v, MSR_RTIT_CTL, VMX_MSR_HOST);
+    }
+
+    return 0;
+}
+
+static int vmtrace_control(struct vcpu *v, bool enable, bool reset)
+{
+    struct vcpu_msrs *msrs = v->arch.msrs;
+    uint64_t new_ctl;
+    int rc;
+
+    /*
+     * Absolutely nothing good will come of Xen's and userspace's idea of
+     * whether ipt is enabled getting out of sync.
+     */
+    if ( v->arch.hvm.vmx.ipt_active == enable )
+        return -EINVAL;
+
+    if ( reset )
+    {
+        msrs->rtit.status = 0;
+        msrs->rtit.output_offset = 0;
+    }
+
+    new_ctl = msrs->rtit.ctl & ~RTIT_CTL_TRACE_EN;
+    if ( enable )
+        new_ctl |= RTIT_CTL_TRACE_EN;
+
+    rc = vmtrace_set_option(v, MSR_RTIT_CTL, new_ctl);
+    if ( rc )
+        return rc;
+
+    v->arch.hvm.vmx.ipt_active = enable;
+
+    return 0;
+}
+
+static int vmtrace_output_position(struct vcpu *v, uint64_t *pos)
+{
+    *pos = v->arch.msrs->rtit.output_offset;
+    return v->arch.hvm.vmx.ipt_active;
+}
+
 static struct hvm_function_table __initdata vmx_function_table = {
     .name                 = "VMX",
     .cpu_up_prepare       = vmx_cpu_up_prepare,
@@ -2316,6 +2467,10 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .altp2m_vcpu_update_vmfunc_ve = vmx_vcpu_update_vmfunc_ve,
     .altp2m_vcpu_emulate_ve = vmx_vcpu_emulate_ve,
     .altp2m_vcpu_emulate_vmfunc = vmx_vcpu_emulate_vmfunc,
+    .vmtrace_control = vmtrace_control,
+    .vmtrace_output_position = vmtrace_output_position,
+    .vmtrace_set_option = vmtrace_set_option,
+    .vmtrace_get_option = vmtrace_get_option,
     .tsc_scaling = {
         .max_ratio = VMX_TSC_MULTIPLIER_MAX,
     },
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 334bd573b9..960ec03917 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -214,6 +214,12 @@ struct hvm_function_table {
     bool_t (*altp2m_vcpu_emulate_ve)(struct vcpu *v);
     int (*altp2m_vcpu_emulate_vmfunc)(const struct cpu_user_regs *regs);
 
+    /* vmtrace */
+    int (*vmtrace_control)(struct vcpu *v, bool enable, bool reset);
+    int (*vmtrace_output_position)(struct vcpu *v, uint64_t *pos);
+    int (*vmtrace_set_option)(struct vcpu *v, uint64_t key, uint64_t value);
+    int (*vmtrace_get_option)(struct vcpu *v, uint64_t key, uint64_t *value);
+
     /*
      * Parameters and callbacks for hardware-assisted TSC scaling,
      * which are valid only when the hardware feature is available.
@@ -655,6 +661,41 @@ static inline bool altp2m_vcpu_emulate_ve(struct vcpu *v)
     return false;
 }
 
+static inline int hvm_vmtrace_control(struct vcpu *v, bool enable, bool reset)
+{
+    if ( hvm_funcs.vmtrace_control )
+        return hvm_funcs.vmtrace_control(v, enable, reset);
+
+    return -EOPNOTSUPP;
+}
+
+/* Returns -errno, or a boolean of whether tracing is currently active. */
+static inline int hvm_vmtrace_output_position(struct vcpu *v, uint64_t *pos)
+{
+    if ( hvm_funcs.vmtrace_output_position )
+        return hvm_funcs.vmtrace_output_position(v, pos);
+
+    return -EOPNOTSUPP;
+}
+
+static inline int hvm_vmtrace_set_option(
+    struct vcpu *v, uint64_t key, uint64_t value)
+{
+    if ( hvm_funcs.vmtrace_set_option )
+        return hvm_funcs.vmtrace_set_option(v, key, value);
+
+    return -EOPNOTSUPP;
+}
+
+static inline int hvm_vmtrace_get_option(
+    struct vcpu *v, uint64_t key, uint64_t *value)
+{
+    if ( hvm_funcs.vmtrace_get_option )
+        return hvm_funcs.vmtrace_get_option(v, key, value);
+
+    return -EOPNOTSUPP;
+}
+
 /*
  * This must be defined as a macro instead of an inline function,
  * because it uses 'struct vcpu' and 'struct domain' which have
@@ -751,6 +792,28 @@ static inline bool hvm_has_set_descriptor_access_exiting(void)
     return false;
 }
 
+static inline int hvm_vmtrace_control(struct vcpu *v, bool enable, bool reset)
+{
+    return -EOPNOTSUPP;
+}
+
+static inline int hvm_vmtrace_output_position(struct vcpu *v, uint64_t *pos)
+{
+    return -EOPNOTSUPP;
+}
+
+static inline int hvm_vmtrace_set_option(
+    struct vcpu *v, uint64_t key, uint64_t value)
+{
+    return -EOPNOTSUPP;
+}
+
+static inline int hvm_vmtrace_get_option(
+    struct vcpu *v, uint64_t key, uint64_t *value)
+{
+    return -EOPNOTSUPP;
+}
+
 #define is_viridian_domain(d) ((void)(d), false)
 #define is_viridian_vcpu(v) ((void)(v), false)
 #define has_viridian_time_ref_count(d) ((void)(d), false)
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 88a5b1ef5d..4dbf107785 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -1135,6 +1135,39 @@ struct xen_domctl_vuart_op {
                                  */
 };
 
+/* XEN_DOMCTL_vmtrace_op: Perform VM tracing operations. */
+struct xen_domctl_vmtrace_op {
+    uint32_t cmd;           /* IN */
+    uint32_t vcpu;          /* IN */
+    uint64_aligned_t key;   /* IN     - @cmd specific data. */
+    uint64_aligned_t value; /* IN/OUT - @cmd specific data. */
+
+    /*
+     * General enable/disable of tracing.
+     *
+     * XEN_DOMCTL_vmtrace_reset_and_enable is provided as optimisation for
+     * common usecases, which want to reset status and position information
+     * when turning tracing back on.
+     */
+#define XEN_DOMCTL_vmtrace_enable             1
+#define XEN_DOMCTL_vmtrace_disable            2
+#define XEN_DOMCTL_vmtrace_reset_and_enable   3
+
+    /* Obtain the current output position within the buffer.  Fills @value. */
+#define XEN_DOMCTL_vmtrace_output_position    4
+
+    /*
+     * Get/Set platform specific configuration.
+     *
+     * For Intel Processor Trace, @key/@value are interpreted as MSR
+     * reads/writes to MSR_RTIT_*, filtered to a safe subset.
+     */
+#define XEN_DOMCTL_vmtrace_get_option         5
+#define XEN_DOMCTL_vmtrace_set_option         6
+};
+typedef struct xen_domctl_vmtrace_op xen_domctl_vmtrace_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_vmtrace_op_t);
+
 struct xen_domctl {
     uint32_t cmd;
 #define XEN_DOMCTL_createdomain                   1
@@ -1219,6 +1252,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_vuart_op                      81
 #define XEN_DOMCTL_get_cpu_policy                82
 #define XEN_DOMCTL_set_cpu_policy                83
+#define XEN_DOMCTL_vmtrace_op                    84
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -1279,6 +1313,7 @@ struct xen_domctl {
         struct xen_domctl_monitor_op        monitor_op;
         struct xen_domctl_psr_alloc         psr_alloc;
         struct xen_domctl_vuart_op          vuart_op;
+        struct xen_domctl_vmtrace_op        vmtrace_op;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index 11784d7425..3b7313b949 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -703,6 +703,7 @@ static int flask_domctl(struct domain *d, int cmd)
         return current_has_perm(d, SECCLASS_DOMAIN2, DOMAIN2__VM_EVENT);
 
     case XEN_DOMCTL_debug_op:
+    case XEN_DOMCTL_vmtrace_op:
     case XEN_DOMCTL_gdbsx_guestmemio:
     case XEN_DOMCTL_gdbsx_pausevcpu:
     case XEN_DOMCTL_gdbsx_unpausevcpu:
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 13/16] tools/libxc: Add xc_vmtrace_* functions
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (11 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 14/16] tools/misc: Add xen-vmtrace tool Andrew Cooper
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Ian Jackson,
	Wei Liu, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Add functions in libxc that use the new XEN_DOMCTL_vmtrace interface.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
CC: Ian Jackson <iwj@xenproject.org>
CC: Wei Liu <wl@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v7:
 * Use the name 'vmtrace' consistently.
---
 tools/include/xenctrl.h      |  73 ++++++++++++++++++++++++
 tools/libs/ctrl/Makefile     |   1 +
 tools/libs/ctrl/xc_vmtrace.c | 128 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 202 insertions(+)
 create mode 100644 tools/libs/ctrl/xc_vmtrace.c

diff --git a/tools/include/xenctrl.h b/tools/include/xenctrl.h
index 3796425e1e..0efcdae8b4 100644
--- a/tools/include/xenctrl.h
+++ b/tools/include/xenctrl.h
@@ -1583,6 +1583,79 @@ int xc_tbuf_set_cpu_mask(xc_interface *xch, xc_cpumap_t mask);
 
 int xc_tbuf_set_evt_mask(xc_interface *xch, uint32_t mask);
 
+/**
+ * Enable vmtrace for given vCPU.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_enable(xc_interface *xch, uint32_t domid, uint32_t vcpu);
+
+/**
+ * Enable vmtrace for given vCPU.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_disable(xc_interface *xch, uint32_t domid, uint32_t vcpu);
+
+/**
+ * Enable vmtrace for a given vCPU, along with resetting status/offset
+ * details.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_reset_and_enable(xc_interface *xch, uint32_t domid,
+                                uint32_t vcpu);
+
+/**
+ * Get current output position inside the trace buffer.
+ *
+ * Repeated calls will return different values if tracing is enabled.  It is
+ * platform specific what happens when the buffer fills completely.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @parm pos current output position in bytes
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_output_position(xc_interface *xch, uint32_t domid,
+                               uint32_t vcpu, uint64_t *pos);
+
+/**
+ * Get platform specific vmtrace options.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @parm key platform-specific input
+ * @parm value platform-specific output
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_get_option(xc_interface *xch, uint32_t domid,
+                          uint32_t vcpu, uint64_t key, uint64_t *value);
+
+/**
+ * Set platform specific vntvmtrace options.
+ *
+ * @parm xch a handle to an open hypervisor interface
+ * @parm domid domain identifier
+ * @parm vcpu vcpu identifier
+ * @parm key platform-specific input
+ * @parm value platform-specific input
+ * @return 0 on success, -1 on failure
+ */
+int xc_vmtrace_set_option(xc_interface *xch, uint32_t domid,
+                          uint32_t vcpu, uint64_t key, uint64_t value);
+
 int xc_domctl(xc_interface *xch, struct xen_domctl *domctl);
 int xc_sysctl(xc_interface *xch, struct xen_sysctl *sysctl);
 
diff --git a/tools/libs/ctrl/Makefile b/tools/libs/ctrl/Makefile
index 6106e36c49..ce9ecae710 100644
--- a/tools/libs/ctrl/Makefile
+++ b/tools/libs/ctrl/Makefile
@@ -22,6 +22,7 @@ SRCS-y       += xc_pm.c
 SRCS-y       += xc_cpu_hotplug.c
 SRCS-y       += xc_resume.c
 SRCS-y       += xc_vm_event.c
+SRCS-y       += xc_vmtrace.c
 SRCS-y       += xc_monitor.c
 SRCS-y       += xc_mem_paging.c
 SRCS-y       += xc_mem_access.c
diff --git a/tools/libs/ctrl/xc_vmtrace.c b/tools/libs/ctrl/xc_vmtrace.c
new file mode 100644
index 0000000000..602502367f
--- /dev/null
+++ b/tools/libs/ctrl/xc_vmtrace.c
@@ -0,0 +1,128 @@
+/******************************************************************************
+ * xc_vmtrace.c
+ *
+ * API for manipulating hardware tracing features
+ *
+ * Copyright (c) 2020, Michal Leszczynski
+ *
+ * Copyright 2020 CERT Polska. All rights reserved.
+ * Use is subject to license terms.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation;
+ * version 2.1 of the License.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "xc_private.h"
+
+int xc_vmtrace_enable(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_enable,
+            .vcpu = vcpu,
+        },
+    };
+
+    return do_domctl(xch, &domctl);
+}
+
+int xc_vmtrace_disable(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_disable,
+            .vcpu = vcpu,
+        },
+    };
+
+    return do_domctl(xch, &domctl);
+}
+
+int xc_vmtrace_reset_and_enable(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_reset_and_enable,
+            .vcpu = vcpu,
+        },
+    };
+
+    return do_domctl(xch, &domctl);
+}
+
+int xc_vmtrace_output_position(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu, uint64_t *pos)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_output_position,
+            .vcpu = vcpu,
+        },
+    };
+    int rc = do_domctl(xch, &domctl);
+
+    if ( !rc )
+        *pos = domctl.u.vmtrace_op.value;
+
+    return rc;
+}
+
+int xc_vmtrace_get_option(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu,
+    uint64_t key, uint64_t *value)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_get_option,
+            .vcpu = vcpu,
+            .key = key,
+        },
+    };
+    int rc = do_domctl(xch, &domctl);
+
+    if ( !rc )
+        *value = domctl.u.vmtrace_op.value;
+
+    return rc;
+}
+
+int xc_vmtrace_set_option(
+    xc_interface *xch, uint32_t domid, uint32_t vcpu,
+    uint64_t key, uint64_t value)
+{
+    struct xen_domctl domctl = {
+        .cmd = XEN_DOMCTL_vmtrace_op,
+        .domain = domid,
+        .u.vmtrace_op = {
+            .cmd = XEN_DOMCTL_vmtrace_set_option,
+            .vcpu = vcpu,
+            .key = key,
+            .value = value,
+        },
+    };
+
+    return do_domctl(xch, &domctl);
+}
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 14/16] tools/misc: Add xen-vmtrace tool
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (12 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 13/16] tools/libxc: Add xc_vmtrace_* functions Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 15/16] xen/vmtrace: support for VM forks Andrew Cooper
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Michał Leszczyński, Andrew Cooper, Ian Jackson,
	Wei Liu, Tamas K Lengyel

From: Michał Leszczyński <michal.leszczynski@cert.pl>

Add an demonstration tool that uses xc_vmtrace_* calls in order
to manage external IPT monitoring for DomU.

Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <iwj@xenproject.org>
---
CC: Ian Jackson <iwj@xenproject.org>
CC: Wei Liu <wl@xen.org>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Switch to being a build-only target
---
 tools/misc/.gitignore    |   1 +
 tools/misc/Makefile      |   7 +++
 tools/misc/xen-vmtrace.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 162 insertions(+)
 create mode 100644 tools/misc/xen-vmtrace.c

diff --git a/tools/misc/.gitignore b/tools/misc/.gitignore
index b2c3b21f57..ce6f937d0c 100644
--- a/tools/misc/.gitignore
+++ b/tools/misc/.gitignore
@@ -1,3 +1,4 @@
 xen-access
 xen-memshare
 xen-ucode
+xen-vmtrace
diff --git a/tools/misc/Makefile b/tools/misc/Makefile
index 912c5d4f0e..2b683819d4 100644
--- a/tools/misc/Makefile
+++ b/tools/misc/Makefile
@@ -50,6 +50,10 @@ TARGETS_COPY += xenpvnetboot
 # Everything which needs to be built
 TARGETS_BUILD := $(filter-out $(TARGETS_COPY),$(TARGETS_ALL))
 
+# ... including build-only targets
+TARGETS_BUILD-$(CONFIG_X86)    += xen-vmtrace
+TARGETS_BUILD += $(TARGETS_BUILD-y)
+
 .PHONY: all build
 all build: $(TARGETS_BUILD)
 
@@ -90,6 +94,9 @@ xen-hvmcrash: xen-hvmcrash.o
 xen-memshare: xen-memshare.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
+xen-vmtrace: xen-vmtrace.o
+	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(LDLIBS_libxenforeignmemory) $(APPEND_LDFLAGS)
+
 xenperf: xenperf.o
 	$(CC) $(LDFLAGS) -o $@ $< $(LDLIBS_libxenctrl) $(APPEND_LDFLAGS)
 
diff --git a/tools/misc/xen-vmtrace.c b/tools/misc/xen-vmtrace.c
new file mode 100644
index 0000000000..47fea871cf
--- /dev/null
+++ b/tools/misc/xen-vmtrace.c
@@ -0,0 +1,154 @@
+/******************************************************************************
+ * tools/vmtrace.c
+ *
+ * Demonstrative tool for collecting Intel Processor Trace data from Xen.
+ *  Could be used to externally monitor a given vCPU in given DomU.
+ *
+ * Copyright (C) 2020 by CERT Polska - NASK PIB
+ *
+ * Authors: Michał Leszczyński, michal.leszczynski@cert.pl
+ * Date:    June, 2020
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <err.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/mman.h>
+
+#include <xenctrl.h>
+#include <xenforeignmemory.h>
+
+#define MSR_RTIT_CTL                        0x00000570
+#define  RTIT_CTL_OS                        (1 <<  2)
+#define  RTIT_CTL_USR                       (1 <<  3)
+#define  RTIT_CTL_BRANCH_EN                 (1 << 13)
+
+static volatile int interrupted = 0;
+
+static xc_interface *xch;
+static xenforeignmemory_handle *fh;
+
+void int_handler(int signum)
+{
+    interrupted = 1;
+}
+
+int main(int argc, char **argv)
+{
+    uint32_t domid, vcpu;
+    int rc, exit = 1;
+    size_t size;
+    char *buf = NULL;
+    xenforeignmemory_resource_handle *fres = NULL;
+    uint64_t last_offset = 0;
+
+    if ( signal(SIGINT, int_handler) == SIG_ERR )
+        err(1, "Failed to register signal handler\n");
+
+    if ( argc != 3 )
+    {
+        fprintf(stderr, "Usage: %s <domid> <vcpu_id>\n", argv[0]);
+        fprintf(stderr, "It's recommended to redirect thisprogram's output to file\n");
+        fprintf(stderr, "or to pipe it's output to xxd or other program.\n");
+        return 1;
+    }
+
+    domid = atoi(argv[1]);
+    vcpu  = atoi(argv[2]);
+
+    xch = xc_interface_open(NULL, NULL, 0);
+    fh = xenforeignmemory_open(NULL, 0);
+
+    if ( !xch )
+        err(1, "xc_interface_open()");
+    if ( !fh )
+        err(1, "xenforeignmemory_open()");
+
+    rc = xenforeignmemory_resource_size(
+        fh, domid, XENMEM_resource_vmtrace_buf, vcpu, &size);
+    if ( rc )
+        err(1, "xenforeignmemory_resource_size()");
+
+    fres = xenforeignmemory_map_resource(
+        fh, domid, XENMEM_resource_vmtrace_buf, vcpu,
+        0, size >> XC_PAGE_SHIFT, (void **)&buf, PROT_READ, 0);
+    if ( !fres )
+        err(1, "xenforeignmemory_map_resource()");
+
+    if ( xc_vmtrace_set_option(
+             xch, domid, vcpu, MSR_RTIT_CTL,
+             RTIT_CTL_BRANCH_EN | RTIT_CTL_USR | RTIT_CTL_OS) )
+    {
+        perror("xc_vmtrace_set_option()");
+        goto out;
+    }
+
+    if ( xc_vmtrace_enable(xch, domid, vcpu) )
+    {
+        perror("xc_vmtrace_enable()");
+        goto out;
+    }
+
+    while ( !interrupted )
+    {
+        xc_dominfo_t dominfo;
+        uint64_t offset;
+
+        if ( xc_vmtrace_output_position(xch, domid, vcpu, &offset) )
+        {
+            perror("xc_vmtrace_output_position()");
+            goto out;
+        }
+
+        if ( offset > last_offset )
+            fwrite(buf + last_offset, offset - last_offset, 1, stdout);
+        else if ( offset < last_offset )
+        {
+            /* buffer wrapped */
+            fwrite(buf + last_offset, size - last_offset, 1, stdout);
+            fwrite(buf, offset, 1, stdout);
+        }
+
+        last_offset = offset;
+        usleep(1000 * 100);
+
+        if ( xc_domain_getinfo(xch, domid, 1, &dominfo) != 1 ||
+             dominfo.domid != domid || dominfo.shutdown )
+            break;
+    }
+
+    exit = 0;
+
+ out:
+    if ( xc_vmtrace_disable(xch, domid, vcpu) )
+        perror("xc_vmtrace_disable()");
+
+    if ( fres && xenforeignmemory_unmap_resource(fh, fres) )
+        perror("xenforeignmemory_unmap_resource()");
+
+    return exit;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 15/16] xen/vmtrace: support for VM forks
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (13 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 14/16] tools/misc: Add xen-vmtrace tool Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-01-30  2:58 ` [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event Andrew Cooper
  2021-02-01 12:34 ` [PATCH v8 00/16] acquire_resource size and external IPT monitoring Oleksandr
  16 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Tamas K Lengyel, Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Michał Leszczyński,
	Tamas K Lengyel

From: Tamas K Lengyel <tamas.lengyel@intel.com>

Implement vmtrace_reset_pt function. Properly set IPT
state for VM forks.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v7:
 * New
---
 xen/arch/x86/hvm/vmx/vmx.c    | 11 +++++++++++
 xen/arch/x86/mm/mem_sharing.c |  3 +++
 xen/include/asm-x86/hvm/hvm.h |  9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a64c4e4177..5a33d8024e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2412,6 +2412,16 @@ static int vmtrace_output_position(struct vcpu *v, uint64_t *pos)
     return v->arch.hvm.vmx.ipt_active;
 }
 
+static int vmtrace_reset(struct vcpu *v)
+{
+    if ( !v->arch.hvm.vmx.ipt_active )
+        return -EINVAL;
+
+    v->arch.msrs->rtit.output_offset = 0;
+    v->arch.msrs->rtit.status = 0;
+    return 0;
+}
+
 static struct hvm_function_table __initdata vmx_function_table = {
     .name                 = "VMX",
     .cpu_up_prepare       = vmx_cpu_up_prepare,
@@ -2471,6 +2481,7 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .vmtrace_output_position = vmtrace_output_position,
     .vmtrace_set_option = vmtrace_set_option,
     .vmtrace_get_option = vmtrace_get_option,
+    .vmtrace_reset = vmtrace_reset,
     .tsc_scaling = {
         .max_ratio = VMX_TSC_MULTIPLIER_MAX,
     },
diff --git a/xen/arch/x86/mm/mem_sharing.c b/xen/arch/x86/mm/mem_sharing.c
index adaeab4612..00ada05c10 100644
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -1632,6 +1632,8 @@ static int copy_vcpu_settings(struct domain *cd, const struct domain *d)
             copy_domain_page(new_vcpu_info_mfn, vcpu_info_mfn);
         }
 
+        hvm_vmtrace_reset(cd_vcpu);
+
         /*
          * TODO: to support VMs with PV interfaces copy additional
          * settings here, such as PV timers.
@@ -1782,6 +1784,7 @@ static int fork(struct domain *cd, struct domain *d)
         cd->max_pages = d->max_pages;
         *cd->arch.cpuid = *d->arch.cpuid;
         *cd->arch.msr = *d->arch.msr;
+        cd->vmtrace_size = d->vmtrace_size;
         cd->parent = d;
     }
 
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 960ec03917..150746de66 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -219,6 +219,7 @@ struct hvm_function_table {
     int (*vmtrace_output_position)(struct vcpu *v, uint64_t *pos);
     int (*vmtrace_set_option)(struct vcpu *v, uint64_t key, uint64_t value);
     int (*vmtrace_get_option)(struct vcpu *v, uint64_t key, uint64_t *value);
+    int (*vmtrace_reset)(struct vcpu *v);
 
     /*
      * Parameters and callbacks for hardware-assisted TSC scaling,
@@ -696,6 +697,14 @@ static inline int hvm_vmtrace_get_option(
     return -EOPNOTSUPP;
 }
 
+static inline int hvm_vmtrace_reset(struct vcpu *v)
+{
+    if ( hvm_funcs.vmtrace_reset )
+        return hvm_funcs.vmtrace_reset(v);
+
+    return -EOPNOTSUPP;
+}
+
 /*
  * This must be defined as a macro instead of an inline function,
  * because it uses 'struct vcpu' and 'struct domain' which have
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (14 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 15/16] xen/vmtrace: support for VM forks Andrew Cooper
@ 2021-01-30  2:58 ` Andrew Cooper
  2021-02-01  9:51   ` Jan Beulich
  2021-02-01 12:34 ` [PATCH v8 00/16] acquire_resource size and external IPT monitoring Oleksandr
  16 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-01-30  2:58 UTC (permalink / raw)
  To: Xen-devel
  Cc: Tamas K Lengyel, Andrew Cooper, Jan Beulich, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Michał Leszczyński,
	Tamas K Lengyel

From: Tamas K Lengyel <tamas.lengyel@intel.com>

Add vmtrace_pos field to x86 regs in vm_event. Initialized to ~0 if
vmtrace is not in use.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Jun Nakajima <jun.nakajima@intel.com>
CC: Kevin Tian <kevin.tian@intel.com>
CC: Michał Leszczyński <michal.leszczynski@cert.pl>
CC: Tamas K Lengyel <tamas@tklengyel.com>

v8:
 * Use 'vmtrace' consistently.

v7:
 * New
---
 xen/arch/x86/vm_event.c       | 3 +++
 xen/include/public/vm_event.h | 7 +++++++
 2 files changed, 10 insertions(+)

diff --git a/xen/arch/x86/vm_event.c b/xen/arch/x86/vm_event.c
index 848d69c1b0..36272e9316 100644
--- a/xen/arch/x86/vm_event.c
+++ b/xen/arch/x86/vm_event.c
@@ -251,6 +251,9 @@ void vm_event_fill_regs(vm_event_request_t *req)
 
     req->data.regs.x86.shadow_gs = ctxt.shadow_gs;
     req->data.regs.x86.dr6 = ctxt.dr6;
+
+    if ( hvm_vmtrace_output_position(curr, &req->data.regs.x86.vmtrace_pos) != 1 )
+        req->data.regs.x86.vmtrace_pos = ~0;
 #endif
 }
 
diff --git a/xen/include/public/vm_event.h b/xen/include/public/vm_event.h
index 141ea024a3..147dc3ea73 100644
--- a/xen/include/public/vm_event.h
+++ b/xen/include/public/vm_event.h
@@ -223,6 +223,13 @@ struct vm_event_regs_x86 {
      */
     uint64_t npt_base;
 
+    /*
+     * Current position in the vmtrace buffer, or ~0 if vmtrace is not active.
+     *
+     * For Intel Processor Trace, it is the upper half of MSR_RTIT_OUTPUT_MASK.
+     */
+    uint64_t vmtrace_pos;
+
     uint32_t cs_base;
     uint32_t ss_base;
     uint32_t ds_base;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event
  2021-01-30  2:58 ` [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event Andrew Cooper
@ 2021-02-01  9:51   ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-02-01  9:51 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tamas K Lengyel, Roger Pau Monné,
	Wei Liu, Jun Nakajima, Kevin Tian, Michał Leszczyński,
	Tamas K Lengyel, Xen-devel

On 30.01.2021 03:58, Andrew Cooper wrote:
> From: Tamas K Lengyel <tamas.lengyel@intel.com>
> 
> Add vmtrace_pos field to x86 regs in vm_event. Initialized to ~0 if
> vmtrace is not in use.
> 
> Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-01-30  2:58 ` [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource Andrew Cooper
@ 2021-02-01 10:10   ` Roger Pau Monné
  2021-02-01 11:11     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 10:10 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
> A guest's default number of grant frames is 64, and XENMEM_acquire_resource
> will reject an attempt to map more than 32 frames.  This limit is caused by
> the size of mfn_list[] on the stack.
> 
> Fix mapping of arbitrary size requests by looping over batches of 32 in
> acquire_resource(), and using hypercall continuations when necessary.
> 
> To start with, break _acquire_resource() out of acquire_resource() to cope
> with type-specific dispatching, and update the return semantics to indicate
> the number of mfns returned.  Update gnttab_acquire_resource() and x86's
> arch_acquire_resource() to match these new semantics.
> 
> Have do_memory_op() pass start_extent into acquire_resource() so it can pick
> up where it left off after a continuation, and loop over batches of 32 until
> all the work is done, or a continuation needs to occur.
> 
> compat_memory_op() is a bit more complicated, because it also has to marshal
> frame_list in the XLAT buffer.  Have it account for continuation information
> itself and hide details from the upper layer, so it can marshal the buffer in
> chunks if necessary.
> 
> With these fixes in place, it is now possible to map the whole grant table for
> a guest.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Just one comment/question regarding a continuation below.

I have to admit I had a hard time reviewing this, all this compat code
plus the continuation stuff is quite hard to follow.

> ---
> CC: George Dunlap <George.Dunlap@eu.citrix.com>
> CC: Ian Jackson <iwj@xenproject.org>
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Stefano Stabellini <sstabellini@kernel.org>
> CC: Wei Liu <wl@xen.org>
> CC: Julien Grall <julien@xen.org>
> CC: Paul Durrant <paul@xen.org>
> CC: Michał Leszczyński <michal.leszczynski@cert.pl>
> CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
> CC: Tamas K Lengyel <tamas@tklengyel.com>
> 
> v8:
>  * nat => cmp change in the start_extent check.
>  * Rebase over 'frame' and ARM/IOREQ series.
> 
> v3:
>  * Spelling fixes
> ---
>  xen/common/compat/memory.c |  94 +++++++++++++++++++++++++++-------
>  xen/common/grant_table.c   |   3 ++
>  xen/common/memory.c        | 124 +++++++++++++++++++++++++++++++++------------
>  3 files changed, 169 insertions(+), 52 deletions(-)
> 
> diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
> index 834c5e19d1..4c9cd9c05a 100644
> --- a/xen/common/compat/memory.c
> +++ b/xen/common/compat/memory.c
> @@ -402,23 +402,10 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>          case XENMEM_acquire_resource:
>          {
>              xen_pfn_t *xen_frame_list = NULL;
> -            unsigned int max_nr_frames;
>  
>              if ( copy_from_guest(&cmp.mar, compat, 1) )
>                  return -EFAULT;
>  
> -            /*
> -             * The number of frames handled is currently limited to a
> -             * small number by the underlying implementation, so the
> -             * scratch space should be sufficient for bouncing the
> -             * frame addresses.
> -             */
> -            max_nr_frames = (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
> -                sizeof(*xen_frame_list);
> -
> -            if ( cmp.mar.nr_frames > max_nr_frames )
> -                return -E2BIG;
> -
>              /* Marshal the frame list in the remainder of the xlat space. */
>              if ( !compat_handle_is_null(cmp.mar.frame_list) )
>                  xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
> @@ -432,6 +419,28 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>  
>              if ( xen_frame_list && cmp.mar.nr_frames )
>              {
> +                unsigned int xlat_max_frames =

Could be made const static I think?

> +                    (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
> +                    sizeof(*xen_frame_list);
> +
> +                if ( start_extent >= cmp.mar.nr_frames )
> +                    return -EINVAL;
> +
> +                /*
> +                 * Adjust nat to account for work done on previous
> +                 * continuations, leaving cmp pristine.  Hide the continaution
> +                 * from the native code to prevent double accounting.
> +                 */
> +                nat.mar->nr_frames -= start_extent;
> +                nat.mar->frame += start_extent;
> +                cmd &= MEMOP_CMD_MASK;
> +
> +                /*
> +                 * If there are two many frames to fit within the xlat buffer,
> +                 * we'll need to loop to marshal them all.
> +                 */
> +                nat.mar->nr_frames = min(nat.mar->nr_frames, xlat_max_frames);
> +
>                  /*
>                   * frame_list is an input for translated guests, and an output
>                   * for untranslated guests.  Only copy in for translated guests.
> @@ -444,14 +453,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>                                               cmp.mar.nr_frames) ||
>                           __copy_from_compat_offset(
>                               compat_frame_list, cmp.mar.frame_list,
> -                             0, cmp.mar.nr_frames) )
> +                             start_extent, nat.mar->nr_frames) )
>                          return -EFAULT;
>  
>                      /*
>                       * Iterate backwards over compat_frame_list[] expanding
>                       * compat_pfn_t to xen_pfn_t in place.
>                       */
> -                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
> +                    for ( int x = nat.mar->nr_frames - 1; x >= 0; --x )
>                          xen_frame_list[x] = compat_frame_list[x];

Unrelated question, but I don't really see the point of iterating
backwards, wouldn't it be easy to use use the existing 'i' loop
counter and for a for ( i = 0; i <  nat.mar->nr_frames; i++ )?

(Not that you need to fix it here, just curious about why we use that
construct instead).

>                  }
>              }
> @@ -600,9 +609,11 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>          case XENMEM_acquire_resource:
>          {
>              DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
> +            unsigned int done;
>  
>              if ( compat_handle_is_null(cmp.mar.frame_list) )
>              {
> +                ASSERT(split == 0 && rc == 0);
>                  if ( __copy_field_to_guest(
>                           guest_handle_cast(compat,
>                                             compat_mem_acquire_resource_t),
> @@ -611,6 +622,21 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>                  break;
>              }
>  
> +            if ( split < 0 )
> +            {
> +                /* Continuation occurred. */
> +                ASSERT(rc != XENMEM_acquire_resource);
> +                done = cmd >> MEMOP_EXTENT_SHIFT;
> +            }
> +            else
> +            {
> +                /* No continuation. */
> +                ASSERT(rc == 0);
> +                done = nat.mar->nr_frames;
> +            }
> +
> +            ASSERT(done <= nat.mar->nr_frames);
> +
>              /*
>               * frame_list is an input for translated guests, and an output for
>               * untranslated guests.  Only copy out for untranslated guests.
> @@ -626,7 +652,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>                   */
>                  BUILD_BUG_ON(sizeof(compat_pfn_t) > sizeof(xen_pfn_t));
>  
> -                for ( i = 0; i < cmp.mar.nr_frames; i++ )
> +                for ( i = 0; i < done; i++ )
>                  {
>                      compat_pfn_t frame = xen_frame_list[i];
>  
> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>                      compat_frame_list[i] = frame;
>                  }
>  
> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>                                               compat_frame_list,
> -                                             cmp.mar.nr_frames) )
> +                                             done) )
>                      return -EFAULT;

Is it fine to return with a possibly pending continuation already
encoded here?

Other places seem to crash the domain when that happens.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace
  2021-01-30  2:58 ` [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace Andrew Cooper
@ 2021-02-01 10:32   ` Roger Pau Monné
  0 siblings, 0 replies; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 10:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Wei Liu, Ian Jackson,
	Michał Leszczyński, Tamas K Lengyel

On Sat, Jan 30, 2021 at 02:58:43AM +0000, Andrew Cooper wrote:
> We're about to introduce support for Intel Processor Trace, but similar
> functionality exists in other platforms.
> 
> Aspects of vmtrace can reasonably can be common, so start with
> XEN_SYSCTL_PHYSCAP_vmtrace and plumb the signal from Xen all the way down into
> `xl info`.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-01-30  2:58 ` [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter Andrew Cooper
@ 2021-02-01 10:51   ` Roger Pau Monné
  2021-02-01 11:19     ` Andrew Cooper
  2021-02-01 13:18   ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 10:51 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Michał Leszczyński, Jan Beulich, Wei Liu,
	Anthony PERARD, Tamas K Lengyel

On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote:
> From: Michał Leszczyński <michal.leszczynski@cert.pl>
> 
> To use vmtrace, buffers of a suitable size need allocating, and different
> tasks will want different sizes.
> 
> Add a domain creation parameter, and audit it appropriately in the
> {arch_,}sanitise_domain_config() functions.
> 
> For now, the x86 specific auditing is tuned to Processor Trace running in
> Single Output mode, which requires a single contiguous range of memory.
> 
> The size is given an arbitrary limit of 64M which is expected to be enough for
> anticipated usecases, but not large enough to get into long-running-hypercall
> problems.
> 
> Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index d1e94d88cf..491b32812e 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v)
>      v->vcpu_info_mfn = INVALID_MFN;
>  }
>  
> +static void vmtrace_free_buffer(struct vcpu *v)
> +{
> +    const struct domain *d = v->domain;
> +    struct page_info *pg = v->vmtrace.pg;
> +    unsigned int i;
> +
> +    if ( !pg )
> +        return;
> +
> +    v->vmtrace.pg = NULL;
> +
> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
> +    {
> +        put_page_alloc_ref(&pg[i]);
> +        put_page_and_type(&pg[i]);
> +    }
> +}
> +
> +static int vmtrace_alloc_buffer(struct vcpu *v)

You might as well make this return true/false, as the error code is
ignored by the caller (at least in this patch).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 10:10   ` Roger Pau Monné
@ 2021-02-01 11:11     ` Andrew Cooper
  2021-02-01 12:07       ` Roger Pau Monné
  2021-02-01 13:03       ` Jan Beulich
  0 siblings, 2 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 11:11 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

On 01/02/2021 10:10, Roger Pau Monné wrote:
> On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
>> A guest's default number of grant frames is 64, and XENMEM_acquire_resource
>> will reject an attempt to map more than 32 frames.  This limit is caused by
>> the size of mfn_list[] on the stack.
>>
>> Fix mapping of arbitrary size requests by looping over batches of 32 in
>> acquire_resource(), and using hypercall continuations when necessary.
>>
>> To start with, break _acquire_resource() out of acquire_resource() to cope
>> with type-specific dispatching, and update the return semantics to indicate
>> the number of mfns returned.  Update gnttab_acquire_resource() and x86's
>> arch_acquire_resource() to match these new semantics.
>>
>> Have do_memory_op() pass start_extent into acquire_resource() so it can pick
>> up where it left off after a continuation, and loop over batches of 32 until
>> all the work is done, or a continuation needs to occur.
>>
>> compat_memory_op() is a bit more complicated, because it also has to marshal
>> frame_list in the XLAT buffer.  Have it account for continuation information
>> itself and hide details from the upper layer, so it can marshal the buffer in
>> chunks if necessary.
>>
>> With these fixes in place, it is now possible to map the whole grant table for
>> a guest.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Just one comment/question regarding a continuation below.
>
> I have to admit I had a hard time reviewing this, all this compat code
> plus the continuation stuff is quite hard to follow.
>
>> ---
>> CC: George Dunlap <George.Dunlap@eu.citrix.com>
>> CC: Ian Jackson <iwj@xenproject.org>
>> CC: Jan Beulich <JBeulich@suse.com>
>> CC: Stefano Stabellini <sstabellini@kernel.org>
>> CC: Wei Liu <wl@xen.org>
>> CC: Julien Grall <julien@xen.org>
>> CC: Paul Durrant <paul@xen.org>
>> CC: Michał Leszczyński <michal.leszczynski@cert.pl>
>> CC: Hubert Jasudowicz <hubert.jasudowicz@cert.pl>
>> CC: Tamas K Lengyel <tamas@tklengyel.com>
>>
>> v8:
>>  * nat => cmp change in the start_extent check.
>>  * Rebase over 'frame' and ARM/IOREQ series.
>>
>> v3:
>>  * Spelling fixes
>> ---
>>  xen/common/compat/memory.c |  94 +++++++++++++++++++++++++++-------
>>  xen/common/grant_table.c   |   3 ++
>>  xen/common/memory.c        | 124 +++++++++++++++++++++++++++++++++------------
>>  3 files changed, 169 insertions(+), 52 deletions(-)
>>
>> diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
>> index 834c5e19d1..4c9cd9c05a 100644
>> --- a/xen/common/compat/memory.c
>> +++ b/xen/common/compat/memory.c
>> @@ -402,23 +402,10 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>          case XENMEM_acquire_resource:
>>          {
>>              xen_pfn_t *xen_frame_list = NULL;
>> -            unsigned int max_nr_frames;
>>  
>>              if ( copy_from_guest(&cmp.mar, compat, 1) )
>>                  return -EFAULT;
>>  
>> -            /*
>> -             * The number of frames handled is currently limited to a
>> -             * small number by the underlying implementation, so the
>> -             * scratch space should be sufficient for bouncing the
>> -             * frame addresses.
>> -             */
>> -            max_nr_frames = (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
>> -                sizeof(*xen_frame_list);
>> -
>> -            if ( cmp.mar.nr_frames > max_nr_frames )
>> -                return -E2BIG;
>> -
>>              /* Marshal the frame list in the remainder of the xlat space. */
>>              if ( !compat_handle_is_null(cmp.mar.frame_list) )
>>                  xen_frame_list = (xen_pfn_t *)(nat.mar + 1);
>> @@ -432,6 +419,28 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>  
>>              if ( xen_frame_list && cmp.mar.nr_frames )
>>              {
>> +                unsigned int xlat_max_frames =
> Could be made const static I think?

It is a compile time constant, but the compiler can already figure that
out.  static is definitely out of the question.

>
>> +                    (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
>> +                    sizeof(*xen_frame_list);
>> +
>> +                if ( start_extent >= cmp.mar.nr_frames )
>> +                    return -EINVAL;
>> +
>> +                /*
>> +                 * Adjust nat to account for work done on previous
>> +                 * continuations, leaving cmp pristine.  Hide the continaution
>> +                 * from the native code to prevent double accounting.
>> +                 */
>> +                nat.mar->nr_frames -= start_extent;
>> +                nat.mar->frame += start_extent;
>> +                cmd &= MEMOP_CMD_MASK;
>> +
>> +                /*
>> +                 * If there are two many frames to fit within the xlat buffer,
>> +                 * we'll need to loop to marshal them all.
>> +                 */
>> +                nat.mar->nr_frames = min(nat.mar->nr_frames, xlat_max_frames);
>> +
>>                  /*
>>                   * frame_list is an input for translated guests, and an output
>>                   * for untranslated guests.  Only copy in for translated guests.
>> @@ -444,14 +453,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>                                               cmp.mar.nr_frames) ||
>>                           __copy_from_compat_offset(
>>                               compat_frame_list, cmp.mar.frame_list,
>> -                             0, cmp.mar.nr_frames) )
>> +                             start_extent, nat.mar->nr_frames) )
>>                          return -EFAULT;
>>  
>>                      /*
>>                       * Iterate backwards over compat_frame_list[] expanding
>>                       * compat_pfn_t to xen_pfn_t in place.
>>                       */
>> -                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
>> +                    for ( int x = nat.mar->nr_frames - 1; x >= 0; --x )
>>                          xen_frame_list[x] = compat_frame_list[x];
> Unrelated question, but I don't really see the point of iterating
> backwards, wouldn't it be easy to use use the existing 'i' loop
> counter and for a for ( i = 0; i <  nat.mar->nr_frames; i++ )?
>
> (Not that you need to fix it here, just curious about why we use that
> construct instead).

Iterating backwards is totally critical.

xen_frame_list and compat_frame_list are the same numerical pointer. 
We've just filled it 50% full with compat_pfn_t's, and need to turn
these into xen_pfn_t's which are double the size.

Iterating forwards would clobber every entry but the first.

>
>>                  }
>>              }
>> @@ -600,9 +609,11 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>          case XENMEM_acquire_resource:
>>          {
>>              DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
>> +            unsigned int done;
>>  
>>              if ( compat_handle_is_null(cmp.mar.frame_list) )
>>              {
>> +                ASSERT(split == 0 && rc == 0);
>>                  if ( __copy_field_to_guest(
>>                           guest_handle_cast(compat,
>>                                             compat_mem_acquire_resource_t),
>> @@ -611,6 +622,21 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>                  break;
>>              }
>>  
>> +            if ( split < 0 )
>> +            {
>> +                /* Continuation occurred. */
>> +                ASSERT(rc != XENMEM_acquire_resource);
>> +                done = cmd >> MEMOP_EXTENT_SHIFT;
>> +            }
>> +            else
>> +            {
>> +                /* No continuation. */
>> +                ASSERT(rc == 0);
>> +                done = nat.mar->nr_frames;
>> +            }
>> +
>> +            ASSERT(done <= nat.mar->nr_frames);
>> +
>>              /*
>>               * frame_list is an input for translated guests, and an output for
>>               * untranslated guests.  Only copy out for untranslated guests.
>> @@ -626,7 +652,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>                   */
>>                  BUILD_BUG_ON(sizeof(compat_pfn_t) > sizeof(xen_pfn_t));
>>  
>> -                for ( i = 0; i < cmp.mar.nr_frames; i++ )
>> +                for ( i = 0; i < done; i++ )
>>                  {
>>                      compat_pfn_t frame = xen_frame_list[i];
>>  
>> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>                      compat_frame_list[i] = frame;
>>                  }
>>  
>> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
>> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>>                                               compat_frame_list,
>> -                                             cmp.mar.nr_frames) )
>> +                                             done) )
>>                      return -EFAULT;
> Is it fine to return with a possibly pending continuation already
> encoded here?
>
> Other places seem to crash the domain when that happens.

Hmm.  This is all a total mess.  (Elsewhere the error handling is also
broken - a caller who receives an error can't figure out how to recover)

But yes - I think you're right - the only thing we can do here is `goto
crash;` and woe betide any 32bit kernel which passes a pointer to a
read-only buffer.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-02-01 10:51   ` Roger Pau Monné
@ 2021-02-01 11:19     ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 11:19 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, Michał Leszczyński, Jan Beulich, Wei Liu,
	Anthony PERARD, Tamas K Lengyel

On 01/02/2021 10:51, Roger Pau Monné wrote:
> On Sat, Jan 30, 2021 at 02:58:44AM +0000, Andrew Cooper wrote:
>> From: Michał Leszczyński <michal.leszczynski@cert.pl>
>>
>> To use vmtrace, buffers of a suitable size need allocating, and different
>> tasks will want different sizes.
>>
>> Add a domain creation parameter, and audit it appropriately in the
>> {arch_,}sanitise_domain_config() functions.
>>
>> For now, the x86 specific auditing is tuned to Processor Trace running in
>> Single Output mode, which requires a single contiguous range of memory.
>>
>> The size is given an arbitrary limit of 64M which is expected to be enough for
>> anticipated usecases, but not large enough to get into long-running-hypercall
>> problems.
>>
>> Signed-off-by: Michał Leszczyński <michal.leszczynski@cert.pl>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index d1e94d88cf..491b32812e 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -132,6 +132,71 @@ static void vcpu_info_reset(struct vcpu *v)
>>      v->vcpu_info_mfn = INVALID_MFN;
>>  }
>>  
>> +static void vmtrace_free_buffer(struct vcpu *v)
>> +{
>> +    const struct domain *d = v->domain;
>> +    struct page_info *pg = v->vmtrace.pg;
>> +    unsigned int i;
>> +
>> +    if ( !pg )
>> +        return;
>> +
>> +    v->vmtrace.pg = NULL;
>> +
>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>> +    {
>> +        put_page_alloc_ref(&pg[i]);
>> +        put_page_and_type(&pg[i]);
>> +    }
>> +}
>> +
>> +static int vmtrace_alloc_buffer(struct vcpu *v)
> You might as well make this return true/false, as the error code is
> ignored by the caller (at least in this patch).

At some point vcpu_create() needs to be fixed not to lose the error
code.  That is the real bug here.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op
  2021-01-30  2:58 ` [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op Andrew Cooper
@ 2021-02-01 12:01   ` Roger Pau Monné
  2021-02-01 13:00     ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 12:01 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Michał Leszczyński, Jan Beulich, Wei Liu,
	Jun Nakajima, Kevin Tian, Tamas K Lengyel

On Sat, Jan 30, 2021 at 02:58:48AM +0000, Andrew Cooper wrote:
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 12b961113e..a64c4e4177 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -2261,6 +2261,157 @@ static bool vmx_get_pending_event(struct vcpu *v, struct x86_event *info)
>      return true;
>  }
>  
> +/*
> + * We only let vmtrace agents see and modify a subset of bits in MSR_RTIT_CTL.
> + * These all pertain to data-emitted into the trace buffer(s).  Must not
> + * include controls pertaining to the structure/position of the trace
> + * buffer(s).
> + */
> +#define RTIT_CTL_MASK                                                   \
> +    (RTIT_CTL_TRACE_EN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_TSC_EN | \
> +     RTIT_CTL_DIS_RETC | RTIT_CTL_BRANCH_EN)
> +
> +/*
> + * Status bits restricted to the first-gen subset (i.e. no further CPUID
> + * requirements.)
> + */
> +#define RTIT_STATUS_MASK                                                \
> +    (RTIT_STATUS_FILTER_EN | RTIT_STATUS_CONTEXT_EN | RTIT_STATUS_TRIGGER_EN | \
> +     RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED)
> +
> +static int vmtrace_get_option(struct vcpu *v, uint64_t key, uint64_t *output)
> +{
> +    const struct vcpu_msrs *msrs = v->arch.msrs;
> +
> +    switch ( key )
> +    {
> +    case MSR_RTIT_OUTPUT_MASK:

Is there any value in returning the raw value of this MSR instead of
just using XEN_DOMCTL_vmtrace_output_position?

The size of the buffer should be known to user-space, and then setting
the offset could be done by adding a XEN_DOMCTL_vmtrace_set_output_position?

Also the contents of this MSR depend on whether ToPA mode is used, and
that's not under the control of the guest. So if Xen is switched to
use ToPA mode at some point the value of this MSR might not be what a
user of the interface expects.

From an interface PoV it might be better to offer:

XEN_DOMCTL_vmtrace_get_limit
XEN_DOMCTL_vmtrace_get_output_position
XEN_DOMCTL_vmtrace_set_output_position

IMO, as that would be compatible with ToPA if we ever switch to it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 11:11     ` Andrew Cooper
@ 2021-02-01 12:07       ` Roger Pau Monné
  2021-02-01 12:10         ` Andrew Cooper
  2021-02-01 13:03       ` Jan Beulich
  1 sibling, 1 reply; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 12:07 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

On Mon, Feb 01, 2021 at 11:11:37AM +0000, Andrew Cooper wrote:
> On 01/02/2021 10:10, Roger Pau Monné wrote:
> > On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
> >> +                    (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
> >> +                    sizeof(*xen_frame_list);
> >> +
> >> +                if ( start_extent >= cmp.mar.nr_frames )
> >> +                    return -EINVAL;
> >> +
> >> +                /*
> >> +                 * Adjust nat to account for work done on previous
> >> +                 * continuations, leaving cmp pristine.  Hide the continaution
> >> +                 * from the native code to prevent double accounting.
> >> +                 */
> >> +                nat.mar->nr_frames -= start_extent;
> >> +                nat.mar->frame += start_extent;
> >> +                cmd &= MEMOP_CMD_MASK;
> >> +
> >> +                /*
> >> +                 * If there are two many frames to fit within the xlat buffer,
> >> +                 * we'll need to loop to marshal them all.
> >> +                 */
> >> +                nat.mar->nr_frames = min(nat.mar->nr_frames, xlat_max_frames);
> >> +
> >>                  /*
> >>                   * frame_list is an input for translated guests, and an output
> >>                   * for untranslated guests.  Only copy in for translated guests.
> >> @@ -444,14 +453,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
> >>                                               cmp.mar.nr_frames) ||
> >>                           __copy_from_compat_offset(
> >>                               compat_frame_list, cmp.mar.frame_list,
> >> -                             0, cmp.mar.nr_frames) )
> >> +                             start_extent, nat.mar->nr_frames) )
> >>                          return -EFAULT;
> >>  
> >>                      /*
> >>                       * Iterate backwards over compat_frame_list[] expanding
> >>                       * compat_pfn_t to xen_pfn_t in place.
> >>                       */
> >> -                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
> >> +                    for ( int x = nat.mar->nr_frames - 1; x >= 0; --x )
> >>                          xen_frame_list[x] = compat_frame_list[x];
> > Unrelated question, but I don't really see the point of iterating
> > backwards, wouldn't it be easy to use use the existing 'i' loop
> > counter and for a for ( i = 0; i <  nat.mar->nr_frames; i++ )?
> >
> > (Not that you need to fix it here, just curious about why we use that
> > construct instead).
> 
> Iterating backwards is totally critical.
> 
> xen_frame_list and compat_frame_list are the same numerical pointer. 
> We've just filled it 50% full with compat_pfn_t's, and need to turn
> these into xen_pfn_t's which are double the size.
> 
> Iterating forwards would clobber every entry but the first.

Oh, I didn't realize they point to the same address. A comment would
help (not that you need to add it now).

> >
> >>                  }
> >>              }
> >> @@ -600,9 +609,11 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
> >>          case XENMEM_acquire_resource:
> >>          {
> >>              DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
> >> +            unsigned int done;
> >>  
> >>              if ( compat_handle_is_null(cmp.mar.frame_list) )
> >>              {
> >> +                ASSERT(split == 0 && rc == 0);
> >>                  if ( __copy_field_to_guest(
> >>                           guest_handle_cast(compat,
> >>                                             compat_mem_acquire_resource_t),
> >> @@ -611,6 +622,21 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
> >>                  break;
> >>              }
> >>  
> >> +            if ( split < 0 )
> >> +            {
> >> +                /* Continuation occurred. */
> >> +                ASSERT(rc != XENMEM_acquire_resource);
> >> +                done = cmd >> MEMOP_EXTENT_SHIFT;
> >> +            }
> >> +            else
> >> +            {
> >> +                /* No continuation. */
> >> +                ASSERT(rc == 0);
> >> +                done = nat.mar->nr_frames;
> >> +            }
> >> +
> >> +            ASSERT(done <= nat.mar->nr_frames);
> >> +
> >>              /*
> >>               * frame_list is an input for translated guests, and an output for
> >>               * untranslated guests.  Only copy out for untranslated guests.
> >> @@ -626,7 +652,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
> >>                   */
> >>                  BUILD_BUG_ON(sizeof(compat_pfn_t) > sizeof(xen_pfn_t));
> >>  
> >> -                for ( i = 0; i < cmp.mar.nr_frames; i++ )
> >> +                for ( i = 0; i < done; i++ )
> >>                  {
> >>                      compat_pfn_t frame = xen_frame_list[i];
> >>  
> >> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
> >>                      compat_frame_list[i] = frame;
> >>                  }
> >>  
> >> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
> >> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
> >>                                               compat_frame_list,
> >> -                                             cmp.mar.nr_frames) )
> >> +                                             done) )
> >>                      return -EFAULT;
> > Is it fine to return with a possibly pending continuation already
> > encoded here?
> >
> > Other places seem to crash the domain when that happens.
> 
> Hmm.  This is all a total mess.  (Elsewhere the error handling is also
> broken - a caller who receives an error can't figure out how to recover)
> 
> But yes - I think you're right - the only thing we can do here is `goto
> crash;` and woe betide any 32bit kernel which passes a pointer to a
> read-only buffer.

With that added:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 12:07       ` Roger Pau Monné
@ 2021-02-01 12:10         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 12:10 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, George Dunlap, Ian Jackson, Jan Beulich,
	Stefano Stabellini, Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel

On 01/02/2021 12:07, Roger Pau Monné wrote:
> On Mon, Feb 01, 2021 at 11:11:37AM +0000, Andrew Cooper wrote:
>> On 01/02/2021 10:10, Roger Pau Monné wrote:
>>> On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
>>>> +                    (COMPAT_ARG_XLAT_SIZE - sizeof(*nat.mar)) /
>>>> +                    sizeof(*xen_frame_list);
>>>> +
>>>> +                if ( start_extent >= cmp.mar.nr_frames )
>>>> +                    return -EINVAL;
>>>> +
>>>> +                /*
>>>> +                 * Adjust nat to account for work done on previous
>>>> +                 * continuations, leaving cmp pristine.  Hide the continaution
>>>> +                 * from the native code to prevent double accounting.
>>>> +                 */
>>>> +                nat.mar->nr_frames -= start_extent;
>>>> +                nat.mar->frame += start_extent;
>>>> +                cmd &= MEMOP_CMD_MASK;
>>>> +
>>>> +                /*
>>>> +                 * If there are two many frames to fit within the xlat buffer,
>>>> +                 * we'll need to loop to marshal them all.
>>>> +                 */
>>>> +                nat.mar->nr_frames = min(nat.mar->nr_frames, xlat_max_frames);
>>>> +
>>>>                  /*
>>>>                   * frame_list is an input for translated guests, and an output
>>>>                   * for untranslated guests.  Only copy in for translated guests.
>>>> @@ -444,14 +453,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>                                               cmp.mar.nr_frames) ||
>>>>                           __copy_from_compat_offset(
>>>>                               compat_frame_list, cmp.mar.frame_list,
>>>> -                             0, cmp.mar.nr_frames) )
>>>> +                             start_extent, nat.mar->nr_frames) )
>>>>                          return -EFAULT;
>>>>  
>>>>                      /*
>>>>                       * Iterate backwards over compat_frame_list[] expanding
>>>>                       * compat_pfn_t to xen_pfn_t in place.
>>>>                       */
>>>> -                    for ( int x = cmp.mar.nr_frames - 1; x >= 0; --x )
>>>> +                    for ( int x = nat.mar->nr_frames - 1; x >= 0; --x )
>>>>                          xen_frame_list[x] = compat_frame_list[x];
>>> Unrelated question, but I don't really see the point of iterating
>>> backwards, wouldn't it be easy to use use the existing 'i' loop
>>> counter and for a for ( i = 0; i <  nat.mar->nr_frames; i++ )?
>>>
>>> (Not that you need to fix it here, just curious about why we use that
>>> construct instead).
>> Iterating backwards is totally critical.
>>
>> xen_frame_list and compat_frame_list are the same numerical pointer. 
>> We've just filled it 50% full with compat_pfn_t's, and need to turn
>> these into xen_pfn_t's which are double the size.
>>
>> Iterating forwards would clobber every entry but the first.
> Oh, I didn't realize they point to the same address. A comment would
> help (not that you need to add it now).

Well - that's what "expand ... in place" means in the existing comment. 
Suggestions for how to make it clearer?

>
>>>>                  }
>>>>              }
>>>> @@ -600,9 +609,11 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>          case XENMEM_acquire_resource:
>>>>          {
>>>>              DEFINE_XEN_GUEST_HANDLE(compat_mem_acquire_resource_t);
>>>> +            unsigned int done;
>>>>  
>>>>              if ( compat_handle_is_null(cmp.mar.frame_list) )
>>>>              {
>>>> +                ASSERT(split == 0 && rc == 0);
>>>>                  if ( __copy_field_to_guest(
>>>>                           guest_handle_cast(compat,
>>>>                                             compat_mem_acquire_resource_t),
>>>> @@ -611,6 +622,21 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>                  break;
>>>>              }
>>>>  
>>>> +            if ( split < 0 )
>>>> +            {
>>>> +                /* Continuation occurred. */
>>>> +                ASSERT(rc != XENMEM_acquire_resource);
>>>> +                done = cmd >> MEMOP_EXTENT_SHIFT;
>>>> +            }
>>>> +            else
>>>> +            {
>>>> +                /* No continuation. */
>>>> +                ASSERT(rc == 0);
>>>> +                done = nat.mar->nr_frames;
>>>> +            }
>>>> +
>>>> +            ASSERT(done <= nat.mar->nr_frames);
>>>> +
>>>>              /*
>>>>               * frame_list is an input for translated guests, and an output for
>>>>               * untranslated guests.  Only copy out for untranslated guests.
>>>> @@ -626,7 +652,7 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>                   */
>>>>                  BUILD_BUG_ON(sizeof(compat_pfn_t) > sizeof(xen_pfn_t));
>>>>  
>>>> -                for ( i = 0; i < cmp.mar.nr_frames; i++ )
>>>> +                for ( i = 0; i < done; i++ )
>>>>                  {
>>>>                      compat_pfn_t frame = xen_frame_list[i];
>>>>  
>>>> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>                      compat_frame_list[i] = frame;
>>>>                  }
>>>>  
>>>> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
>>>> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>>>>                                               compat_frame_list,
>>>> -                                             cmp.mar.nr_frames) )
>>>> +                                             done) )
>>>>                      return -EFAULT;
>>> Is it fine to return with a possibly pending continuation already
>>> encoded here?
>>>
>>> Other places seem to crash the domain when that happens.
>> Hmm.  This is all a total mess.  (Elsewhere the error handling is also
>> broken - a caller who receives an error can't figure out how to recover)
>>
>> But yes - I think you're right - the only thing we can do here is `goto
>> crash;` and woe betide any 32bit kernel which passes a pointer to a
>> read-only buffer.
> With that added:
>
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 00/16] acquire_resource size and external IPT monitoring
  2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
                   ` (15 preceding siblings ...)
  2021-01-30  2:58 ` [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event Andrew Cooper
@ 2021-02-01 12:34 ` Oleksandr
  2021-02-01 13:07   ` Andrew Cooper
  16 siblings, 1 reply; 40+ messages in thread
From: Oleksandr @ 2021-02-01 12:34 UTC (permalink / raw)
  To: Andrew Cooper, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel


On 30.01.21 04:58, Andrew Cooper wrote:

Hi Andrew

> Combined series (as they are dependent).  First, the resource size fixes, and
> then the external IPT monitoring built on top.
>
> Posting in full for reference, but several patches are ready to go in.  Those
> in need of review are patch 6, 8 and 12.
>
> See individual patches for changes.  The major work was rebasing over the
> ARM/IOREQ series which moved a load of code which this series was bugfixing.

Looks like that some of these patches have been already merged. So I 
preliminary tested current staging 
(9dc687f155a57216b83b17f9cde55dd43e06b0cd x86/debug: fix page-overflow 
bug in dbg_rw_guest_mem) on Arm *with* IOREQ enabled.

I didn't notice any regressions with IOREQ on Arm))


>
> Andrew Cooper (7):
>    xen/memory: Reject out-of-range resource 'frame' values
>    xen/gnttab: Rework resource acquisition
>    xen/memory: Fix acquire_resource size semantics
>    xen/memory: Improve compat XENMEM_acquire_resource handling
>    xen/memory: Indent part of acquire_resource()
>    xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
>    xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace
>
> Michał Leszczyński (7):
>    xen/domain: Add vmtrace_size domain creation parameter
>    tools/[lib]xl: Add vmtrace_buf_size parameter
>    xen/memory: Add a vmtrace_buf resource type
>    x86/vmx: Add Intel Processor Trace support
>    xen/domctl: Add XEN_DOMCTL_vmtrace_op
>    tools/libxc: Add xc_vmtrace_* functions
>    tools/misc: Add xen-vmtrace tool
>
> Tamas K Lengyel (2):
>    xen/vmtrace: support for VM forks
>    x86/vm_event: Carry the vmtrace buffer position in vm_event
>
>   docs/man/xl.cfg.5.pod.in                    |   9 +
>   tools/golang/xenlight/helpers.gen.go        |   4 +
>   tools/golang/xenlight/types.gen.go          |   2 +
>   tools/include/libxl.h                       |  14 ++
>   tools/include/xenctrl.h                     |  73 ++++++++
>   tools/libs/ctrl/Makefile                    |   1 +
>   tools/libs/ctrl/xc_vmtrace.c                | 128 +++++++++++++
>   tools/libs/light/libxl.c                    |   2 +
>   tools/libs/light/libxl_cpuid.c              |   1 +
>   tools/libs/light/libxl_create.c             |   1 +
>   tools/libs/light/libxl_types.idl            |   5 +
>   tools/misc/.gitignore                       |   1 +
>   tools/misc/Makefile                         |   7 +
>   tools/misc/xen-cpuid.c                      |   2 +-
>   tools/misc/xen-vmtrace.c                    | 154 ++++++++++++++++
>   tools/ocaml/libs/xc/xenctrl.ml              |   1 +
>   tools/ocaml/libs/xc/xenctrl.mli             |   1 +
>   tools/xl/xl_info.c                          |   5 +-
>   tools/xl/xl_parse.c                         |   4 +
>   xen/arch/x86/domain.c                       |  23 +++
>   xen/arch/x86/domctl.c                       |  55 ++++++
>   xen/arch/x86/hvm/vmx/vmcs.c                 |  19 +-
>   xen/arch/x86/hvm/vmx/vmx.c                  | 200 +++++++++++++++++++-
>   xen/arch/x86/mm/mem_sharing.c               |   3 +
>   xen/arch/x86/vm_event.c                     |   3 +
>   xen/common/compat/memory.c                  | 147 +++++++++++----
>   xen/common/domain.c                         |  81 ++++++++
>   xen/common/grant_table.c                    | 112 ++++++++----
>   xen/common/ioreq.c                          |   2 +-
>   xen/common/memory.c                         | 274 +++++++++++++++++++---------
>   xen/common/sysctl.c                         |   2 +
>   xen/include/asm-x86/cpufeature.h            |   1 +
>   xen/include/asm-x86/hvm/hvm.h               |  72 ++++++++
>   xen/include/asm-x86/hvm/vmx/vmcs.h          |   4 +
>   xen/include/asm-x86/msr.h                   |  32 ++++
>   xen/include/public/arch-x86/cpufeatureset.h |   1 +
>   xen/include/public/domctl.h                 |  38 ++++
>   xen/include/public/memory.h                 |  18 +-
>   xen/include/public/sysctl.h                 |   3 +-
>   xen/include/public/vm_event.h               |   7 +
>   xen/include/xen/domain.h                    |   2 +
>   xen/include/xen/grant_table.h               |  21 ++-
>   xen/include/xen/ioreq.h                     |   2 +-
>   xen/include/xen/sched.h                     |   6 +
>   xen/xsm/flask/hooks.c                       |   1 +
>   45 files changed, 1366 insertions(+), 178 deletions(-)
>   create mode 100644 tools/libs/ctrl/xc_vmtrace.c
>   create mode 100644 tools/misc/xen-vmtrace.c
>
-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op
  2021-02-01 12:01   ` Roger Pau Monné
@ 2021-02-01 13:00     ` Andrew Cooper
  2021-02-01 14:27       ` Roger Pau Monné
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 13:00 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Xen-devel, Michał Leszczyński, Jan Beulich, Wei Liu,
	Jun Nakajima, Kevin Tian, Tamas K Lengyel

On 01/02/2021 12:01, Roger Pau Monné wrote:
> On Sat, Jan 30, 2021 at 02:58:48AM +0000, Andrew Cooper wrote:
>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>> index 12b961113e..a64c4e4177 100644
>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>> @@ -2261,6 +2261,157 @@ static bool vmx_get_pending_event(struct vcpu *v, struct x86_event *info)
>>      return true;
>>  }
>>  
>> +/*
>> + * We only let vmtrace agents see and modify a subset of bits in MSR_RTIT_CTL.
>> + * These all pertain to data-emitted into the trace buffer(s).  Must not
>> + * include controls pertaining to the structure/position of the trace
>> + * buffer(s).
>> + */
>> +#define RTIT_CTL_MASK                                                   \
>> +    (RTIT_CTL_TRACE_EN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_TSC_EN | \
>> +     RTIT_CTL_DIS_RETC | RTIT_CTL_BRANCH_EN)
>> +
>> +/*
>> + * Status bits restricted to the first-gen subset (i.e. no further CPUID
>> + * requirements.)
>> + */
>> +#define RTIT_STATUS_MASK                                                \
>> +    (RTIT_STATUS_FILTER_EN | RTIT_STATUS_CONTEXT_EN | RTIT_STATUS_TRIGGER_EN | \
>> +     RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED)
>> +
>> +static int vmtrace_get_option(struct vcpu *v, uint64_t key, uint64_t *output)
>> +{
>> +    const struct vcpu_msrs *msrs = v->arch.msrs;
>> +
>> +    switch ( key )
>> +    {
>> +    case MSR_RTIT_OUTPUT_MASK:
> Is there any value in returning the raw value of this MSR instead of
> just using XEN_DOMCTL_vmtrace_output_position?

Yes, but for interface reasons.

There are deliberately some common interfaces (for the subset of options
expected to be useful), and some platform-specific ones (because there's
no possible way we encode all of the options in some "common" interface).

Yes - there is some overlap between the two sets - that is unavoidable
IMO.  A user of this interface already needs platform specific knowledge
because it has to interpret the contents of the trace buffer.

Future extensions to this interface would be setting up the CR3 filter
and range filters, which definitely shouldn't be common, and can be
added without new subops in the current model.

> The size of the buffer should be known to user-space, and then setting
> the offset could be done by adding a XEN_DOMCTL_vmtrace_set_output_position?
>
> Also the contents of this MSR depend on whether ToPA mode is used, and
> that's not under the control of the guest. So if Xen is switched to
> use ToPA mode at some point the value of this MSR might not be what a
> user of the interface expects.
>
> From an interface PoV it might be better to offer:
>
> XEN_DOMCTL_vmtrace_get_limit
> XEN_DOMCTL_vmtrace_get_output_position
> XEN_DOMCTL_vmtrace_set_output_position
>
> IMO, as that would be compatible with ToPA if we ever switch to it.

ToPA is definitely more complicated.  We'd need to stitch the disparate
buffers back together into one logical view, at which point
get_output_position becomes more complicated.

As for set_output_position, that's not useful.  You either want to keep
the position as-is, or reset back to 0, hence having a platform-neutral
reset option.

However, based on this reasoning, I think I should drop access to
MSR_RTIT_OUTPUT_MASK entirely.  Neither half is useful for userspace to
access in a platforms-specific way, and disallowing access entirely will
simplify adding ToPA support in the future.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 11:11     ` Andrew Cooper
  2021-02-01 12:07       ` Roger Pau Monné
@ 2021-02-01 13:03       ` Jan Beulich
  2021-02-01 14:04         ` Andrew Cooper
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-02-01 13:03 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel,
	Roger Pau Monné

On 01.02.2021 12:11, Andrew Cooper wrote:
> On 01/02/2021 10:10, Roger Pau Monné wrote:
>> On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
>>> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>                      compat_frame_list[i] = frame;
>>>                  }
>>>  
>>> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
>>> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>>>                                               compat_frame_list,
>>> -                                             cmp.mar.nr_frames) )
>>> +                                             done) )
>>>                      return -EFAULT;
>> Is it fine to return with a possibly pending continuation already
>> encoded here?
>>
>> Other places seem to crash the domain when that happens.
> 
> Hmm.  This is all a total mess.  (Elsewhere the error handling is also
> broken - a caller who receives an error can't figure out how to recover)
> 
> But yes - I think you're right - the only thing we can do here is `goto
> crash;` and woe betide any 32bit kernel which passes a pointer to a
> read-only buffer.

I'd like to ask you to reconsider the "goto crash", both the one
you mention above and the other one already present in the patch.
Wiring all the cases where we mean to crash the guest into a
single domain_crash() invocation has the downside that when
observing such a case one can't remotely know which path has led
there. Therefore I'd like to suggest individual domain_crash()
invocations on every affected path. Elsewhere in the file there
already is such an instance, commented "Cannot cancel the
continuation...".

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 00/16] acquire_resource size and external IPT monitoring
  2021-02-01 12:34 ` [PATCH v8 00/16] acquire_resource size and external IPT monitoring Oleksandr
@ 2021-02-01 13:07   ` Andrew Cooper
  2021-02-01 13:47     ` Oleksandr
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 13:07 UTC (permalink / raw)
  To: Oleksandr, Xen-devel
  Cc: Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel

On 01/02/2021 12:34, Oleksandr wrote:
>
> On 30.01.21 04:58, Andrew Cooper wrote:
>
> Hi Andrew
>
>> Combined series (as they are dependent).  First, the resource size
>> fixes, and
>> then the external IPT monitoring built on top.
>>
>> Posting in full for reference, but several patches are ready to go
>> in.  Those
>> in need of review are patch 6, 8 and 12.
>>
>> See individual patches for changes.  The major work was rebasing over
>> the
>> ARM/IOREQ series which moved a load of code which this series was
>> bugfixing.
>
> Looks like that some of these patches have been already merged. So I
> preliminary tested current staging
> (9dc687f155a57216b83b17f9cde55dd43e06b0cd x86/debug: fix page-overflow
> bug in dbg_rw_guest_mem) on Arm *with* IOREQ enabled.
>
> I didn't notice any regressions with IOREQ on Arm))

Fantastic!

Tamas and I did do extended testing on the subset which got committed,
before it went in, and it is all fixing of corner cases, rather than
fundamentally changing how things worked.


One query I did leave on IRC, and hasn't had an answer.

What is the maximum number of vcpus in an ARM guest?  You moved an
x86-ism "max 128 vcpus" into common code.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-01-30  2:58 ` [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter Andrew Cooper
  2021-02-01 10:51   ` Roger Pau Monné
@ 2021-02-01 13:18   ` Jan Beulich
  2021-02-01 14:22     ` Andrew Cooper
  1 sibling, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-02-01 13:18 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Michał Leszczyński, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Tamas K Lengyel, Xen-devel

On 30.01.2021 03:58, Andrew Cooper wrote:
> +static int vmtrace_alloc_buffer(struct vcpu *v)
> +{
> +    struct domain *d = v->domain;
> +    struct page_info *pg;
> +    unsigned int i;
> +
> +    if ( !d->vmtrace_size )
> +        return 0;
> +
> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
> +                             MEMF_no_refcount);
> +    if ( !pg )
> +        return -ENOMEM;
> +
> +    /*
> +     * Getting the reference counting correct here is hard.
> +     *
> +     * All pages are now on the domlist.  They, or subranges within, will be

"domlist" is too imprecise, as there's no list with this name. It's
extra_page_list in this case (see also below).

> +     * freed when their reference count drops to zero, which may any time
> +     * between now and the domain teardown path.
> +     */
> +
> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
> +            goto refcnt_err;
> +
> +    /*
> +     * We must only let vmtrace_free_buffer() take any action in the success
> +     * case when we've taken all the refs it intends to drop.
> +     */
> +    v->vmtrace.pg = pg;
> +
> +    return 0;
> +
> + refcnt_err:
> +    /*
> +     * In the failure case, we must drop all the acquired typerefs thus far,
> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
> +     * drop the alloc refs on any remaining pages - some pages could already
> +     * have been freed behind our backs.
> +     */
> +    while ( i-- )
> +        put_page_and_type(&pg[i]);
> +
> +    return -ENODATA;
> +}

As said in reply on the other thread, PGC_extra pages don't get
freed automatically. I too initially thought they would, but
(re-)learned otherwise when trying to repro your claims on that
other thread. For all pages you've managed to get the writable
ref, freeing is easily done by prefixing the loop body above by
put_page_alloc_ref(). For all other pages best you can do (I
think; see the debugging patches I had sent on that other
thread) is to try get_page() - if it succeeds, calling
put_page_alloc_ref() is allowed. Otherwise we can only leak the
respective page (unless going to further extents with trying to
recover from the "impossible"), or assume the failure here was
because it did get freed already.

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 00/16] acquire_resource size and external IPT monitoring
  2021-02-01 13:07   ` Andrew Cooper
@ 2021-02-01 13:47     ` Oleksandr
  2021-02-01 14:00       ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Oleksandr @ 2021-02-01 13:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel


On 01.02.21 15:07, Andrew Cooper wrote:

Hi Andrew

> On 01/02/2021 12:34, Oleksandr wrote:
>> On 30.01.21 04:58, Andrew Cooper wrote:
>>
>> Hi Andrew
>>
>>> Combined series (as they are dependent).  First, the resource size
>>> fixes, and
>>> then the external IPT monitoring built on top.
>>>
>>> Posting in full for reference, but several patches are ready to go
>>> in.  Those
>>> in need of review are patch 6, 8 and 12.
>>>
>>> See individual patches for changes.  The major work was rebasing over
>>> the
>>> ARM/IOREQ series which moved a load of code which this series was
>>> bugfixing.
>> Looks like that some of these patches have been already merged. So I
>> preliminary tested current staging
>> (9dc687f155a57216b83b17f9cde55dd43e06b0cd x86/debug: fix page-overflow
>> bug in dbg_rw_guest_mem) on Arm *with* IOREQ enabled.
>>
>> I didn't notice any regressions with IOREQ on Arm))
> Fantastic!
>
> Tamas and I did do extended testing on the subset which got committed,
> before it went in, and it is all fixing of corner cases, rather than
> fundamentally changing how things worked.
>
>
> One query I did leave on IRC, and hasn't had an answer.
>
> What is the maximum number of vcpus in an ARM guest?

public/arch-arm.h says that current supported guest VCPUs max number is 
128.


> You moved an
> x86-ism "max 128 vcpus" into common code.

Ooh, I am not sure I understand where exactly. Could you please clarify 
in which patch?


>
> ~Andrew

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 00/16] acquire_resource size and external IPT monitoring
  2021-02-01 13:47     ` Oleksandr
@ 2021-02-01 14:00       ` Andrew Cooper
  2021-02-02 12:09         ` Oleksandr
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 14:00 UTC (permalink / raw)
  To: Oleksandr
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel

On 01/02/2021 13:47, Oleksandr wrote:
>
> On 01.02.21 15:07, Andrew Cooper wrote:
>
> Hi Andrew
>
>> On 01/02/2021 12:34, Oleksandr wrote:
>>> On 30.01.21 04:58, Andrew Cooper wrote:
>>
>> One query I did leave on IRC, and hasn't had an answer.
>>
>> What is the maximum number of vcpus in an ARM guest?
>
> public/arch-arm.h says that current supported guest VCPUs max number
> is 128.
>
>
>> You moved an
>> x86-ism "max 128 vcpus" into common code.
>
> Ooh, I am not sure I understand where exactly. Could you please
> clarify in which patch?

ioreq_server_get_frame() hardcodes "there is exactly one non-bufioreq
frame", which in practice means there is 128 vcpu's work of struct
ioreqs contained within the mapping.

I've coded ioreq_server_max_frames() to perform the calculation
correctly, but ioreq_server_get_frame() will need fixing by whomever
supports more than 128 vcpus with ioreq servers first.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 13:03       ` Jan Beulich
@ 2021-02-01 14:04         ` Andrew Cooper
  2021-02-01 14:32           ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 14:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Xen-devel, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel,
	Roger Pau Monné

On 01/02/2021 13:03, Jan Beulich wrote:
> On 01.02.2021 12:11, Andrew Cooper wrote:
>> On 01/02/2021 10:10, Roger Pau Monné wrote:
>>> On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
>>>> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>                      compat_frame_list[i] = frame;
>>>>                  }
>>>>  
>>>> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
>>>> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>>>>                                               compat_frame_list,
>>>> -                                             cmp.mar.nr_frames) )
>>>> +                                             done) )
>>>>                      return -EFAULT;
>>> Is it fine to return with a possibly pending continuation already
>>> encoded here?
>>>
>>> Other places seem to crash the domain when that happens.
>> Hmm.  This is all a total mess.  (Elsewhere the error handling is also
>> broken - a caller who receives an error can't figure out how to recover)
>>
>> But yes - I think you're right - the only thing we can do here is `goto
>> crash;` and woe betide any 32bit kernel which passes a pointer to a
>> read-only buffer.
> I'd like to ask you to reconsider the "goto crash", both the one
> you mention above and the other one already present in the patch.
> Wiring all the cases where we mean to crash the guest into a
> single domain_crash() invocation has the downside that when
> observing such a case one can't remotely know which path has led
> there. Therefore I'd like to suggest individual domain_crash()
> invocations on every affected path. Elsewhere in the file there
> already is such an instance, commented "Cannot cancel the
> continuation...".

But they're all logically the same, are they not?

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-02-01 13:18   ` Jan Beulich
@ 2021-02-01 14:22     ` Andrew Cooper
  2021-02-01 14:36       ` Jan Beulich
  0 siblings, 1 reply; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 14:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Michał Leszczyński, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Tamas K Lengyel, Xen-devel

On 01/02/2021 13:18, Jan Beulich wrote:
> On 30.01.2021 03:58, Andrew Cooper wrote:
>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>> +{
>> +    struct domain *d = v->domain;
>> +    struct page_info *pg;
>> +    unsigned int i;
>> +
>> +    if ( !d->vmtrace_size )
>> +        return 0;
>> +
>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>> +                             MEMF_no_refcount);
>> +    if ( !pg )
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Getting the reference counting correct here is hard.
>> +     *
>> +     * All pages are now on the domlist.  They, or subranges within, will be
> "domlist" is too imprecise, as there's no list with this name. It's
> extra_page_list in this case (see also below).
>
>> +     * freed when their reference count drops to zero, which may any time
>> +     * between now and the domain teardown path.
>> +     */
>> +
>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>> +            goto refcnt_err;
>> +
>> +    /*
>> +     * We must only let vmtrace_free_buffer() take any action in the success
>> +     * case when we've taken all the refs it intends to drop.
>> +     */
>> +    v->vmtrace.pg = pg;
>> +
>> +    return 0;
>> +
>> + refcnt_err:
>> +    /*
>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>> +     * drop the alloc refs on any remaining pages - some pages could already
>> +     * have been freed behind our backs.
>> +     */
>> +    while ( i-- )
>> +        put_page_and_type(&pg[i]);
>> +
>> +    return -ENODATA;
>> +}
> As said in reply on the other thread, PGC_extra pages don't get
> freed automatically. I too initially thought they would, but
> (re-)learned otherwise when trying to repro your claims on that
> other thread. For all pages you've managed to get the writable
> ref, freeing is easily done by prefixing the loop body above by
> put_page_alloc_ref(). For all other pages best you can do (I
> think; see the debugging patches I had sent on that other
> thread) is to try get_page() - if it succeeds, calling
> put_page_alloc_ref() is allowed. Otherwise we can only leak the
> respective page (unless going to further extents with trying to
> recover from the "impossible"), or assume the failure here was
> because it did get freed already.

Right - I'm going to insist on breaking apart orthogonal issues.

This refcounting issue isn't introduced by this series - this series
uses an established pattern, in which we've found a corner case.

The corner case is theoretical, not practical - it is not possible for a
malicious PV domain to take 2^43 refs on any of the pages in this
allocation.  Doing so would require an hours-long SMI, or equivalent,
and even then all malicious activity would be paused after 1s for the
time calibration rendezvous which would livelock the system until the
watchdog kicked in.


I will drop the comments, because in light of this discovery, they're
not correct.

We should fix the corner case, but that should be a separate patch. 
Whatever we do needs to start by writing down the the refcounting rules
first because its totally clear that noone understands them, and then a
change to all examples of this pattern adjusting (if necessary).

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op
  2021-02-01 13:00     ` Andrew Cooper
@ 2021-02-01 14:27       ` Roger Pau Monné
  0 siblings, 0 replies; 40+ messages in thread
From: Roger Pau Monné @ 2021-02-01 14:27 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Michał Leszczyński, Jan Beulich, Wei Liu,
	Jun Nakajima, Kevin Tian, Tamas K Lengyel

On Mon, Feb 01, 2021 at 01:00:47PM +0000, Andrew Cooper wrote:
> On 01/02/2021 12:01, Roger Pau Monné wrote:
> > On Sat, Jan 30, 2021 at 02:58:48AM +0000, Andrew Cooper wrote:
> >> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> >> index 12b961113e..a64c4e4177 100644
> >> --- a/xen/arch/x86/hvm/vmx/vmx.c
> >> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >> @@ -2261,6 +2261,157 @@ static bool vmx_get_pending_event(struct vcpu *v, struct x86_event *info)
> >>      return true;
> >>  }
> >>  
> >> +/*
> >> + * We only let vmtrace agents see and modify a subset of bits in MSR_RTIT_CTL.
> >> + * These all pertain to data-emitted into the trace buffer(s).  Must not
> >> + * include controls pertaining to the structure/position of the trace
> >> + * buffer(s).
> >> + */
> >> +#define RTIT_CTL_MASK                                                   \
> >> +    (RTIT_CTL_TRACE_EN | RTIT_CTL_OS | RTIT_CTL_USR | RTIT_CTL_TSC_EN | \
> >> +     RTIT_CTL_DIS_RETC | RTIT_CTL_BRANCH_EN)
> >> +
> >> +/*
> >> + * Status bits restricted to the first-gen subset (i.e. no further CPUID
> >> + * requirements.)
> >> + */
> >> +#define RTIT_STATUS_MASK                                                \
> >> +    (RTIT_STATUS_FILTER_EN | RTIT_STATUS_CONTEXT_EN | RTIT_STATUS_TRIGGER_EN | \
> >> +     RTIT_STATUS_ERROR | RTIT_STATUS_STOPPED)
> >> +
> >> +static int vmtrace_get_option(struct vcpu *v, uint64_t key, uint64_t *output)
> >> +{
> >> +    const struct vcpu_msrs *msrs = v->arch.msrs;
> >> +
> >> +    switch ( key )
> >> +    {
> >> +    case MSR_RTIT_OUTPUT_MASK:
> > Is there any value in returning the raw value of this MSR instead of
> > just using XEN_DOMCTL_vmtrace_output_position?
> 
> Yes, but for interface reasons.
> 
> There are deliberately some common interfaces (for the subset of options
> expected to be useful), and some platform-specific ones (because there's
> no possible way we encode all of the options in some "common" interface).
> 
> Yes - there is some overlap between the two sets - that is unavoidable
> IMO.  A user of this interface already needs platform specific knowledge
> because it has to interpret the contents of the trace buffer.
> 
> Future extensions to this interface would be setting up the CR3 filter
> and range filters, which definitely shouldn't be common, and can be
> added without new subops in the current model.
> 
> > The size of the buffer should be known to user-space, and then setting
> > the offset could be done by adding a XEN_DOMCTL_vmtrace_set_output_position?
> >
> > Also the contents of this MSR depend on whether ToPA mode is used, and
> > that's not under the control of the guest. So if Xen is switched to
> > use ToPA mode at some point the value of this MSR might not be what a
> > user of the interface expects.
> >
> > From an interface PoV it might be better to offer:
> >
> > XEN_DOMCTL_vmtrace_get_limit
> > XEN_DOMCTL_vmtrace_get_output_position
> > XEN_DOMCTL_vmtrace_set_output_position
> >
> > IMO, as that would be compatible with ToPA if we ever switch to it.
> 
> ToPA is definitely more complicated.  We'd need to stitch the disparate
> buffers back together into one logical view, at which point
> get_output_position becomes more complicated.
> 
> As for set_output_position, that's not useful.  You either want to keep
> the position as-is, or reset back to 0, hence having a platform-neutral
> reset option.
> 
> However, based on this reasoning, I think I should drop access to
> MSR_RTIT_OUTPUT_MASK entirely.  Neither half is useful for userspace to
> access in a platforms-specific way, and disallowing access entirely will
> simplify adding ToPA support in the future.

Exactly. Dropping access to MSR_RTIT_OUTPUT_MASK would indeed solve my
concerns. I somehow assumed that setting the offset was needed for the
users of the interface. With that dropped you can add:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource
  2021-02-01 14:04         ` Andrew Cooper
@ 2021-02-01 14:32           ` Jan Beulich
  0 siblings, 0 replies; 40+ messages in thread
From: Jan Beulich @ 2021-02-01 14:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, George Dunlap, Ian Jackson, Stefano Stabellini,
	Wei Liu, Julien Grall, Paul Durrant,
	Michał Leszczyński, Hubert Jasudowicz, Tamas K Lengyel,
	Roger Pau Monné

On 01.02.2021 15:04, Andrew Cooper wrote:
> On 01/02/2021 13:03, Jan Beulich wrote:
>> On 01.02.2021 12:11, Andrew Cooper wrote:
>>> On 01/02/2021 10:10, Roger Pau Monné wrote:
>>>> On Sat, Jan 30, 2021 at 02:58:42AM +0000, Andrew Cooper wrote:
>>>>> @@ -636,15 +662,45 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
>>>>>                      compat_frame_list[i] = frame;
>>>>>                  }
>>>>>  
>>>>> -                if ( __copy_to_compat_offset(cmp.mar.frame_list, 0,
>>>>> +                if ( __copy_to_compat_offset(cmp.mar.frame_list, start_extent,
>>>>>                                               compat_frame_list,
>>>>> -                                             cmp.mar.nr_frames) )
>>>>> +                                             done) )
>>>>>                      return -EFAULT;
>>>> Is it fine to return with a possibly pending continuation already
>>>> encoded here?
>>>>
>>>> Other places seem to crash the domain when that happens.
>>> Hmm.  This is all a total mess.  (Elsewhere the error handling is also
>>> broken - a caller who receives an error can't figure out how to recover)
>>>
>>> But yes - I think you're right - the only thing we can do here is `goto
>>> crash;` and woe betide any 32bit kernel which passes a pointer to a
>>> read-only buffer.
>> I'd like to ask you to reconsider the "goto crash", both the one
>> you mention above and the other one already present in the patch.
>> Wiring all the cases where we mean to crash the guest into a
>> single domain_crash() invocation has the downside that when
>> observing such a case one can't remotely know which path has led
>> there. Therefore I'd like to suggest individual domain_crash()
>> invocations on every affected path. Elsewhere in the file there
>> already is such an instance, commented "Cannot cancel the
>> continuation...".
> 
> But they're all logically the same, are they not?

Depends on what "logically the same" here means. To me different
paths and different causes aren't necessarily "the same".

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-02-01 14:22     ` Andrew Cooper
@ 2021-02-01 14:36       ` Jan Beulich
  2021-02-01 22:14         ` Andrew Cooper
  0 siblings, 1 reply; 40+ messages in thread
From: Jan Beulich @ 2021-02-01 14:36 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Michał Leszczyński, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Tamas K Lengyel, Xen-devel

On 01.02.2021 15:22, Andrew Cooper wrote:
> On 01/02/2021 13:18, Jan Beulich wrote:
>> On 30.01.2021 03:58, Andrew Cooper wrote:
>>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>>> +{
>>> +    struct domain *d = v->domain;
>>> +    struct page_info *pg;
>>> +    unsigned int i;
>>> +
>>> +    if ( !d->vmtrace_size )
>>> +        return 0;
>>> +
>>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>>> +                             MEMF_no_refcount);
>>> +    if ( !pg )
>>> +        return -ENOMEM;
>>> +
>>> +    /*
>>> +     * Getting the reference counting correct here is hard.
>>> +     *
>>> +     * All pages are now on the domlist.  They, or subranges within, will be
>> "domlist" is too imprecise, as there's no list with this name. It's
>> extra_page_list in this case (see also below).
>>
>>> +     * freed when their reference count drops to zero, which may any time
>>> +     * between now and the domain teardown path.
>>> +     */
>>> +
>>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>>> +            goto refcnt_err;
>>> +
>>> +    /*
>>> +     * We must only let vmtrace_free_buffer() take any action in the success
>>> +     * case when we've taken all the refs it intends to drop.
>>> +     */
>>> +    v->vmtrace.pg = pg;
>>> +
>>> +    return 0;
>>> +
>>> + refcnt_err:
>>> +    /*
>>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>>> +     * drop the alloc refs on any remaining pages - some pages could already
>>> +     * have been freed behind our backs.
>>> +     */
>>> +    while ( i-- )
>>> +        put_page_and_type(&pg[i]);
>>> +
>>> +    return -ENODATA;
>>> +}
>> As said in reply on the other thread, PGC_extra pages don't get
>> freed automatically. I too initially thought they would, but
>> (re-)learned otherwise when trying to repro your claims on that
>> other thread. For all pages you've managed to get the writable
>> ref, freeing is easily done by prefixing the loop body above by
>> put_page_alloc_ref(). For all other pages best you can do (I
>> think; see the debugging patches I had sent on that other
>> thread) is to try get_page() - if it succeeds, calling
>> put_page_alloc_ref() is allowed. Otherwise we can only leak the
>> respective page (unless going to further extents with trying to
>> recover from the "impossible"), or assume the failure here was
>> because it did get freed already.
> 
> Right - I'm going to insist on breaking apart orthogonal issues.
> 
> This refcounting issue isn't introduced by this series - this series
> uses an established pattern, in which we've found a corner case.
> 
> The corner case is theoretical, not practical - it is not possible for a
> malicious PV domain to take 2^43 refs on any of the pages in this
> allocation.  Doing so would require an hours-long SMI, or equivalent,
> and even then all malicious activity would be paused after 1s for the
> time calibration rendezvous which would livelock the system until the
> watchdog kicked in.

Actually an overflow is only one of the possible reasons here.
Another, which may be more "practical", is that another entity
has already managed to free the page (by dropping its alloc-ref,
and of course implying it did guess at the MFN).

Jan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter
  2021-02-01 14:36       ` Jan Beulich
@ 2021-02-01 22:14         ` Andrew Cooper
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Cooper @ 2021-02-01 22:14 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Michał Leszczyński, Roger Pau Monné,
	Wei Liu, Anthony PERARD, Tamas K Lengyel, Xen-devel

On 01/02/2021 14:36, Jan Beulich wrote:
> On 01.02.2021 15:22, Andrew Cooper wrote:
>> On 01/02/2021 13:18, Jan Beulich wrote:
>>> On 30.01.2021 03:58, Andrew Cooper wrote:
>>>> +static int vmtrace_alloc_buffer(struct vcpu *v)
>>>> +{
>>>> +    struct domain *d = v->domain;
>>>> +    struct page_info *pg;
>>>> +    unsigned int i;
>>>> +
>>>> +    if ( !d->vmtrace_size )
>>>> +        return 0;
>>>> +
>>>> +    pg = alloc_domheap_pages(d, get_order_from_bytes(d->vmtrace_size),
>>>> +                             MEMF_no_refcount);
>>>> +    if ( !pg )
>>>> +        return -ENOMEM;
>>>> +
>>>> +    /*
>>>> +     * Getting the reference counting correct here is hard.
>>>> +     *
>>>> +     * All pages are now on the domlist.  They, or subranges within, will be
>>> "domlist" is too imprecise, as there's no list with this name. It's
>>> extra_page_list in this case (see also below).
>>>
>>>> +     * freed when their reference count drops to zero, which may any time
>>>> +     * between now and the domain teardown path.
>>>> +     */
>>>> +
>>>> +    for ( i = 0; i < (d->vmtrace_size >> PAGE_SHIFT); i++ )
>>>> +        if ( unlikely(!get_page_and_type(&pg[i], d, PGT_writable_page)) )
>>>> +            goto refcnt_err;
>>>> +
>>>> +    /*
>>>> +     * We must only let vmtrace_free_buffer() take any action in the success
>>>> +     * case when we've taken all the refs it intends to drop.
>>>> +     */
>>>> +    v->vmtrace.pg = pg;
>>>> +
>>>> +    return 0;
>>>> +
>>>> + refcnt_err:
>>>> +    /*
>>>> +     * In the failure case, we must drop all the acquired typerefs thus far,
>>>> +     * skip vmtrace_free_buffer(), and leave domain_relinquish_resources() to
>>>> +     * drop the alloc refs on any remaining pages - some pages could already
>>>> +     * have been freed behind our backs.
>>>> +     */
>>>> +    while ( i-- )
>>>> +        put_page_and_type(&pg[i]);
>>>> +
>>>> +    return -ENODATA;
>>>> +}
>>> As said in reply on the other thread, PGC_extra pages don't get
>>> freed automatically. I too initially thought they would, but
>>> (re-)learned otherwise when trying to repro your claims on that
>>> other thread. For all pages you've managed to get the writable
>>> ref, freeing is easily done by prefixing the loop body above by
>>> put_page_alloc_ref(). For all other pages best you can do (I
>>> think; see the debugging patches I had sent on that other
>>> thread) is to try get_page() - if it succeeds, calling
>>> put_page_alloc_ref() is allowed. Otherwise we can only leak the
>>> respective page (unless going to further extents with trying to
>>> recover from the "impossible"), or assume the failure here was
>>> because it did get freed already.
>> Right - I'm going to insist on breaking apart orthogonal issues.
>>
>> This refcounting issue isn't introduced by this series - this series
>> uses an established pattern, in which we've found a corner case.
>>
>> The corner case is theoretical, not practical - it is not possible for a
>> malicious PV domain to take 2^43 refs on any of the pages in this
>> allocation.  Doing so would require an hours-long SMI, or equivalent,
>> and even then all malicious activity would be paused after 1s for the
>> time calibration rendezvous which would livelock the system until the
>> watchdog kicked in.
> Actually an overflow is only one of the possible reasons here.
> Another, which may be more "practical", is that another entity
> has already managed to free the page (by dropping its alloc-ref,
> and of course implying it did guess at the MFN).

Yes, but in this case it did get dropped from the extra page list, in
which case looping over the remaining ones in relinquish_resource would
be safe.

~Andrew


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v8 00/16] acquire_resource size and external IPT monitoring
  2021-02-01 14:00       ` Andrew Cooper
@ 2021-02-02 12:09         ` Oleksandr
  0 siblings, 0 replies; 40+ messages in thread
From: Oleksandr @ 2021-02-02 12:09 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Xen-devel, Jan Beulich, Roger Pau Monné,
	Wei Liu, Ian Jackson, Anthony PERARD, Jun Nakajima, Kevin Tian,
	Michał Leszczyński, Tamas K Lengyel


On 01.02.21 16:00, Andrew Cooper wrote:

Hi Andrew

> On 01/02/2021 13:47, Oleksandr wrote:
>> On 01.02.21 15:07, Andrew Cooper wrote:
>>
>> Hi Andrew
>>
>>> On 01/02/2021 12:34, Oleksandr wrote:
>>>> On 30.01.21 04:58, Andrew Cooper wrote:
>>> One query I did leave on IRC, and hasn't had an answer.
>>>
>>> What is the maximum number of vcpus in an ARM guest?
>> public/arch-arm.h says that current supported guest VCPUs max number
>> is 128.
>>
>>
>>> You moved an
>>> x86-ism "max 128 vcpus" into common code.
>> Ooh, I am not sure I understand where exactly. Could you please
>> clarify in which patch?
> ioreq_server_get_frame() hardcodes "there is exactly one non-bufioreq
> frame", which in practice means there is 128 vcpu's work of struct
> ioreqs contained within the mapping.
>
> I've coded ioreq_server_max_frames() to perform the calculation
> correctly, but ioreq_server_get_frame() will need fixing by whomever
> supports more than 128 vcpus with ioreq servers first.

Thank you for the explanation. Now it is clear what you meant.

-- 
Regards,

Oleksandr Tyshchenko



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-02-02 12:09 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-30  2:58 [PATCH v8 00/16] acquire_resource size and external IPT monitoring Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 01/16] xen/memory: Reject out-of-range resource 'frame' values Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 02/16] xen/gnttab: Rework resource acquisition Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 03/16] xen/memory: Fix acquire_resource size semantics Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 04/16] xen/memory: Improve compat XENMEM_acquire_resource handling Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 05/16] xen/memory: Indent part of acquire_resource() Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 06/16] xen/memory: Fix mapping grant tables with XENMEM_acquire_resource Andrew Cooper
2021-02-01 10:10   ` Roger Pau Monné
2021-02-01 11:11     ` Andrew Cooper
2021-02-01 12:07       ` Roger Pau Monné
2021-02-01 12:10         ` Andrew Cooper
2021-02-01 13:03       ` Jan Beulich
2021-02-01 14:04         ` Andrew Cooper
2021-02-01 14:32           ` Jan Beulich
2021-01-30  2:58 ` [PATCH v8 07/16] xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vmtrace Andrew Cooper
2021-02-01 10:32   ` Roger Pau Monné
2021-01-30  2:58 ` [PATCH v8 08/16] xen/domain: Add vmtrace_size domain creation parameter Andrew Cooper
2021-02-01 10:51   ` Roger Pau Monné
2021-02-01 11:19     ` Andrew Cooper
2021-02-01 13:18   ` Jan Beulich
2021-02-01 14:22     ` Andrew Cooper
2021-02-01 14:36       ` Jan Beulich
2021-02-01 22:14         ` Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 09/16] tools/[lib]xl: Add vmtrace_buf_size parameter Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 10/16] xen/memory: Add a vmtrace_buf resource type Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 11/16] x86/vmx: Add Intel Processor Trace support Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 12/16] xen/domctl: Add XEN_DOMCTL_vmtrace_op Andrew Cooper
2021-02-01 12:01   ` Roger Pau Monné
2021-02-01 13:00     ` Andrew Cooper
2021-02-01 14:27       ` Roger Pau Monné
2021-01-30  2:58 ` [PATCH v8 13/16] tools/libxc: Add xc_vmtrace_* functions Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 14/16] tools/misc: Add xen-vmtrace tool Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 15/16] xen/vmtrace: support for VM forks Andrew Cooper
2021-01-30  2:58 ` [PATCH v8 16/16] x86/vm_event: Carry the vmtrace buffer position in vm_event Andrew Cooper
2021-02-01  9:51   ` Jan Beulich
2021-02-01 12:34 ` [PATCH v8 00/16] acquire_resource size and external IPT monitoring Oleksandr
2021-02-01 13:07   ` Andrew Cooper
2021-02-01 13:47     ` Oleksandr
2021-02-01 14:00       ` Andrew Cooper
2021-02-02 12:09         ` Oleksandr

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.