All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 0/5] Basic guest memory introspection support
@ 2014-09-09 10:28 Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 1/5] xen: Emulate with no writes Razvan Cojocaru
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, stefano.stabellini, eddie.dong,
	ian.jackson, tim, jbeulich, jun.nakajima, andrew.cooper3

We need to be able to detect rootkits in HVM guests, in a way that
allows an application that runs in dom0 (or a similarly privileged
domain) to control what the guest is allowed to do once a threat is
detected. This has been done over the mem_event mechanism.

Here is a summary of the series:

a  1/5 xen: Emulate with no writes
a  2/5 xen: Optimize introspection access to guest state
A  3/5 xen, libxc: Force-enable relevant MSR events
A* 4/5 xen, libxc: Request page fault injection via libxc
a  5/5 xen: Handle resumed instruction based on previous mem_event reply

Key to symbols:
 *   Updated in this version of the series.
 +   New patch in this version.
 /   Updated but only to remove changes into a separate patch.
 -   Updated with style changes only.
 a   Acked/reviwed by one reviewer.
 A   Acked/reviwed by more than one reviewer.

This new version has added checks against CR3 == ~0, the only allowed
value for the per-VCPU case. If CR3 == ~0 in the per-domain case,
the trap is injected without checking if CR3 matches. Also, added
spin_lock-based synchronization for the trap injection data.
(Patch 4/5).

We needed to:

1. Be able to execute the current instruction without allowing it to
write to memory. This is done based on new mem_event response
fields sent from the controlling application.

2. Have the guest as responsive as possible amid all the processing. So
we had to cache some information with each mem_event sent.

3. Not allow the hypervisor to disable sending information about
interesting MSR events.

4. Add an additional mem_event type, namely a VMCALL event - in order to
do its work, our application occasionally triggers VMCALLs in the guest
(not included in the current series, but included in the initial RFC
series).

5. Extend libxc to allow triggering page faults in the guest (in order
to be able to bring back in pages swapped out by the guest OS).


Thanks,
Razvan Cojocaru

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH V6 1/5] xen: Emulate with no writes
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
@ 2014-09-09 10:28 ` Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 2/5] xen: Optimize introspection access to guest state Razvan Cojocaru
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, Razvan Cojocaru,
	stefano.stabellini, eddie.dong, ian.jackson, tim, jbeulich,
	jun.nakajima, andrew.cooper3

Added support for emulating an instruction with no memory writes.
Additionally, introduced hvm_emulate_one_full(), which inspects
possible return values from the hvm_emulate_one() functions
(EXCEPTION, UNHANDLEABLE) and acts on them.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/hvm/emulate.c        |  175 ++++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/hvm/emulate.h |    5 ++
 2 files changed, 177 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index 86cf432..6ab06e0 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -690,6 +690,94 @@ static int hvmemul_write(
     return X86EMUL_OKAY;
 }
 
+static int hvmemul_write_discard(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_data,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    /* Discarding the write. */
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_rep_ins_discard(
+    uint16_t src_port,
+    enum x86_segment dst_seg,
+    unsigned long dst_offset,
+    unsigned int bytes_per_rep,
+    unsigned long *reps,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_rep_movs_discard(
+   enum x86_segment src_seg,
+   unsigned long src_offset,
+   enum x86_segment dst_seg,
+   unsigned long dst_offset,
+   unsigned int bytes_per_rep,
+   unsigned long *reps,
+   struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_rep_outs_discard(
+    enum x86_segment src_seg,
+    unsigned long src_offset,
+    uint16_t dst_port,
+    unsigned int bytes_per_rep,
+    unsigned long *reps,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_cmpxchg_discard(
+    enum x86_segment seg,
+    unsigned long offset,
+    void *p_old,
+    void *p_new,
+    unsigned int bytes,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_read_io_discard(
+    unsigned int port,
+    unsigned int bytes,
+    unsigned long *val,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_write_io_discard(
+    unsigned int port,
+    unsigned int bytes,
+    unsigned long val,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_write_msr_discard(
+    unsigned long reg,
+    uint64_t val,
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
+static int hvmemul_wbinvd_discard(
+    struct x86_emulate_ctxt *ctxt)
+{
+    return X86EMUL_OKAY;
+}
+
 static int hvmemul_cmpxchg(
     enum x86_segment seg,
     unsigned long offset,
@@ -1140,8 +1228,33 @@ static const struct x86_emulate_ops hvm_emulate_ops = {
     .invlpg        = hvmemul_invlpg
 };
 
-int hvm_emulate_one(
-    struct hvm_emulate_ctxt *hvmemul_ctxt)
+static const struct x86_emulate_ops hvm_emulate_ops_no_write = {
+    .read          = hvmemul_read,
+    .insn_fetch    = hvmemul_insn_fetch,
+    .write         = hvmemul_write_discard,
+    .cmpxchg       = hvmemul_cmpxchg_discard,
+    .rep_ins       = hvmemul_rep_ins_discard,
+    .rep_outs      = hvmemul_rep_outs_discard,
+    .rep_movs      = hvmemul_rep_movs_discard,
+    .read_segment  = hvmemul_read_segment,
+    .write_segment = hvmemul_write_segment,
+    .read_io       = hvmemul_read_io_discard,
+    .write_io      = hvmemul_write_io_discard,
+    .read_cr       = hvmemul_read_cr,
+    .write_cr      = hvmemul_write_cr,
+    .read_msr      = hvmemul_read_msr,
+    .write_msr     = hvmemul_write_msr_discard,
+    .wbinvd        = hvmemul_wbinvd_discard,
+    .cpuid         = hvmemul_cpuid,
+    .inject_hw_exception = hvmemul_inject_hw_exception,
+    .inject_sw_interrupt = hvmemul_inject_sw_interrupt,
+    .get_fpu       = hvmemul_get_fpu,
+    .put_fpu       = hvmemul_put_fpu,
+    .invlpg        = hvmemul_invlpg
+};
+
+static int _hvm_emulate_one(struct hvm_emulate_ctxt *hvmemul_ctxt,
+    const struct x86_emulate_ops *ops)
 {
     struct cpu_user_regs *regs = hvmemul_ctxt->ctxt.regs;
     struct vcpu *curr = current;
@@ -1193,7 +1306,7 @@ int hvm_emulate_one(
     vio->mmio_retrying = vio->mmio_retry;
     vio->mmio_retry = 0;
 
-    rc = x86_emulate(&hvmemul_ctxt->ctxt, &hvm_emulate_ops);
+    rc = x86_emulate(&hvmemul_ctxt->ctxt, ops);
 
     if ( rc == X86EMUL_OKAY && vio->mmio_retry )
         rc = X86EMUL_RETRY;
@@ -1241,6 +1354,62 @@ int hvm_emulate_one(
     return X86EMUL_OKAY;
 }
 
+int hvm_emulate_one(
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    return _hvm_emulate_one(hvmemul_ctxt, &hvm_emulate_ops);
+}
+
+int hvm_emulate_one_no_write(
+    struct hvm_emulate_ctxt *hvmemul_ctxt)
+{
+    return _hvm_emulate_one(hvmemul_ctxt, &hvm_emulate_ops_no_write);
+}
+
+void hvm_mem_event_emulate_one(bool_t nowrite, unsigned int trapnr,
+    unsigned int errcode)
+{
+    struct hvm_emulate_ctxt ctx = {{ 0 }};
+    int rc;
+
+    hvm_emulate_prepare(&ctx, guest_cpu_user_regs());
+
+    if ( nowrite )
+        rc = hvm_emulate_one_no_write(&ctx);
+    else
+        rc = hvm_emulate_one(&ctx);
+
+    switch ( rc )
+    {
+    case X86EMUL_RETRY:
+        /*
+         * This function is called when handling an EPT-related mem_event
+         * reply. As such, nothing else needs to be done here, since simply
+         * returning makes the current instruction cause a page fault again,
+         * consistent with X86EMUL_RETRY.
+         */
+        return;
+    case X86EMUL_UNHANDLEABLE:
+        gdprintk(XENLOG_DEBUG, "Emulation failed @ %04x:%lx: "
+               "%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
+               hvmemul_get_seg_reg(x86_seg_cs, &ctx)->sel,
+               ctx.insn_buf_eip,
+               ctx.insn_buf[0], ctx.insn_buf[1],
+               ctx.insn_buf[2], ctx.insn_buf[3],
+               ctx.insn_buf[4], ctx.insn_buf[5],
+               ctx.insn_buf[6], ctx.insn_buf[7],
+               ctx.insn_buf[8], ctx.insn_buf[9]);
+        hvm_inject_hw_exception(trapnr, errcode);
+        break;
+    case X86EMUL_EXCEPTION:
+        if ( ctx.exn_pending )
+            hvm_inject_hw_exception(ctx.exn_vector, ctx.exn_error_code);
+        break;
+    }
+
+    hvm_emulate_writeback(&ctx);
+}
+
 void hvm_emulate_prepare(
     struct hvm_emulate_ctxt *hvmemul_ctxt,
     struct cpu_user_regs *regs)
diff --git a/xen/include/asm-x86/hvm/emulate.h b/xen/include/asm-x86/hvm/emulate.h
index 00a06cc..efff97e 100644
--- a/xen/include/asm-x86/hvm/emulate.h
+++ b/xen/include/asm-x86/hvm/emulate.h
@@ -37,6 +37,11 @@ struct hvm_emulate_ctxt {
 
 int hvm_emulate_one(
     struct hvm_emulate_ctxt *hvmemul_ctxt);
+int hvm_emulate_one_no_write(
+    struct hvm_emulate_ctxt *hvmemul_ctxt);
+void hvm_mem_event_emulate_one(bool_t nowrite,
+    unsigned int trapnr,
+    unsigned int errcode);
 void hvm_emulate_prepare(
     struct hvm_emulate_ctxt *hvmemul_ctxt,
     struct cpu_user_regs *regs);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V6 2/5] xen: Optimize introspection access to guest state
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 1/5] xen: Emulate with no writes Razvan Cojocaru
@ 2014-09-09 10:28 ` Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 3/5] xen, libxc: Force-enable relevant MSR events Razvan Cojocaru
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, Razvan Cojocaru,
	stefano.stabellini, eddie.dong, ian.jackson, tim, jbeulich,
	jun.nakajima, andrew.cooper3

Speed optimization for introspection purposes: a handful of registers
are sent along with each mem_event. This requires enlargement of the
mem_event_request / mem_event_response stuctures, and additional code
to fill in relevant values. Since the EPT event processing code needs
more data than CR3 or MSR event processors, hvm_mem_event_fill_regs()
fills in less data than p2m_mem_event_fill_regs(), in order to avoid
overhead. Struct hvm_hw_cpu has been considered instead of the custom
struct mem_event_regs_st, but its size would cause quick filling up
of the mem_event ring buffer.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/hvm/hvm.c         |   33 +++++++++++++++++++++++
 xen/arch/x86/mm/p2m.c          |   57 ++++++++++++++++++++++++++++++++++++++++
 xen/include/public/mem_event.h |   39 +++++++++++++++++++++++++++
 3 files changed, 129 insertions(+)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 8d905d3..bb45593 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -6149,6 +6149,38 @@ int hvm_debug_op(struct vcpu *v, int32_t op)
     return rc;
 }
 
+static void hvm_mem_event_fill_regs(mem_event_request_t *req)
+{
+    const struct cpu_user_regs *regs = guest_cpu_user_regs();
+    const struct vcpu *curr = current;
+
+    req->x86_regs.rax = regs->eax;
+    req->x86_regs.rcx = regs->ecx;
+    req->x86_regs.rdx = regs->edx;
+    req->x86_regs.rbx = regs->ebx;
+    req->x86_regs.rsp = regs->esp;
+    req->x86_regs.rbp = regs->ebp;
+    req->x86_regs.rsi = regs->esi;
+    req->x86_regs.rdi = regs->edi;
+
+    req->x86_regs.r8  = regs->r8;
+    req->x86_regs.r9  = regs->r9;
+    req->x86_regs.r10 = regs->r10;
+    req->x86_regs.r11 = regs->r11;
+    req->x86_regs.r12 = regs->r12;
+    req->x86_regs.r13 = regs->r13;
+    req->x86_regs.r14 = regs->r14;
+    req->x86_regs.r15 = regs->r15;
+
+    req->x86_regs.rflags = regs->eflags;
+    req->x86_regs.rip    = regs->eip;
+
+    req->x86_regs.msr_efer = curr->arch.hvm_vcpu.guest_efer;
+    req->x86_regs.cr0 = curr->arch.hvm_vcpu.guest_cr[0];
+    req->x86_regs.cr3 = curr->arch.hvm_vcpu.guest_cr[3];
+    req->x86_regs.cr4 = curr->arch.hvm_vcpu.guest_cr[4];
+}
+
 static int hvm_memory_event_traps(long p, uint32_t reason,
                                   unsigned long value, unsigned long old, 
                                   bool_t gla_valid, unsigned long gla) 
@@ -6193,6 +6225,7 @@ static int hvm_memory_event_traps(long p, uint32_t reason,
         req.gla = old;
     }
     
+    hvm_mem_event_fill_regs(&req);
     mem_event_put_request(d, &d->mem_event->access, &req);
     
     return 1;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 32776c3..d0962aa 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1327,6 +1327,61 @@ void p2m_mem_paging_resume(struct domain *d)
     }
 }
 
+static void p2m_mem_event_fill_regs(mem_event_request_t *req)
+{
+    const struct cpu_user_regs *regs = guest_cpu_user_regs();
+    struct segment_register seg;
+    struct hvm_hw_cpu ctxt;
+    struct vcpu *curr = current;
+
+    /* Architecture-specific vmcs/vmcb bits */
+    hvm_funcs.save_cpu_ctxt(curr, &ctxt);
+
+    req->x86_regs.rax = regs->eax;
+    req->x86_regs.rcx = regs->ecx;
+    req->x86_regs.rdx = regs->edx;
+    req->x86_regs.rbx = regs->ebx;
+    req->x86_regs.rsp = regs->esp;
+    req->x86_regs.rbp = regs->ebp;
+    req->x86_regs.rsi = regs->esi;
+    req->x86_regs.rdi = regs->edi;
+
+    req->x86_regs.r8  = regs->r8;
+    req->x86_regs.r9  = regs->r9;
+    req->x86_regs.r10 = regs->r10;
+    req->x86_regs.r11 = regs->r11;
+    req->x86_regs.r12 = regs->r12;
+    req->x86_regs.r13 = regs->r13;
+    req->x86_regs.r14 = regs->r14;
+    req->x86_regs.r15 = regs->r15;
+
+    req->x86_regs.rflags = regs->eflags;
+    req->x86_regs.rip    = regs->eip;
+
+    req->x86_regs.dr7 = curr->arch.debugreg[7];
+    req->x86_regs.cr0 = ctxt.cr0;
+    req->x86_regs.cr2 = ctxt.cr2;
+    req->x86_regs.cr3 = ctxt.cr3;
+    req->x86_regs.cr4 = ctxt.cr4;
+
+    req->x86_regs.sysenter_cs = ctxt.sysenter_cs;
+    req->x86_regs.sysenter_esp = ctxt.sysenter_esp;
+    req->x86_regs.sysenter_eip = ctxt.sysenter_eip;
+
+    req->x86_regs.msr_efer = ctxt.msr_efer;
+    req->x86_regs.msr_star = ctxt.msr_star;
+    req->x86_regs.msr_lstar = ctxt.msr_lstar;
+
+    hvm_get_segment_register(curr, x86_seg_fs, &seg);
+    req->x86_regs.fs_base = seg.base;
+
+    hvm_get_segment_register(curr, x86_seg_gs, &seg);
+    req->x86_regs.gs_base = seg.base;
+
+    hvm_get_segment_register(curr, x86_seg_cs, &seg);
+    req->x86_regs.cs_arbytes = seg.attr.bytes;
+}
+
 bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla,
                             struct npfec npfec,
                             mem_event_request_t **req_ptr)
@@ -1417,6 +1472,8 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla,
         req->access_w = npfec.write_access;
         req->access_x = npfec.insn_fetch;
         req->vcpu_id = v->vcpu_id;
+
+        p2m_mem_event_fill_regs(req);
     }
 
     /* Pause the current VCPU */
diff --git a/xen/include/public/mem_event.h b/xen/include/public/mem_event.h
index fc12697..d3dd9c6 100644
--- a/xen/include/public/mem_event.h
+++ b/xen/include/public/mem_event.h
@@ -48,6 +48,44 @@
 #define MEM_EVENT_REASON_MSR         7    /* MSR was hit: gfn is MSR value, gla is MSR address;
                                              does NOT honour HVMPME_onchangeonly */
 
+/* Using a custom struct (not hvm_hw_cpu) so as to not fill
+ * the mem_event ring buffer too quickly. */
+struct mem_event_regs_x86 {
+    uint64_t rax;
+    uint64_t rcx;
+    uint64_t rdx;
+    uint64_t rbx;
+    uint64_t rsp;
+    uint64_t rbp;
+    uint64_t rsi;
+    uint64_t rdi;
+    uint64_t r8;
+    uint64_t r9;
+    uint64_t r10;
+    uint64_t r11;
+    uint64_t r12;
+    uint64_t r13;
+    uint64_t r14;
+    uint64_t r15;
+    uint64_t rflags;
+    uint64_t dr7;
+    uint64_t rip;
+    uint64_t cr0;
+    uint64_t cr2;
+    uint64_t cr3;
+    uint64_t cr4;
+    uint64_t sysenter_cs;
+    uint64_t sysenter_esp;
+    uint64_t sysenter_eip;
+    uint64_t msr_efer;
+    uint64_t msr_star;
+    uint64_t msr_lstar;
+    uint64_t fs_base;
+    uint64_t gs_base;
+    uint32_t cs_arbytes;
+    uint32_t _pad;
+};
+
 typedef struct mem_event_st {
     uint32_t flags;
     uint32_t vcpu_id;
@@ -67,6 +105,7 @@ typedef struct mem_event_st {
     uint16_t available:10;
 
     uint16_t reason;
+    struct mem_event_regs_x86 x86_regs;
 } mem_event_request_t, mem_event_response_t;
 
 DEFINE_RING_TYPES(mem_event, mem_event_request_t, mem_event_response_t);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V6 3/5] xen, libxc: Force-enable relevant MSR events
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 1/5] xen: Emulate with no writes Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 2/5] xen: Optimize introspection access to guest state Razvan Cojocaru
@ 2014-09-09 10:28 ` Razvan Cojocaru
  2014-09-09 10:28 ` [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc Razvan Cojocaru
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, Razvan Cojocaru,
	stefano.stabellini, eddie.dong, ian.jackson, tim, jbeulich,
	jun.nakajima, andrew.cooper3

Vmx_disable_intercept_for_msr() will now refuse to disable interception of
MSRs needed for memory introspection. It is not possible to gate this on
mem_access being active for the domain, since by the time mem_access does
become active the interception for the interesting MSRs has already been
disabled (vmx_disable_intercept_for_msr() runs very early on).

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
---
 tools/libxc/xc_mem_access.c        |   10 +++++++++-
 tools/libxc/xc_mem_event.c         |    7 +++++--
 tools/libxc/xc_private.h           |    2 +-
 tools/libxc/xenctrl.h              |    2 ++
 xen/arch/x86/hvm/vmx/vmcs.c        |   25 +++++++++++++++++++++++++
 xen/arch/x86/hvm/vmx/vmx.c         |   13 +++++++++++++
 xen/arch/x86/mm/mem_event.c        |   11 +++++++++++
 xen/include/asm-x86/hvm/domain.h   |    1 +
 xen/include/asm-x86/hvm/hvm.h      |    2 ++
 xen/include/asm-x86/hvm/vmx/vmcs.h |    7 +++++++
 xen/include/public/domctl.h        |    7 ++++---
 11 files changed, 80 insertions(+), 7 deletions(-)

diff --git a/tools/libxc/xc_mem_access.c b/tools/libxc/xc_mem_access.c
index 461f0e9..55d0e9f 100644
--- a/tools/libxc/xc_mem_access.c
+++ b/tools/libxc/xc_mem_access.c
@@ -26,7 +26,15 @@
 
 void *xc_mem_access_enable(xc_interface *xch, domid_t domain_id, uint32_t *port)
 {
-    return xc_mem_event_enable(xch, domain_id, HVM_PARAM_ACCESS_RING_PFN, port);
+    return xc_mem_event_enable(xch, domain_id, HVM_PARAM_ACCESS_RING_PFN,
+                               port, 0);
+}
+
+void *xc_mem_access_enable_introspection(xc_interface *xch, domid_t domain_id,
+                                         uint32_t *port)
+{
+    return xc_mem_event_enable(xch, domain_id, HVM_PARAM_ACCESS_RING_PFN,
+                               port, 1);
 }
 
 int xc_mem_access_disable(xc_interface *xch, domid_t domain_id)
diff --git a/tools/libxc/xc_mem_event.c b/tools/libxc/xc_mem_event.c
index faf1cc6..8c0be4e 100644
--- a/tools/libxc/xc_mem_event.c
+++ b/tools/libxc/xc_mem_event.c
@@ -57,7 +57,7 @@ int xc_mem_event_memop(xc_interface *xch, domid_t domain_id,
 }
 
 void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
-                          uint32_t *port)
+                          uint32_t *port, int enable_introspection)
 {
     void *ring_page = NULL;
     uint64_t pfn;
@@ -120,7 +120,10 @@ void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
         break;
 
     case HVM_PARAM_ACCESS_RING_PFN:
-        op = XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE;
+        if ( enable_introspection )
+            op = XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE_INTROSPECTION;
+        else
+            op = XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE;
         mode = XEN_DOMCTL_MEM_EVENT_OP_ACCESS;
         break;
 
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index c50a7c9..94df688 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -381,6 +381,6 @@ int xc_mem_event_memop(xc_interface *xch, domid_t domain_id,
  * param can be HVM_PARAM_PAGING/ACCESS/SHARING_RING_PFN
  */
 void *xc_mem_event_enable(xc_interface *xch, domid_t domain_id, int param,
-                          uint32_t *port);
+                          uint32_t *port, int enable_introspection);
 
 #endif /* __XC_PRIVATE_H__ */
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 1c8aa42..514b241 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -2273,6 +2273,8 @@ int xc_mem_paging_load(xc_interface *xch, domid_t domain_id,
  * Caller has to unmap this page when done.
  */
 void *xc_mem_access_enable(xc_interface *xch, domid_t domain_id, uint32_t *port);
+void *xc_mem_access_enable_introspection(xc_interface *xch, domid_t domain_id,
+                                         uint32_t *port);
 int xc_mem_access_disable(xc_interface *xch, domid_t domain_id);
 int xc_mem_access_resume(xc_interface *xch, domid_t domain_id);
 
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 4a4f4e1..fc1f882 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -39,6 +39,7 @@
 #include <xen/keyhandler.h>
 #include <asm/shadow.h>
 #include <asm/tboot.h>
+#include <asm/mem_event.h>
 
 static bool_t __read_mostly opt_vpid_enabled = 1;
 boolean_param("vpid", opt_vpid_enabled);
@@ -71,6 +72,18 @@ u32 vmx_vmexit_control __read_mostly;
 u32 vmx_vmentry_control __read_mostly;
 u64 vmx_ept_vpid_cap __read_mostly;
 
+const u32 vmx_introspection_force_enabled_msrs[] = {
+    MSR_IA32_SYSENTER_EIP,
+    MSR_IA32_SYSENTER_ESP,
+    MSR_IA32_SYSENTER_CS,
+    MSR_IA32_MC0_CTL,
+    MSR_STAR,
+    MSR_LSTAR
+};
+
+const unsigned int vmx_introspection_force_enabled_msrs_size =
+    ARRAY_SIZE(vmx_introspection_force_enabled_msrs);
+
 static DEFINE_PER_CPU_READ_MOSTLY(struct vmcs_struct *, vmxon_region);
 static DEFINE_PER_CPU(struct vmcs_struct *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, active_vmcs_list);
@@ -695,11 +708,23 @@ static void vmx_set_host_env(struct vcpu *v)
 void vmx_disable_intercept_for_msr(struct vcpu *v, u32 msr, int type)
 {
     unsigned long *msr_bitmap = v->arch.hvm_vmx.msr_bitmap;
+    struct domain *d = v->domain;
 
     /* VMX MSR bitmap supported? */
     if ( msr_bitmap == NULL )
         return;
 
+    if ( unlikely(d->arch.hvm_domain.introspection_enabled) &&
+         mem_event_check_ring(&d->mem_event->access) )
+    {
+        unsigned int i;
+
+        /* Filter out MSR-s needed for memory introspection */
+        for ( i = 0; i < vmx_introspection_force_enabled_msrs_size; i++ )
+            if ( msr == vmx_introspection_force_enabled_msrs[i] )
+                return;
+    }
+
     /*
      * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
      * have the write-low and read-high bitmap offsets the wrong way round.
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 26b1ad5..b08664e 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1683,6 +1683,18 @@ void vmx_hypervisor_cpuid_leaf(uint32_t sub_idx,
         *eax |= XEN_HVM_CPUID_X2APIC_VIRT;
 }
 
+static void vmx_enable_msr_exit_interception(struct domain *d)
+{
+    struct vcpu *v;
+    unsigned int i;
+
+    /* Enable interception for MSRs needed for memory introspection. */
+    for_each_vcpu ( d, v )
+        for ( i = 0; i < vmx_introspection_force_enabled_msrs_size; i++ )
+            vmx_enable_intercept_for_msr(v, vmx_introspection_force_enabled_msrs[i],
+                                         MSR_TYPE_W);
+}
+
 static struct hvm_function_table __initdata vmx_function_table = {
     .name                 = "VMX",
     .cpu_up_prepare       = vmx_cpu_up_prepare,
@@ -1741,6 +1753,7 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .handle_eoi           = vmx_handle_eoi,
     .nhvm_hap_walk_L1_p2m = nvmx_hap_walk_L1_p2m,
     .hypervisor_cpuid_leaf = vmx_hypervisor_cpuid_leaf,
+    .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
 };
 
 const struct hvm_function_table * __init start_vmx(void)
diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index ba7e71e..fdd5ff6 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -587,6 +587,7 @@ int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec,
         switch( mec->op )
         {
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE:
+        case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE_INTROSPECTION:
         {
             rc = -ENODEV;
             /* Only HAP is supported */
@@ -600,13 +601,23 @@ int mem_event_domctl(struct domain *d, xen_domctl_mem_event_op_t *mec,
             rc = mem_event_enable(d, mec, med, _VPF_mem_access, 
                                     HVM_PARAM_ACCESS_RING_PFN,
                                     mem_access_notification);
+
+            if ( mec->op != XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE &&
+                 rc == 0 && hvm_funcs.enable_msr_exit_interception )
+            {
+                d->arch.hvm_domain.introspection_enabled = 1;
+                hvm_funcs.enable_msr_exit_interception(d);
+            }
         }
         break;
 
         case XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE:
         {
             if ( med->ring_page )
+            {
                 rc = mem_event_disable(d, med);
+                d->arch.hvm_domain.introspection_enabled = 0;
+            }
         }
         break;
 
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index 291a2e0..30d4aa3 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -134,6 +134,7 @@ struct hvm_domain {
     bool_t                 mem_sharing_enabled;
     bool_t                 qemu_mapcache_invalidate;
     bool_t                 is_s3_suspended;
+    bool_t                 introspection_enabled;
 
     /*
      * TSC value that VCPUs use to calculate their tsc_offset value.
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 1123857..121d053 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -205,6 +205,8 @@ struct hvm_function_table {
     void (*hypervisor_cpuid_leaf)(uint32_t sub_idx,
                                   uint32_t *eax, uint32_t *ebx,
                                   uint32_t *ecx, uint32_t *edx);
+
+    void (*enable_msr_exit_interception)(struct domain *d);
 };
 
 extern struct hvm_function_table hvm_funcs;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 215d93c..6a99dca 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -471,6 +471,13 @@ enum vmcs_field {
     HOST_RIP                        = 0x00006c16,
 };
 
+/*
+ * A set of MSR-s that need to be enabled for memory introspection
+ * to work.
+ */
+extern const u32 vmx_introspection_force_enabled_msrs[];
+extern const unsigned int vmx_introspection_force_enabled_msrs_size;
+
 #define VMCS_VPID_WIDTH 16
 
 #define MSR_TYPE_R 1
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 69a8b44..cfa39b3 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -773,10 +773,11 @@ struct xen_domctl_gdbsx_domstatus {
  * ENODEV - host lacks HAP support (EPT/NPT) or HAP is disabled in guest
  * EBUSY  - guest has or had access enabled, ring buffer still active
  */
-#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS            2
+#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS                        2
 
-#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE     0
-#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE    1
+#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE                 0
+#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS_DISABLE                1
+#define XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE_INTROSPECTION   2
 
 /*
  * Sharing ENOMEM helper.
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
                   ` (2 preceding siblings ...)
  2014-09-09 10:28 ` [PATCH V6 3/5] xen, libxc: Force-enable relevant MSR events Razvan Cojocaru
@ 2014-09-09 10:28 ` Razvan Cojocaru
  2014-09-10 14:38   ` Jan Beulich
  2014-09-09 10:28 ` [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply Razvan Cojocaru
  2014-09-09 10:34 ` [PATCH V6 0/5] Basic guest memory introspection support Jan Beulich
  5 siblings, 1 reply; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, Razvan Cojocaru,
	stefano.stabellini, eddie.dong, ian.jackson, tim, jbeulich,
	jun.nakajima, andrew.cooper3

Extended HVMOP_inject_trap to allow asking for trap injection done
by the first available CPU, when it's in user mode and its CR3
matches the one for an interesting application inside the guest.
This mechanism allows bringing in swapped-out pages for inspection.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>

---
Changes since V5:
 - Now synchronizing access to the trap injection data.
 - Only enforcing the CR3 match for cr3 != ~0.
 - Updated the xen-access.c test.

Changes since V4:
 - Allow per-domain trap injection without checking if CR3 matches
   for traps other than page faults.
 - Added comments to explain that trap injection requests are not
   serialized.

Changes since V3:
 - Now returning -EBUSY if the hypercall tries to set per-domain
   injection data when there's a pending injection on any of the
   domain's CPUs.
 - Minor code cleanup.

Changes since V2:
 - Renamed hvm_is_pf_requested() to hvm_can_inject_domain_pf().
 - Modified comments to better reflect the introspection-related
   purpose of added code.
---
 tools/libxc/xc_misc.c               |    5 +-
 tools/libxc/xenctrl.h               |    3 +-
 tools/tests/xen-access/xen-access.c |    2 +-
 xen/arch/x86/hvm/hvm.c              |  111 ++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/hvm/domain.h    |    4 ++
 xen/include/asm-x86/hvm/hvm.h       |    1 +
 xen/include/public/hvm/hvm_op.h     |    7 +++
 7 files changed, 115 insertions(+), 18 deletions(-)

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
index e253a58..6773446 100644
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -597,7 +597,7 @@ int xc_hvm_set_mem_type(
 int xc_hvm_inject_trap(
     xc_interface *xch, domid_t dom, int vcpu, uint32_t vector,
     uint32_t type, uint32_t error_code, uint32_t insn_len,
-    uint64_t cr2)
+    uint64_t cr2, uint64_t cr3)
 {
     DECLARE_HYPERCALL;
     DECLARE_HYPERCALL_BUFFER(struct xen_hvm_inject_trap, arg);
@@ -611,12 +611,13 @@ int xc_hvm_inject_trap(
     }
 
     arg->domid       = dom;
-    arg->vcpuid      = vcpu;
+    arg->vcpuid      = (vcpu == -1 ? (uint32_t)~0 : vcpu);
     arg->vector      = vector;
     arg->type        = type;
     arg->error_code  = error_code;
     arg->insn_len    = insn_len;
     arg->cr2         = cr2;
+    arg->cr3         = cr3;
 
     hypercall.op     = __HYPERVISOR_hvm_op;
     hypercall.arg[0] = HVMOP_inject_trap;
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 514b241..0a0281c 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -1826,11 +1826,12 @@ int xc_hvm_set_mem_type(
 /*
  * Injects a hardware/software CPU trap, to take effect the next time the HVM 
  * resumes. 
+ * Cr3 is only taken into account if vcpu == -1 (wildcard for "any vcpu").
  */
 int xc_hvm_inject_trap(
     xc_interface *xch, domid_t dom, int vcpu, uint32_t vector,
     uint32_t type, uint32_t error_code, uint32_t insn_len,
-    uint64_t cr2);
+    uint64_t cr2, uint64_t cr3);
 
 /*
  *  LOGGING AND ERROR REPORTING
diff --git a/tools/tests/xen-access/xen-access.c b/tools/tests/xen-access/xen-access.c
index 6cb382d..87f9e00 100644
--- a/tools/tests/xen-access/xen-access.c
+++ b/tools/tests/xen-access/xen-access.c
@@ -604,7 +604,7 @@ int main(int argc, char *argv[])
                 /* Reinject */
                 rc = xc_hvm_inject_trap(
                     xch, domain_id, req.vcpu_id, 3,
-                    HVMOP_TRAP_sw_exc, -1, 0, 0);
+                    HVMOP_TRAP_sw_exc, -1, 0, 0, ~0);
                 if (rc < 0)
                 {
                     ERROR("Error %d injecting int3\n", rc);
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index bb45593..4cc1160 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -420,6 +420,31 @@ static bool_t hvm_wait_for_io(struct hvm_ioreq_vcpu *sv, ioreq_t *p)
     return 1;
 }
 
+static bool_t hvm_can_inject_domain_pf(struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    struct segment_register seg;
+    uint64_t mask;
+
+    hvm_get_segment_register(v, x86_seg_ss, &seg);
+
+    if ( seg.attr.fields.dpl != 3 ) /* Guest is not in user mode */
+        return 0;
+
+    if ( hvm_long_mode_enabled(v) )
+        mask = PADDR_MASK & PAGE_MASK; /* Bits 51:12. */
+    else if ( hvm_pae_enabled(v) )
+        mask = 0x00000000ffffffe0; /* Bits 31:5. */
+    else
+        mask = (uint32_t)PAGE_MASK; /* Bits 31:12. */
+
+    if ( (v->arch.hvm_vcpu.guest_cr[3] & mask) !=
+         (d->arch.hvm_domain.inject_trap.cr3 & mask) )
+        return 0;
+
+    return 1;
+}
+
 void hvm_do_resume(struct vcpu *v)
 {
     struct domain *d = v->domain;
@@ -450,12 +475,30 @@ void hvm_do_resume(struct vcpu *v)
         }
     }
 
+    spin_lock(&d->arch.hvm_domain.inject_trap_lock);
+
     /* Inject pending hw/sw trap */
-    if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ) 
+    if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
     {
         hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
         v->arch.hvm_vcpu.inject_trap.vector = -1;
     }
+    /*
+     * Inject per-domain pending hw/sw trap (this will most likely
+     * be a page fault injected by memory introspection code).
+     */
+    else if ( d->arch.hvm_domain.inject_trap.vector != -1 )
+    {
+        if ( d->arch.hvm_domain.inject_trap.vector != TRAP_page_fault ||
+             d->arch.hvm_domain.inject_trap.cr3 == ~0 ||
+             hvm_can_inject_domain_pf(v) )
+        {
+            hvm_inject_trap(&d->arch.hvm_domain.inject_trap);
+            d->arch.hvm_domain.inject_trap.vector = -1;
+        }
+    }
+
+    spin_unlock(&d->arch.hvm_domain.inject_trap_lock);
 }
 
 static int hvm_alloc_ioreq_gmfn(struct domain *d, unsigned long *gmfn)
@@ -1473,9 +1516,11 @@ int hvm_domain_initialise(struct domain *d)
             printk(XENLOG_G_INFO "PVH guest must have HAP on\n");
             return -EINVAL;
         }
-
     }
 
+    d->arch.hvm_domain.inject_trap.vector = -1;
+    spin_lock_init(&d->arch.hvm_domain.inject_trap_lock);
+
     spin_lock_init(&d->arch.hvm_domain.ioreq_server.lock);
     INIT_LIST_HEAD(&d->arch.hvm_domain.ioreq_server.list);
     spin_lock_init(&d->arch.hvm_domain.irq_lock);
@@ -6086,20 +6131,58 @@ long do_hvm_op(unsigned long op, XEN_GUEST_HANDLE_PARAM(void) arg)
             goto param_fail8;
 
         rc = -ENOENT;
-        if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid]) == NULL )
-            goto param_fail8;
-        
-        if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
-            rc = -EBUSY;
-        else 
+
+        spin_lock(&d->arch.hvm_domain.inject_trap_lock);
+
+        if ( tr.vcpuid == (uint32_t)~0 ) /* Any VCPU. */
         {
-            v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
-            v->arch.hvm_vcpu.inject_trap.type = tr.type;
-            v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
-            v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
-            v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
-            rc = 0;
+            unsigned int i;
+
+            for ( i = 0; i < d->max_vcpus; i++ )
+                if ( (v = d->vcpu[i]) != NULL &&
+                     v->arch.hvm_vcpu.inject_trap.vector != -1 )
+                {
+                    rc = -EBUSY;
+                    break;
+                }
+
+            if ( d->arch.hvm_domain.inject_trap.vector != -1 )
+                rc = -EBUSY;
+
+            if ( rc != -EBUSY )
+            {
+                d->arch.hvm_domain.inject_trap.vector = tr.vector;
+                d->arch.hvm_domain.inject_trap.type = tr.type;
+                d->arch.hvm_domain.inject_trap.error_code = tr.error_code;
+                d->arch.hvm_domain.inject_trap.insn_len = tr.insn_len;
+                d->arch.hvm_domain.inject_trap.cr2 = tr.cr2;
+                d->arch.hvm_domain.inject_trap.cr3 = tr.cr3;
+                rc = 0;
+            }
         }
+        else /* Specific VCPU. */
+        {
+            if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid]) == NULL )
+                goto param_fail8;
+
+            if ( v->arch.hvm_vcpu.inject_trap.cr3 != ~0 )
+                rc = -EOPNOTSUPP;
+            if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ||
+                 d->arch.hvm_domain.inject_trap.vector != -1 )
+                rc = -EBUSY;
+            else
+            {
+                v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
+                v->arch.hvm_vcpu.inject_trap.type = tr.type;
+                v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
+                v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
+                v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
+                v->arch.hvm_vcpu.inject_trap.cr3 = tr.cr3;
+                rc = 0;
+            }
+        }
+
+        spin_unlock(&d->arch.hvm_domain.inject_trap_lock);
 
     param_fail8:
         rcu_unlock_domain(d);
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index 30d4aa3..8ab6a38 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -146,6 +146,10 @@ struct hvm_domain {
         struct vmx_domain vmx;
         struct svm_domain svm;
     };
+
+    /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
+    struct hvm_trap     inject_trap;
+    spinlock_t          inject_trap_lock;
 };
 
 #define hap_enabled(d)  ((d)->arch.hvm_domain.hap_enabled)
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 121d053..3b0bde9 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -78,6 +78,7 @@ struct hvm_trap {
     int           error_code;   /* HVM_DELIVER_NO_ERROR_CODE if n/a */
     int           insn_len;     /* Instruction length */ 
     unsigned long cr2;          /* Only for TRAP_page_fault h/w exception */
+    unsigned long cr3;          /* Only for TRAP_page_fault h/w exception */
 };
 
 /*
diff --git a/xen/include/public/hvm/hvm_op.h b/xen/include/public/hvm/hvm_op.h
index eeb0a60..eb018d4 100644
--- a/xen/include/public/hvm/hvm_op.h
+++ b/xen/include/public/hvm/hvm_op.h
@@ -197,6 +197,13 @@ struct xen_hvm_inject_trap {
     uint32_t insn_len;
     /* CR2 for page faults */
     uint64_aligned_t cr2;
+    /*
+     * Currently only used if vcpuid == ~0 (wildcard for any VCPU).
+     * In that case, injection data is set per-domain, and any VCPU
+     * running will inject the trap. For TRAP_page_fault, the CR3
+     * of the currently running process will have to match cr3.
+     */
+    uint64_aligned_t cr3;
 };
 typedef struct xen_hvm_inject_trap xen_hvm_inject_trap_t;
 DEFINE_XEN_GUEST_HANDLE(xen_hvm_inject_trap_t);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
                   ` (3 preceding siblings ...)
  2014-09-09 10:28 ` [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc Razvan Cojocaru
@ 2014-09-09 10:28 ` Razvan Cojocaru
  2014-09-10 16:03   ` George Dunlap
  2014-09-09 10:34 ` [PATCH V6 0/5] Basic guest memory introspection support Jan Beulich
  5 siblings, 1 reply; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:28 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, ian.campbell, Razvan Cojocaru,
	stefano.stabellini, eddie.dong, ian.jackson, tim, jbeulich,
	jun.nakajima, andrew.cooper3

In a scenario where a page fault that triggered a mem_event occured,
p2m_mem_access_check() will now be able to either 1) emulate the
current instruction, or 2) emulate it, but don't allow it to perform
any writes.

Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>

---
Changes since V4:
 - Removed vmx_exited_by_nested_pagefault() (now using npfec.kind).
---
 xen/arch/x86/domain.c          |    3 ++
 xen/arch/x86/mm/p2m.c          |   83 ++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/domain.h   |    9 +++++
 xen/include/public/mem_event.h |   12 +++---
 4 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index f7e0e78..7b1dfe6 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -415,6 +415,9 @@ int vcpu_initialise(struct vcpu *v)
 
     v->arch.flags = TF_kernel_mode;
 
+    /* By default, do not emulate */
+    v->arch.mem_event.emulate_flags = 0;
+
     rc = mapcache_vcpu_init(v);
     if ( rc )
         return rc;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index d0962aa..c9ede2b 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1395,6 +1395,7 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla,
     p2m_access_t p2ma;
     mem_event_request_t *req;
     int rc;
+    unsigned long eip = guest_cpu_user_regs()->eip;
 
     /* First, handle rx2rw conversion automatically.
      * These calls to p2m->set_entry() must succeed: we have the gfn
@@ -1447,6 +1448,35 @@ bool_t p2m_mem_access_check(paddr_t gpa, unsigned long gla,
             return 1;
         }
     }
+    else if ( v->arch.mem_event.emulate_flags == 0 &&
+              npfec.kind != npfec_kind_with_gla ) /* don't send a mem_event */
+    {
+        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
+        v->arch.mem_event.gpa = gpa;
+        v->arch.mem_event.eip = eip;
+    }
+
+    /* The previous mem_event reply does not match the current state. */
+    if ( v->arch.mem_event.gpa != gpa || v->arch.mem_event.eip != eip )
+    {
+        /* Don't emulate the current instruction, send a new mem_event. */
+        v->arch.mem_event.emulate_flags = 0;
+
+        /* Make sure to mark the current state to match it again against
+         * the new mem_event about to be sent. */
+        v->arch.mem_event.gpa = gpa;
+        v->arch.mem_event.eip = eip;
+    }
+
+    if ( v->arch.mem_event.emulate_flags )
+    {
+        hvm_mem_event_emulate_one((v->arch.mem_event.emulate_flags &
+                                   MEM_EVENT_FLAG_EMULATE_NOWRITE) != 0,
+                                  TRAP_invalid_op, HVM_DELIVER_NO_ERROR_CODE);
+
+        v->arch.mem_event.emulate_flags = 0;
+        return 1;
+    }
 
     *req_ptr = NULL;
     req = xzalloc(mem_event_request_t);
@@ -1502,6 +1532,59 @@ void p2m_mem_access_resume(struct domain *d)
 
         v = d->vcpu[rsp.vcpu_id];
 
+        /* Mark vcpu for skipping one instruction upon rescheduling */
+        if ( rsp.flags & MEM_EVENT_FLAG_EMULATE )
+        {
+            xenmem_access_t access;
+            bool_t violation = 1;
+
+            v->arch.mem_event.emulate_flags = 0;
+
+            if ( p2m_get_mem_access(d, rsp.gfn, &access) == 0 )
+            {
+                switch ( access )
+                {
+                case XENMEM_access_n:
+                case XENMEM_access_n2rwx:
+                default:
+                    violation = rsp.access_r || rsp.access_w || rsp.access_x;
+                    break;
+
+                case XENMEM_access_r:
+                    violation = rsp.access_w || rsp.access_x;
+                    break;
+
+                case XENMEM_access_w:
+                    violation = rsp.access_r || rsp.access_x;
+                    break;
+
+                case XENMEM_access_x:
+                    violation = rsp.access_r || rsp.access_w;
+                    break;
+
+                case XENMEM_access_rx:
+                case XENMEM_access_rx2rw:
+                    violation = rsp.access_w;
+                    break;
+
+                case XENMEM_access_wx:
+                    violation = rsp.access_r;
+                    break;
+
+                case XENMEM_access_rw:
+                    violation = rsp.access_x;
+                    break;
+
+                case XENMEM_access_rwx:
+                    violation = 0;
+                    break;
+                }
+            }
+
+            if ( violation )
+                v->arch.mem_event.emulate_flags = rsp.flags;
+        }
+
         /* Unpause domain */
         if ( rsp.flags & MEM_EVENT_FLAG_VCPU_PAUSED )
             mem_event_vcpu_unpause(v);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 83329ed..440aa81 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -458,6 +458,15 @@ struct arch_vcpu
 
     /* A secondary copy of the vcpu time info. */
     XEN_GUEST_HANDLE(vcpu_time_info_t) time_info_guest;
+
+    /* Should we emulate the next matching instruction on VCPU resume
+     * after a mem_event? */
+    struct {
+        uint32_t emulate_flags;
+        unsigned long gpa;
+        unsigned long eip;
+    } mem_event;
+
 } __cacheline_aligned;
 
 smap_check_policy_t smap_policy_change(struct vcpu *v,
diff --git a/xen/include/public/mem_event.h b/xen/include/public/mem_event.h
index d3dd9c6..92c063c 100644
--- a/xen/include/public/mem_event.h
+++ b/xen/include/public/mem_event.h
@@ -31,11 +31,13 @@
 #include "io/ring.h"
 
 /* Memory event flags */
-#define MEM_EVENT_FLAG_VCPU_PAUSED  (1 << 0)
-#define MEM_EVENT_FLAG_DROP_PAGE    (1 << 1)
-#define MEM_EVENT_FLAG_EVICT_FAIL   (1 << 2)
-#define MEM_EVENT_FLAG_FOREIGN      (1 << 3)
-#define MEM_EVENT_FLAG_DUMMY        (1 << 4)
+#define MEM_EVENT_FLAG_VCPU_PAUSED     (1 << 0)
+#define MEM_EVENT_FLAG_DROP_PAGE       (1 << 1)
+#define MEM_EVENT_FLAG_EVICT_FAIL      (1 << 2)
+#define MEM_EVENT_FLAG_FOREIGN         (1 << 3)
+#define MEM_EVENT_FLAG_DUMMY           (1 << 4)
+#define MEM_EVENT_FLAG_EMULATE         (1 << 5)
+#define MEM_EVENT_FLAG_EMULATE_NOWRITE (1 << 6)
 
 /* Reasons for the memory event request */
 #define MEM_EVENT_REASON_UNKNOWN     0    /* typical reason */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 0/5] Basic guest memory introspection support
  2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
                   ` (4 preceding siblings ...)
  2014-09-09 10:28 ` [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply Razvan Cojocaru
@ 2014-09-09 10:34 ` Jan Beulich
  2014-09-09 10:39   ` Razvan Cojocaru
  5 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2014-09-09 10:34 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: kevin.tian, keir, ian.campbell, stefano.stabellini,
	andrew.cooper3, eddie.dong, tim, jun.nakajima, xen-devel,
	ian.jackson

>>> On 09.09.14 at 12:28, <rcojocaru@bitdefender.com> wrote:

So v5 came only 3 hours ago. I think you need to really consider
whether at some point people like me are going to discard the
flood you're producing as spam. Please allow some time between
revisions, even more so with others also wanting to get their
stuff rushed in (and hence looked at).

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 0/5] Basic guest memory introspection support
  2014-09-09 10:34 ` [PATCH V6 0/5] Basic guest memory introspection support Jan Beulich
@ 2014-09-09 10:39   ` Razvan Cojocaru
  0 siblings, 0 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-09 10:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: kevin.tian, keir, ian.campbell, stefano.stabellini,
	andrew.cooper3, eddie.dong, tim, jun.nakajima, xen-devel,
	ian.jackson

On 09/09/2014 01:34 PM, Jan Beulich wrote:
>>>> On 09.09.14 at 12:28, <rcojocaru@bitdefender.com> wrote:
> 
> So v5 came only 3 hours ago. I think you need to really consider
> whether at some point people like me are going to discard the
> flood you're producing as spam. Please allow some time between
> revisions, even more so with others also wanting to get their
> stuff rushed in (and hence looked at).

I'm sorry, certainly there was no intention to spam the list. Just
thought it would help the process to move as fast as possible, but you
obviously make a good point.


Thanks,
Razvan Cojocaru

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc
  2014-09-09 10:28 ` [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc Razvan Cojocaru
@ 2014-09-10 14:38   ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2014-09-10 14:38 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: kevin.tian, keir, ian.campbell, stefano.stabellini,
	andrew.cooper3, eddie.dong, tim, jun.nakajima, xen-devel,
	ian.jackson

>>> On 09.09.14 at 12:28, <rcojocaru@bitdefender.com> wrote:
> Changes since V5:
>  - Now synchronizing access to the trap injection data.
>  - Only enforcing the CR3 match for cr3 != ~0.
>  - Updated the xen-access.c test.

Leaving aside the question of whether this is the right mechanism,
still a couple of comments:

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -420,6 +420,31 @@ static bool_t hvm_wait_for_io(struct hvm_ioreq_vcpu *sv, ioreq_t *p)
>      return 1;
>  }
>  
> +static bool_t hvm_can_inject_domain_pf(struct vcpu *v)
> +{
> +    const struct domain *d = v->domain;
> +    struct segment_register seg;
> +    uint64_t mask;
> +
> +    hvm_get_segment_register(v, x86_seg_ss, &seg);
> +
> +    if ( seg.attr.fields.dpl != 3 ) /* Guest is not in user mode */
> +        return 0;
> +
> +    if ( hvm_long_mode_enabled(v) )
> +        mask = PADDR_MASK & PAGE_MASK; /* Bits 51:12. */
> +    else if ( hvm_pae_enabled(v) )
> +        mask = 0x00000000ffffffe0; /* Bits 31:5. */
> +    else
> +        mask = (uint32_t)PAGE_MASK; /* Bits 31:12. */
> +
> +    if ( (v->arch.hvm_vcpu.guest_cr[3] & mask) !=
> +         (d->arch.hvm_domain.inject_trap.cr3 & mask) )

If you take off the masking here ...

> @@ -450,12 +475,30 @@ void hvm_do_resume(struct vcpu *v)
>          }
>      }
>  
> +    spin_lock(&d->arch.hvm_domain.inject_trap_lock);
> +
>      /* Inject pending hw/sw trap */
> -    if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ) 
> +    if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
>      {
>          hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
>          v->arch.hvm_vcpu.inject_trap.vector = -1;
>      }
> +    /*
> +     * Inject per-domain pending hw/sw trap (this will most likely
> +     * be a page fault injected by memory introspection code).
> +     */
> +    else if ( d->arch.hvm_domain.inject_trap.vector != -1 )
> +    {
> +        if ( d->arch.hvm_domain.inject_trap.vector != TRAP_page_fault ||
> +             d->arch.hvm_domain.inject_trap.cr3 == ~0 ||

... you can avoid this check (which is suspicious anyway due to the
hard coded number - it'd be better if that value got a manifest
constant assigned in the public header).

> @@ -1473,9 +1516,11 @@ int hvm_domain_initialise(struct domain *d)
>              printk(XENLOG_G_INFO "PVH guest must have HAP on\n");
>              return -EINVAL;
>          }
> -
>      }
>  
> +    d->arch.hvm_domain.inject_trap.vector = -1;
> +    spin_lock_init(&d->arch.hvm_domain.inject_trap_lock);
> +
>      spin_lock_init(&d->arch.hvm_domain.ioreq_server.lock);
>      INIT_LIST_HEAD(&d->arch.hvm_domain.ioreq_server.list);
>      spin_lock_init(&d->arch.hvm_domain.irq_lock);
> @@ -6086,20 +6131,58 @@ long do_hvm_op(unsigned long op, 
> XEN_GUEST_HANDLE_PARAM(void) arg)
>              goto param_fail8;
>  
>          rc = -ENOENT;
> -        if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid]) == NULL )
> -            goto param_fail8;
> -        
> -        if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
> -            rc = -EBUSY;
> -        else 
> +
> +        spin_lock(&d->arch.hvm_domain.inject_trap_lock);
> +
> +        if ( tr.vcpuid == (uint32_t)~0 ) /* Any VCPU. */
>          {
> -            v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
> -            v->arch.hvm_vcpu.inject_trap.type = tr.type;
> -            v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
> -            v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
> -            v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
> -            rc = 0;
> +            unsigned int i;
> +
> +            for ( i = 0; i < d->max_vcpus; i++ )
> +                if ( (v = d->vcpu[i]) != NULL &&
> +                     v->arch.hvm_vcpu.inject_trap.vector != -1 )
> +                {
> +                    rc = -EBUSY;
> +                    break;
> +                }
> +
> +            if ( d->arch.hvm_domain.inject_trap.vector != -1 )
> +                rc = -EBUSY;
> +
> +            if ( rc != -EBUSY )
> +            {
> +                d->arch.hvm_domain.inject_trap.vector = tr.vector;
> +                d->arch.hvm_domain.inject_trap.type = tr.type;
> +                d->arch.hvm_domain.inject_trap.error_code = tr.error_code;
> +                d->arch.hvm_domain.inject_trap.insn_len = tr.insn_len;
> +                d->arch.hvm_domain.inject_trap.cr2 = tr.cr2;
> +                d->arch.hvm_domain.inject_trap.cr3 = tr.cr3;
> +                rc = 0;
> +            }
>          }
> +        else /* Specific VCPU. */
> +        {
> +            if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid]) == NULL )
> +                goto param_fail8;

You can't do this anymore with a spin lock held.

> +
> +            if ( v->arch.hvm_vcpu.inject_trap.cr3 != ~0 )
> +                rc = -EOPNOTSUPP;
> +            if ( v->arch.hvm_vcpu.inject_trap.vector != -1 ||

This ought to be "else if".

> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -146,6 +146,10 @@ struct hvm_domain {
>          struct vmx_domain vmx;
>          struct svm_domain svm;
>      };
> +
> +    /* Pending hw/sw interrupt (.vector = -1 means nothing pending). */
> +    struct hvm_trap     inject_trap;
> +    spinlock_t          inject_trap_lock;

Are you sure you want a spin lock here, not an rw one?

Jan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-09 10:28 ` [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply Razvan Cojocaru
@ 2014-09-10 16:03   ` George Dunlap
  2014-09-10 16:12     ` Razvan Cojocaru
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2014-09-10 16:03 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Ian Jackson, Dong, Eddie, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On Tue, Sep 9, 2014 at 11:28 AM, Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
> In a scenario where a page fault that triggered a mem_event occured,
> p2m_mem_access_check() will now be able to either 1) emulate the
> current instruction, or 2) emulate it, but don't allow it to perform
> any writes.
>
> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
> Acked-by: Kevin Tian <kevin.tian@intel.com>
[snip]

> +    else if ( v->arch.mem_event.emulate_flags == 0 &&
> +              npfec.kind != npfec_kind_with_gla ) /* don't send a mem_event */
> +    {
> +        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
> +        v->arch.mem_event.gpa = gpa;
> +        v->arch.mem_event.eip = eip;
> +    }

It looks like the previous if() is true, that it will never get to
this point (because it will either return 0 or 1 depending on whether
p2m->access_required is set).  So you don't need to make this an
"else" here -- you should just add a blank line and make this a normal
if().

Also, maybe it's just because I'm not familiar with the mem_event
interface, but I don't really see what this code is doing.  It seems
to be changing the behavior even for clients that aren't using
MEM_EVENT_FLAG_EMULATE*.  Is that intended?  In any case it seems like
there could be a better comment here.

 -George

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-10 16:03   ` George Dunlap
@ 2014-09-10 16:12     ` Razvan Cojocaru
  2014-09-10 16:38       ` George Dunlap
  0 siblings, 1 reply; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-10 16:12 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini, Dong,
	Eddie, Ian Jackson, Tim Deegan, Jan Beulich, Jun Nakajima,
	Andrew Cooper, xen-devel

On 09/10/2014 07:03 PM, George Dunlap wrote:
> On Tue, Sep 9, 2014 at 11:28 AM, Razvan Cojocaru
> <rcojocaru@bitdefender.com> wrote:
>> In a scenario where a page fault that triggered a mem_event occured,
>> p2m_mem_access_check() will now be able to either 1) emulate the
>> current instruction, or 2) emulate it, but don't allow it to perform
>> any writes.
>>
>> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
>> Acked-by: Kevin Tian <kevin.tian@intel.com>
> [snip]
> 
>> +    else if ( v->arch.mem_event.emulate_flags == 0 &&
>> +              npfec.kind != npfec_kind_with_gla ) /* don't send a mem_event */
>> +    {
>> +        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
>> +        v->arch.mem_event.gpa = gpa;
>> +        v->arch.mem_event.eip = eip;
>> +    }
> 
> It looks like the previous if() is true, that it will never get to
> this point (because it will either return 0 or 1 depending on whether
> p2m->access_required is set).  So you don't need to make this an
> "else" here -- you should just add a blank line and make this a normal
> if().
> 
> Also, maybe it's just because I'm not familiar with the mem_event
> interface, but I don't really see what this code is doing.  It seems
> to be changing the behavior even for clients that aren't using
> MEM_EVENT_FLAG_EMULATE*.  Is that intended?  In any case it seems like
> there could be a better comment here.

Thanks, those are very good points. I'll make that a regular if(), and
test also if introspection monitoring is enabled (please see patch 3/5:
d->arch.hvm_domain.introspection_enabled) before setting the emulate
flag, that way we won't alter the behaviour for other clients.

As for the previous if, I think that if it holds then it won't be
possible to send a mem_event anyway, hence the else.


Thanks,
Razvan Cojocaru

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-10 16:12     ` Razvan Cojocaru
@ 2014-09-10 16:38       ` George Dunlap
       [not found]         ` <CAJu=L59+C7+byC0UJVcriERf-cueiz8CYcCeBOZfXX8ZLjTBRQ@mail.gmail.com>
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2014-09-10 16:38 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini, Dong,
	Eddie, Andres Lagar-Cavilla, Ian Jackson, Tim Deegan,
	Jan Beulich, Jun Nakajima, Andrew Cooper, xen-devel

On Wed, Sep 10, 2014 at 5:12 PM, Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
> On 09/10/2014 07:03 PM, George Dunlap wrote:
>> On Tue, Sep 9, 2014 at 11:28 AM, Razvan Cojocaru
>> <rcojocaru@bitdefender.com> wrote:
>>> In a scenario where a page fault that triggered a mem_event occured,
>>> p2m_mem_access_check() will now be able to either 1) emulate the
>>> current instruction, or 2) emulate it, but don't allow it to perform
>>> any writes.
>>>
>>> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
>>> Acked-by: Kevin Tian <kevin.tian@intel.com>
>> [snip]
>>
>>> +    else if ( v->arch.mem_event.emulate_flags == 0 &&
>>> +              npfec.kind != npfec_kind_with_gla ) /* don't send a mem_event */
>>> +    {
>>> +        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
>>> +        v->arch.mem_event.gpa = gpa;
>>> +        v->arch.mem_event.eip = eip;
>>> +    }
>>
>> It looks like the previous if() is true, that it will never get to
>> this point (because it will either return 0 or 1 depending on whether
>> p2m->access_required is set).  So you don't need to make this an
>> "else" here -- you should just add a blank line and make this a normal
>> if().
>>
>> Also, maybe it's just because I'm not familiar with the mem_event
>> interface, but I don't really see what this code is doing.  It seems
>> to be changing the behavior even for clients that aren't using
>> MEM_EVENT_FLAG_EMULATE*.  Is that intended?  In any case it seems like
>> there could be a better comment here.
>
> Thanks, those are very good points. I'll make that a regular if(), and
> test also if introspection monitoring is enabled (please see patch 3/5:
> d->arch.hvm_domain.introspection_enabled) before setting the emulate
> flag, that way we won't alter the behaviour for other clients.

...and you should also put a comment there explaining why someone with
introspection enabled wouldn't want an event here (something I'm still
not clear on).

Are you *sure* that everyone who enables introspection will want that
event suppressed (not just you), and that no one else will?
Otherwise, it might make more sense to add some kind of flag to enable
or disable it, rather than gating it on introspection.  Or it's
possible everyone actually does want that event suppressed -- in which
case making it universal is the best option.

Andres, any opinions here?

> As for the previous if, I think that if it holds then it won't be
> possible to send a mem_event anyway, hence the else.

Sure, it won't be possible to send a mem event, because that code will
not be executed at all. :-)

Putting an "else" there sort of implies to someone reading the code
that you think the if() block might be executed and then continue
executing, which is misleading.  In your patch it's even more
misleading, because the else only covers the first if() and not the
subsequent conditionals you've added right after; which implies that
the if() block might be executed and then execute the conditionals
below this one, but not this one.

The less things a programmer has to remember / figure out /  keep in
her head, the less likely she is to make a mistake. :-)

 -George

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
       [not found]         ` <CAJu=L59+C7+byC0UJVcriERf-cueiz8CYcCeBOZfXX8ZLjTBRQ@mail.gmail.com>
@ 2014-09-10 18:28           ` Andres Lagar Cavilla
  2014-09-10 21:28             ` Razvan Cojocaru
  2014-09-11 14:40             ` Tamas K Lengyel
  0 siblings, 2 replies; 21+ messages in thread
From: Andres Lagar Cavilla @ 2014-09-10 18:28 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini, Dong,
	Eddie, Ian Jackson, Tim Deegan, Jan Beulich, Jun Nakajima,
	Andrew Cooper, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5591 bytes --]

>
> From: George Dunlap <dunlapg@umich.edu>
> Date: Wed, Sep 10, 2014 at 9:38 AM
> Subject: Re: [Xen-devel] [PATCH V6 5/5] xen: Handle resumed instruction
> based on previous mem_event reply
> To: Razvan Cojocaru <rcojocaru@bitdefender.com>
> Cc: "Tian, Kevin" <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>, Ian
> Campbell <ian.campbell@citrix.com>, Stefano Stabellini <
> stefano.stabellini@eu.citrix.com>, Jun Nakajima <jun.nakajima@intel.com>,
> Ian Jackson <ian.jackson@eu.citrix.com>, "Dong, Eddie" <
> eddie.dong@intel.com>, Tim Deegan <tim@xen.org>, Jan Beulich <
> jbeulich@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>, xen-devel <
> xen-devel@lists.xenproject.org>, Andres Lagar-Cavilla <
> andres@gridcentric.ca>
>
>
> On Wed, Sep 10, 2014 at 5:12 PM, Razvan Cojocaru
> <rcojocaru@bitdefender.com> wrote:
> > On 09/10/2014 07:03 PM, George Dunlap wrote:
> >> On Tue, Sep 9, 2014 at 11:28 AM, Razvan Cojocaru
> >> <rcojocaru@bitdefender.com> wrote:
> >>> In a scenario where a page fault that triggered a mem_event occured,
> >>> p2m_mem_access_check() will now be able to either 1) emulate the
> >>> current instruction, or 2) emulate it, but don't allow it to perform
> >>> any writes.
> >>>
> >>> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
> >>> Acked-by: Kevin Tian <kevin.tian@intel.com>
> >> [snip]
> >>
> >>> +    else if ( v->arch.mem_event.emulate_flags == 0 &&
> >>> +              npfec.kind != npfec_kind_with_gla ) /* don't send a
> mem_event */
> >>> +    {
> >>> +        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
> >>> +        v->arch.mem_event.gpa = gpa;
> >>> +        v->arch.mem_event.eip = eip;
> >>> +    }
> >>
> >> It looks like the previous if() is true, that it will never get to
> >> this point (because it will either return 0 or 1 depending on whether
> >> p2m->access_required is set).  So you don't need to make this an
> >> "else" here -- you should just add a blank line and make this a normal
> >> if().
> >>
> >> Also, maybe it's just because I'm not familiar with the mem_event
> >> interface, but I don't really see what this code is doing.  It seems
> >> to be changing the behavior even for clients that aren't using
> >> MEM_EVENT_FLAG_EMULATE*.  Is that intended?  In any case it seems like
> >> there could be a better comment here.
> >
> > Thanks, those are very good points. I'll make that a regular if(), and
> > test also if introspection monitoring is enabled (please see patch 3/5:
> > d->arch.hvm_domain.introspection_enabled) before setting the emulate
> > flag, that way we won't alter the behaviour for other clients.
>
> ...and you should also put a comment there explaining why someone with
> introspection enabled wouldn't want an event here (something I'm still
> not clear on).
>
> Are you *sure* that everyone who enables introspection will want that
> event suppressed (not just you), and that no one else will?
> Otherwise, it might make more sense to add some kind of flag to enable
> or disable it, rather than gating it on introspection.  Or it's
> possible everyone actually does want that event suppressed -- in which
> case making it universal is the best option.
>
> Andres, any opinions here?
>

My view of the mem event interface is that it should err on the side of
informing the consumer. Now, if the consumer doesn't sign up for something,
why bother (i.e. we don't inform of writes, if the access mode set for the
gfn does not mask writes, etc).

In an ideal world, the emulation of the instruction should raise all
relevant new mem events. We don't know a priori what the consumer might
learn throughout the execution of this specific instruction. Does it read
from or write to new gfns which have mem access masks set? TTBOMK, because
the emulation does not go through the EPT fault handler, no mem access
events will be generated, even if they should.

This is a long-standing shortcoming of mem event in security frameworks, in
that mem access is only defined as raising events through EPT faults. One
could conceivably craft an attack by having an instruction that through its
emulation reads/writes a massive buffer going into other gfns. Conversely,
"virtual DMA", i.e. qemu accesses via map_foreign_pages and grant accesses
form backends don't raise mem access events. So there are many (conceptual)
holes.

A decent thing to do for now would be to add a flag ..._EMULATE_SILENT,
which resolves to the current behavior, and lack of ..._EMULATE_SILENT in a
brave future would raise all the mem access events resulting from the full
emulation of this instruction. Fix the API at least, before it's set in
stone.

Thanks
Andres



>
> > As for the previous if, I think that if it holds then it won't be
> > possible to send a mem_event anyway, hence the else.
>
> Sure, it won't be possible to send a mem event, because that code will
> not be executed at all. :-)
>
> Putting an "else" there sort of implies to someone reading the code
> that you think the if() block might be executed and then continue
> executing, which is misleading.  In your patch it's even more
> misleading, because the else only covers the first if() and not the
> subsequent conditionals you've added right after; which implies that
> the if() block might be executed and then execute the conditionals
> below this one, but not this one.
>
> The less things a programmer has to remember / figure out /  keep in
> her head, the less likely she is to make a mistake. :-)
>
>  -George
>
>
>
> --
> Andres Lagar-Cavilla | Google Cloud Platform | andreslc@google.com |
> 647-778-4380
>

[-- Attachment #1.2: Type: text/html, Size: 9169 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-10 18:28           ` Andres Lagar Cavilla
@ 2014-09-10 21:28             ` Razvan Cojocaru
  2014-09-11 10:09               ` George Dunlap
  2014-09-11 14:40             ` Tamas K Lengyel
  1 sibling, 1 reply; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-10 21:28 UTC (permalink / raw)
  To: Andres Lagar Cavilla
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Dong, Eddie, Ian Jackson, Tim Deegan, Jan Beulich,
	Andrew Cooper, xen-devel

On 09/10/14 21:28, Andres Lagar Cavilla wrote:
>     On Wed, Sep 10, 2014 at 5:12 PM, Razvan Cojocaru
>     <rcojocaru@bitdefender.com <mailto:rcojocaru@bitdefender.com>> wrote:
>     > On 09/10/2014 07:03 PM, George Dunlap wrote:
>     >> On Tue, Sep 9, 2014 at 11:28 AM, Razvan Cojocaru
>     >> <rcojocaru@bitdefender.com <mailto:rcojocaru@bitdefender.com>> wrote:
>     >>> In a scenario where a page fault that triggered a mem_event occured,
>     >>> p2m_mem_access_check() will now be able to either 1) emulate the
>     >>> current instruction, or 2) emulate it, but don't allow it to perform
>     >>> any writes.
>     >>>
>     >>> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com
>     <mailto:rcojocaru@bitdefender.com>>
>     >>> Acked-by: Kevin Tian <kevin.tian@intel.com
>     <mailto:kevin.tian@intel.com>>
>     >> [snip]
>     >>
>     >>> +    else if ( v->arch.mem_event.emulate_flags == 0 &&
>     >>> +              npfec.kind != npfec_kind_with_gla ) /* don't send
>     a mem_event */
>     >>> +    {
>     >>> +        v->arch.mem_event.emulate_flags = MEM_EVENT_FLAG_EMULATE;
>     >>> +        v->arch.mem_event.gpa = gpa;
>     >>> +        v->arch.mem_event.eip = eip;
>     >>> +    }
>     >>
>     >> It looks like the previous if() is true, that it will never get to
>     >> this point (because it will either return 0 or 1 depending on whether
>     >> p2m->access_required is set).  So you don't need to make this an
>     >> "else" here -- you should just add a blank line and make this a
>     normal
>     >> if().
>     >>
>     >> Also, maybe it's just because I'm not familiar with the mem_event
>     >> interface, but I don't really see what this code is doing.  It seems
>     >> to be changing the behavior even for clients that aren't using
>     >> MEM_EVENT_FLAG_EMULATE*.  Is that intended?  In any case it seems
>     like
>     >> there could be a better comment here.
>     >
>     > Thanks, those are very good points. I'll make that a regular if(), and
>     > test also if introspection monitoring is enabled (please see patch
>     3/5:
>     > d->arch.hvm_domain.introspection_enabled) before setting the emulate
>     > flag, that way we won't alter the behaviour for other clients.
> 
>     ...and you should also put a comment there explaining why someone with
>     introspection enabled wouldn't want an event here (something I'm still
>     not clear on).
> 
>     Are you *sure* that everyone who enables introspection will want that
>     event suppressed (not just you), and that no one else will?
>     Otherwise, it might make more sense to add some kind of flag to enable
>     or disable it, rather than gating it on introspection.  Or it's
>     possible everyone actually does want that event suppressed -- in which
>     case making it universal is the best option.
> 
>     Andres, any opinions here?
> 
> 
> My view of the mem event interface is that it should err on the side of
> informing the consumer. Now, if the consumer doesn't sign up for
> something, why bother (i.e. we don't inform of writes, if the access
> mode set for the gfn does not mask writes, etc).
> 
> In an ideal world, the emulation of the instruction should raise all
> relevant new mem events. We don't know a priori what the consumer might
> learn throughout the execution of this specific instruction. Does it
> read from or write to new gfns which have mem access masks set? TTBOMK,
> because the emulation does not go through the EPT fault handler, no mem
> access events will be generated, even if they should.
> 
> This is a long-standing shortcoming of mem event in security frameworks,
> in that mem access is only defined as raising events through EPT faults.
> One could conceivably craft an attack by having an instruction that
> through its emulation reads/writes a massive buffer going into other
> gfns. Conversely, "virtual DMA", i.e. qemu accesses via
> map_foreign_pages and grant accesses form backends don't raise mem
> access events. So there are many (conceptual) holes.
> 
> A decent thing to do for now would be to add a flag ..._EMULATE_SILENT,
> which resolves to the current behavior, and lack of ..._EMULATE_SILENT
> in a brave future would raise all the mem access events resulting from
> the full emulation of this instruction. Fix the API at least, before
> it's set in stone.

As far as I understand, George is asking about why events that have
npfec.kind != npfec_kind_with_gla are being emulated instead of being
sent out like the rest, and if that's a requirement that all memory
introspection clients might have.

To answer that question, _our_ application is not interested in events
other than npfec_kind_with_gla, and because of that it seemed worthwhile
to save a HV <-> dom0 roundtrip for events that would need to be ignored
by the application anyway, and thus keep the guest as responsive as
possible. I can't, of course, state that no other introspection client
will be interested in the other types of events.  But I can add another
parameter to xc_mem_access_enable_introspection() (please see patch 3/5
in the series) to specify whether non-npfec_kind_with_gla events should
be ignored or not (is this what the ..._EMULATE_SILENT suggestion refers
to?).


Thanks,
Razvan Cojocaru

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-10 21:28             ` Razvan Cojocaru
@ 2014-09-11 10:09               ` George Dunlap
  2014-09-11 10:44                 ` Razvan Cojocaru
  0 siblings, 1 reply; 21+ messages in thread
From: George Dunlap @ 2014-09-11 10:09 UTC (permalink / raw)
  To: Razvan Cojocaru
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Ian Jackson, Dong, Eddie, Tim Deegan, Jan Beulich,
	Andres Lagar Cavilla, Jun Nakajima, Andrew Cooper, xen-devel

On Wed, Sep 10, 2014 at 10:28 PM, Razvan Cojocaru
<rcojocaru@bitdefender.com> wrote:
> As far as I understand, George is asking about why events that have
> npfec.kind != npfec_kind_with_gla are being emulated instead of being
> sent out like the rest, and if that's a requirement that all memory
> introspection clients might have.
>
> To answer that question, _our_ application is not interested in events
> other than npfec_kind_with_gla, and because of that it seemed worthwhile
> to save a HV <-> dom0 roundtrip for events that would need to be ignored
> by the application anyway, and thus keep the guest as responsive as
> possible. I can't, of course, state that no other introspection client
> will be interested in the other types of events.  But I can add another
> parameter to xc_mem_access_enable_introspection() (please see patch 3/5
> in the series) to specify whether non-npfec_kind_with_gla events should
> be ignored or not (is this what the ..._EMULATE_SILENT suggestion refers
> to?).

I know when you come new to the list it seems like we're being a bunch
of old stodgy skeptics nitpicking everything for no good reason.  :-)

So just to explain a bit where we're coming from: We're always happy
to have improvements from people, and we're glad to have more people
using Xen.  But there are interfaces that we're going to have to be
supporting long-term.  When you're writing a product and you own the
entire piece of code, you can make all these kinds of changes however
you want; the only people it will affect in the future are you.
Adding them to a shared project like Xen, they affect lots of other
people who aren't doing what you're doing.   Furthermore, every
addition to the interface makes it a tiny bit more complicated,
fragile, and easy to break; and it's fairly likely that we'll end up
being the ones having to fix it at some point.

So we definitely want you to be able to write your introspection
engine and have it run well.  But we also want to make sure that you
don't break anything for anyone else; and, we want to make sure that
anything you're adding to the interface is worth the extra cost in
terms of maintenance: that means pushing back and asking if you could
accomplish your goals just as well with the existing interface, or
with a simpler interface that is 1) useful generally, not just to you,
and 2) simplier and easier for us to maintain.

And we realize you may not have experience with this kind of project
before, which is why we're giving you feedback (even if we sometimes
we forget what it's like to be new and get annoyed at having to say
the same thing to every new person who wants to contribute -- you'll
have to cut us some slack too).

So with that in mind:

This "don't give a notification on npfec_kind_with_gla" is a separate
feature that should have been introduced in a separate patch.  Of
course it's nice not to have to do context switches when you don't
have to; but adding a whole raft of flags for events you want to be
able to skip opens up a can of worms interface wise.  So it needs to
be justified. How expensive is it actually to just have the controller
ignore these?  How many unwanted events like this do you get on a
regular basis, and does it actually have a measurable performance
impact?

And if it does have a measurable performance impact, is it likely that
we're going to have other events that we may want to be able to filter
out in the hypervisor?  Is it better to think about a more general /
scalable way of specifying these events, rather than adding flags in
an ad-hoc fashion as they come up?

On the whole, if you're hoping to have the less controversial bits
(patches 1-3, and the other half of this patch) in 4.5, you might want
to set this to the side and come back to it after the code freeze is
done.

 -George

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-11 10:09               ` George Dunlap
@ 2014-09-11 10:44                 ` Razvan Cojocaru
  0 siblings, 0 replies; 21+ messages in thread
From: Razvan Cojocaru @ 2014-09-11 10:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, Keir Fraser, Ian Campbell, Stefano Stabellini,
	Jun Nakajima, Dong, Eddie, Ian Jackson, Tim Deegan,
	Andres Lagar Cavilla, Jan Beulich, Andrew Cooper, xen-devel

On 09/11/2014 01:09 PM, George Dunlap wrote:
> So just to explain a bit where we're coming from: We're always happy
> to have improvements from people, and we're glad to have more people
> using Xen.  But there are interfaces that we're going to have to be
> supporting long-term.  When you're writing a product and you own the
> entire piece of code, you can make all these kinds of changes however
> you want; the only people it will affect in the future are you.
> Adding them to a shared project like Xen, they affect lots of other
> people who aren't doing what you're doing.   Furthermore, every
> addition to the interface makes it a tiny bit more complicated,
> fragile, and easy to break; and it's fairly likely that we'll end up
> being the ones having to fix it at some point.

Thank you for the message, it's appreciated! Fair enough.

> So with that in mind:
> 
> This "don't give a notification on npfec_kind_with_gla" is a separate
> feature that should have been introduced in a separate patch.  Of
> course it's nice not to have to do context switches when you don't
> have to; but adding a whole raft of flags for events you want to be
> able to skip opens up a can of worms interface wise.  So it needs to
> be justified. How expensive is it actually to just have the controller
> ignore these?  How many unwanted events like this do you get on a
> regular basis, and does it actually have a measurable performance
> impact?
> 
> And if it does have a measurable performance impact, is it likely that
> we're going to have other events that we may want to be able to filter
> out in the hypervisor?  Is it better to think about a more general /
> scalable way of specifying these events, rather than adding flags in
> an ad-hoc fashion as they come up?
> 
> On the whole, if you're hoping to have the less controversial bits
> (patches 1-3, and the other half of this patch) in 4.5, you might want
> to set this to the side and come back to it after the code freeze is
> done.

The impact is acceptable, I'll do the filtering in the application.
Thanks for the suggestion! I'll do a bit more testing, and if all goes
well I'll resubmit the series as patches 1-3, and 5 without the GLA
filtering.


Thanks,
Razvan Cojocaru

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-10 18:28           ` Andres Lagar Cavilla
  2014-09-10 21:28             ` Razvan Cojocaru
@ 2014-09-11 14:40             ` Tamas K Lengyel
  2014-09-11 16:42               ` Andres Lagar Cavilla
  1 sibling, 1 reply; 21+ messages in thread
From: Tamas K Lengyel @ 2014-09-11 14:40 UTC (permalink / raw)
  To: Andres Lagar Cavilla; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1186 bytes --]

I've removed the CC's as I'm going a bit off-topic here.


> In an ideal world, the emulation of the instruction should raise all
> relevant new mem events. We don't know a priori what the consumer might
> learn throughout the execution of this specific instruction. Does it read
> from or write to new gfns which have mem access masks set? TTBOMK, because
> the emulation does not go through the EPT fault handler, no mem access
> events will be generated, even if they should.
>
> This is a long-standing shortcoming of mem event in security frameworks,
> in that mem access is only defined as raising events through EPT faults.
> One could conceivably craft an attack by having an instruction that through
> its emulation reads/writes a massive buffer going into other gfns.
> Conversely, "virtual DMA", i.e. qemu accesses via map_foreign_pages and
> grant accesses form backends don't raise mem access events. So there are
> many (conceptual) holes.
>

Could you provide an example instruction that is trapped-and-emulated by
Xen which may be used in such a fashion? Also, is there any technical
reason why we couldn't hook such emulations into the mem_event system?

Thanks,
Tamas

[-- Attachment #1.2: Type: text/html, Size: 1591 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-11 14:40             ` Tamas K Lengyel
@ 2014-09-11 16:42               ` Andres Lagar Cavilla
  2014-09-11 18:09                 ` Tamas K Lengyel
  0 siblings, 1 reply; 21+ messages in thread
From: Andres Lagar Cavilla @ 2014-09-11 16:42 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3330 bytes --]

On Thu, Sep 11, 2014 at 7:40 AM, Tamas K Lengyel <tamas.lengyel@zentific.com
> wrote:

> I've removed the CC's as I'm going a bit off-topic here.
>
>
>> In an ideal world, the emulation of the instruction should raise all
>> relevant new mem events. We don't know a priori what the consumer might
>> learn throughout the execution of this specific instruction. Does it read
>> from or write to new gfns which have mem access masks set? TTBOMK, because
>> the emulation does not go through the EPT fault handler, no mem access
>> events will be generated, even if they should.
>>
>> This is a long-standing shortcoming of mem event in security frameworks,
>> in that mem access is only defined as raising events through EPT faults.
>> One could conceivably craft an attack by having an instruction that through
>> its emulation reads/writes a massive buffer going into other gfns.
>> Conversely, "virtual DMA", i.e. qemu accesses via map_foreign_pages and
>> grant accesses form backends don't raise mem access events. So there are
>> many (conceptual) holes.
>>
>
> Could you provide an example instruction that is trapped-and-emulated by
> Xen which may be used in such a fashion? Also, is there any technical
> reason why we couldn't hook such emulations into the mem_event system?
>

Tamas,
I think it's safe to assume Razvan's dom0 application is powerful enough to
emulate the entire trapping instruction and not be victimized.

For the sake of argument, what I'm going at is that after the mem_event has
been handled and control is passed to hvm_emulate_one, Xen will start
resolving gfn->mfn translations needed by the instruction emulation by
internally walking the p2m (read EPT) table with get_page_from_gfn. This
will not invoke p2m_mem_access_check (only happens for actual hw faults),
so an instruction that reads or writes across pages will not have a mem
event generated for the other pages. A rep stos across page boundaries
would do that (key: the rep stos is emulated in Xen, and the eip is then
moved silently forward, so the hardware actually doesn't get to execute the
instruction).

A harder to catch example is a qemu-based driver, which grabs guest pages
via the mapcache buckets using xc_map_foreign_bulk. This resolves
to MMU_NORMAL_PT_UPDATE, which will grab the target page with ...
get_page_from_gfn. Basically, every page qemu reads/writes to/from will not
result in a mem event. This is akin to an unrestricted DMA engine that can
bypass the hardware PTE protection bits and do things behind the OS back.

Grant mapping also uses get_page_from_gfn ... no mem access checks.

The way to fix it is very laborious, that is why it hasn't happened. The
root cause is that p2m->get_entry does not check any of the access bits. It
could, and then you would be generating mem events from everywhere. But
that brings two problems. First, repeated events, as the same gfn may be
read multiple times -- I don't think anybody wants that. Second, you have
to be able to sleep on a wait queue when the event ring fills up (unless
you are comfortable dropping events). Sleeping on a wait queue pretty much
means stopping everything you are doing, carefully unrolling your stack
until you hold no spinlocks, going into the wait queue, and when you wake
up dive back into business.

HTH
Andres


> Thanks,
> Tamas
>

[-- Attachment #1.2: Type: text/html, Size: 4623 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-11 16:42               ` Andres Lagar Cavilla
@ 2014-09-11 18:09                 ` Tamas K Lengyel
  2014-09-11 18:39                   ` Andres Lagar Cavilla
  0 siblings, 1 reply; 21+ messages in thread
From: Tamas K Lengyel @ 2014-09-11 18:09 UTC (permalink / raw)
  To: Andres Lagar Cavilla; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4015 bytes --]

On Thu, Sep 11, 2014 at 6:42 PM, Andres Lagar Cavilla <
andres@lagarcavilla.org> wrote:

> On Thu, Sep 11, 2014 at 7:40 AM, Tamas K Lengyel <
> tamas.lengyel@zentific.com> wrote:
>
>> I've removed the CC's as I'm going a bit off-topic here.
>>
>>
>>> In an ideal world, the emulation of the instruction should raise all
>>> relevant new mem events. We don't know a priori what the consumer might
>>> learn throughout the execution of this specific instruction. Does it read
>>> from or write to new gfns which have mem access masks set? TTBOMK, because
>>> the emulation does not go through the EPT fault handler, no mem access
>>> events will be generated, even if they should.
>>>
>>> This is a long-standing shortcoming of mem event in security frameworks,
>>> in that mem access is only defined as raising events through EPT faults.
>>> One could conceivably craft an attack by having an instruction that through
>>> its emulation reads/writes a massive buffer going into other gfns.
>>> Conversely, "virtual DMA", i.e. qemu accesses via map_foreign_pages and
>>> grant accesses form backends don't raise mem access events. So there are
>>> many (conceptual) holes.
>>>
>>
>> Could you provide an example instruction that is trapped-and-emulated by
>> Xen which may be used in such a fashion? Also, is there any technical
>> reason why we couldn't hook such emulations into the mem_event system?
>>
>
> Tamas,
> I think it's safe to assume Razvan's dom0 application is powerful enough
> to emulate the entire trapping instruction and not be victimized.
>
> For the sake of argument, what I'm going at is that after the mem_event
> has been handled and control is passed to hvm_emulate_one, Xen will start
> resolving gfn->mfn translations needed by the instruction emulation by
> internally walking the p2m (read EPT) table with get_page_from_gfn. This
> will not invoke p2m_mem_access_check (only happens for actual hw faults),
> so an instruction that reads or writes across pages will not have a mem
> event generated for the other pages. A rep stos across page boundaries
> would do that (key: the rep stos is emulated in Xen, and the eip is then
> moved silently forward, so the hardware actually doesn't get to execute the
> instruction).
>
> A harder to catch example is a qemu-based driver, which grabs guest pages
> via the mapcache buckets using xc_map_foreign_bulk. This resolves
> to MMU_NORMAL_PT_UPDATE, which will grab the target page with ...
> get_page_from_gfn. Basically, every page qemu reads/writes to/from will not
> result in a mem event. This is akin to an unrestricted DMA engine that can
> bypass the hardware PTE protection bits and do things behind the OS back.
>
> Grant mapping also uses get_page_from_gfn ... no mem access checks.
>
> The way to fix it is very laborious, that is why it hasn't happened. The
> root cause is that p2m->get_entry does not check any of the access bits. It
> could, and then you would be generating mem events from everywhere. But
> that brings two problems. First, repeated events, as the same gfn may be
> read multiple times -- I don't think anybody wants that. Second, you have
> to be able to sleep on a wait queue when the event ring fills up (unless
> you are comfortable dropping events). Sleeping on a wait queue pretty much
> means stopping everything you are doing, carefully unrolling your stack
> until you hold no spinlocks, going into the wait queue, and when you wake
> up dive back into business.
>
> HTH
> Andres
>

Thanks for the in-depth explanation, it certainly sheds some light on the
limitations of the mem_access system. I understand that any memory access
to mfn's via mechanisms that don't use the trapped EPT (a pv domain or the
hypervisor itself) or have a mapping of the same pages via different EPTs
won't trigger the mem_event traps. For the emulation part my question was
rather if you are aware of any emulation that currently takes place
(outside this patch series) which may be used in this fashion?

Thanks,
Tamas

[-- Attachment #1.2: Type: text/html, Size: 5283 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-11 18:09                 ` Tamas K Lengyel
@ 2014-09-11 18:39                   ` Andres Lagar Cavilla
  2014-09-11 19:06                     ` Andrew Cooper
  0 siblings, 1 reply; 21+ messages in thread
From: Andres Lagar Cavilla @ 2014-09-11 18:39 UTC (permalink / raw)
  To: Tamas K Lengyel; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 4624 bytes --]

On Thu, Sep 11, 2014 at 11:09 AM, Tamas K Lengyel <
tamas.lengyel@zentific.com> wrote:

>
>
> On Thu, Sep 11, 2014 at 6:42 PM, Andres Lagar Cavilla <
> andres@lagarcavilla.org> wrote:
>
>> On Thu, Sep 11, 2014 at 7:40 AM, Tamas K Lengyel <
>> tamas.lengyel@zentific.com> wrote:
>>
>>> I've removed the CC's as I'm going a bit off-topic here.
>>>
>>>
>>>> In an ideal world, the emulation of the instruction should raise all
>>>> relevant new mem events. We don't know a priori what the consumer might
>>>> learn throughout the execution of this specific instruction. Does it read
>>>> from or write to new gfns which have mem access masks set? TTBOMK, because
>>>> the emulation does not go through the EPT fault handler, no mem access
>>>> events will be generated, even if they should.
>>>>
>>>> This is a long-standing shortcoming of mem event in security
>>>> frameworks, in that mem access is only defined as raising events through
>>>> EPT faults. One could conceivably craft an attack by having an instruction
>>>> that through its emulation reads/writes a massive buffer going into other
>>>> gfns. Conversely, "virtual DMA", i.e. qemu accesses via map_foreign_pages
>>>> and grant accesses form backends don't raise mem access events. So there
>>>> are many (conceptual) holes.
>>>>
>>>
>>> Could you provide an example instruction that is trapped-and-emulated by
>>> Xen which may be used in such a fashion? Also, is there any technical
>>> reason why we couldn't hook such emulations into the mem_event system?
>>>
>>
>> Tamas,
>> I think it's safe to assume Razvan's dom0 application is powerful enough
>> to emulate the entire trapping instruction and not be victimized.
>>
>> For the sake of argument, what I'm going at is that after the mem_event
>> has been handled and control is passed to hvm_emulate_one, Xen will start
>> resolving gfn->mfn translations needed by the instruction emulation by
>> internally walking the p2m (read EPT) table with get_page_from_gfn. This
>> will not invoke p2m_mem_access_check (only happens for actual hw faults),
>> so an instruction that reads or writes across pages will not have a mem
>> event generated for the other pages. A rep stos across page boundaries
>> would do that (key: the rep stos is emulated in Xen, and the eip is then
>> moved silently forward, so the hardware actually doesn't get to execute the
>> instruction).
>>
>> A harder to catch example is a qemu-based driver, which grabs guest pages
>> via the mapcache buckets using xc_map_foreign_bulk. This resolves
>> to MMU_NORMAL_PT_UPDATE, which will grab the target page with ...
>> get_page_from_gfn. Basically, every page qemu reads/writes to/from will not
>> result in a mem event. This is akin to an unrestricted DMA engine that can
>> bypass the hardware PTE protection bits and do things behind the OS back.
>>
>> Grant mapping also uses get_page_from_gfn ... no mem access checks.
>>
>> The way to fix it is very laborious, that is why it hasn't happened. The
>> root cause is that p2m->get_entry does not check any of the access bits. It
>> could, and then you would be generating mem events from everywhere. But
>> that brings two problems. First, repeated events, as the same gfn may be
>> read multiple times -- I don't think anybody wants that. Second, you have
>> to be able to sleep on a wait queue when the event ring fills up (unless
>> you are comfortable dropping events). Sleeping on a wait queue pretty much
>> means stopping everything you are doing, carefully unrolling your stack
>> until you hold no spinlocks, going into the wait queue, and when you wake
>> up dive back into business.
>>
>> HTH
>> Andres
>>
>
> Thanks for the in-depth explanation, it certainly sheds some light on the
> limitations of the mem_access system. I understand that any memory access
> to mfn's via mechanisms that don't use the trapped EPT (a pv domain or the
> hypervisor itself) or have a mapping of the same pages via different EPTs
> won't trigger the mem_event traps. For the emulation part my question was
> rather if you are aware of any emulation that currently takes place
> (outside this patch series) which may be used in this fashion?
>

Uhm. Examples I can think of: MMIO access. The OS reads values from lapic
or hpet pages, and those get emulated (although there are lapic fast paths
out there). If the buffers in regular RAM fall in pages that have mem
access permission trapping set, then no event will be generated (by that
mmio instruction).

And all your PV driver frontend needs. Qemu does the RTC (IIRC), so RTC
reads also escape mem access.

Andres


> Thanks,
> Tamas
>

[-- Attachment #1.2: Type: text/html, Size: 6401 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply
  2014-09-11 18:39                   ` Andres Lagar Cavilla
@ 2014-09-11 19:06                     ` Andrew Cooper
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Cooper @ 2014-09-11 19:06 UTC (permalink / raw)
  To: Andres Lagar Cavilla, Tamas K Lengyel; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5859 bytes --]

On 11/09/2014 19:39, Andres Lagar Cavilla wrote:
> On Thu, Sep 11, 2014 at 11:09 AM, Tamas K Lengyel
> <tamas.lengyel@zentific.com <mailto:tamas.lengyel@zentific.com>> wrote:
>
>
>
>     On Thu, Sep 11, 2014 at 6:42 PM, Andres Lagar Cavilla
>     <andres@lagarcavilla.org <mailto:andres@lagarcavilla.org>> wrote:
>
>         On Thu, Sep 11, 2014 at 7:40 AM, Tamas K Lengyel
>         <tamas.lengyel@zentific.com
>         <mailto:tamas.lengyel@zentific.com>> wrote:
>
>             I've removed the CC's as I'm going a bit off-topic here.
>              
>
>                 In an ideal world, the emulation of the instruction
>                 should raise all relevant new mem events. We don't
>                 know a priori what the consumer might learn throughout
>                 the execution of this specific instruction. Does it
>                 read from or write to new gfns which have mem access
>                 masks set? TTBOMK, because the emulation does not go
>                 through the EPT fault handler, no mem access events
>                 will be generated, even if they should.
>
>                 This is a long-standing shortcoming of mem event in
>                 security frameworks, in that mem access is only
>                 defined as raising events through EPT faults. One
>                 could conceivably craft an attack by having an
>                 instruction that through its emulation reads/writes a
>                 massive buffer going into other gfns. Conversely,
>                 "virtual DMA", i.e. qemu accesses via
>                 map_foreign_pages and grant accesses form backends
>                 don't raise mem access events. So there are many
>                 (conceptual) holes.
>
>
>             Could you provide an example instruction that is
>             trapped-and-emulated by Xen which may be used in such a
>             fashion? Also, is there any technical reason why we
>             couldn't hook such emulations into the mem_event system?
>
>
>         Tamas,
>         I think it's safe to assume Razvan's dom0 application is
>         powerful enough to emulate the entire trapping instruction and
>         not be victimized.
>
>         For the sake of argument, what I'm going at is that after the
>         mem_event has been handled and control is passed to
>         hvm_emulate_one, Xen will start resolving gfn->mfn
>         translations needed by the instruction emulation by internally
>         walking the p2m (read EPT) table with get_page_from_gfn. This
>         will not invoke p2m_mem_access_check (only happens for actual
>         hw faults), so an instruction that reads or writes across
>         pages will not have a mem event generated for the other pages.
>         A rep stos across page boundaries would do that (key: the rep
>         stos is emulated in Xen, and the eip is then moved silently
>         forward, so the hardware actually doesn't get to execute the
>         instruction).
>
>         A harder to catch example is a qemu-based driver, which grabs
>         guest pages via the mapcache buckets using
>         xc_map_foreign_bulk. This resolves to MMU_NORMAL_PT_UPDATE,
>         which will grab the target page with ... get_page_from_gfn.
>         Basically, every page qemu reads/writes to/from will not
>         result in a mem event. This is akin to an unrestricted DMA
>         engine that can bypass the hardware PTE protection bits and do
>         things behind the OS back.
>
>         Grant mapping also uses get_page_from_gfn ... no mem access
>         checks.
>
>         The way to fix it is very laborious, that is why it hasn't
>         happened. The root cause is that p2m->get_entry does not check
>         any of the access bits. It could, and then you would be
>         generating mem events from everywhere. But that brings two
>         problems. First, repeated events, as the same gfn may be read
>         multiple times -- I don't think anybody wants that. Second,
>         you have to be able to sleep on a wait queue when the event
>         ring fills up (unless you are comfortable dropping events).
>         Sleeping on a wait queue pretty much means stopping everything
>         you are doing, carefully unrolling your stack until you hold
>         no spinlocks, going into the wait queue, and when you wake up
>         dive back into business.
>
>         HTH
>         Andres
>
>
>     Thanks for the in-depth explanation, it certainly sheds some light
>     on the limitations of the mem_access system. I understand that any
>     memory access to mfn's via mechanisms that don't use the trapped
>     EPT (a pv domain or the hypervisor itself) or have a mapping of
>     the same pages via different EPTs won't trigger the mem_event
>     traps. For the emulation part my question was rather if you are
>     aware of any emulation that currently takes place (outside this
>     patch series) which may be used in this fashion?
>
>
> Uhm. Examples I can think of: MMIO access. The OS reads values from
> lapic or hpet pages, and those get emulated (although there are lapic
> fast paths out there). If the buffers in regular RAM fall in pages
> that have mem access permission trapping set, then no event will be
> generated (by that mmio instruction).
>
> And all your PV driver frontend needs. Qemu does the RTC (IIRC), so
> RTC reads also escape mem access.

Xen does all forms of timer and interrupt emulation (so off the top of
my head, RTC, PIT, HPET, PMTimer, PIC, IOAPIC and LAPIC) but all other
legacy devices are handled by Qemu.  There is now a fastpath for
anything emulated by Xen, for performance/scalability reasons with
many-vcpu guests.

~Andrew

[-- Attachment #1.2: Type: text/html, Size: 16512 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-09-11 19:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-09 10:28 [PATCH V6 0/5] Basic guest memory introspection support Razvan Cojocaru
2014-09-09 10:28 ` [PATCH V6 1/5] xen: Emulate with no writes Razvan Cojocaru
2014-09-09 10:28 ` [PATCH V6 2/5] xen: Optimize introspection access to guest state Razvan Cojocaru
2014-09-09 10:28 ` [PATCH V6 3/5] xen, libxc: Force-enable relevant MSR events Razvan Cojocaru
2014-09-09 10:28 ` [PATCH V6 4/5] xen, libxc: Request page fault injection via libxc Razvan Cojocaru
2014-09-10 14:38   ` Jan Beulich
2014-09-09 10:28 ` [PATCH V6 5/5] xen: Handle resumed instruction based on previous mem_event reply Razvan Cojocaru
2014-09-10 16:03   ` George Dunlap
2014-09-10 16:12     ` Razvan Cojocaru
2014-09-10 16:38       ` George Dunlap
     [not found]         ` <CAJu=L59+C7+byC0UJVcriERf-cueiz8CYcCeBOZfXX8ZLjTBRQ@mail.gmail.com>
2014-09-10 18:28           ` Andres Lagar Cavilla
2014-09-10 21:28             ` Razvan Cojocaru
2014-09-11 10:09               ` George Dunlap
2014-09-11 10:44                 ` Razvan Cojocaru
2014-09-11 14:40             ` Tamas K Lengyel
2014-09-11 16:42               ` Andres Lagar Cavilla
2014-09-11 18:09                 ` Tamas K Lengyel
2014-09-11 18:39                   ` Andres Lagar Cavilla
2014-09-11 19:06                     ` Andrew Cooper
2014-09-09 10:34 ` [PATCH V6 0/5] Basic guest memory introspection support Jan Beulich
2014-09-09 10:39   ` Razvan Cojocaru

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.