All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/19] Guest introspection
@ 2017-06-16 13:43 Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 01/19] kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte() Adalbert Lazar
                   ` (19 more replies)
  0 siblings, 20 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

This patch series proposes an interface that will allow a guest
introspection tool to monitor and control other guests, in order to
protect them against different forms of exploits. This type of interface
is already present in the XEN hypervisor.

With the current implementation, the introspection tool connects to
the KVMi (the introspection subsystem from KVM) using a vsock socket,
establishes a main communication channel, used for a few messages
(KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
KVMI_GET_VERSION).

Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
establish a new connection, used to monitor and control that guest.

In order to control the guests, we found that the following list
of introspection commands/events is required:

Commands - messages sent from introspection tool to KVMi
========

- KVMI_GET_GUEST_INFO

	Get the number of online VCPUs and the TSC speed.

- KVMI_PAUSE_GUEST, KVMI_UNPAUSE_GUEST

	Pause/unpause all VCPUs.

- KVMI_SHUTDOWN_GUEST

- KVMI_GET_REGISTERS

	Get general purpose, special and a small subset of MSRs
	(the ones controlling the syscall behaviour).

- KVMI_SET_REGISTERS

	Set the general purpose registers.

- KVMI_GET_MTRR_TYPE

	Get the guest memory type for a specific physical address.

- KVMI_GET_MTRRS

	Get MSR_IA32_CR_PAT, MSR_MTRRcap and MSR_MTRRdefType.

- KVMI_GET_XSAVE_INFO

	Get vcpu->arch.guest_xstate_size.

- KVMI_GET_PAGE_ACCESS, KVMI_SET_PAGE_ACCESS

	Get/set the spte flags (rwx - present, write & user).

- KVMI_INJECT_PAGE_FAULT, KVMI_INJECT_BREAKPOINT

	Used to instruct the OS to do a page in

- KVMI_READ_PHYSICAL, KVMI_WRITE_PHYSICAL

- KVMI_MAP_PHYSICAL_PAGE_TO_SVA, KVMI_UNMAP_PHYSICAL_PAGE_FROM_SVA

	A faster alternative to read/write messages (above).

- KVMI_EVENT_CONTROL

	Enable event reports (see the event list bellow).

- KVMI_CR_CONTROL, KVMI_MSR_CONTROL

	Filter VCPUs events regarding CR and MSR registers
	(if enabled with KVMI_EVENT_CONTROL).
	
Events - messages sent from KVMi to introspection tool
======

- KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF

	Send the guest UUID.
	On KVMI_EVENT_GUEST_ON, the introspection tool connects back with the UUID,
	in order to establish a control channel for this guest.

- KVMI_EVENT_VCPU

	This message is used to send one of the following events (if
	enabled with KVMI_EVENT_CONTROL - see above), together with
	the registers (see KVMI_GET_REGISTERS). The introspection tool
	can reply with the KVMI_EVENT_SET_REGS flag set and provide new
	values for the registers, as with the KVMI_SET_REGISTERS command.
	
	- KVMI_EVENT_CR

	  A CR register was modified. If the event reporting for this
	  specific CR was enabled with KVMI_CR_CONTROL, send a message to
	  the introspection tool with the CR number, the old value, the
	  new value, and wait for a reply with one or more actions/flags:

	  + KVMI_EVENT_ALLOW    (allow the new value to be set)
	  + KVMI_EVENT_SET_REGS (override the registers)

	  otherwise, block this modification.

	- KVMI_EVENT_MSR

	  Similar with KVMI_EVENT_CR. Filtered with KVMI_MSR_CONTROL.

	- KVMI_EVENT_XSETBV

	  An extended control register was modified. Send the value.
	  The introspection tool can reply with KVMI_EVENT_SET_REGS.

	- KVMI_EVENT_BREAKPOINT

	  A breakpoint was reached. Send the guest address.
	  The introspection tool can reply with KVMI_EVENT_SET_REGS
	  and KVMI_EVENT_ALLOW.

	- KVMI_EVENT_USER_CALL

	  User hypercall.
	  The introspection tool can reply with KVMI_EVENT_SET_REGS.

	- KVMI_EVENT_TRAP

	  A trap will be delivered to the guest (#PF, INT3 etc.).
	  The introspection tool can reply with KVMI_EVENT_SET_REGS.

	- KVMI_EVENT_PAGE_FAULT

	  A hypervisor page fault was encountered.
	  The introspection tool can reply with:
	  + KVMI_EVENT_ALLOW (otherwise EMULATE_FAIL will be returned)
	  + KVMI_EVENT_NOEMU (EMULATE_DONE)
	  + KVMI_EVENT_SET_REGS
	  + KVMI_EVENT_SET_CTX (change the emulation context)

The control channels are handled by workqueue jobs, receiving messages
from the introspection tool and signaling the proper VCPU threads to
act on the message.

Currently, all the commands will pause/unpause the guest, but we will like
to avoid this when possible.

This patch series is not complete. Your input would be greatly appreciated.


Adalbert Lazar (2):
  kvm: Add the introspection subsystem
  kvm: x86: Handle KVM_REQ_INTROSPECTION

Mihai Dontu (17):
  kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte()
  kvm: x86: Add kvm_arch_vcpu_set_regs()
  mm: Add vm_replace_page()
  kvm: Add kvm_enum()
  kvm: Add uuid member in struct kvm + support for KVM_CAP_VM_UUID
  kvm: Add kvm_vm_shutdown()
  kvm: x86: Add kvm_arch_msr_intercept()
  kvm: Hook in kvmi on VM on/off events
  kvm: vmx: Hook in kvmi_page_fault()
  kvm: x86: Hook in kvmi_breakpoint_event()
  kvm: x86: Hook in kvmi_trap_event()
  kvm: x86: Hook in kvmi_cr_event()
  kvm: x86: Hook in kvmi_xsetbv_event()
  kvm: x86: Hook in kvmi_msr_event()
  kvm: x86: Change the emulation context
  kvm: x86: Hook in kvmi_vmcall_event()
  kvm: x86: Set the new spte flags before entering the guest

 arch/x86/include/asm/kvm_host.h |   11 +-
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/Makefile           |    1 +
 arch/x86/kvm/mmu.c              |  126 ++-
 arch/x86/kvm/mmu.h              |    3 +
 arch/x86/kvm/svm.c              |   10 +
 arch/x86/kvm/vmx.c              |   79 +-
 arch/x86/kvm/x86.c              |  148 ++-
 include/linux/kvm_host.h        |   36 +
 include/linux/mm.h              |    1 +
 include/uapi/linux/kvm.h        |    2 +
 include/uapi/linux/kvm_para.h   |    4 +
 include/uapi/linux/kvmi.h       |  263 +++++
 mm/memory.c                     |   69 ++
 virt/kvm/kvm_main.c             |   81 ++
 virt/kvm/kvmi.c                 | 2252 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi.h                 |   42 +
 virt/kvm/kvmi_socket.c          |  412 +++++++
 virt/kvm/kvmi_socket.h          |   33 +
 19 files changed, 3566 insertions(+), 9 deletions(-)
 create mode 100644 include/uapi/linux/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi.h
 create mode 100644 virt/kvm/kvmi_socket.c
 create mode 100644 virt/kvm/kvmi_socket.h

-- 
2.12.2

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH 01/19] kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 02/19] kvm: x86: Add kvm_arch_vcpu_set_regs() Adalbert Lazar
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

These are helpers used by the introspection subsystem to adjust the SPTE
rwx flags. At present, the code assumes we're deadling with 4-level
hardware shadow page tables.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/mmu.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu.h |  3 +++
 2 files changed, 78 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cb8225969255..12e4c33ff879 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5350,3 +5350,78 @@ void kvm_mmu_module_exit(void)
 	unregister_shrinker(&mmu_shrinker);
 	mmu_audit_disable();
 }
+
+u64 kvm_mmu_get_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, gpa_t gpa)
+{
+	u64 spte = -1;
+	unsigned int c = 0;
+	const u64 mask = PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
+	struct kvm_shadow_walk_iterator iterator;
+
+	spin_lock(&kvm->mmu_lock);
+	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+		goto error;
+	for_each_shadow_entry(vcpu, gpa & PAGE_MASK, iterator) {
+		u64 __spte = *iterator.sptep;
+
+		if (!(__spte & mask))
+			break;
+		else if (++c == PT64_ROOT_LEVEL) {
+			spte = __spte;
+			break;
+		}
+	}
+	if (spte == (u64) -1)
+		goto error;
+	spin_unlock(&kvm->mmu_lock);
+	return spte & mask;
+error:
+	spin_unlock(&kvm->mmu_lock);
+	return -ENOENT;
+}
+
+int kvm_mmu_set_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, gpa_t gpa,
+		     unsigned int r, unsigned int w, unsigned int x)
+{
+	int flush = 0;
+	u64 *pspte[4] = { };
+	u64 spte;
+	u64 old_spte;
+	unsigned int c = 0;
+	const u64 mask = PT_PRESENT_MASK | PT_WRITABLE_MASK | PT_USER_MASK;
+	struct kvm_shadow_walk_iterator iterator;
+
+	spin_lock(&kvm->mmu_lock);
+	if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
+		goto error;
+	for_each_shadow_entry(vcpu, gpa & PAGE_MASK, iterator) {
+		u64 __spte = *iterator.sptep;
+
+		if (!(__spte & mask))
+			break;
+		pspte[c++] = iterator.sptep;
+	}
+	if (c < PT64_ROOT_LEVEL || !pspte[c - 1])
+		goto error;
+	c--;
+	old_spte = *pspte[c];
+	spte = old_spte & ~mask;
+	if (r)
+		spte |= PT_PRESENT_MASK;
+	if (w)
+		spte |= PT_WRITABLE_MASK;
+	if (x)
+		spte |= PT_USER_MASK;
+	if (old_spte != spte)
+		flush |= mmu_spte_update(pspte[c], spte);
+	while (c-- > 0) {
+		spte = *pspte[c];
+		if ((spte & mask) != mask)
+			flush |= mmu_spte_update(pspte[c], spte | mask);
+	}
+	spin_unlock(&kvm->mmu_lock);
+	return flush;
+error:
+	spin_unlock(&kvm->mmu_lock);
+	return -ENOENT;
+}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 330bf3a811fb..82246fdc0479 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -204,4 +204,7 @@ void kvm_mmu_gfn_allow_lpage(struct kvm_memory_slot *slot, gfn_t gfn);
 bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
 				    struct kvm_memory_slot *slot, u64 gfn);
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
+u64 kvm_mmu_get_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, gpa_t gpa);
+int kvm_mmu_set_spte(struct kvm *kvm, struct kvm_vcpu *vcpu, gpa_t gpa,
+		     unsigned int r, unsigned int w, unsigned int x);
 #endif
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 02/19] kvm: x86: Add kvm_arch_vcpu_set_regs()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 01/19] kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 03/19] mm: Add vm_replace_page() Adalbert Lazar
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

This is a version of kvm_arch_vcpu_ioctl_set_regs() which does not touch
the exceptions vector.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c       | 34 ++++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h |  1 +
 2 files changed, 35 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87d3cb901935..1a7493982310 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7291,6 +7291,40 @@ int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	return 0;
 }
 
+/*
+ * Similar to kvm_arch_vcpu_ioctl_set_regs() but it does not reset
+ * the exceptions
+ */
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
+{
+	vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
+	vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+	kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
+	kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
+	kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
+	kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
+	kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
+	kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
+	kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
+	kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
+#ifdef CONFIG_X86_64
+	kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
+	kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
+	kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
+	kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
+	kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
+	kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
+	kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
+	kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
+#endif
+
+	kvm_rip_write(vcpu, regs->rip);
+	kvm_set_rflags(vcpu, regs->rflags);
+
+	kvm_make_request(KVM_REQ_EVENT, vcpu);
+}
+
 void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	struct kvm_segment cs;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8c0664309815..48cd2d856132 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -748,6 +748,7 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
+void kvm_arch_vcpu_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs);
 int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 				  struct kvm_sregs *sregs);
 int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 03/19] mm: Add vm_replace_page()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 01/19] kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte() Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 02/19] kvm: x86: Add kvm_arch_vcpu_set_regs() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 04/19] kvm: Add kvm_enum() Adalbert Lazar
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

This function is used to get two processes to share a page. It's inspired
by replace_page() from KSM.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b892e95d4929..9cd088ef9d0c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2296,6 +2296,7 @@ int vm_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 			unsigned long pfn, pgprot_t pgprot);
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 			pfn_t pfn);
+int vm_replace_page(struct vm_area_struct *vma, struct page *page);
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
 
 
diff --git a/mm/memory.c b/mm/memory.c
index 2e65df1831d9..ae7716ffe6e9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1776,6 +1776,75 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+/**
+ * vm_replace_page - given a page-sized VMA, drop the currently
+ *                   referenced page and place the specified one
+ *                   in its stead
+ * @vma: the remote user VMA in which the replace takes place
+ * @page: the page with which we make the replacement
+ */
+int vm_replace_page(struct vm_area_struct *vma, struct page *page)
+{
+	unsigned long mmun_start;
+	unsigned long mmun_end;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	struct page *old_page;
+	pte_t *ptep;
+	spinlock_t *ptl;
+
+	/* Make sure the area is page aligned */
+	if (vma->vm_start % PAGE_SIZE)
+		return -EINVAL;
+
+	/* Make sure the area is page-sized */
+	if ((vma->vm_end - vma->vm_start) != PAGE_SIZE)
+		return -EINVAL;
+
+	old_page = follow_page(vma, vma->vm_start, 0);
+	if (IS_ERR_OR_NULL(old_page))
+		return old_page ? PTR_ERR(old_page) : -ENOENT;
+
+	pmd = mm_find_pmd(mm, vma->vm_start);
+	if (!pmd)
+		return -ENOENT;
+
+	mmun_start = vma->vm_start;
+	mmun_end = mmun_start + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
+	ptep = pte_offset_map_lock(mm, pmd, vma->vm_start, &ptl);
+
+	get_page(page);
+	page_add_anon_rmap(page, vma, vma->vm_start, false);
+
+	flush_cache_page(vma, vma->vm_start, pte_pfn(*ptep));
+	ptep_clear_flush_notify(vma, vma->vm_start, ptep);
+
+	/*
+	 * TODO: Find why we can't do:
+	 *       set_pte_at_notify(mm, vma->vm_start, ptep,
+	 *                         mk_pte(page, vma->vm_page_prot))
+	 */
+	set_pte_at_notify(mm, vma->vm_start, ptep,
+			  mk_pte(page,
+				 __pgprot(_PAGE_PRESENT | _PAGE_RW |
+					  _PAGE_BIT_NX)));
+
+	/* Drop the old page */
+	page_remove_rmap(old_page, false);
+	if (!page_mapped(old_page))
+		try_to_free_swap(old_page);
+	put_page(old_page);
+
+	pte_unmap_unlock(ptep, ptl);
+
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vm_replace_page);
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 04/19] kvm: Add kvm_enum()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (2 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 03/19] mm: Add vm_replace_page() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 05/19] kvm: Add uuid member in struct kvm + support for KVM_CAP_VM_UUID Adalbert Lazar
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

This is a helper used by the introspection subsystem to find a specific
VM by UUID or to build a list of UUIDs.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 48cd2d856132..88d4e4cbaba5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -549,6 +549,8 @@ static inline void kvm_irqfd_exit(void)
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 		  struct module *module);
 void kvm_exit(void);
+void kvm_enum(int (*enum_cb) (const struct kvm *kvm, void *param),
+	      void *param);
 
 void kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f0fe9d02f6bb..cfd2d1bf8ac4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4046,3 +4046,15 @@ void kvm_exit(void)
 	kvm_vfio_ops_exit();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
+
+void kvm_enum(int (*enum_cb) (const struct kvm *kvm, void *param), void *param)
+{
+	struct kvm *kvm;
+
+	spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		if (enum_cb(kvm, param))
+			break;
+	}
+	spin_unlock(&kvm_lock);
+}
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 05/19] kvm: Add uuid member in struct kvm + support for KVM_CAP_VM_UUID
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (3 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 04/19] kvm: Add kvm_enum() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 06/19] kvm: Add kvm_vm_shutdown() Adalbert Lazar
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

The introspection subsystem uses UUIDs to identify VMs.

This patch adds support for QEMU to query for KVM_CAP_VM_UUID
and set the uuid with KVM_SET_VM_UUID.

The kvm_from_uuid() helper is used to search and 'get' a kvm struct by
UUID, in order to link the VM and the guest introspection tool with a
control socket connection.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 include/linux/kvm_host.h |  4 ++++
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/kvm_main.c      | 26 ++++++++++++++++++++++++++
 3 files changed, 32 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 88d4e4cbaba5..545964ed6a63 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -27,6 +27,7 @@
 #include <linux/irqbypass.h>
 #include <linux/swait.h>
 #include <linux/refcount.h>
+#include <linux/uuid.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -438,6 +439,8 @@ struct kvm {
 	struct kvm_stat_data **debugfs_stat_data;
 	struct srcu_struct srcu;
 	struct srcu_struct irq_srcu;
+
+	uuid_le uuid;
 };
 
 #define kvm_err(fmt, ...) \
@@ -551,6 +554,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 void kvm_exit(void);
 void kvm_enum(int (*enum_cb) (const struct kvm *kvm, void *param),
 	      void *param);
+struct kvm *kvm_from_uuid(const uuid_le *uuid);
 
 void kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 577429a95ad8..9b5813597d71 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -895,6 +895,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_SPAPR_TCE_VFIO 142
 #define KVM_CAP_X86_GUEST_MWAIT 143
 #define KVM_CAP_ARM_USER_IRQ 144
+#define KVM_CAP_VM_UUID 145
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1318,6 +1319,7 @@ struct kvm_s390_ucas_mapping {
 #define KVM_S390_GET_IRQ_STATE	  _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
 /* Available with KVM_CAP_X86_SMM */
 #define KVM_SMI                   _IO(KVMIO,   0xb7)
+#define KVM_SET_VM_UUID           _IOW(KVMIO, 0xb8, uuid_le)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 #define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cfd2d1bf8ac4..31bcdc92f1ea 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2924,6 +2924,7 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 	case KVM_CAP_IOEVENTFD_ANY_LENGTH:
 	case KVM_CAP_CHECK_EXTENSION_VM:
+	case KVM_CAP_VM_UUID:
 		return 1;
 #ifdef CONFIG_KVM_MMIO
 	case KVM_CAP_COALESCED_MMIO:
@@ -3106,6 +3107,13 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
+	case KVM_SET_VM_UUID:
+		r = -EFAULT;
+		if (copy_from_user(&kvm->uuid, argp, sizeof(kvm->uuid)))
+			goto out;
+
+		r = 0;
+		break;
 	default:
 		r = kvm_arch_vm_ioctl(filp, ioctl, arg);
 	}
@@ -4058,3 +4066,21 @@ void kvm_enum(int (*enum_cb) (const struct kvm *kvm, void *param), void *param)
 	}
 	spin_unlock(&kvm_lock);
 }
+
+/* Make sure to call kvm_put_kvm() when done */
+struct kvm *kvm_from_uuid(const uuid_le *uuid)
+{
+	struct kvm *kvm;
+	struct kvm *found = NULL;
+
+	spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		if (!memcmp(&kvm->uuid, uuid, sizeof(kvm->uuid))) {
+			kvm_get_kvm(kvm);
+			found = kvm;
+			break;
+		}
+	}
+	spin_unlock(&kvm_lock);
+	return found;
+}
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 06/19] kvm: Add kvm_vm_shutdown()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (4 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 05/19] kvm: Add uuid member in struct kvm + support for KVM_CAP_VM_UUID Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 07/19] kvm: x86: Add kvm_arch_msr_intercept() Adalbert Lazar
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

This function is used by the introspection subsystem to shutdown
a specific VM.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 27 +++++++++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 545964ed6a63..2f00b5c64632 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -555,6 +555,7 @@ void kvm_exit(void);
 void kvm_enum(int (*enum_cb) (const struct kvm *kvm, void *param),
 	      void *param);
 struct kvm *kvm_from_uuid(const uuid_le *uuid);
+void kvm_vm_shutdown(struct kvm *kvm);
 
 void kvm_get_kvm(struct kvm *kvm);
 void kvm_put_kvm(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 31bcdc92f1ea..52d92fcf39ff 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4084,3 +4084,30 @@ struct kvm *kvm_from_uuid(const uuid_le *uuid)
 	spin_unlock(&kvm_lock);
 	return found;
 }
+
+static int kvm_vcpu_kill(int sig, struct kvm_vcpu *vcpu)
+{
+	int err = -ESRCH;
+	struct pid *pid;
+	struct siginfo siginfo[1] = { };
+
+	rcu_read_lock();
+	pid = rcu_dereference(vcpu->pid);
+	if (pid)
+		err = kill_pid_info(sig, siginfo, pid);
+	rcu_read_unlock();
+
+	return err;
+}
+
+void kvm_vm_shutdown(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvm_vcpu_kill(SIGTERM, vcpu);
+	}
+	mutex_unlock(&kvm->lock);
+}
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 07/19] kvm: x86: Add kvm_arch_msr_intercept()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (5 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 06/19] kvm: Add kvm_vm_shutdown() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 08/19] kvm: Add the introspection subsystem Adalbert Lazar
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

This function is used by the introspection subsytem to enable/disable
MSR register interception.

The patch adds back the __vmx_enable_intercept_for_msr() function
removed with 40d8338d095e6117112f4d303e5d6cf776069e38
(KVM: VMX: remove functions that enable msr intercepts).

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/svm.c              |  7 ++++++
 arch/x86/kvm/vmx.c              | 52 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  6 +++++
 4 files changed, 67 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 695605eb1dfb..ff94a3512347 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1048,6 +1048,7 @@ struct kvm_x86_ops {
 	void (*cancel_hv_timer)(struct kvm_vcpu *vcpu);
 
 	void (*setup_mce)(struct kvm_vcpu *vcpu);
+	void (*msr_intercept)(unsigned int msr, bool enable);
 };
 
 struct kvm_arch_async_pf {
@@ -1429,4 +1430,5 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #endif
 }
 
+void kvm_arch_msr_intercept(unsigned int msr, bool enable);
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ba9891ac5c56..7f1b00b74199 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5262,6 +5262,11 @@ static void svm_setup_mce(struct kvm_vcpu *vcpu)
 	vcpu->arch.mcg_cap &= 0x1ff;
 }
 
+static void svm_msr_intercept(unsigned int msr, bool enable)
+{
+	set_msr_interception(msrpm, msr, enable, enable);
+}
+
 static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
@@ -5374,6 +5379,8 @@ static struct kvm_x86_ops svm_x86_ops __ro_after_init = {
 	.deliver_posted_interrupt = svm_deliver_avic_intr,
 	.update_pi_irte = svm_update_pi_irte,
 	.setup_mce = svm_setup_mce,
+
+	.msr_intercept = svm_msr_intercept,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index ca5d2b93385c..7a594cfcb2ea 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -11457,6 +11457,56 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 			~FEATURE_CONTROL_LMCE;
 }
 
+static void __vmx_enable_intercept_for_msr(unsigned long *msr_bitmap,
+						u32 msr, int type)
+{
+	int f = sizeof(unsigned long);
+
+	if (!cpu_has_vmx_msr_bitmap())
+		return;
+
+	/*
+	 * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals
+	 * have the write-low and read-high bitmap offsets the wrong way round.
+	 * We can control MSRs 0x00000000-0x00001fff and 0xc0000000-0xc0001fff.
+	 */
+	if (msr <= 0x1fff) {
+		if (type & MSR_TYPE_R)
+			/* read-low */
+			__set_bit(msr, msr_bitmap + 0x000 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-low */
+			__set_bit(msr, msr_bitmap + 0x800 / f);
+
+	} else if ((msr >= 0xc0000000) && (msr <= 0xc0001fff)) {
+		msr &= 0x1fff;
+		if (type & MSR_TYPE_R)
+			/* read-high */
+			__set_bit(msr, msr_bitmap + 0x400 / f);
+
+		if (type & MSR_TYPE_W)
+			/* write-high */
+			__set_bit(msr, msr_bitmap + 0xc00 / f);
+
+	}
+}
+
+static void vmx_msr_intercept(unsigned int msr, bool enabled)
+{
+	if (enabled) {
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+					       MSR_TYPE_W);
+		__vmx_enable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+					       MSR_TYPE_W);
+	} else {
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_legacy, msr,
+						MSR_TYPE_W);
+		__vmx_disable_intercept_for_msr(vmx_msr_bitmap_longmode, msr,
+						MSR_TYPE_W);
+	}
+}
+
 static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.cpu_has_kvm_support = cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
@@ -11584,6 +11634,8 @@ static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 #endif
 
 	.setup_mce = vmx_setup_mce,
+
+	.msr_intercept = vmx_msr_intercept,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1a7493982310..9a47f640a7b5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8734,6 +8734,12 @@ bool kvm_vector_hashing_enabled(void)
 }
 EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
 
+void kvm_arch_msr_intercept(unsigned int msr, bool enable)
+{
+	kvm_x86_ops->msr_intercept(msr, enable);
+}
+EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 08/19] kvm: Add the introspection subsystem
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (6 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 07/19] kvm: x86: Add kvm_arch_msr_intercept() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-21 11:54   ` Paolo Bonzini
  2017-06-16 13:43 ` [RFC PATCH 09/19] kvm: Hook in kvmi on VM on/off events Adalbert Lazar
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

On kvmi_init(), an workqueue job is added to listen on a vsock
port. The guest introspection tool connections are handled by the
accept_socket_cb(). It should check that the other end is indeed the
introspection tool. Based on the first 16 bytes (UUID) read from the
socket, the connection is used as the main connection (if UUID=={0}) or
as a VM connection, and an workqueue job is added to wait introspection
commands.

With the exception of the first 16 bytes sent by the introspection tool,
and another 16 bytes sent on the main connection (the UUID of the guest
introspection tool), all the other messages between the introspection
subsystem and the guest introspection tool have a common header:
	struct kvmi_socket_hdr {
		__u16 msg_id;
		__u16 size; /* msg. data following this hdr */
		__u32 seq;  /* maintained by the sending party */
	};
With the exception of KVMI_EVENT_GUEST_ON and KVMI_EVENT_GUEST_OFF,
all the other messages must have a reply. The reply has the same
sequence number and KVMI_FLAG_RESPONSE added to msg_id.

Because the introspection commands are received on a different thread,
the VCPU threads have to be signaled with:
	kvm_make_request(KVM_REQ_INTROSPECTION,vcpu)
call, which in turn will call kvmi_handle_controller_request() to handle
the request: REQ_PAUSE, REQ_RESUME, REQ_CMD (introspection command,
dispatched with the guest_responses[]), REQ_REPLY (reply from the guest
introspection tool, signaled from handle_event_reply()), REQ_CLOSE
(socket close, uninit).

The introspection needs VHOST_VSOCK, but only KVM_INTEL depends on it,
for now. Probably, moving this subsystem in another module will be better.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
Signed-off-by: Adalbert Lazar <alazar@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |    3 +
 arch/x86/kvm/Kconfig            |    2 +
 arch/x86/kvm/Makefile           |    1 +
 include/linux/kvm_host.h        |   28 +
 include/uapi/linux/kvmi.h       |  263 +++++
 virt/kvm/kvm_main.c             |   13 +
 virt/kvm/kvmi.c                 | 2252 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvmi.h                 |   42 +
 virt/kvm/kvmi_socket.c          |  412 +++++++
 virt/kvm/kvmi_socket.h          |   33 +
 10 files changed, 3049 insertions(+)
 create mode 100644 include/uapi/linux/kvmi.h
 create mode 100644 virt/kvm/kvmi.c
 create mode 100644 virt/kvm/kvmi.h
 create mode 100644 virt/kvm/kvmi_socket.c
 create mode 100644 virt/kvm/kvmi_socket.h

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ff94a3512347..40d1ee68474a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -70,6 +70,7 @@
 #define KVM_REQ_HV_RESET          28
 #define KVM_REQ_HV_EXIT           29
 #define KVM_REQ_HV_STIMER         30
+#define KVM_REQ_INTROSPECTION     31
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -678,6 +679,8 @@ struct kvm_vcpu_arch {
 
 	/* GPA available (AMD only) */
 	bool gpa_available;
+
+	atomic_t next_interrupt_enabled;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 760433b2574a..a84f9de2e4b0 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -60,6 +60,8 @@ config KVM
 config KVM_INTEL
 	tristate "KVM for Intel processors support"
 	depends on KVM
+	# for kvmi
+	depends on VHOST_VSOCK
 	# for perf_guest_get_msrs():
 	depends on CPU_SUP_INTEL
 	---help---
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 09d4b17be022..aee76a0a74fb 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -8,6 +8,7 @@ CFLAGS_vmx.o := -I.
 KVM := ../../../virt/kvm
 
 kvm-y			+= $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
+				$(KVM)/kvmi_socket.o $(KVM)/kvmi.o \
 				$(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
 kvm-$(CONFIG_KVM_ASYNC_PF)	+= $(KVM)/async_pf.o
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2f00b5c64632..aeda9e1d7a45 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -28,6 +28,8 @@
 #include <linux/swait.h>
 #include <linux/refcount.h>
 #include <linux/uuid.h>
+#include <linux/mutex.h>
+#include <linux/radix-tree.h>
 #include <asm/signal.h>
 
 #include <linux/kvm.h>
@@ -268,6 +270,19 @@ struct kvm_vcpu {
 	bool preempted;
 	struct kvm_vcpu_arch arch;
 	struct dentry *debugfs_dentry;
+	size_t pause_count;
+	u8 ctx_data[256];
+	u32 ctx_size;
+	u32 ctx_pos;
+	struct semaphore sock_sem;
+	unsigned long sem_requests;
+	u8 sock_cmd_buf[960];
+	void *sock_cmd_ctx;
+	void *sock_rsp_buf;
+	size_t sock_rsp_size;
+	size_t sock_rsp_received;
+	u32 sock_rsp_seq;
+	bool sock_rsp_waiting;
 };
 
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -441,6 +456,19 @@ struct kvm {
 	struct srcu_struct irq_srcu;
 
 	uuid_le uuid;
+	atomic_t event_mask;
+	unsigned long cr_mask;
+	struct {
+		unsigned long low[BITS_TO_LONGS(8192)];
+		unsigned long hypervisor[BITS_TO_LONGS(8192)];
+		unsigned long high[BITS_TO_LONGS(8192)];
+	} msr_mask;
+	unsigned long introduced;
+	struct radix_tree_root access_tree;
+	struct mutex access_tree_lock;
+	struct list_head access_list;
+	void *socket_ctx;
+	rwlock_t socket_ctx_lock;
 };
 
 #define kvm_err(fmt, ...) \
diff --git a/include/uapi/linux/kvmi.h b/include/uapi/linux/kvmi.h
new file mode 100644
index 000000000000..c823b937bd4e
--- /dev/null
+++ b/include/uapi/linux/kvmi.h
@@ -0,0 +1,263 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with the GNU C Library; if not, see
+ * <http://www.gnu.org/licenses/>
+ */
+#ifndef __KVMI_H_INCLUDED__
+#define __KVMI_H_INCLUDED__
+
+#include "asm/kvm.h"
+
+#define KVMI_VERSION 0x00000001
+
+#define KVMI_EVENT_CR         (1 << 1)	/* control register was modified */
+#define KVMI_EVENT_MSR        (1 << 2)	/* model specific reg. was modified */
+#define KVMI_EVENT_XSETBV     (1 << 3)	/* ext. control register was modified */
+#define KVMI_EVENT_BREAKPOINT (1 << 4)	/* breakpoint was reached */
+#define KVMI_EVENT_USER_CALL  (1 << 5)	/* user hypercall */
+#define KVMI_EVENT_PAGE_FAULT (1 << 6)	/* hyp. page fault was encountered */
+#define KVMI_EVENT_TRAP       (1 << 7)	/* trap was injected */
+#define KVMI_EVENT_SET_CTX    (1 << 28)	/* set the emulation context */
+#define KVMI_EVENT_NOEMU      (1 << 29)	/* return to guest without emulation */
+#define KVMI_EVENT_SET_REGS   (1 << 30)	/* registers need to be written back */
+#define KVMI_EVENT_ALLOW      (1 << 31)	/* used in replies */
+
+#define KVMI_KNOWN_EVENTS (KVMI_EVENT_CR | \
+			   KVMI_EVENT_MSR | \
+			   KVMI_EVENT_XSETBV | \
+			   KVMI_EVENT_BREAKPOINT | \
+			   KVMI_EVENT_USER_CALL | \
+			   KVMI_EVENT_PAGE_FAULT | \
+			   KVMI_EVENT_TRAP)
+
+#define KVMI_FLAG_RESPONSE 0x8000
+
+#define KVMI_GET_VERSION                  1
+#define KVMI_GET_GUESTS                   2
+#define KVMI_GET_GUEST_INFO               3
+#define KVMI_PAUSE_GUEST                  4
+#define KVMI_UNPAUSE_GUEST                5
+#define KVMI_GET_REGISTERS                6
+#define KVMI_SET_REGISTERS                7
+#define KVMI_SHUTDOWN_GUEST               8
+#define KVMI_GET_MTRR_TYPE                9
+#define KVMI_GET_MTRRS                    10
+#define KVMI_GET_XSAVE_INFO               11
+#define KVMI_GET_PAGE_ACCESS              12
+#define KVMI_SET_PAGE_ACCESS              13
+#define KVMI_INJECT_PAGE_FAULT            14
+#define KVMI_READ_PHYSICAL                15
+#define KVMI_WRITE_PHYSICAL               16
+#define KVMI_MAP_PHYSICAL_PAGE_TO_SVA     17
+#define KVMI_UNMAP_PHYSICAL_PAGE_FROM_SVA 18
+#define KVMI_EVENT_CONTROL                19
+#define KVMI_CR_CONTROL                   20
+#define KVMI_MSR_CONTROL                  21
+#define KVMI_INJECT_BREAKPOINT            22
+#define KVMI_EVENT_GUEST_ON               23
+#define KVMI_EVENT_GUEST_OFF              24
+#define KVMI_EVENT_VCPU                   25
+#define KVMI_REPLY_EVENT_VCPU             26
+
+struct kvmi_socket_hdr {
+	__u16 msg_id;
+	__u16 size;
+	__u32 seq;
+};
+
+struct kvmi_event_reply {
+	struct kvm_regs regs;
+	__u64 new_val;
+	__u32 event;
+	__u32 padding1;
+	__u8 ctx_data[256];
+	__u32 ctx_size;
+	__u32 padding2;
+};
+
+struct kvmi_guest {
+	__u8 uuid[16];
+};
+
+struct kvmi_guests {
+	__u32 size;		/* in: the size of the entire structure */
+	struct kvmi_guest guests[1];
+};
+
+struct kvmi_event_cr {
+	__u16 cr;
+	__u16 padding1;
+	__u32 padding2;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_msr {
+	__u32 msr;
+	__u32 padding;
+	__u64 old_value;
+	__u64 new_value;
+};
+
+struct kvmi_event_xsetbv {
+	__u64 xcr0;
+};
+
+struct kvmi_event_breakpoint {
+	__u64 gpa;
+};
+
+struct kvmi_event_page_fault {
+	__u64 gva;
+	__u64 gpa;
+	__u32 mode;
+	__u32 padding;
+};
+
+struct kvmi_event_trap {
+	__u32 vector;
+	__u32 type;
+	__u32 err;
+	__u32 padding;
+	__u64 cr2;
+};
+
+struct kvmi_event {
+	__u16 vcpu;
+	__u8 mode;		/* 2, 4 or 8 */
+	__u8 padding1;
+	__u32 event;
+	struct kvm_regs regs;	/* in/out */
+	struct kvm_sregs sregs;	/* in */
+	struct {
+		__u64 sysenter_cs;
+		__u64 sysenter_esp;
+		__u64 sysenter_eip;
+		__u64 efer;
+		__u64 star;
+		__u64 lstar;
+	} msrs;
+	union {
+		struct kvmi_event_cr cr;
+		struct kvmi_event_msr msr;
+		struct kvmi_event_xsetbv xsetbv;
+		struct kvmi_event_breakpoint breakpoint;
+		struct kvmi_event_page_fault page_fault;
+		struct kvmi_event_trap trap;
+	};			/* out */
+};
+
+struct kvmi_event_control {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 events;
+};
+
+struct kvmi_cr_control {
+	__u8 enable;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 cr;
+};
+
+struct kvmi_msr_control {
+	__u8 enable;
+	__u8 padding1;
+	__u16 padding2;
+	__u32 msr;
+};
+
+struct kvmi_page_access {
+	__u16 vcpu;
+	__u16 padding;
+	int err;
+	__u64 gpa;
+	__u64 access;
+};
+
+struct kvmi_mtrr_type {
+	int err;
+	__u32 padding;
+	__u64 gpa;
+	__u64 type;
+};
+
+struct kvmi_mtrrs {
+	__u16 vcpu;
+	__u16 padding;
+	int err;
+	__u64 pat;
+	__u64 cap;
+	__u64 type;
+};
+
+struct kvmi_guest_info {
+	__u16 vcpu_count;
+	__u16 padding1;
+	__u32 padding2;
+	__u64 tsc_speed;
+};
+
+struct kvmi_xsave_info {
+	__u16 vcpu;
+	__u16 padding;
+	int err;
+	__u64 size;
+};
+
+struct kvmi_page_fault {
+	__u16 vcpu;
+	__u16 padding;
+	__u32 error;
+	__u64 gva;
+};
+
+struct kvmi_rw_physical_info {
+	__u64 gpa;
+	__u64 buffer;
+	__u64 size;
+};
+
+struct kvmi_map_physical_to_sva_info {
+	__u64 gpa_src;
+	__u64 gfn_dest;
+};
+
+struct kvmi_unmap_physical_from_sva_info {
+	__u64 gfn_dest;
+};
+
+struct kvmi_get_registers {
+	__u16 vcpu;
+	__u16 nmsrs;
+	__u32 msrs_idx[0];
+};
+
+struct kvmi_get_registers_r {
+	int err;
+	__u32 mode;
+	struct kvm_regs regs;
+	struct kvm_sregs sregs;
+	struct kvm_msrs msrs;
+};
+
+struct kvmi_set_registers {
+	__u16 vcpu;
+	__u16 padding1;
+	__u32 padding2;
+	struct kvm_regs regs;
+};
+
+#endif /* __KVMI_H_INCLUDED__ */
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 52d92fcf39ff..c819b6b0a36e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -61,6 +61,7 @@
 #include "coalesced_mmio.h"
 #include "async_pf.h"
 #include "vfio.h"
+#include "kvmi.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kvm.h>
@@ -279,6 +280,8 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
 
+	sema_init(&vcpu->sock_sem, 0);
+
 	r = kvm_arch_vcpu_init(vcpu);
 	if (r < 0)
 		goto fail_free_run;
@@ -690,6 +693,11 @@ static struct kvm *kvm_create_vm(unsigned long type)
 
 	preempt_notifier_inc();
 
+	INIT_LIST_HEAD(&kvm->access_list);
+	mutex_init(&kvm->access_tree_lock);
+	rwlock_init(&kvm->socket_ctx_lock);
+	INIT_RADIX_TREE(&kvm->access_tree, GFP_KERNEL);
+
 	return kvm;
 
 out_err:
@@ -728,6 +736,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	int i;
 	struct mm_struct *mm = kvm->mm;
 
+	mutex_destroy(&kvm->access_tree_lock);
 	kvm_destroy_vm_debugfs(kvm);
 	kvm_arch_sync_events(kvm);
 	spin_lock(&kvm_lock);
@@ -4011,6 +4020,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_vfio_ops_init();
 	WARN_ON(r);
 
+	r = kvmi_init();
+	WARN_ON(r);
+
 	return 0;
 
 out_undebugfs:
@@ -4039,6 +4051,7 @@ EXPORT_SYMBOL_GPL(kvm_init);
 
 void kvm_exit(void)
 {
+	kvmi_uninit();
 	debugfs_remove_recursive(kvm_debugfs_dir);
 	misc_deregister(&kvm_dev);
 	kmem_cache_destroy(kvm_vcpu_cache);
diff --git a/virt/kvm/kvmi.c b/virt/kvm/kvmi.c
new file mode 100644
index 000000000000..6b34e0fe06df
--- /dev/null
+++ b/virt/kvm/kvmi.c
@@ -0,0 +1,2252 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with the GNU C Library; if not, see
+ * <http://www.gnu.org/licenses/>
+ */
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/list.h>
+#include <linux/uuid.h>
+#include <linux/poll.h>
+#include <linux/vmalloc.h>
+#include <linux/anon_inodes.h>
+#include <linux/uaccess.h>
+#include <asm/pgtable_types.h>
+#include <linux/mmu_context.h>
+#include <uapi/linux/kvmi.h>
+#include <linux/uuid.h>
+#include <linux/hashtable.h>
+#include <linux/kconfig.h>
+#include "../../arch/x86/kvm/x86.h"
+#include "../../arch/x86/kvm/mmu.h"
+#include <net/sock.h>
+#include <net/af_vsock.h>
+#include "kvmi_socket.h"
+
+struct kvmi_mem_access {
+	struct list_head link;
+	gfn_t gfn;
+	unsigned int access;
+};
+
+struct kvm_enum_param {
+	unsigned int k;
+	unsigned int n;
+	struct kvmi_guests *guests;
+};
+
+struct resp_info {
+	size_t to_read;
+	int vcpu_req;
+	int (*cb)(void *s, struct kvm *, struct kvmi_socket_hdr *req,
+		  void *i);
+};
+
+struct ev_recv {
+	struct hlist_node list;
+	struct completion ready;
+	struct kvmi_socket_hdr h;
+	void *buf;
+	size_t buf_size;
+	bool processing;
+	bool received;
+};
+
+static bool accept_socket_cb(void *ctx, kvmi_socket_read_cb read_cb,
+			     void *cb_ctx);
+static bool consume_bytes_from_socket(size_t n, kvmi_socket_read_cb read_cb,
+				      void *ctx);
+static bool guest_recv_cb(void *ctx, kvmi_socket_read_cb read_cb, void *cb_ctx);
+static bool main_recv_cb(void *ctx, kvmi_socket_read_cb read_cb, void *cb_ctx);
+static bool send_vcpu_event_and_wait(struct kvm_vcpu *vcpu, void *ev,
+				     size_t ev_size, void *resp,
+				     size_t resp_size);
+static const char *id2str(int i);
+static int cnt_cb(const struct kvm *kvm, void *p);
+static int connect_handler_if_missing(void *s, struct kvm *kvm,
+				      kvmi_socket_use_cb recv_cb);
+static int copy_guest_cb(const struct kvm *kvm, void *param);
+static int get_msr_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int get_mttr_memory_type_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int get_page_info_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int get_registers_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int get_tsc_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int get_vcpu(struct kvm *kvm, int vcpu_id, struct kvm_vcpu **vcpu);
+static int get_xstate_size_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int inject_breakpoint_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int inject_pf_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int query_locked_vcpu(struct kvm *kvm, int vcpu_id,
+			     int (*cb)(struct kvm_vcpu *, void *), void *ctx);
+static int query_paused_vcpu(struct kvm *kvm, int vcpu_id,
+			     int (*cb)(struct kvm_vcpu *, void *), void *ctx);
+static int query_paused_vm(struct kvm *kvm, int (*cb) (struct kvm *, void *),
+			   void *ctx);
+static int respond_cr_control(void *s, struct kvm *kvm,
+			      struct kvmi_socket_hdr *req, void *_i);
+static int respond_event_control(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *_i);
+static int respond_get_guest_info(void *s, struct kvm *kvm,
+				  struct kvmi_socket_hdr *req, void *i);
+static int respond_get_guests(void *s, struct kvmi_socket_hdr *req);
+static int respond_get_mtrr_type(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *i);
+static int respond_get_mtrrs(void *s, struct kvm *kvm,
+			     struct kvmi_socket_hdr *req, void *i);
+static int respond_get_page_access(void *s, struct kvm *kvm,
+				   struct kvmi_socket_hdr *req, void *_i);
+static int respond_get_registers(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *i);
+static int respond_get_version(void *s, struct kvm *kvm,
+			       struct kvmi_socket_hdr *req, void *i);
+static int respond_get_xsave_info(void *s, struct kvm *kvm,
+				  struct kvmi_socket_hdr *req, void *i);
+static int respond_inject_breakpoint(void *s, struct kvm *kvm,
+				     struct kvmi_socket_hdr *req, void *_i);
+static int respond_inject_page_fault(void *s, struct kvm *kvm,
+				     struct kvmi_socket_hdr *req, void *_i);
+static int respond_map_physical_page_to_sva(void *s, struct kvm *kvm,
+					    struct kvmi_socket_hdr *req,
+					    void *_i);
+static int respond_unmap_physical_page_from_sva(void *s, struct kvm *kvm,
+						struct kvmi_socket_hdr *req,
+						void *_i);
+static int respond_msr_control(void *s, struct kvm *kvm,
+			       struct kvmi_socket_hdr *req, void *_i);
+static int respond_pause_guest(void *s, struct kvm *kvm,
+			       struct kvmi_socket_hdr *req, void *i);
+static int respond_read_physical(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *_i);
+static int respond_set_page_access(void *s, struct kvm *kvm,
+				   struct kvmi_socket_hdr *req, void *_i);
+static int respond_set_registers(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *i);
+static int respond_shutdown_guest(void *s, struct kvm *kvm,
+				  struct kvmi_socket_hdr *req, void *i);
+static int respond_to_request(void *s, struct kvmi_socket_hdr *req, void *buf,
+			      size_t size);
+static int respond_to_request_buf(void *s, struct kvmi_socket_hdr *req,
+				  const void *buf, size_t size);
+static int respond_unpause_guest(void *s, struct kvm *kvm,
+				 struct kvmi_socket_hdr *req, void *i);
+static int respond_with_error_code(void *s, int err, struct kvmi_socket_hdr *h);
+static int respond_write_physical(void *s, struct kvm *kvm,
+				  struct kvmi_socket_hdr *req, void *_i);
+static int send_async_event_to_socket(struct kvm *kvm, struct kvec *i, size_t n,
+				      size_t bytes);
+static int set_cr_control(struct kvm *kvm, void *ctx);
+static int set_msr_control(struct kvm *kvm, void *ctx);
+static int set_page_info_cb(struct kvm_vcpu *vcpu, void *ctx);
+static int set_registers_cb(struct kvm_vcpu *vcpu, void *ctx);
+static u32 new_seq(void);
+static void __release_kvm_socket(struct kvm *kvm);
+static void send_event(struct kvm *kvm, int msg_id, void *data, size_t size);
+static void wakeup_events(struct kvm *kvm);
+
+static struct kvm dummy;
+static struct kvm *sva;
+static atomic_t seq_ev = ATOMIC_INIT(0);
+static struct resp_info guest_responses[] = {
+	{0, 0, NULL},
+	{0, 0, respond_get_version},
+	{0, 0, NULL},		/* KVMI_GET_GUESTS */
+	{0, 2, respond_get_guest_info},
+	{0, 0, respond_pause_guest},
+	{0, 0, respond_unpause_guest},
+	{-1, 1, respond_get_registers},
+	{sizeof(struct kvmi_set_registers), 1, respond_set_registers},
+	{0, 0, respond_shutdown_guest},
+	{sizeof(__u64), 2, respond_get_mtrr_type},
+	{sizeof(__u16), 1, respond_get_mtrrs},
+	{sizeof(__u16), 1, respond_get_xsave_info},
+	{sizeof(struct kvmi_page_access), 1, respond_get_page_access},
+	{sizeof(struct kvmi_page_access), 1, respond_set_page_access},
+	{sizeof(struct kvmi_page_fault), 1, respond_inject_page_fault},
+	{sizeof(struct kvmi_rw_physical_info), 0, respond_read_physical},
+	{-1, 0, respond_write_physical},	/* TODO: avoid kalloc+memcpy */
+	{sizeof(struct kvmi_map_physical_to_sva_info), 0,
+	 respond_map_physical_page_to_sva},
+	{sizeof(struct kvmi_unmap_physical_from_sva_info), 0,
+	 respond_unmap_physical_page_from_sva},
+	{sizeof(struct kvmi_event_control), 1, respond_event_control},
+	{sizeof(struct kvmi_cr_control), 0, respond_cr_control},
+	{sizeof(struct kvmi_msr_control), 0, respond_msr_control},
+	{sizeof(__u16), 1, respond_inject_breakpoint},
+};
+
+static char *IDs[] = {
+	"KVMI_NULL???",
+	"KVMI_GET_VERSION",
+	"KVMI_GET_GUESTS",
+	"KVMI_GET_GUEST_INFO",
+	"KVMI_PAUSE_GUEST",
+	"KVMI_UNPAUSE_GUEST",
+	"KVMI_GET_REGISTERS",
+	"KVMI_SET_REGISTERS",
+	"KVMI_SHUTDOWN_GUEST",
+	"KVMI_GET_MTRR_TYPE",
+	"KVMI_GET_MTRRS",
+	"KVMI_GET_XSAVE_INFO",
+	"KVMI_GET_PAGE_ACCESS",
+	"KVMI_SET_PAGE_ACCESS",
+	"KVMI_INJECT_PAGE_FAULT",
+	"KVMI_READ_PHYSICAL",
+	"KVMI_WRITE_PHYSICAL",
+	"KVMI_MAP_PHYSICAL_PAGE_TO_SVA",
+	"KVMI_UNMAP_PHYSICAL_PAGE_TO_SVA",
+	"KVMI_EVENT_CONTROL",
+	"KVMI_CR_CONTROL",
+	"KVMI_MSR_CONTROL",
+	"KVMI_INJECT_BREAKPOINT",
+	"KVMI_EVENT_GUEST_ON",
+	"KVMI_EVENT_GUEST_OFF",
+	"KVMI_EVENT_VCPU",
+	"KVMI_REPLY_EVENT_VCPU",
+};
+
+#define REQ_PAUSE  0
+#define REQ_RESUME 1
+#define REQ_CMD    2
+#define REQ_REPLY  3
+#define REQ_CLOSE  4
+
+static void set_sem_req(int req, struct kvm_vcpu *vcpu)
+{
+	set_bit(req, &vcpu->sem_requests);
+	/* Make sure the bit is set when the worker wakes up */
+	smp_wmb();
+	up(&vcpu->sock_sem);
+}
+
+static void clear_sem_req(int req, struct kvm_vcpu *vcpu)
+{
+	clear_bit(req, &vcpu->sem_requests);
+}
+
+static int vm_pause(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		size_t cnt = READ_ONCE(vcpu->pause_count);
+
+		WRITE_ONCE(vcpu->pause_count, cnt + 1);
+		if (!cnt) {
+			set_sem_req(REQ_PAUSE, vcpu);
+			kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+			kvm_vcpu_kick(vcpu);
+			while (test_bit(REQ_PAUSE, &vcpu->sem_requests))
+				;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+	return 0;
+}
+
+static int vm_resume(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		size_t cnt = READ_ONCE(vcpu->pause_count);
+
+		WARN_ON(cnt == 0);
+		WRITE_ONCE(vcpu->pause_count, cnt - 1);
+		if (cnt == 1) {
+			set_sem_req(REQ_RESUME, vcpu);
+			while (test_bit(REQ_RESUME, &vcpu->sem_requests))
+				;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+	return 0;
+}
+
+static int kvmi_set_mem_access(struct kvm *kvm, unsigned long gpa,
+			       unsigned int access)
+{
+	struct kvmi_mem_access *m;
+	struct kvmi_mem_access *__m;
+
+	m = kzalloc(sizeof(struct kvmi_mem_access), GFP_KERNEL);
+	if (!m)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&m->link);
+	m->gfn = gpa_to_gfn(gpa);
+	m->access = access;
+
+	mutex_lock(&kvm->access_tree_lock);
+	__m = radix_tree_lookup(&kvm->access_tree, m->gfn);
+	if (__m) {
+		__m->access = m->access;
+		if (list_empty(&__m->link))
+			list_add_tail(&__m->link, &kvm->access_list);
+	} else {
+		radix_tree_insert(&kvm->access_tree, m->gfn, m);
+		list_add_tail(&m->link, &kvm->access_list);
+		m = NULL;
+	}
+	mutex_unlock(&kvm->access_tree_lock);
+
+	kfree(m);
+
+	return 0;
+}
+
+static bool kvmi_test_mem_access(struct kvm *kvm, unsigned long gpa,
+				 unsigned int exception_flags)
+{
+	struct kvmi_mem_access *m;
+	bool report = false;
+
+	mutex_lock(&kvm->access_tree_lock);
+	m = radix_tree_lookup(&kvm->access_tree, gpa_to_gfn(gpa));
+	mutex_unlock(&kvm->access_tree_lock);
+
+	if (m) {
+		bool missing_ept_paging_structs =
+		    (((exception_flags >> 3) & 7) == 0);
+		report = !missing_ept_paging_structs;
+	}
+
+	return report;
+}
+
+static void kvmi_apply_mem_access(struct kvm_vcpu *vcpu, gfn_t gfn,
+				  unsigned int access)
+{
+	int err;
+	gpa_t gpa = gfn << PAGE_SHIFT;
+	struct kvm *kvm = vcpu->kvm;
+
+	err = kvm_mmu_set_spte(kvm, vcpu, gpa,
+			       access & 1, access & 2, access & 4);
+	if (err < 0) {
+		u32 error_code = PFERR_PRESENT_MASK;
+
+		/* The entry is not present. Tell the MMU to create it */
+		err = vcpu->arch.mmu.page_fault(vcpu, gpa, error_code, false);
+
+		if (!err) {
+			err = kvm_mmu_set_spte(kvm, vcpu, gpa,
+					       access & 1,
+					       access & 2, access & 4);
+		}
+
+		if (err < 0)
+			kvm_err("%s: page_fault: %d (gpa:%llX)\n", __func__,
+				err, gpa);
+	}
+
+	if (err > 0)
+		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
+}
+
+void kvmi_flush_mem_access(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	mutex_lock(&kvm->access_tree_lock);
+	while (!list_empty(&kvm->access_list)) {
+		struct kvmi_mem_access *m =
+		    list_first_entry(&kvm->access_list, struct kvmi_mem_access,
+				     link);
+
+		list_del(&m->link);
+		INIT_LIST_HEAD(&m->link);
+
+		kvmi_apply_mem_access(vcpu, m->gfn, m->access);
+	}
+	mutex_unlock(&kvm->access_tree_lock);
+}
+
+static void kvmi_free_mem_access(struct kvm *kvm)
+{
+	void **slot;
+	struct radix_tree_iter iter;
+
+	radix_tree_for_each_slot(slot, &kvm->access_tree, &iter, 0) {
+		struct kvmi_mem_access *m = *slot;
+
+		radix_tree_delete(&kvm->access_tree, m->gfn);
+		kfree(m);
+	}
+}
+
+static unsigned long *msr_mask(struct kvm *kvm, unsigned int *msr)
+{
+	switch (*msr) {
+	case 0 ... 0x1fff:
+		return kvm->msr_mask.low;
+	case 0x40000000 ... 0x40001fff:
+		*msr &= 0x1fff;
+		return kvm->msr_mask.hypervisor;
+	case 0xc0000000 ... 0xc0001fff:
+		*msr &= 0x1fff;
+		return kvm->msr_mask.high;
+	}
+	return NULL;
+}
+
+static int msr_control(struct kvm *kvm, unsigned int msr, bool enable)
+{
+	unsigned long *mask = msr_mask(kvm, &msr);
+
+	if (!mask)
+		return -EINVAL;
+	if (enable)
+		set_bit(msr, mask);
+	else
+		clear_bit(msr, mask);
+	return 0;
+}
+
+static void kvmi_cleanup(struct kvm *kvm)
+{
+	write_lock(&kvm->socket_ctx_lock);
+	__release_kvm_socket(kvm);
+	write_unlock(&kvm->socket_ctx_lock);
+
+	kvmi_free_mem_access(kvm);
+	kvm->introduced = 0;
+	/* TODO */
+	smp_wmb();
+}
+
+static unsigned int kvmi_vcpu_mode(const struct kvm_vcpu *vcpu,
+				   const struct kvm_sregs *sregs)
+{
+	unsigned int mode = 0;
+
+	if (is_long_mode((struct kvm_vcpu *) vcpu)) {
+		if (sregs->cs.l)
+			mode = 8;
+		else if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (sregs->cr0 & X86_CR0_PE) {
+		if (!sregs->cs.db)
+			mode = 2;
+		else
+			mode = 4;
+	} else if (!sregs->cs.db)
+		mode = 2;
+	else
+		mode = 4;
+
+	return mode;
+}
+
+int kvmi_init(void)
+{
+	rwlock_init(&dummy.socket_ctx_lock);
+	dummy.introduced = 1;
+
+	/* TODO: change ANY to a specific CID */
+	return kvmi_socket_start_vsock(VMADDR_CID_ANY, 1234, accept_socket_cb,
+				       &dummy);
+}
+
+void kvmi_uninit(void)
+{
+	dummy.introduced = 0;
+
+	__release_kvm_socket(&dummy);
+	kvmi_socket_stop();
+}
+
+void kvmi_vm_powered_on(struct kvm *kvm)
+{
+	if (sva)
+		send_event(&dummy, KVMI_EVENT_GUEST_ON, &kvm->uuid,
+			   sizeof(kvm->uuid));
+}
+
+void kvmi_vm_powered_off(struct kvm *kvm)
+{
+	if (sva && kvm != sva)
+		send_event(&dummy, KVMI_EVENT_GUEST_OFF, &kvm->uuid,
+			   sizeof(kvm->uuid));
+	kvmi_cleanup(kvm);
+}
+
+static void kvm_get_msrs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
+{
+	struct msr_data msr;
+
+	msr.host_initiated = true;
+
+	msr.index = MSR_IA32_SYSENTER_CS;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_cs = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_ESP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_esp = msr.data;
+
+	msr.index = MSR_IA32_SYSENTER_EIP;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.sysenter_eip = msr.data;
+
+	msr.index = MSR_EFER;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.efer = msr.data;
+
+	msr.index = MSR_STAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.star = msr.data;
+
+	msr.index = MSR_LSTAR;
+	kvm_get_msr(vcpu, &msr);
+	event->msrs.lstar = msr.data;
+}
+
+static void kvmi_load_regs(struct kvm_vcpu *vcpu, struct kvmi_event *event)
+{
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, &event->regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &event->sregs);
+	kvm_get_msrs(vcpu, event);
+
+	event->mode = kvmi_vcpu_mode(vcpu, &event->sregs);
+}
+
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value)
+{
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_CR,
+		.cr.cr = cr,
+		.cr.old_value = old_value,
+		.cr.new_value = *new_value
+	};
+	struct kvmi_event_reply r;
+
+	/* Is anyone interested in this event? */
+	if (!(KVMI_EVENT_CR & event_mask))
+		return true;
+	if (!test_bit(cr, &kvm->cr_mask))
+		return true;
+	if (old_value == *new_value)
+		return true;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return true;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+
+	if (r.event & KVMI_EVENT_ALLOW) {
+		*new_value = r.new_val;
+		return true;
+	}
+
+	return false;
+}
+
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, unsigned int msr, u64 old_value,
+		    u64 *new_value)
+{
+	unsigned long event_mask;
+	unsigned long *mask;
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_MSR,
+		.msr.msr = msr,
+		.msr.old_value = old_value,
+		.msr.new_value = *new_value
+	};
+	struct kvmi_event_reply r;
+
+	/* Is anyone interested in this event? */
+	event_mask = atomic_read(&kvm->event_mask);
+	if (!(KVMI_EVENT_MSR & event_mask))
+		return true;
+	mask = msr_mask(kvm, &msr);
+	if (!mask)
+		return true;
+	if (!test_bit(msr, mask))
+		return true;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return true;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+
+	if (r.event & KVMI_EVENT_ALLOW) {
+		*new_value = r.new_val;
+		return true;
+	}
+
+	return false;
+}
+
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu, u64 value)
+{
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_XSETBV,
+		.xsetbv.xcr0 = value
+	};
+	struct kvmi_event_reply r;
+
+	/* Is anyone interested in this event? */
+	if (!(KVMI_EVENT_XSETBV & event_mask))
+		return;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+}
+
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gpa)
+{
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_BREAKPOINT,
+		.breakpoint.gpa = gpa
+	};
+	struct kvmi_event_reply r;
+
+	/* Is anyone interested in this event? */
+	if (!(KVMI_EVENT_BREAKPOINT & event_mask))
+		return true;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return true;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+
+	if (r.event & KVMI_EVENT_ALLOW)
+		return true;
+
+	return false;
+}
+
+void kvmi_vmcall_event(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_USER_CALL
+	};
+	struct kvmi_event_reply r;
+
+	/* Is anyone interested in this event? */
+	if (!(KVMI_EVENT_USER_CALL & event_mask))
+		return;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+}
+
+bool kvmi_page_fault(struct kvm_vcpu *vcpu, unsigned long gpa,
+		     unsigned long gva, unsigned int mode, unsigned int *opts)
+{
+	struct kvm *kvm = vcpu->kvm;
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_PAGE_FAULT,
+		.page_fault.gpa = gpa,
+		.page_fault.gva = gva,
+		.page_fault.mode = mode
+	};
+	struct kvmi_event_reply r;
+	bool emulate = false;
+
+	/* Is anyone interested in this event? */
+	if (!(KVMI_EVENT_PAGE_FAULT & event_mask))
+		return emulate;
+
+	/* Have we shown interest in this page? */
+	if (!kvmi_test_mem_access(kvm, gpa, mode))
+		return emulate;
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return emulate;
+
+	emulate = (r.event & KVMI_EVENT_ALLOW);
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+
+	*opts = r.event & (KVMI_EVENT_NOEMU | KVMI_EVENT_SET_CTX);
+
+	if (r.event & KVMI_EVENT_SET_CTX) {
+		u32 size = min(sizeof(vcpu->ctx_data), sizeof(r.ctx_data));
+
+		memcpy(vcpu->ctx_data, r.ctx_data, size);
+		vcpu->ctx_size = size;
+		vcpu->ctx_pos = 0;
+	} else {
+		vcpu->ctx_size = 0;
+		vcpu->ctx_pos = 0;
+	}
+
+	return emulate;
+}
+
+void kvmi_trap_event(struct kvm_vcpu *vcpu, unsigned int vector,
+		     unsigned int type, unsigned int err, u64 cr2)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi_event vm_event = {
+		.vcpu = vcpu->vcpu_id,
+		.event = KVMI_EVENT_TRAP,
+		.trap.vector = vector,
+		.trap.type = type,
+		.trap.err = err,
+		.trap.cr2 = cr2
+	};
+	struct kvmi_event_reply r;
+
+	unsigned long event_mask = atomic_read(&kvm->event_mask);
+
+	if (!(KVMI_EVENT_TRAP & event_mask))
+		return;
+
+	if (!atomic_read(&vcpu->arch.next_interrupt_enabled))
+		return;
+	atomic_set(&vcpu->arch.next_interrupt_enabled, 0);
+
+	kvmi_load_regs(vcpu, &vm_event);
+
+	if (!send_vcpu_event_and_wait
+	    (vcpu, &vm_event, sizeof(vm_event), &r, sizeof(r)))
+		return;
+
+	if (r.event & KVMI_EVENT_SET_REGS)
+		kvm_arch_vcpu_set_regs(vcpu, &r.regs);
+}
+
+bool accept_socket_cb(void *ctx, kvmi_socket_read_cb read_cb, void *cb_ctx)
+{
+	int is_main;
+	uuid_le id;
+	struct kvm *kvm = ctx;	/* &dummy */
+	int err;
+	bool closing = (read_cb == NULL);
+
+	if (closing) {
+		kvm_info("%s: closing\n", __func__);
+		return false;
+	}
+
+	/* TODO: validate sva */
+	err = read_cb(cb_ctx, &id, sizeof(id));
+
+	if (err) {
+		kvm_err("%s: read: %d\n", __func__, err);
+		return false;
+	}
+
+	is_main = (uuid_le_cmp(id, NULL_UUID_LE) == 0);
+
+	/* TODO: use kvm_get with every new onnection */
+
+	if (is_main) {
+		err = connect_handler_if_missing(cb_ctx, kvm, main_recv_cb);
+	} else if (sva && uuid_le_cmp(id, sva->uuid) == 0) {
+		kvm_info("Avoid self-introspection\n");
+		err = -EPERM;
+	} else {
+		struct kvm *g = kvm_from_uuid(&id);
+
+		if (g) {
+			err = connect_handler_if_missing(cb_ctx, g,
+							 guest_recv_cb);
+			kvm_put_kvm(g);
+		} else {
+			err = -ENOENT;
+		}
+	}
+
+	if (err)
+		kvm_err("%s: connect %s: %d\n", __func__,
+			is_main ? "main" : "guest", err);
+
+	return (err == 0);
+}
+
+int connect_handler_if_missing(void *s, struct kvm *kvm,
+			       kvmi_socket_use_cb recv_cb)
+{
+	void *ctx;
+	int err = 0;
+
+	write_lock(&kvm->socket_ctx_lock);
+
+	if (kvm->socket_ctx && kvmi_socket_is_active(kvm->socket_ctx)) {
+		err = -EEXIST;
+		goto unlock;
+	}
+
+	/*
+	 * We can lose a new connection if the old one didn't finished closing,
+	 * but we expect another connection attempt.
+	 */
+
+	__release_kvm_socket(kvm);
+	ctx = kvmi_socket_monitor(s, recv_cb, kvm);
+
+	if (IS_ERR(ctx)) {
+		err = (int) PTR_ERR(ctx);
+		goto unlock;
+	}
+
+	kvm->socket_ctx = ctx;
+unlock:
+	write_unlock(&kvm->socket_ctx_lock);
+	return err;
+}
+
+/*
+ * The other side must use one send/write call
+ * in order to avoid the need for reconstruction in this function.
+ */
+bool main_recv_cb(void *ctx, kvmi_socket_read_cb read_cb, void *cb_ctx)
+{
+	struct kvmi_socket_hdr h;
+	int err;
+	bool closing = (read_cb == NULL);
+	static bool first = true;
+
+	if (closing) {
+		kvm_info("%s: closing\n", __func__);
+		first = true;
+		if (sva) {
+			kvm_put_kvm(sva);
+			sva = NULL;
+		}
+		return false;
+	}
+
+	if (first) {		/* TODO: pack it into a KVMI_ message */
+		uuid_le sva_id;
+
+		err = read_cb(cb_ctx, &sva_id, sizeof(sva_id));
+		if (err) {
+			kvm_err("%s: error getting sva err:%d\n", __func__,
+				err);
+			return false;
+		}
+		sva = kvm_from_uuid(&sva_id);	/* TODO: lock ? */
+		if (!sva) {
+			kvm_err("%s: can't find sva\n", __func__);
+			return false;
+		}
+		first = false;
+	}
+
+	err = read_cb(cb_ctx, &h, sizeof(h));
+
+	if (err) {
+		kvm_err("%s/%p: id:%d (%s) size:%u seq:%u err:%d\n", __func__,
+			cb_ctx, h.msg_id, id2str(h.msg_id), h.size, h.seq, err);
+		return false;
+	}
+
+	kvm_debug("%s: id:%d (%s) size:%u\n", __func__, h.msg_id,
+		  id2str(h.msg_id), h.size);
+
+	switch (h.msg_id) {
+	case KVMI_GET_VERSION:
+		err = respond_get_version(cb_ctx, &dummy, &h, NULL);
+		break;
+
+	case KVMI_GET_GUESTS:
+		err = respond_get_guests(cb_ctx, &h);
+		break;
+
+	default:
+		kvm_err("%s: unknown message 0x%x of %u bytes\n", __func__,
+			h.msg_id, h.size);
+		return consume_bytes_from_socket(h.size, read_cb, cb_ctx);
+	}
+
+	if (err) {
+		kvm_err("%s: id:%d (%s) err:%d\n", __func__, h.msg_id,
+			id2str(h.msg_id), err);
+		return false;
+	}
+
+	return true;
+}
+
+const char *id2str(int i)
+{
+	return (i > 0 && i < ARRAY_SIZE(IDs) ? IDs[i] : "unknown");
+}
+
+static bool handle_event_reply(struct kvm *kvm, struct kvmi_socket_hdr *h,
+			       kvmi_socket_read_cb read_cb, void *cb_ctx)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+	bool found_seq = false;
+	bool ok = false;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (READ_ONCE(vcpu->sock_rsp_waiting)
+		    && h->seq == vcpu->sock_rsp_seq) {
+			found_seq = true;
+			break;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+
+	if (!found_seq) {
+		kvm_err("%s: unexpected event reply (seq=%u)\n", __func__,
+			h->seq);
+		return false;
+	}
+
+	if (h->size > vcpu->sock_rsp_size) {
+		kvm_err("%s: event reply too big (max=%lu, recv=%u)\n",
+		     __func__, vcpu->sock_rsp_size, h->size);
+	} else {
+		int err = read_cb(cb_ctx, vcpu->sock_rsp_buf, h->size);
+
+		if (!err)
+			ok = true;
+		else
+			kvm_err("%s: reply err: %d\n", __func__, err);
+	}
+
+	WARN_ON(h->size == 0);
+
+	WRITE_ONCE(vcpu->sock_rsp_received, ok ? h->size : -1);
+
+	set_sem_req(REQ_REPLY, vcpu);
+
+	return ok;
+}
+
+/*
+ * The other side must use one send/write call
+ * in order to avoid the need for reconstruction in this function.
+ */
+bool guest_recv_cb(void *ctx, kvmi_socket_read_cb read_cb, void *cb_ctx)
+{
+	struct kvm *kvm = ctx;
+	struct kvmi_socket_hdr h;
+	struct resp_info *r;
+	u8 tmp[256];
+	void *i = (void *) tmp;
+	int err;
+	bool closing = (read_cb == NULL);
+
+	if (closing) {
+		kvm_info("%s: closing\n", __func__);
+
+		/* We are no longer interested in any kind of events */
+		atomic_set(&kvm->event_mask, 0);
+		kvm->cr_mask = 0;
+		memset(&kvm->msr_mask, 0, sizeof(kvm->msr_mask));
+		/* TODO */
+		smp_wmb();
+
+		wakeup_events(kvm);
+
+		return false;
+	}
+
+	err = read_cb(cb_ctx, &h, sizeof(h));
+
+	if (err) {
+		kvm_err("%s/%p: id:%d (%s) size:%u seq:%u err:%d\n", __func__,
+			cb_ctx, h.msg_id, id2str(h.msg_id), h.size, h.seq, err);
+		return false;
+	}
+
+	kvm_debug("%s: id:%d (%s) size:%u\n", __func__, h.msg_id,
+		  id2str(h.msg_id), h.size);
+
+	if (h.msg_id == KVMI_REPLY_EVENT_VCPU)
+		return handle_event_reply(kvm, &h, read_cb, cb_ctx);
+
+	r = guest_responses + h.msg_id;
+
+	if (h.msg_id >= ARRAY_SIZE(guest_responses) || !r->cb) {
+		kvm_err("%s: unknown message 0x%x of %u bytes\n", __func__,
+			h.msg_id, h.size);
+		return consume_bytes_from_socket(h.size, read_cb, cb_ctx);
+	}
+
+	if (r->to_read != h.size && r->to_read != (size_t) -1) {
+		kvm_err("%s: %u instead of %u bytes\n", __func__, h.size,
+			(unsigned int) r->to_read);
+		return false;
+	}
+
+	if (r->to_read) {
+		size_t chunk = r->to_read;
+
+		if (chunk == (size_t) -1)
+			chunk = h.size;
+
+		if (chunk > sizeof(tmp))
+			i = kmalloc(chunk, GFP_KERNEL);
+
+		if (!i)
+			return false;
+
+		err = read_cb(cb_ctx, i, chunk);
+		if (err)
+			goto out;
+	}
+
+	if (r->vcpu_req == 0) {
+		err = r->cb(cb_ctx, kvm, &h, i);
+	} else {
+		u16 vcpu_id;
+		struct kvm_vcpu *vcpu;
+
+		if (r->vcpu_req > 1) {
+			vcpu_id = 0;
+		} else {
+			if (h.size < sizeof(vcpu_id)) {
+				kvm_err("%s: invalid message\n", __func__);
+				err = -E2BIG;
+				goto out;
+			}
+			vcpu_id = *((u16 *) i);
+		}
+		err = get_vcpu(kvm, vcpu_id, &vcpu);
+		if (err) {
+			kvm_err("%s: invalid vcpu:%d err:%d\n", __func__,
+				vcpu_id, err);
+			goto out;
+		}
+		if (test_bit(REQ_CMD, &vcpu->sem_requests)) {
+			kvm_err("%s: vcpu %d is busy\n", __func__, vcpu_id);
+			err = -EBUSY;
+			goto out;
+		}
+		if (h.size > sizeof(vcpu->sock_cmd_buf) - sizeof(h)) {
+			kvm_err("%s: message too big: %u\n", __func__, h.size);
+			err = -E2BIG;
+			goto out;
+		}
+		memcpy(vcpu->sock_cmd_buf, &h, sizeof(h));
+		memcpy(vcpu->sock_cmd_buf + sizeof(h), i, h.size);
+		vcpu->sock_cmd_ctx = cb_ctx;
+		set_sem_req(REQ_CMD, vcpu);
+		kvm_make_request(KVM_REQ_INTROSPECTION, vcpu);
+		kvm_vcpu_kick(vcpu);
+	}
+
+out:
+	if (i != (void *) tmp)
+		kfree(i);
+
+	if (err) {
+		kvm_err("%s: id:%d (%s) err:%d\n", __func__, h.msg_id,
+			id2str(h.msg_id), err);
+		return false;
+	}
+
+	return true;
+}
+
+void handle_request(struct kvm_vcpu *vcpu)
+{
+	struct resp_info *r;
+	struct kvmi_socket_hdr h;
+	u8 req[960];
+	int err;
+
+	memcpy(&h, vcpu->sock_cmd_buf, sizeof(h));
+	memcpy(req, vcpu->sock_cmd_buf + sizeof(h), h.size);
+
+	clear_sem_req(REQ_CMD, vcpu);
+
+	r = guest_responses + h.msg_id;
+	/* TODO: vcpu->sock_cmd_ctx might be invalid ? */
+	err = r->cb(vcpu->sock_cmd_ctx, vcpu->kvm, &h, req);
+	if (err)
+		kvm_err("%s: id:%d (%s) err:%d\n", __func__, h.msg_id,
+			id2str(h.msg_id), err);
+}
+
+void kvmi_handle_controller_request(struct kvm_vcpu *vcpu)
+{
+	while (READ_ONCE(vcpu->pause_count)
+	       || READ_ONCE(vcpu->sock_rsp_waiting)
+	       || READ_ONCE(vcpu->sem_requests)) {
+
+		down(&vcpu->sock_sem);
+
+		if (test_bit(REQ_PAUSE, &vcpu->sem_requests)) {
+			clear_sem_req(REQ_PAUSE, vcpu);
+		} else if (test_bit(REQ_RESUME, &vcpu->sem_requests)) {
+			clear_sem_req(REQ_RESUME, vcpu);
+		} else if (test_bit(REQ_CMD, &vcpu->sem_requests)) {
+			handle_request(vcpu);	/* it will clear REQ_CMD bit */
+		} else if (test_bit(REQ_REPLY, &vcpu->sem_requests)) {
+			clear_sem_req(REQ_REPLY, vcpu);
+			WARN_ON(READ_ONCE(vcpu->sock_rsp_waiting) == false);
+			WRITE_ONCE(vcpu->sock_rsp_waiting, false);
+		} else if (test_bit(REQ_CLOSE, &vcpu->sem_requests)) {
+			clear_sem_req(REQ_CLOSE, vcpu);
+			break;
+		} else {
+			WARN_ON(1);
+		}
+	}
+}
+
+bool consume_bytes_from_socket(size_t n, kvmi_socket_read_cb read_cb, void *s)
+{
+	while (n) {
+		u8 buf[128];
+		size_t chunk = min(n, sizeof(buf));
+		int err = read_cb(s, buf, chunk);
+
+		if (err) {
+			kvm_err("%s: read_cb failed: %d\n", __func__, err);
+			return false;
+		}
+
+		n -= chunk;
+	}
+
+	return true;
+}
+
+int respond_get_version(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		unsigned int version;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.version = KVMI_VERSION;
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int respond_to_request(void *s, struct kvmi_socket_hdr *req, void *buf,
+		       size_t size)
+{
+	struct kvmi_socket_hdr *h = buf;
+	struct kvec i = {
+		.iov_base = buf,
+		.iov_len = size
+	};
+	int err;
+
+	h->msg_id = req->msg_id | KVMI_FLAG_RESPONSE;
+	h->seq = req->seq;
+	h->size = (__u16) size - sizeof(*h);
+
+	err = kvmi_socket_send(s, &i, 1, size);
+
+	if (err)
+		kvm_err("%s: kvmi_socket_send() => %d\n", __func__, err);
+
+	return err;
+}
+
+int respond_to_request_buf(void *s, struct kvmi_socket_hdr *req,
+			   const void *buf, size_t size)
+{
+	struct kvmi_socket_hdr h;
+	struct kvec i[2] = {
+		{.iov_base = &h, .iov_len = sizeof(h)},
+		{.iov_base = (void *) buf, .iov_len = size},
+	};
+	int err;
+
+	memset(&h, 0, sizeof(h));
+	h.msg_id = req->msg_id | KVMI_FLAG_RESPONSE;
+	h.seq = req->seq;
+	h.size = size;
+
+	err = kvmi_socket_send(s, i, size ? 2 : 1, sizeof(h) + size);
+
+	if (err)
+		kvm_err("%s: kvmi_socket_send() => %d\n", __func__, err);
+
+	return err;
+}
+
+int respond_get_guests(void *s, struct kvmi_socket_hdr *req)
+{
+	struct kvm_enum_param p = { };
+	u8 *resp;
+	size_t resp_size;
+	struct kvmi_guests *g;
+	int err;
+
+	kvm_enum(cnt_cb, &p.n);
+
+	/* TODO: make struct kvmi_guests easy to use: (size -> cnt, guest[0]) */
+
+	resp_size = sizeof(struct kvmi_socket_hdr) + sizeof(struct kvmi_guests);
+
+	if (p.n)
+		resp_size += sizeof(struct kvmi_guest) * (p.n - 1);
+	else
+		resp_size -= sizeof(struct kvmi_guest);
+
+	resp = kzalloc(resp_size, GFP_KERNEL);
+
+	if (!resp)
+		return -ENOMEM;
+
+	g = (struct kvmi_guests *) (resp + sizeof(struct kvmi_socket_hdr));
+
+	if (p.n) {
+		p.guests = g;
+		kvm_enum(copy_guest_cb, &p);
+	}
+
+	g->size = sizeof(g->size) + sizeof(struct kvmi_guest) * p.k;
+
+	err =
+	    respond_to_request(s, req, resp,
+			       sizeof(struct kvmi_socket_hdr) + g->size);
+
+	kfree(resp);
+
+	return err;
+}
+
+int cnt_cb(const struct kvm *kvm, void *param)
+{
+	unsigned int *n = param;
+
+	if (test_bit(0, &kvm->introduced))
+		*n += 1;
+
+	return 0;
+}
+
+int copy_guest_cb(const struct kvm *kvm, void *param)
+{
+	struct kvm_enum_param *p = param;
+
+	if (test_bit(0, &kvm->introduced))
+		memcpy(p->guests->guests + p->k++, &kvm->uuid,
+		       sizeof(kvm->uuid));
+
+	return (p->k == p->n ? -1 : 0);
+}
+
+int respond_get_guest_info(void *s, struct kvm *kvm,
+			   struct kvmi_socket_hdr *req, void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_guest_info m;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+
+	resp.m.vcpu_count = atomic_read(&kvm->online_vcpus);
+
+	query_paused_vcpu(kvm, 0, get_tsc_cb, &resp.m.tsc_speed);
+
+	resp.m.tsc_speed *= 1000UL;
+
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int get_tsc_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	__u64 *tsc = ctx;
+
+	*tsc = vcpu->arch.virtual_tsc_khz;
+	return 0;
+}
+
+int get_vcpu(struct kvm *kvm, int vcpu_id, struct kvm_vcpu **vcpu)
+{
+	struct kvm_vcpu *v;
+
+	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
+		return -EINVAL;
+
+	v = kvm_get_vcpu(kvm, vcpu_id);
+
+	if (!v)
+		return -EINVAL;
+
+	if (vcpu)
+		*vcpu = v;
+
+	return 0;
+}
+
+int query_paused_vcpu(struct kvm *kvm, int vcpu_id,
+		      int (*cb)(struct kvm_vcpu *, void *), void *ctx)
+{
+	return query_locked_vcpu(kvm, vcpu_id, cb, ctx);
+}
+
+int query_locked_vcpu(struct kvm *kvm, int vcpu_id,
+		      int (*cb)(struct kvm_vcpu *, void *), void *ctx)
+{
+	struct kvm_vcpu *vcpu;
+
+	if (vcpu_id >= atomic_read(&kvm->online_vcpus))
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, vcpu_id);
+
+	if (!vcpu)
+		return -EINVAL;
+
+	return cb(vcpu, ctx);
+}
+
+int respond_pause_guest(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			void *i)
+{
+	return respond_with_error_code(s, vm_pause(kvm), req);
+}
+
+int respond_unpause_guest(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *i)
+{
+	return respond_with_error_code(s, vm_resume(kvm), req);
+}
+
+int respond_with_error_code(void *s, int err, struct kvmi_socket_hdr *req)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		int err;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.err = err;
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int respond_get_registers(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_get_registers_r m;
+	} empty;
+	struct kvmi_get_registers *r = i;
+	struct kvmi_get_registers_r *c = NULL;
+	u8 *resp;
+	size_t sz_resp;
+	__u16 k;
+	int err;
+
+	if (req->size < sizeof(*r)
+	    || req->size != sizeof(*r) + sizeof(__u32) * r->nmsrs) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	sz_resp =
+	    sizeof(empty.h) + sizeof(empty.m) +
+	    sizeof(struct kvm_msr_entry) * r->nmsrs;
+
+	resp = kzalloc(sz_resp, GFP_KERNEL);
+
+	if (!resp) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	c = (struct kvmi_get_registers_r *) (resp + sizeof(empty.h));
+	c->msrs.nmsrs = r->nmsrs;
+
+	for (k = 0; k < r->nmsrs; k++)
+		c->msrs.entries[k].index = r->msrs_idx[k];
+
+	err = query_locked_vcpu(kvm, r->vcpu, get_registers_cb, c);
+
+	if (!err) {
+		err = respond_to_request(s, req, resp, sz_resp);
+		kfree(resp);
+		return err;
+	}
+
+	kfree(resp);
+
+out_err:
+	memset(&empty, 0, sizeof(empty));
+	empty.m.err = err;
+	respond_to_request(s, req, &empty, sizeof(empty));
+	return err;
+}
+
+int get_registers_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_get_registers_r *c = ctx;
+	struct kvm_msr_entry *msr = c->msrs.entries + 0;
+	unsigned int n = c->msrs.nmsrs;
+
+	for (; n--; msr++) {
+		struct msr_data m = {.index = msr->index };
+		int err = kvm_get_msr(vcpu, &m);
+
+		if (err)
+			return err;
+
+		msr->data = m.data;
+	}
+
+	kvm_arch_vcpu_ioctl_get_regs(vcpu, &c->regs);
+	kvm_arch_vcpu_ioctl_get_sregs(vcpu, &c->sregs);
+	c->mode = kvmi_vcpu_mode(vcpu, &c->sregs);
+
+	return 0;
+}
+
+int respond_set_registers(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *i)
+{
+	struct kvmi_set_registers *r = i;
+	int err = query_locked_vcpu(kvm, r->vcpu, set_registers_cb,
+				    (void *) &r->regs);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int set_registers_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvm_regs *regs = ctx;
+
+	kvm_arch_vcpu_set_regs(vcpu, regs);
+	return 0;
+}
+
+int respond_shutdown_guest(void *s, struct kvm *kvm,
+			   struct kvmi_socket_hdr *req, void *i)
+{
+	kvm_vm_shutdown(kvm);
+	return 0;
+}
+
+int respond_get_mtrr_type(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_mtrr_type m;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.m.gpa = *((__u64 *) i);
+	resp.m.err =
+	    query_paused_vcpu(kvm, 0, get_mttr_memory_type_cb, &resp.m);
+
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int get_mttr_memory_type_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_mtrr_type *c = ctx;
+
+	c->type = kvm_mtrr_get_guest_memory_type(vcpu, c->gpa);
+	return 0;
+}
+
+int respond_get_mtrrs(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+		      void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_mtrrs m;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.m.vcpu = *((__u16 *) i);
+	resp.m.err = query_paused_vcpu(kvm, resp.m.vcpu, get_msr_cb, &resp.m);
+
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int get_msr_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_mtrrs *c = ctx;
+
+	if (kvm_mtrr_get_msr(vcpu, MSR_IA32_CR_PAT, &c->pat)
+	    || kvm_mtrr_get_msr(vcpu, MSR_MTRRcap, &c->cap)
+	    || kvm_mtrr_get_msr(vcpu, MSR_MTRRdefType, &c->type))
+		return -EINVAL;
+
+	return 0;
+}
+
+int respond_get_xsave_info(void *s, struct kvm *kvm,
+			   struct kvmi_socket_hdr *req, void *i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_xsave_info m;
+	} resp;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.m.vcpu = *((__u16 *) i);
+	resp.m.err =
+	    query_paused_vcpu(kvm, resp.m.vcpu, get_xstate_size_cb,
+			      &resp.m.size);
+
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int get_xstate_size_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	__u64 *size = ctx;
+
+	*size = vcpu->arch.guest_xstate_size;
+
+	return 0;
+}
+
+int respond_get_page_access(void *s, struct kvm *kvm,
+			    struct kvmi_socket_hdr *req, void *_i)
+{
+	struct {
+		struct kvmi_socket_hdr h;
+		struct kvmi_page_access m;
+	} resp;
+	struct kvmi_page_access *i = _i;
+
+	memset(&resp, 0, sizeof(resp));
+	resp.m.vcpu = i->vcpu;	/* ? */
+	resp.m.gpa = i->gpa;
+	resp.m.err = query_paused_vcpu(kvm, i->vcpu, get_page_info_cb, &resp.m);
+
+	return respond_to_request(s, req, &resp, sizeof(resp));
+}
+
+int get_page_info_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_page_access *c = ctx;
+
+	c->access = kvm_mmu_get_spte(vcpu->kvm, vcpu, c->gpa);
+
+	return 0;
+}
+
+int respond_set_page_access(void *s, struct kvm *kvm,
+			    struct kvmi_socket_hdr *req, void *_i)
+{
+	int err;
+	struct kvmi_page_access *i = _i;
+
+	if (i->access & ~7ULL) {
+		err = -EINVAL;
+	} else {
+		err =
+		    query_paused_vcpu(kvm, i->vcpu, set_page_info_cb,
+				      (void *) i);
+	}
+
+	return respond_with_error_code(s, err, req);
+}
+
+int set_page_info_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_page_access *c = ctx;
+
+	return kvmi_set_mem_access(vcpu->kvm, c->gpa, c->access);
+}
+
+int respond_inject_page_fault(void *s, struct kvm *kvm,
+			      struct kvmi_socket_hdr *req, void *_i)
+{
+	struct kvmi_page_fault *i = _i;
+	int err;
+
+	err = query_paused_vcpu(kvm, i->vcpu, inject_pf_cb, i);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int inject_pf_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvmi_page_fault *c = ctx;
+	struct x86_exception fault = {
+		.address = c->gva,
+		.error_code = c->error
+	};
+
+	kvm_inject_page_fault(vcpu, &fault);
+
+	/*
+	 * Generate an event to let the client know if the injection
+	 * worked
+	 */
+	atomic_set(&vcpu->arch.next_interrupt_enabled, 1);
+	return 0;
+}
+
+static unsigned long gfn_to_hva_safe(struct kvm *kvm, gfn_t gfn)
+{
+	unsigned long hva;
+
+	mutex_lock(&kvm->slots_lock);
+	hva = gfn_to_hva(kvm, gfn);
+	mutex_unlock(&kvm->slots_lock);
+	return hva;
+}
+
+static long get_user_pages_remote_unlocked(struct mm_struct *mm,
+					   unsigned long start,
+					   unsigned long nr_pages,
+					   unsigned int gup_flags,
+					   struct page **pages)
+{
+	long ret;
+	struct task_struct *tsk = NULL;
+	struct vm_area_struct **vmas = NULL;
+	int locked = 1;
+
+	down_read(&mm->mmap_sem);
+	ret =
+	    get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				  vmas, &locked);
+	if (locked)
+		up_read(&mm->mmap_sem);
+	return ret;
+}
+
+int respond_read_physical(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *_i)
+{
+	struct kvmi_rw_physical_info *i = _i;
+	int err;
+	unsigned long hva;
+	struct page *page;
+	void *ptr;
+	struct kvm_vcpu *vcpu;
+
+	if (!i->size || i->size > PAGE_SIZE) {
+		err = -EINVAL;
+		goto out_err_no_resume;
+	}
+
+	err = get_vcpu(kvm, 0, &vcpu);
+
+	if (err)
+		goto out_err_no_resume;
+
+	vm_pause(kvm);
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(i->gpa));
+
+	if (kvm_is_error_hva(hva)) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	if (((i->gpa & ~PAGE_MASK) + i->size) > PAGE_SIZE) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	err = get_user_pages_remote_unlocked(kvm->mm, hva, 1, 0, &page);
+	if (err != 1) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	ptr = kmap_atomic(page);
+
+	err =
+	    respond_to_request_buf(s, req, ptr + (i->gpa & ~PAGE_MASK),
+				   i->size);
+
+	kunmap_atomic(ptr);
+	put_page(page);
+
+	vm_resume(kvm);
+
+	return err;
+
+out_err:
+	vm_resume(kvm);
+
+out_err_no_resume:
+	return respond_to_request_buf(s, req, NULL, 0);
+}
+
+int respond_write_physical(void *s, struct kvm *kvm,
+			   struct kvmi_socket_hdr *req, void *_i)
+{
+	struct kvmi_rw_physical_info *i = _i;
+	int err;
+	unsigned long hva;
+	struct page *page;
+	void *ptr;
+	struct kvm_vcpu *vcpu;
+
+	if (req->size != sizeof(struct kvmi_rw_physical_info) + i->size) {
+		err = -EINVAL;
+		goto out_err_no_resume;
+	}
+
+	if (!i->size || i->size > PAGE_SIZE) {
+		err = -EINVAL;
+		goto out_err_no_resume;
+	}
+
+	err = get_vcpu(kvm, 0, &vcpu);
+
+	if (err)
+		goto out_err_no_resume;
+
+	vm_pause(kvm);
+
+	hva = gfn_to_hva_safe(kvm, gpa_to_gfn(i->gpa));
+
+	if (kvm_is_error_hva(hva)) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	if (((i->gpa & ~PAGE_MASK) + i->size) > PAGE_SIZE) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	err =
+	    get_user_pages_remote_unlocked(kvm->mm, hva, 1, FOLL_WRITE, &page);
+	if (err != 1) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	ptr = kmap_atomic(page);
+
+	memcpy(ptr + (i->gpa & ~PAGE_MASK), (i + 1), i->size);
+
+	kunmap_atomic(ptr);
+	put_page(page);
+
+	err = 0;
+
+out_err:
+	vm_resume(kvm);
+
+out_err_no_resume:
+	return respond_with_error_code(s, err, req);
+}
+
+static struct vm_area_struct *get_one_page_vma(struct kvm *kvm,
+					       unsigned long addr)
+{
+	struct vm_area_struct *v =
+	    find_vma_intersection(kvm->mm, addr, addr + PAGE_SIZE);
+
+	if (!v) {
+		kvm_err("%s: find_vma(%lX) = NULL\n", __func__, addr);
+		return NULL;
+	}
+
+	if (addr != v->vm_start) {
+		int err = split_vma(kvm->mm, v, addr, 0);
+
+		if (err) {
+			kvm_err("%s: split_vma(cut above): %d\n", __func__,
+				err);
+			return NULL;
+		}
+		v = find_vma(kvm->mm, addr);
+	}
+
+	if (v->vm_end - v->vm_start != PAGE_SIZE) {
+		int err = split_vma(kvm->mm, v, addr + PAGE_SIZE, 0);
+
+		if (err) {
+			kvm_err("%s: split_vma(cut below): %d\n", __func__,
+				err);
+			return NULL;
+		}
+	}
+
+	return v;
+}
+
+int respond_map_physical_page_to_sva(void *s, struct kvm *kvm,
+				     struct kvmi_socket_hdr *req, void *_i)
+{
+	struct kvmi_map_physical_to_sva_info *i = _i;
+	int err;
+	unsigned long hva_src, hva_dest;
+	struct vm_area_struct *vma_dest;
+	struct page *page;
+	struct kvm_vcpu *vcpu;
+
+	err = get_vcpu(kvm, 0, &vcpu);
+
+	if (err)
+		goto out_err_no_resume;
+
+	vm_pause(kvm);
+
+	hva_src = gfn_to_hva_safe(kvm, gpa_to_gfn(i->gpa_src));
+	hva_dest = gfn_to_hva_safe(sva, i->gfn_dest);
+
+	if (kvm_is_error_hva(hva_src) || kvm_is_error_hva(hva_dest)) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	if (get_user_pages_remote_unlocked
+	    (kvm->mm, hva_src, 1, FOLL_WRITE, &page) != 1) {
+		err = -ENOENT;
+		goto out_err;
+	}
+
+	down_write(&sva->mm->mmap_sem);
+	vma_dest = get_one_page_vma(sva, hva_dest);
+	if (vma_dest) {
+		err = vm_replace_page(vma_dest, page);
+		if (err)
+			kvm_err("%s: vm_replace_page: %d\n", __func__, err);
+	} else
+		err = -ENOENT;
+	up_write(&sva->mm->mmap_sem);
+
+	put_page(page);
+
+out_err:
+	vm_resume(kvm);
+
+out_err_no_resume:
+	if (err)
+		kvm_err("%s: %d\n", __func__, err);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int respond_unmap_physical_page_from_sva(void *s, struct kvm *kvm,
+					 struct kvmi_socket_hdr *req, void *_i)
+{
+	int err;
+	unsigned long hva;
+	struct kvmi_unmap_physical_from_sva_info *i = _i;
+	struct vm_area_struct *vma;
+	struct page *page;
+	struct kvm_vcpu *vcpu;
+
+	err = get_vcpu(kvm, 0, &vcpu);
+
+	if (err)
+		goto out_err_no_resume;
+
+	vm_pause(kvm);
+
+	page = alloc_page(GFP_HIGHUSER_MOVABLE);
+	if (!page) {
+		err = -ENOMEM;
+		goto out_err;
+	}
+
+	hva = gfn_to_hva_safe(sva, i->gfn_dest);
+	if (kvm_is_error_hva(hva)) {
+		err = -EFAULT;
+		goto out_err;
+	}
+
+	down_write(&sva->mm->mmap_sem);
+
+	vma = find_vma(sva->mm, hva);
+	if (vma->vm_start != hva
+			|| (vma->vm_end - vma->vm_start) != PAGE_SIZE) {
+		kvm_err("%s: invalid vma\n", __func__);
+		err = -EINVAL;
+	} else {
+		err = vm_replace_page(vma, page);
+		if (err)
+			kvm_err("%s: vm_replace_page: %d\n", __func__, err);
+		else
+			put_page(page);
+	}
+
+	up_write(&sva->mm->mmap_sem);
+
+out_err:
+	if (err) {
+		if (page)
+			__free_pages(page, 0);
+		kvm_err("%s: %d\n", __func__, err);
+	}
+
+	vm_resume(kvm);
+
+out_err_no_resume:
+
+	return respond_with_error_code(s, err, req);
+}
+
+int respond_event_control(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			  void *_i)
+{
+	struct kvmi_event_control *i = _i;
+	int err;
+	struct kvm_vcpu *vcpu;
+
+	if (i->events & ~KVMI_KNOWN_EVENTS) {
+		err = -EINVAL;
+		goto out_err;
+	}
+
+	err = get_vcpu(kvm, i->vcpu, &vcpu);
+
+	if (err)
+		goto out_err;
+
+	if (i->events & KVMI_EVENT_BREAKPOINT) {
+		unsigned int event_mask = atomic_read(&kvm->event_mask);
+
+		if (!(event_mask & KVMI_EVENT_BREAKPOINT)) {
+			struct kvm_guest_debug dbg = { };
+
+			dbg.control =
+			    KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
+
+			err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
+		}
+	}
+
+	if (!err)
+		atomic_set(&kvm->event_mask, i->events);
+
+out_err:
+	return respond_with_error_code(s, err, req);
+}
+
+int respond_cr_control(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+		       void *i)
+{
+	int err = query_paused_vm(kvm, set_cr_control, i);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int set_cr_control(struct kvm *kvm, void *ctx)
+{
+	struct kvmi_cr_control *i = ctx;
+
+	switch (i->cr) {
+	case 0:
+	case 3:
+	case 4:
+		if (i->enable)
+			set_bit(i->cr, &kvm->cr_mask);
+		else
+			clear_bit(i->cr, &kvm->cr_mask);
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+int respond_msr_control(void *s, struct kvm *kvm, struct kvmi_socket_hdr *req,
+			void *i)
+{
+	int err = query_paused_vm(kvm, set_msr_control, i);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int query_paused_vm(struct kvm *kvm, int (*cb)(struct kvm *kvm, void *),
+		    void *ctx)
+{
+	struct kvm_vcpu *vcpu;
+	int err;
+
+	err = get_vcpu(kvm, 0, &vcpu);
+	if (err) {
+		kvm_err("%s: get_vcpu: %d\n", __func__, err);
+		return err;
+	}
+
+	vm_pause(kvm);
+
+	err = cb(kvm, ctx);
+
+	vm_resume(kvm);
+
+	return err;
+}
+
+int set_msr_control(struct kvm *kvm, void *ctx)
+{
+	struct kvmi_msr_control *i = ctx;
+
+	int err = msr_control(kvm, i->msr, i->enable);
+
+	if (!err)
+		kvm_arch_msr_intercept(i->msr, i->enable);
+
+	return err;
+}
+
+int respond_inject_breakpoint(void *s, struct kvm *kvm,
+			      struct kvmi_socket_hdr *req, void *i)
+{
+	int err;
+
+	err =
+	    query_locked_vcpu(kvm, *((__u16 *) i), inject_breakpoint_cb, NULL);
+
+	return respond_with_error_code(s, err, req);
+}
+
+int inject_breakpoint_cb(struct kvm_vcpu *vcpu, void *ctx)
+{
+	struct kvm_guest_debug dbg = {.control = KVM_GUESTDBG_INJECT_BP };
+
+	int err = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);
+
+	/*
+	 * Generate an event to let the client know if the injection
+	 * worked
+	 */
+
+	/* if (!err) */
+	atomic_set(&vcpu->arch.next_interrupt_enabled, 1);
+	return err;
+}
+
+void send_event(struct kvm *kvm, int msg_id, void *data, size_t size)
+{
+	struct kvmi_socket_hdr h;
+	struct kvec i[2] = {
+		{.iov_base = &h, .iov_len = sizeof(h)},
+		{.iov_base = (void *) data, .iov_len = size}
+	};
+	size_t n = size ? 2 : 1;
+	size_t total = sizeof(h) + size;
+
+	memset(&h, 0, sizeof(h));
+	h.msg_id = msg_id;
+	h.seq = new_seq();
+	h.size = size;
+
+	send_async_event_to_socket(kvm, i, n, total);
+}
+
+u32 new_seq(void)
+{
+	return atomic_inc_return(&seq_ev);
+}
+
+static const char *event_str(unsigned int e)
+{
+	switch (e) {
+	case KVMI_EVENT_CR:
+		return "CR";
+	case KVMI_EVENT_MSR:
+		return "MSR";
+	case KVMI_EVENT_XSETBV:
+		return "XSETBV";
+	case KVMI_EVENT_BREAKPOINT:
+		return "BREAKPOINT";
+	case KVMI_EVENT_USER_CALL:
+		return "USER_CALL";
+	case KVMI_EVENT_PAGE_FAULT:
+		return "PAGE_FAULT";
+	case KVMI_EVENT_TRAP:
+		return "TRAP";
+	default:
+		return "EVENT?";
+	}
+}
+
+static void inspect_kvmi_event(struct kvmi_event *ev, u32 seq)
+{
+	switch (ev->event) {
+	case KVMI_EVENT_CR:
+		kvm_debug("%s: seq:%u %-11s(%d) cr:%x old:%llx new:%llx\n",
+			  __func__, seq, event_str(ev->event), ev->vcpu,
+			  ev->cr.cr, ev->cr.old_value, ev->cr.new_value);
+		break;
+	case KVMI_EVENT_MSR:
+		kvm_debug("%s: seq:%u %-11s(%d) msr:%x old:%llx new:%llx\n",
+			  __func__, seq, event_str(ev->event), ev->vcpu,
+			  ev->msr.msr, ev->msr.old_value, ev->msr.new_value);
+		break;
+	case KVMI_EVENT_XSETBV:
+		kvm_debug("%s: seq:%u %-11s(%d) xcr0:%llx\n", __func__, seq,
+			  event_str(ev->event), ev->vcpu, ev->xsetbv.xcr0);
+		break;
+	case KVMI_EVENT_BREAKPOINT:
+		kvm_debug("%s: seq:%u %-11s(%d) gpa:%llx\n", __func__, seq,
+			  event_str(ev->event), ev->vcpu, ev->breakpoint.gpa);
+		break;
+	case KVMI_EVENT_USER_CALL:
+		kvm_debug("%s: seq:%u %-11s(%d)\n", __func__, seq,
+			  event_str(ev->event), ev->vcpu);
+		break;
+	case KVMI_EVENT_PAGE_FAULT:
+		kvm_debug("%s: seq:%u %-11s(%d) gpa:%llx gva:%llx mode:%x\n",
+			  __func__, seq, event_str(ev->event), ev->vcpu,
+			  ev->page_fault.gpa, ev->page_fault.gva,
+			  ev->page_fault.mode);
+		break;
+	case KVMI_EVENT_TRAP:
+		kvm_debug
+		    ("%s: seq:%u %-11s(%d) vector:%x type:%x err:%x cr2:%llx\n",
+		     __func__, seq, event_str(ev->event), ev->vcpu,
+		     ev->trap.vector, ev->trap.type, ev->trap.err,
+		     ev->trap.cr2);
+		break;
+	}
+}
+
+bool send_vcpu_event_and_wait(struct kvm_vcpu *vcpu, void *ev, size_t ev_size,
+			      void *resp, size_t resp_size)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmi_socket_hdr h;
+	struct kvec i[2] = {
+		{.iov_base = &h, .iov_len = sizeof(h)},
+		{.iov_base = ev, .iov_len = ev_size}
+	};
+	size_t total = sizeof(h) + ev_size;
+	struct kvmi_event *e = ev;
+	bool ok = false;
+
+	memset(&h, 0, sizeof(h));
+	h.msg_id = KVMI_EVENT_VCPU;
+	h.seq = new_seq();
+	h.size = ev_size;
+
+	inspect_kvmi_event(e, h.seq);
+
+	vcpu->sock_rsp_buf = resp;
+	vcpu->sock_rsp_size = resp_size;
+	vcpu->sock_rsp_seq = h.seq;
+	WRITE_ONCE(vcpu->sock_rsp_received, 0);
+	WRITE_ONCE(vcpu->sock_rsp_waiting, true);
+
+	if (send_async_event_to_socket(kvm, i, 2, total) == 0)
+		kvmi_handle_controller_request(vcpu);
+
+	kvm_debug("%s: reply for vcpu:%d event:%d (%s)\n", __func__, e->vcpu,
+		  e->event, event_str(e->event));
+
+	ok = (READ_ONCE(vcpu->sock_rsp_received) > 0);
+	return ok;
+}
+
+int send_async_event_to_socket(struct kvm *kvm, struct kvec *i, size_t n,
+			       size_t bytes)
+{
+	int err;
+
+	read_lock(&kvm->socket_ctx_lock);
+
+	if (kvm->socket_ctx)
+		err = kvmi_socket_send(kvm->socket_ctx, i, n, bytes);
+	else
+		err = -ENOENT;
+
+	read_unlock(&kvm->socket_ctx_lock);
+
+	if (err)
+		kvm_err("%s: kvmi_socket_send() => %d\n", __func__, err);
+
+	return err;
+}
+
+void wakeup_events(struct kvm *kvm)
+{
+	int i;
+	struct kvm_vcpu *vcpu;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		set_sem_req(REQ_CLOSE, vcpu);
+		while (test_bit(REQ_CLOSE, &vcpu->sem_requests))
+			;
+	}
+	mutex_unlock(&kvm->lock);
+}
+
+void __release_kvm_socket(struct kvm *kvm)
+{
+	if (kvm->socket_ctx) {
+		kvmi_socket_release(kvm->socket_ctx);
+		kvm->socket_ctx = NULL;
+	}
+}
+
+int kvmi_patch_emul_instr(struct kvm_vcpu *vcpu, void *val, unsigned int bytes)
+{
+	u32 size;
+
+	if (bytes > vcpu->ctx_size) {
+		kvm_err("%s: requested %u bytes(s) but only %u available\n",
+			__func__, bytes, vcpu->ctx_size);
+		return X86EMUL_UNHANDLEABLE;
+	}
+	size = min(vcpu->ctx_size, bytes);
+	memcpy(val, &vcpu->ctx_data[vcpu->ctx_pos], size);
+	vcpu->ctx_size -= size;
+	vcpu->ctx_pos += size;
+	return X86EMUL_CONTINUE;
+
+}
diff --git a/virt/kvm/kvmi.h b/virt/kvm/kvmi.h
new file mode 100644
index 000000000000..736a28862857
--- /dev/null
+++ b/virt/kvm/kvmi.h
@@ -0,0 +1,42 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with the GNU C Library; if not, see
+ * <http://www.gnu.org/licenses/>
+ */
+#ifndef __KVMI_H__
+#define __KVMI_H__
+
+#include <linux/kvm_host.h>
+
+int kvmi_init(void);
+void kvmi_uninit(void);
+void kvmi_vm_powered_on(struct kvm *kvm);
+void kvmi_vm_powered_off(struct kvm *kvm);
+bool kvmi_cr_event(struct kvm_vcpu *vcpu, unsigned int cr,
+		   unsigned long old_value, unsigned long *new_value);
+bool kvmi_msr_event(struct kvm_vcpu *vcpu, unsigned int msr,
+		    u64 old_value, u64 *new_value);
+void kvmi_xsetbv_event(struct kvm_vcpu *vcpu, u64 value);
+bool kvmi_breakpoint_event(struct kvm_vcpu *vcpu, u64 gpa);
+void kvmi_vmcall_event(struct kvm_vcpu *vcpu);
+bool kvmi_page_fault(struct kvm_vcpu *vcpu, unsigned long gpa,
+		     unsigned long gva, unsigned int mode, unsigned int *opts);
+void kvmi_trap_event(struct kvm_vcpu *vcpu, unsigned int vector,
+		     unsigned int type, unsigned int err, u64 cr2);
+void kvmi_flush_mem_access(struct kvm_vcpu *vcpu);
+void kvmi_handle_controller_request(struct kvm_vcpu *vcpu);
+int kvmi_patch_emul_instr(struct kvm_vcpu *vcpu, void *val, unsigned int bytes);
+
+#endif
diff --git a/virt/kvm/kvmi_socket.c b/virt/kvm/kvmi_socket.c
new file mode 100644
index 000000000000..7e88693efdf4
--- /dev/null
+++ b/virt/kvm/kvmi_socket.c
@@ -0,0 +1,412 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with the GNU C Library; if not, see
+ * <http://www.gnu.org/licenses/>
+ */
+#include <linux/kernel.h>
+#include <linux/net.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/un.h>
+#include <linux/namei.h>
+#include <linux/kvm_host.h>
+#include <linux/kconfig.h>
+#include <net/sock.h>
+#include <net/net_namespace.h>
+#include <net/vsock_addr.h>
+
+#include "kvmi_socket.h"
+
+#define SEND_TIMEOUT_SECS 2
+
+struct worker {
+	struct work_struct work;
+	wait_queue_head_t wait;	/* accept_cb */
+	struct completion finished;
+	struct socket *s;
+	kvmi_socket_use_cb cb;
+	void *cb_ctx;
+	void (*orig_sk_state_change)(struct sock *sk);	/* accept_cb */
+	void (*orig_sk_data_ready)(struct sock *sk);
+	atomic_t knocks;	/* accept_cb */
+	bool stopping;
+};
+
+static struct workqueue_struct *wq;
+static struct kmem_cache *cache;
+static struct worker *awork;
+
+static bool should_accept(struct worker *w, struct socket **newsock);
+static int __recv(struct socket *s, void *buf, size_t len);
+static int __send(struct socket *s, struct kvec *i, size_t n, size_t size);
+static int init(int proto, struct sockaddr *addr, size_t addr_len,
+		kvmi_socket_use_cb cb, void *cb_ctx);
+static int init_socket(int proto, struct sockaddr *addr, size_t addr_len,
+		       kvmi_socket_use_cb cb, void *cb_ctx);
+static int read_worker_cb(void *_w, void *buf, size_t len);
+static int read_socket_cb(void *_s, void *buf, size_t len);
+static struct worker *alloc_worker(struct socket *s, kvmi_socket_use_cb cb,
+				   void *cb_ctx, work_func_t fct);
+static void __socket_close(struct socket *s);
+static void accept_cb(struct work_struct *work);
+static void data_ready_cb(struct sock *sk);
+static void restore_socket_callbacks(struct worker *w);
+static void set_socket_callbacks(struct worker *w, bool with_data_ready);
+static void state_change_cb(struct sock *sk);
+static void stop_cb_on_error(struct worker *w, int err);
+static void wakeup_worker(struct worker *w);
+static void work_cb(struct work_struct *work);
+
+int kvmi_socket_start_vsock(unsigned int cid, unsigned int port,
+			    kvmi_socket_use_cb cb, void *cb_ctx)
+{
+	struct sockaddr_vm sa;
+
+	vsock_addr_init(&sa, cid, port);
+
+	return init(PF_VSOCK, (struct sockaddr *) &sa, sizeof(sa), cb, cb_ctx);
+}
+
+int init(int proto, struct sockaddr *addr, size_t addr_len,
+	 kvmi_socket_use_cb cb, void *cb_ctx)
+{
+	int err;
+
+	wq = alloc_workqueue("kvmi/socket", WQ_CPU_INTENSIVE, 0);
+	cache = kmem_cache_create("kvmi/socket", sizeof(struct worker), 0, 0,
+				  NULL);
+
+	if (!wq || !cache) {
+		kvmi_socket_stop();
+		return -ENOMEM;
+	}
+
+	err = init_socket(proto, addr, addr_len, cb, cb_ctx);
+
+	if (err) {
+		kvm_err("kvmi_socket init: %d\n", err);
+		kvmi_socket_stop();
+		return err;
+	}
+
+	return 0;
+}
+
+void kvmi_socket_stop(void)
+{
+	if (!IS_ERR_OR_NULL(awork)) {
+		kvmi_socket_release(awork);
+		awork = NULL;
+	}
+
+	if (wq) {
+		destroy_workqueue(wq);
+		wq = NULL;
+	}
+
+	kmem_cache_destroy(cache);
+	cache = NULL;
+}
+
+static void signal_stop(struct worker *w)
+{
+	WRITE_ONCE(w->stopping, 1);
+}
+
+/*
+ * !!! MUST NOT be called from use_cb !!!
+ */
+void kvmi_socket_release(void *_w)
+{
+	struct worker *w = _w;
+
+	restore_socket_callbacks(w);
+
+	signal_stop(w);
+	wakeup_worker(w);
+
+	wait_for_completion(&w->finished);
+
+	if (w->s)
+		__socket_close(w->s);
+
+	kmem_cache_free(cache, w);
+}
+
+void wakeup_worker(struct worker *w)
+{
+	if (w == awork)
+		wake_up_interruptible(&w->wait);
+}
+
+void __socket_close(struct socket *s)
+{
+	kernel_sock_shutdown(s, SHUT_RDWR);
+	sock_release(s);
+}
+
+int init_socket(int proto, struct sockaddr *addr, size_t addr_len,
+		kvmi_socket_use_cb cb, void *cb_ctx)
+{
+	struct socket *s;
+	int err = sock_create_kern(&init_net, proto, SOCK_STREAM, 0,
+				   &s);
+
+	if (err)
+		return err;
+
+	err = kernel_bind(s, addr, addr_len);
+
+	if (!err)
+		err = kernel_listen(s, 256);
+
+	if (!err) {
+		awork = alloc_worker(s, cb, cb_ctx, accept_cb);
+
+		if (IS_ERR(awork)) {
+			err = PTR_ERR(awork);
+		} else {
+			init_waitqueue_head(&awork->wait);
+			atomic_set(&awork->knocks, 0);
+			set_socket_callbacks(awork, true);
+			queue_work(wq, &awork->work);
+		}
+	}
+
+	if (err)
+		sock_release(s);
+
+	return err;
+}
+
+struct worker *alloc_worker(struct socket *s, kvmi_socket_use_cb cb,
+			    void *cb_ctx, work_func_t fct)
+{
+	struct worker *w = kmem_cache_zalloc(cache, GFP_KERNEL);
+
+	if (!w)
+		return ERR_PTR(-ENOMEM);
+
+	w->s = s;
+	w->cb = cb;
+	w->cb_ctx = cb_ctx;
+
+	init_completion(&w->finished);
+	INIT_WORK(&w->work, fct);
+
+	return w;
+}
+
+void set_socket_callbacks(struct worker *w, bool with_data_ready)
+{
+	struct sock *sk = w->s->sk;
+
+	sk->sk_user_data = w;
+
+	write_lock_bh(&sk->sk_callback_lock);
+
+	if (with_data_ready) {
+		w->orig_sk_data_ready = sk->sk_data_ready;
+		sk->sk_data_ready = data_ready_cb;
+	}
+
+	w->orig_sk_state_change = sk->sk_state_change;
+	sk->sk_state_change = state_change_cb;
+
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+void restore_socket_callbacks(struct worker *w)
+{
+	struct sock *sk = w->s->sk;
+
+	write_lock_bh(&sk->sk_callback_lock);
+
+	if (w->orig_sk_data_ready)
+		sk->sk_data_ready = w->orig_sk_data_ready;
+
+	sk->sk_state_change = w->orig_sk_state_change;
+
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+void data_ready_cb(struct sock *sk)
+{
+	struct worker *w = sk->sk_user_data;
+
+	atomic_inc(&w->knocks);
+	wakeup_worker(w);
+}
+
+void state_change_cb(struct sock *sk)
+{
+	struct worker *w = sk->sk_user_data;
+
+	signal_stop(w);
+	wakeup_worker(w);
+}
+
+void accept_cb(struct work_struct *work)
+{
+	struct worker *w = container_of(work, struct worker, work);
+
+	while (1) {
+		struct socket *s = NULL;
+
+		wait_event_interruptible(w->wait, should_accept(w, &s));
+
+		if (READ_ONCE(w->stopping))
+			break;
+
+		s->sk->sk_sndtimeo = SEND_TIMEOUT_SECS * HZ;
+
+		if (!w->cb(w->cb_ctx, read_socket_cb, s)) {
+			kvm_info("%s(%p) drop the last accepted socket\n",
+				 __func__, w);
+			__socket_close(s);
+		}
+	}
+
+	w->cb(w->cb_ctx, NULL, NULL);
+	complete_all(&w->finished);
+}
+
+bool should_accept(struct worker *w, struct socket **newsock)
+{
+	if (READ_ONCE(w->stopping))
+		return true;
+
+	if (__atomic_add_unless(&w->knocks, -1, 0))
+		return (kernel_accept(w->s, newsock, O_NONBLOCK) != -EAGAIN);
+
+	return false;
+}
+
+int read_socket_cb(void *s, void *buf, size_t len)
+{
+	return __recv((struct socket *) s, buf, len);
+}
+
+int __recv(struct socket *s, void *buf, size_t len)
+{
+	struct kvec i = {
+		.iov_base = buf,
+		.iov_len = len
+	};
+	struct msghdr m = { };
+
+	int rc = kernel_recvmsg(s, &m, &i, 1, i.iov_len, MSG_WAITALL);
+
+	if (unlikely(rc != len)) {
+		struct worker *w = s->sk->sk_user_data;
+		int err = (rc >= 0) ? -ETIMEDOUT : rc;
+
+		kvm_info("%s(%p, %u): %d -> %d\n", __func__, w,
+			 (unsigned int) len, rc, err);
+		return err;
+	}
+
+	return 0;
+}
+
+void *kvmi_socket_monitor(void *s, kvmi_socket_use_cb cb, void *cb_ctx)
+{
+	struct worker *w = alloc_worker((struct socket *) s, cb, cb_ctx,
+					work_cb);
+
+	if (!IS_ERR(w)) {
+		set_socket_callbacks(w, false);
+		queue_work(wq, &w->work);
+	}
+
+	return w;
+}
+
+void work_cb(struct work_struct *work)
+{
+	struct worker *w = container_of(work, struct worker, work);
+
+	while (w->cb(w->cb_ctx, read_worker_cb, w))
+		;
+
+	w->cb(w->cb_ctx, NULL, NULL);
+	complete_all(&w->finished);
+}
+
+void stop_cb_on_error(struct worker *w, int err)
+{
+	if (err != -EAGAIN)
+		signal_stop(w);
+}
+
+int read_worker_cb(void *_w, void *buf, size_t len)
+{
+	struct worker *w = _w;
+	int err;
+
+	if (READ_ONCE(w->stopping))
+		return -ENOENT;
+
+	err = __recv(w->s, buf, len);
+
+	if (unlikely(err)) {
+		kvm_info("%s(%p): %d\n", __func__, w, err);
+		stop_cb_on_error(w, err);
+	}
+
+	return err;
+}
+
+int kvmi_socket_send(void *_w, struct kvec *i, size_t n, size_t size)
+{
+	struct worker *w = _w;
+	int err;
+
+	if (READ_ONCE(w->stopping))
+		return -ENOENT;
+
+	err = __send(w->s, i, n, size);
+
+	if (unlikely(err)) {
+		kvm_info("%s(%p): %d\n", __func__, w, err);
+		stop_cb_on_error(w, err);
+	}
+
+	return err;
+}
+
+int __send(struct socket *s, struct kvec *i, size_t n, size_t size)
+{
+	struct msghdr m = { };
+	int rc = kernel_sendmsg(s, &m, i, n, size);
+
+	if (unlikely(rc != size)) {
+		int err = (rc > 0) ? -ETIMEDOUT : rc;
+		struct worker *w = s->sk->sk_user_data;
+
+		kvm_info("%s(%p): %d -> %d\n", __func__, w, rc, err);
+		return err;
+	}
+
+	return 0;
+}
+
+bool kvmi_socket_is_active(void *_w)
+{
+	struct worker *w = _w;
+	bool running = !completion_done(&w->finished);
+
+	return running;
+}
diff --git a/virt/kvm/kvmi_socket.h b/virt/kvm/kvmi_socket.h
new file mode 100644
index 000000000000..0a89ada84804
--- /dev/null
+++ b/virt/kvm/kvmi_socket.h
@@ -0,0 +1,33 @@
+/*
+ * Copyright (C) 2017 Bitdefender S.R.L.
+ *
+ * The KVMI Library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * The KVMI Library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with the GNU C Library; if not, see
+ * <http://www.gnu.org/licenses/>
+ */
+#ifndef __KVMI_SOCKET_H__
+#define __KVMI_SOCKET_H__
+
+typedef int (*kvmi_socket_read_cb) (void *, void *buf, size_t len);
+typedef bool(*kvmi_socket_use_cb) (void *ctx, kvmi_socket_read_cb read_cb,
+				   void *read_ctx);
+
+int kvmi_socket_start_vsock(unsigned int cid, unsigned int port,
+			    kvmi_socket_use_cb cb, void *cb_ctx);
+void kvmi_socket_stop(void);
+void *kvmi_socket_monitor(void *s, kvmi_socket_use_cb cb, void *cb_ctx);
+int kvmi_socket_send(void *s, struct kvec *i, size_t n, size_t size);
+void kvmi_socket_release(void *s);
+bool kvmi_socket_is_active(void *s);
+
+#endif
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 09/19] kvm: Hook in kvmi on VM on/off events
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (7 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 08/19] kvm: Add the introspection subsystem Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 10/19] kvm: vmx: Hook in kvmi_page_fault() Adalbert Lazar
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Notify the guest introspection tool when a VM is created
(KVMI_EVENT_GUEST_ON) or destroyed (KVMI_EVENT_GUEST_OFF).

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 virt/kvm/kvm_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c819b6b0a36e..179b688a8aef 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -784,6 +784,7 @@ static int kvm_vm_release(struct inode *inode, struct file *filp)
 {
 	struct kvm *kvm = filp->private_data;
 
+	kvmi_vm_powered_off(kvm);
 	kvm_irqfd_release(kvm);
 
 	kvm_put_kvm(kvm);
@@ -2574,6 +2575,8 @@ static long kvm_vcpu_ioctl(struct file *filp,
 				synchronize_rcu();
 			put_pid(oldpid);
 		}
+		if (!test_and_set_bit(0, &vcpu->kvm->introduced))
+			kvmi_vm_powered_on(vcpu->kvm);
 		r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 10/19] kvm: vmx: Hook in kvmi_page_fault()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (8 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 09/19] kvm: Hook in kvmi on VM on/off events Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event() Adalbert Lazar
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Notify the guest introspection tool when a #PF occurs due to a failed
permission check in the shadow page tables.

This call and the code involved in managing the shadow page tables
permissions are the essence of a security solution using guest
introspection facilities.

The shadow page tables are used to guarantee the purpose of code areas
inside the guest (code, rodata, stack, heap etc.) Each attempt at an
operation unfitting for a certain memory range (eg. execute code in
heap) triggers a #PF and gives the introspection tool the chance to
audit the code attempting the operation. The possible responses can be:

 * allow it
 * allow it via emulation
 * allow it via emulation and with custom input (see the 'Change
 emulation context' patch)
 * deny it by skipping the instruction

The #PF event is generated only for pages for which the guest
introspection tool has shown interest (ie. has previously touched it by
adjusting the permissions).

Page size is essential for performance (the smaller the better), that's
why huge pages should be split. At the time of writing this patch, they
are disabled with CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/mmu.c              | 51 +++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx.c              | 24 ++++++++++++++++---
 3 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40d1ee68474a..8d1d80bd2230 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1238,8 +1238,8 @@ void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu);
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
-int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva, u64 error_code,
-		       void *insn, int insn_len);
+int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
+		       void *insn, int insn_len, unsigned long gva, bool pf);
 void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva);
 void kvm_mmu_new_cr3(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 12e4c33ff879..3d2527626694 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -40,6 +40,8 @@
 #include <linux/uaccess.h>
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
+#include <linux/kvmi.h>
+#include "../../../../virt/kvm/kvmi.h"
 
 #include <asm/page.h>
 #include <asm/cmpxchg.h>
@@ -4723,11 +4725,46 @@ static void make_mmu_pages_available(struct kvm_vcpu *vcpu)
 	kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
 }
 
+static enum emulation_result __kvm_mmu_page_fault(struct kvm_vcpu *vcpu,
+						  gpa_t gpa, unsigned long gva,
+						  bool *again)
+{
+	unsigned int opts = 0;
+	unsigned long eq = vcpu->arch.exit_qualification;
+	u64 spte = kvm_mmu_get_spte(vcpu->kvm, vcpu, gpa);
+	enum emulation_result er = EMULATE_FAIL;
+
+	if (spte == -ENOENT) {
+		/* The SPTE is not present */
+		*again = true;
+		return EMULATE_FAIL;
+	}
+
+	if (!kvmi_page_fault(vcpu, gpa, gva, eq, &opts))
+		return EMULATE_FAIL;
+
+	if (opts & KVMI_EVENT_NOEMU)
+		er = EMULATE_DONE;
+	else {
+		er = x86_emulate_instruction(vcpu, gpa, 0, NULL, 0);
+
+		vcpu->ctx_size = 0;
+		vcpu->ctx_pos = 0;
+
+		if (er != EMULATE_DONE)
+			kvm_err("%s: emulate failed (err: %d, gpa: %llX)\n",
+			     __func__, er, gpa);
+	}
+
+	return er;
+}
+
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
-		       void *insn, int insn_len)
+		       void *insn, int insn_len, unsigned long gva, bool pf)
 {
 	int r, emulation_type = EMULTYPE_RETRY;
 	enum emulation_result er;
+	bool again = false;
 	bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu);
 
 	if (unlikely(error_code & PFERR_RSVD_MASK)) {
@@ -4742,12 +4779,21 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 			return r;
 	}
 
+	if (pf) {
+		er = __kvm_mmu_page_fault(vcpu, cr2, gva, &again);
+		if (er != EMULATE_FAIL)
+			goto check_er;
+	}
+
 	r = vcpu->arch.mmu.page_fault(vcpu, cr2, lower_32_bits(error_code),
 				      false);
 	if (r < 0)
 		return r;
-	if (!r)
+	if (!r) {
+		if (again)
+			__kvm_mmu_page_fault(vcpu, cr2, gva, &again);
 		return 1;
+	}
 
 	/*
 	 * Before emulating the instruction, check if the error code
@@ -4769,6 +4815,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u64 error_code,
 emulate:
 	er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
 
+check_er:
 	switch (er) {
 	case EMULATE_DONE:
 		return 1;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7a594cfcb2ea..f99fcc86f141 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5653,7 +5653,8 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 
 		if (kvm_event_needs_reinjection(vcpu))
 			kvm_mmu_unprotect_page_virt(vcpu, cr2);
-		return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0);
+		return kvm_mmu_page_fault(vcpu, cr2, error_code, NULL, 0, 0,
+					  false);
 	}
 
 	ex_no = intr_info & INTR_INFO_VECTOR_MASK;
@@ -6204,6 +6205,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
+	bool pf = false;
+	unsigned long gla = 0;
 	unsigned long exit_qualification;
 	gpa_t gpa;
 	u32 error_code;
@@ -6234,6 +6237,21 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(gpa, exit_qualification);
 
+	if ((exit_qualification & EPT_VIOLATION_GVA_TRANSLATED)) {
+		pf  = true;
+		gla = vmcs_readl(GUEST_LINEAR_ADDRESS);
+
+		/*
+		 * It can happen for kvm_read_cr3() to return 0 event though
+		 * the page fault took place as a result of a guest page table
+		 * translation
+		 *
+		 * TODO: Fix kvm_read_cr3(). The problem is in is_paging()
+		 */
+		vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+		__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+	}
+
 	/* Is it a read fault? */
 	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
 		     ? PFERR_USER_MASK : 0;
@@ -6252,7 +6270,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	vcpu->arch.gpa_available = true;
 	vcpu->arch.exit_qualification = exit_qualification;
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0, gla, pf);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
@@ -6273,7 +6291,7 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
 					      EMULATE_DONE;
 
 	if (unlikely(ret == RET_MMIO_PF_INVALID))
-		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
+		return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0, 0, false);
 
 	if (unlikely(ret == RET_MMIO_PF_RETRY))
 		return 1;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (9 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 10/19] kvm: vmx: Hook in kvmi_page_fault() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-21 11:48   ` Paolo Bonzini
  2017-06-16 13:43 ` [RFC PATCH 12/19] kvm: x86: Hook in kvmi_trap_event() Adalbert Lazar
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Inform the guest introspection tool than a breakpoint instruction (INT3)
is being executed. These one-byte intructions are placed in the slack
space of various functions and used as notification for when the OS or
an application has reached a certain state or is trying to perform a
certain operation (like creating a process).

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/svm.c              |  3 +++
 arch/x86/kvm/vmx.c              |  3 +++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 4 files changed, 22 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8d1d80bd2230..7024f8e3962b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1434,4 +1434,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 }
 
 void kvm_arch_msr_intercept(unsigned int msr, bool enable);
+int kvm_breakpoint(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7f1b00b74199..69d4d5c9e469 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2133,6 +2133,9 @@ static int bp_interception(struct vcpu_svm *svm)
 {
 	struct kvm_run *kvm_run = svm->vcpu.run;
 
+	if (kvm_breakpoint(svm->vcpu, svm->vmcb->control.exit_info_2))
+		return 1;
+
 	kvm_run->exit_reason = KVM_EXIT_DEBUG;
 	kvm_run->debug.arch.pc = svm->vmcb->save.cs.base + svm->vmcb->save.rip;
 	kvm_run->debug.arch.exception = BP_VECTOR;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f99fcc86f141..405b739cd07b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5682,6 +5682,9 @@ static int handle_exception(struct kvm_vcpu *vcpu)
 		kvm_run->debug.arch.dr7 = vmcs_readl(GUEST_DR7);
 		/* fall through */
 	case BP_VECTOR:
+		if (kvm_breakpoint(vcpu))
+			return 1;
+
 		/*
 		 * Update instruction length as we may reinject #BP from
 		 * user space while in guest debugging mode. Reading it for
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a47f640a7b5..3a50710629b5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -54,6 +54,7 @@
 #include <linux/kvm_irqfd.h>
 #include <linux/irqbypass.h>
 #include <linux/sched/stat.h>
+#include "../../../../virt/kvm/kvmi.h"
 
 #include <trace/events/kvm.h>
 
@@ -8740,6 +8741,19 @@ void kvm_arch_msr_intercept(unsigned int msr, bool enable)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_msr_intercept);
 
+int kvm_breakpoint(struct kvm_vcpu *vcpu)
+{
+	gpa_t gpa;
+	struct kvm_segment cs;
+
+	kvm_get_segment(vcpu, &cs, VCPU_SREG_CS);
+	gpa = kvm_mmu_gva_to_gpa_read(vcpu, cs.base + kvm_rip_read(vcpu), NULL);
+	if (kvmi_breakpoint_event(vcpu, gpa))
+		return 0;
+	return 1;
+}
+EXPORT_SYMBOL_GPL(kvm_breakpoint);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 12/19] kvm: x86: Hook in kvmi_trap_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (10 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 13/19] kvm: x86: Hook in kvmi_cr_event() Adalbert Lazar
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Inform the guest introspection tool that a trap was successfully
injected.

It can happen for the tool to queue a pagefault but have it overwritten
by an interrupt picked up during guest reentry. kvmi_trap_event() is
used to inform the tool of all pending traps giving it a chance to
determine if it should try again later.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3a50710629b5..29d07f8aa7fa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6928,6 +6928,30 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
 	}
 
+	if (atomic_read(&vcpu->arch.next_interrupt_enabled)) {
+		if (vcpu->arch.exception.pending) {
+			unsigned int nr = vcpu->arch.exception.nr;
+			unsigned int type;
+
+			if (kvm_exception_is_soft(nr))
+				type = INTR_TYPE_SOFT_EXCEPTION;
+			else
+				type = INTR_TYPE_HARD_EXCEPTION;
+			kvmi_trap_event(vcpu, nr, type,
+					vcpu->arch.exception.error_code,
+					vcpu->arch.cr2);
+		} else if (vcpu->arch.interrupt.pending) {
+			unsigned int nr = vcpu->arch.interrupt.nr;
+			unsigned int type;
+
+			if (vcpu->arch.interrupt.soft)
+				type = INTR_TYPE_SOFT_INTR;
+			else
+				type = INTR_TYPE_EXT_INTR;
+			kvmi_trap_event(vcpu, nr, type, 0, vcpu->arch.cr2);
+		}
+	}
+
 	kvm_x86_ops->run(vcpu);
 
 	/*
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 13/19] kvm: x86: Hook in kvmi_cr_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (11 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 12/19] kvm: x86: Hook in kvmi_trap_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 14/19] kvm: x86: Hook in kvmi_xsetbv_event() Adalbert Lazar
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Notify the guest introspection tool that cr{0,3,4} is going to be
changed. The function kvmi_cr_event() will load in crX the new value
if the tool permits it.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29d07f8aa7fa..5298f93412db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -648,6 +648,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
 		return 1;
 
+	if (old_cr0 != cr0 && !kvmi_cr_event(vcpu, 0, old_cr0, &cr0))
+		return 1;
+
 	kvm_x86_ops->set_cr0(vcpu, cr0);
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
@@ -785,6 +788,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if (!kvmi_cr_event(vcpu, 4, old_cr4, &cr4))
+		return 1;
+
 	if (kvm_x86_ops->set_cr4(vcpu, cr4))
 		return 1;
 
@@ -801,11 +807,13 @@ EXPORT_SYMBOL_GPL(kvm_set_cr4);
 
 int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
+	unsigned long old_cr3 = kvm_read_cr3(vcpu);
+
 #ifdef CONFIG_X86_64
 	cr3 &= ~CR3_PCID_INVD;
 #endif
 
-	if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
+	if (cr3 == old_cr3 && !pdptrs_changed(vcpu)) {
 		kvm_mmu_sync_roots(vcpu);
 		kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 		return 0;
@@ -818,6 +826,9 @@ int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 		   !load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
 		return 1;
 
+	if (!kvmi_cr_event(vcpu, 3, old_cr3, &cr3))
+		return 1;
+
 	vcpu->arch.cr3 = cr3;
 	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
 	kvm_mmu_new_cr3(vcpu);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 14/19] kvm: x86: Hook in kvmi_xsetbv_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (12 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 13/19] kvm: x86: Hook in kvmi_cr_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 15/19] kvm: x86: Hook in kvmi_msr_event() Adalbert Lazar
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Notify the guest introspection tool that the extended control register
has been changed.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5298f93412db..248fb7e99423 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -728,6 +728,10 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 		if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
 			return 1;
 	}
+
+	if (xcr0 != old_xcr0)
+		kvmi_xsetbv_event(vcpu, xcr);
+
 	vcpu->arch.xcr0 = xcr0;
 
 	if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 15/19] kvm: x86: Hook in kvmi_msr_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (13 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 14/19] kvm: x86: Hook in kvmi_xsetbv_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 16/19] kvm: x86: Change the emulation context Adalbert Lazar
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Inform the guest introspection tool that an MSR is going to be changed.

The kvmi_msr_event() function will check a bitmap of MSR-s of interest
(configured via a KVMI_EVENT_CONTROL(KVMI_MSR_CONTROL) request) and, if
the new value differs from the previous one, it will generate a
notification. The introspection tool can respond by allowing the guest
to continue with normal execution or by discarding the change.

This is meant to prevent malicious changes to MSR-s such as
MSR_IA32_SYSENTER_EIP.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 248fb7e99423..b7d2a9901665 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1090,6 +1090,23 @@ EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
  */
 int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
+	if (!msr->host_initiated) {
+		struct msr_data __msr;
+
+		memset(&__msr, 0, sizeof(__msr));
+		__msr.host_initiated = true;
+		__msr.index = msr->index;
+
+		if (!kvm_get_msr(vcpu, &__msr)) {
+			u64 data = msr->data;
+
+			if (kvmi_msr_event(vcpu, msr->index, __msr.data, &data))
+				msr->data = data;
+			else
+				return 0;
+		}
+	}
+
 	switch (msr->index) {
 	case MSR_FS_BASE:
 	case MSR_GS_BASE:
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 16/19] kvm: x86: Change the emulation context
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (14 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 15/19] kvm: x86: Hook in kvmi_msr_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 17/19] kvm: x86: Hook in kvmi_vmcall_event() Adalbert Lazar
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Certain instructions that generate a #PF due to the unset read bit in
the corresponding spte, need to be emulated and passed a certain input
(usually 8 bytes in length).

This is used to hide injected code by the introspecting tool from
integrity checkers running inside the guest.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b7d2a9901665..9465856a9e37 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4480,6 +4480,10 @@ static int kvm_read_guest_virt_system(struct x86_emulate_ctxt *ctxt,
 				      struct x86_exception *exception)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+	if (vcpu->ctx_size)
+		return kvmi_patch_emul_instr(vcpu, val, bytes);
+
 	return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
@@ -4487,7 +4491,12 @@ static int kvm_read_guest_phys_system(struct x86_emulate_ctxt *ctxt,
 		unsigned long addr, void *val, unsigned int bytes)
 {
 	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
-	int r = kvm_vcpu_read_guest(vcpu, addr, val, bytes);
+	int r;
+
+	if (vcpu->ctx_size)
+		return kvmi_patch_emul_instr(vcpu, val, bytes);
+
+	r = kvm_vcpu_read_guest(vcpu, addr, val, bytes);
 
 	return r < 0 ? X86EMUL_IO_NEEDED : X86EMUL_CONTINUE;
 }
@@ -4773,6 +4782,11 @@ static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
 				  unsigned int bytes,
 				  struct x86_exception *exception)
 {
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+
+	if (vcpu->ctx_size)
+		return kvmi_patch_emul_instr(vcpu, val, bytes);
+
 	return emulator_read_write(ctxt, addr, val, bytes,
 				   exception, &read_emultor);
 }
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 17/19] kvm: x86: Hook in kvmi_vmcall_event()
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (15 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 16/19] kvm: x86: Change the emulation context Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 18/19] kvm: x86: Set the new spte flags before entering the guest Adalbert Lazar
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Code residing inside the introspected guest can call the introspection
tool to report certain details about its operation. For example, a
classic antimalware remediation tool can report what it has found during
a scan.

The VMCALL convention is the one used on Xen (DOMCTL + subop). This code
is largely untested. Its purpose is only to show how guest code
communicates with the introspection tool.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c            | 15 +++++++++++++++
 include/uapi/linux/kvm_para.h |  4 ++++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9465856a9e37..cafe878ba148 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6302,6 +6302,21 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = kvm_pv_clock_pairing(vcpu, a0, a1);
 		break;
 #endif
+	case KVM_HC_XEN_HVM_OP:{
+		unsigned long subop;
+
+		if (op_64_bit) {
+			subop = kvm_register_read(vcpu, VCPU_REGS_RDI);
+			subop &= 0xFFFFFFFF;
+		} else
+			subop = a0;
+
+		if (subop == KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT)
+			kvmi_vmcall_event(vcpu);
+
+		ret = kvm_register_read(vcpu, VCPU_REGS_RAX);
+		break;
+	}
 	default:
 		ret = -KVM_ENOSYS;
 		break;
diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h
index fed506aeff62..297b75435831 100644
--- a/include/uapi/linux/kvm_para.h
+++ b/include/uapi/linux/kvm_para.h
@@ -25,6 +25,10 @@
 #define KVM_HC_MIPS_EXIT_VM		7
 #define KVM_HC_MIPS_CONSOLE_OUTPUT	8
 #define KVM_HC_CLOCK_PAIRING		9
+#define KVM_HC_XEN_HVM_OP		34
+/* Matches Xen's __HYPERVISOR_hvm_op */
+
+#define KVM_HC_XEN_HVM_OP_GUEST_REQUEST_VM_EVENT 24
 
 /*
  * hypercalls use architecture specific
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 18/19] kvm: x86: Set the new spte flags before entering the guest
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (16 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 17/19] kvm: x86: Hook in kvmi_vmcall_event() Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 13:43 ` [RFC PATCH 19/19] kvm: x86: Handle KVM_REQ_INTROSPECTION Adalbert Lazar
  2017-06-16 14:45 ` [RFC PATCH 00/19] Guest introspection Jan Kiszka
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

From: Mihai Dontu <mdontu@bitdefender.com>

Apply the changes made to the shadow page tables by the guest
introspection tool. These changes involve only the page permissions
bits.

Signed-off-by: Mihai Dontu <mdontu@bitdefender.com>
---
 arch/x86/kvm/x86.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cafe878ba148..30f4d301453c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6793,6 +6793,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	bool req_immediate_exit = false;
 
+	kvmi_flush_mem_access(vcpu);
+
 	if (vcpu->requests) {
 		if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
 			kvm_mmu_unload(vcpu);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH 19/19] kvm: x86: Handle KVM_REQ_INTROSPECTION
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (17 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 18/19] kvm: x86: Set the new spte flags before entering the guest Adalbert Lazar
@ 2017-06-16 13:43 ` Adalbert Lazar
  2017-06-16 14:45 ` [RFC PATCH 00/19] Guest introspection Jan Kiszka
  19 siblings, 0 replies; 38+ messages in thread
From: Adalbert Lazar @ 2017-06-16 13:43 UTC (permalink / raw)
  To: kvm; +Cc: Paolo Bonzini, Radim Krčmář, alazar, mdontu

This VCPU request is needed to handle introspection requests: pause,
unpause, command, reply, close.

Signed-off-by: Adalbert Lazar <alazar@bitdefender.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 30f4d301453c..32a757939474 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6880,6 +6880,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
+
+		if (kvm_check_request(KVM_REQ_INTROSPECTION, vcpu))
+			kvmi_handle_controller_request(vcpu);
 	}
 
 	if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
                   ` (18 preceding siblings ...)
  2017-06-16 13:43 ` [RFC PATCH 19/19] kvm: x86: Handle KVM_REQ_INTROSPECTION Adalbert Lazar
@ 2017-06-16 14:45 ` Jan Kiszka
  2017-06-16 15:18   ` Mihai Donțu
  19 siblings, 1 reply; 38+ messages in thread
From: Jan Kiszka @ 2017-06-16 14:45 UTC (permalink / raw)
  To: Adalbert Lazar, kvm; +Cc: Paolo Bonzini, Radim Krčmář, mdontu

On 2017-06-16 15:43, Adalbert Lazar wrote:
> This patch series proposes an interface that will allow a guest
> introspection tool to monitor and control other guests, in order to
> protect them against different forms of exploits. This type of interface
> is already present in the XEN hypervisor.
> 
> With the current implementation, the introspection tool connects to
> the KVMi (the introspection subsystem from KVM) using a vsock socket,
> establishes a main communication channel, used for a few messages
> (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
> KVMI_GET_VERSION).
> 
> Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
> establish a new connection, used to monitor and control that guest.
> 

What prevented building this on top of the already existing guest debug
interfaces of KVM, maybe extending it where needed? Could be win-win.

Also, this looks like as if it can easily work against the userspace
part of the hypervisor - bad idea.

API/ABI documentation is missing.

Did you check if the concept is portable to other architectures? Another
reason to try hard to reuse existing interfaces.

Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 14:45 ` [RFC PATCH 00/19] Guest introspection Jan Kiszka
@ 2017-06-16 15:18   ` Mihai Donțu
  2017-06-16 15:34     ` Jan Kiszka
  2017-06-16 17:05     ` Paolo Bonzini
  0 siblings, 2 replies; 38+ messages in thread
From: Mihai Donțu @ 2017-06-16 15:18 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Radim Krčmář, Adalbert Lazar, kvm

Hi Jan,

On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
> On 2017-06-16 15:43, Adalbert Lazar wrote:
> > This patch series proposes an interface that will allow a guest
> > introspection tool to monitor and control other guests, in order to
> > protect them against different forms of exploits. This type of interface
> > is already present in the XEN hypervisor.
> > 
> > With the current implementation, the introspection tool connects to
> > the KVMi (the introspection subsystem from KVM) using a vsock socket,
> > establishes a main communication channel, used for a few messages
> > (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
> > KVMI_GET_VERSION).
> > 
> > Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
> > establish a new connection, used to monitor and control that guest.

Thank you very much for taking a look over this series!

> What prevented building this on top of the already existing guest debug
> interfaces of KVM, maybe extending it where needed? Could be win-win.

I might be mistaking, but this would require the application using the
introspection capabilities to run on the host. If so, what we are
trying to do is to isolate the application into its own VM. This is why
we use vSock to communicate with the host.

If instead you are suggesting we integrate the kernel-side API into the
debug framework, I see no problem with that right now. We'll need a bit
more time to look into what that entails.

> Also, this looks like as if it can easily work against the userspace
> part of the hypervisor - bad idea.

The way it is implemented right now, it works behind its back (qemu
specifically), in that it intercepts and handles certain events before
it. It should be possible to put some code in qemu and move part of the
logic in it, but we're trying hard to avoid context switches as guest
exits themselves are currently quite expensive. The experience comes
from working with Xen. We have no benchmark numbers for KVM.

> API/ABI documentation is missing.

Understood. We will try to put something together in the coming weeks.

> Did you check if the concept is portable to other architectures? Another
> reason to try hard to reuse existing interfaces.

The API that we propose is the result of work done for x86 and ARM,
though for the latter we're still working on a PoC. It's fairly
generic.

> Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.

Good catch! We'll make the adjustment.

Thank-you!

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 15:18   ` Mihai Donțu
@ 2017-06-16 15:34     ` Jan Kiszka
  2017-06-16 15:59       ` Mihai Donțu
  2017-06-19  9:39       ` Stefan Hajnoczi
  2017-06-16 17:05     ` Paolo Bonzini
  1 sibling, 2 replies; 38+ messages in thread
From: Jan Kiszka @ 2017-06-16 15:34 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: Paolo Bonzini, Radim Krčmář, Adalbert Lazar, kvm

On 2017-06-16 17:18, Mihai Donțu wrote:
> Hi Jan,
> 
> On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
>> On 2017-06-16 15:43, Adalbert Lazar wrote:
>>> This patch series proposes an interface that will allow a guest
>>> introspection tool to monitor and control other guests, in order to
>>> protect them against different forms of exploits. This type of interface
>>> is already present in the XEN hypervisor.
>>>
>>> With the current implementation, the introspection tool connects to
>>> the KVMi (the introspection subsystem from KVM) using a vsock socket,
>>> establishes a main communication channel, used for a few messages
>>> (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
>>> KVMI_GET_VERSION).
>>>
>>> Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
>>> establish a new connection, used to monitor and control that guest.
> 
> Thank you very much for taking a look over this series!
> 
>> What prevented building this on top of the already existing guest debug
>> interfaces of KVM, maybe extending it where needed? Could be win-win.
> 
> I might be mistaking, but this would require the application using the
> introspection capabilities to run on the host. If so, what we are
> trying to do is to isolate the application into its own VM. This is why
> we use vSock to communicate with the host.

Communication alone does not require isolation. Interpretation of what
can be sees may benefit from that, though.

> 
> If instead you are suggesting we integrate the kernel-side API into the
> debug framework, I see no problem with that right now. We'll need a bit
> more time to look into what that entails.

The hypervisor process could terminate your link, providing that other
VM the introspection access. Or you even have a gdb-speaking process
running on the host, just reusing the existing gdbstub of QEMU. Just
wild ideas, I didn't look into details, and you may further elaborate on
your requirements.

> 
>> Also, this looks like as if it can easily work against the userspace
>> part of the hypervisor - bad idea.
> 
> The way it is implemented right now, it works behind its back (qemu
> specifically), in that it intercepts and handles certain events before
> it. It should be possible to put some code in qemu and move part of the
> logic in it, but we're trying hard to avoid context switches as guest
> exits themselves are currently quite expensive. The experience comes
> from working with Xen. We have no benchmark numbers for KVM.

Even if you don't run the hot-paths through QEMU, you should inform it
about what is going on. Starting/stopping behind its back it bad, so is
fiddling with guest stats. Keep in mind that your introspection VM is,
well, just another VM that could be scheduled, suspended or even
migrated away, and then you leave the original VM rather clueless behind.

Migration is actually an interesting topic of its own...

> 
>> API/ABI documentation is missing.
> 
> Understood. We will try to put something together in the coming weeks.
> 
>> Did you check if the concept is portable to other architectures? Another
>> reason to try hard to reuse existing interfaces.
> 
> The API that we propose is the result of work done for x86 and ARM,
> though for the latter we're still working on a PoC. It's fairly
> generic.
> 
>> Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.
> 
> Good catch! We'll make the adjustment.
> 
> Thank-you!
> 

BTW, I remember that there was/is some larger research community
interested in such kind of interfaces as well, or they even have their
own out-of-tree tooling. Hope they will speak up and review your
proposals as well so that the result is of general use.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 15:34     ` Jan Kiszka
@ 2017-06-16 15:59       ` Mihai Donțu
  2017-06-19  9:39       ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Mihai Donțu @ 2017-06-16 15:59 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Paolo Bonzini, Radim Krčmář, Adalbert Lazar, kvm

On Fri, 2017-06-16 at 17:34 +0200, Jan Kiszka wrote:
> On 2017-06-16 17:18, Mihai Donțu wrote:
> > On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
> > > On 2017-06-16 15:43, Adalbert Lazar wrote:
> > > > This patch series proposes an interface that will allow a guest
> > > > introspection tool to monitor and control other guests, in order to
> > > > protect them against different forms of exploits. This type of interface
> > > > is already present in the XEN hypervisor.
> > > > 
> > > > With the current implementation, the introspection tool connects to
> > > > the KVMi (the introspection subsystem from KVM) using a vsock socket,
> > > > establishes a main communication channel, used for a few messages
> > > > (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
> > > > KVMI_GET_VERSION).
> > > > 
> > > > Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
> > > > establish a new connection, used to monitor and control that guest.
> > 
> > Thank you very much for taking a look over this series!
> > 
> > > What prevented building this on top of the already existing guest debug
> > > interfaces of KVM, maybe extending it where needed? Could be win-win.
> > 
> > I might be mistaking, but this would require the application using the
> > introspection capabilities to run on the host. If so, what we are
> > trying to do is to isolate the application into its own VM. This is why
> > we use vSock to communicate with the host.
> 
> Communication alone does not require isolation. Interpretation of what
> can be sees may benefit from that, though.
> 
> > If instead you are suggesting we integrate the kernel-side API into the
> > debug framework, I see no problem with that right now. We'll need a bit
> > more time to look into what that entails.
> 
> The hypervisor process could terminate your link, providing that other
> VM the introspection access. Or you even have a gdb-speaking process
> running on the host, just reusing the existing gdbstub of QEMU. Just
> wild ideas, I didn't look into details, and you may further elaborate on
> your requirements.
> 
> > > Also, this looks like as if it can easily work against the userspace
> > > part of the hypervisor - bad idea.
> > 
> > The way it is implemented right now, it works behind its back (qemu
> > specifically), in that it intercepts and handles certain events before
> > it. It should be possible to put some code in qemu and move part of the
> > logic in it, but we're trying hard to avoid context switches as guest
> > exits themselves are currently quite expensive. The experience comes
> > from working with Xen. We have no benchmark numbers for KVM.
> 
> Even if you don't run the hot-paths through QEMU, you should inform it
> about what is going on. Starting/stopping behind its back it bad, so is
> fiddling with guest stats. Keep in mind that your introspection VM is,
> well, just another VM that could be scheduled, suspended or even
> migrated away, and then you leave the original VM rather clueless behind.
> 
> Migration is actually an interesting topic of its own...

On this topic specifically, a complete security solution using guest
introspection interfaces will need to tap into the management
application and be notified when a migration is taking place. The
reason being, and I'm talking from our perspective only, that the
introspected guest is patched and those patches „talk” with the
security solution. A guest that reaches the destination in this state
will remain in limbo and (so far) can only be recovered through a
reboot.

> > > API/ABI documentation is missing.
> > 
> > Understood. We will try to put something together in the coming weeks.
> > 
> > > Did you check if the concept is portable to other architectures? Another
> > > reason to try hard to reuse existing interfaces.
> > 
> > The API that we propose is the result of work done for x86 and ARM,
> > though for the latter we're still working on a PoC. It's fairly
> > generic.
> > 
> > > Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.
> > 
> > Good catch! We'll make the adjustment.
> > 
> > Thank-you!
> > 
> 
> BTW, I remember that there was/is some larger research community
> interested in such kind of interfaces as well, or they even have their
> own out-of-tree tooling. Hope they will speak up and review your
> proposals as well so that the result is of general use.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 15:18   ` Mihai Donțu
  2017-06-16 15:34     ` Jan Kiszka
@ 2017-06-16 17:05     ` Paolo Bonzini
  2017-06-16 17:27       ` Jan Kiszka
  1 sibling, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-16 17:05 UTC (permalink / raw)
  To: Mihai Donțu, Jan Kiszka
  Cc: Radim Krčmář, Adalbert Lazar, kvm



On 16/06/2017 17:18, Mihai Donțu wrote:
>> Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.
> Good catch! We'll make the adjustment.

LGPL is GPL-compatible.  If there's anything LGPL to share this code
with, I have no issue with that.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 17:05     ` Paolo Bonzini
@ 2017-06-16 17:27       ` Jan Kiszka
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kiszka @ 2017-06-16 17:27 UTC (permalink / raw)
  To: Paolo Bonzini, Mihai Donțu
  Cc: Radim Krčmář, Adalbert Lazar, kvm

On 2017-06-16 19:05, Paolo Bonzini wrote:
> 
> 
> On 16/06/2017 17:18, Mihai Donțu wrote:
>>> Last but not least: LGPL slipped into your kernel parts - the kernel is GPL.
>> Good catch! We'll make the adjustment.
> 
> LGPL is GPL-compatible.  If there's anything LGPL to share this code
> with, I have no issue with that.

It can falsely suggests that linking against this removes certain
obligations of the GPL that the rest is licensed under. For core code,
it's therefore not a good choice.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-16 15:34     ` Jan Kiszka
  2017-06-16 15:59       ` Mihai Donțu
@ 2017-06-19  9:39       ` Stefan Hajnoczi
  2017-06-20 14:58         ` alazar
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-06-19  9:39 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Mihai Donțu, Paolo Bonzini, Radim Krčmář,
	Adalbert Lazar, kvm

[-- Attachment #1: Type: text/plain, Size: 2655 bytes --]

On Fri, Jun 16, 2017 at 05:34:48PM +0200, Jan Kiszka wrote:
> On 2017-06-16 17:18, Mihai Donțu wrote:
> > Hi Jan,
> > 
> > On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
> >> On 2017-06-16 15:43, Adalbert Lazar wrote:
> >>> This patch series proposes an interface that will allow a guest
> >>> introspection tool to monitor and control other guests, in order to
> >>> protect them against different forms of exploits. This type of interface
> >>> is already present in the XEN hypervisor.
> >>>
> >>> With the current implementation, the introspection tool connects to
> >>> the KVMi (the introspection subsystem from KVM) using a vsock socket,
> >>> establishes a main communication channel, used for a few messages
> >>> (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
> >>> KVMI_GET_VERSION).
> >>>
> >>> Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
> >>> establish a new connection, used to monitor and control that guest.
> > 
> > Thank you very much for taking a look over this series!
> > 
> >> What prevented building this on top of the already existing guest debug
> >> interfaces of KVM, maybe extending it where needed? Could be win-win.
> > 
> > I might be mistaking, but this would require the application using the
> > introspection capabilities to run on the host. If so, what we are
> > trying to do is to isolate the application into its own VM. This is why
> > we use vSock to communicate with the host.
> 
> Communication alone does not require isolation. Interpretation of what
> can be sees may benefit from that, though.
> 
> > 
> > If instead you are suggesting we integrate the kernel-side API into the
> > debug framework, I see no problem with that right now. We'll need a bit
> > more time to look into what that entails.
> 
> The hypervisor process could terminate your link, providing that other
> VM the introspection access. Or you even have a gdb-speaking process
> running on the host, just reusing the existing gdbstub of QEMU. Just
> wild ideas, I didn't look into details, and you may further elaborate on
> your requirements.

QEMU userspace can provide the interface on an AF_VSOCK listen socket.
Here is a rough idea:

  qemu --chardev socket,id=chardev0,type=vsock,port=1234,server,nowait \
       --guest-introspection chardev=chardev0,allowed-cids=10

Since it uses a chardev this means the guest introspection API is also
available via UNIX domain sockets to local applications, etc.

In this example only CID 10 has access to the guest introspection API.
Connections from other VMs will be dropped.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-19  9:39       ` Stefan Hajnoczi
@ 2017-06-20 14:58         ` alazar
  2017-06-20 15:03           ` Jan Kiszka
  2017-06-21 11:04           ` Stefan Hajnoczi
  0 siblings, 2 replies; 38+ messages in thread
From: alazar @ 2017-06-20 14:58 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jan Kiszka
  Cc: Mihai Dontu, Paolo Bonzini, Radim Krčmář, kvm

On Mon, 19 Jun 2017 10:39:28 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Jun 16, 2017 at 05:34:48PM +0200, Jan Kiszka wrote:
> > On 2017-06-16 17:18, Mihai Donțu wrote:
> > > On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
> > >> On 2017-06-16 15:43, Adalbert Lazar wrote:
> > >>> This patch series proposes an interface that will allow a guest
> > >>> introspection tool to monitor and control other guests, in order to
> > >>> protect them against different forms of exploits. This type of interface
> > >>> is already present in the XEN hypervisor.
> > >>>
> > >>> With the current implementation, the introspection tool connects to
> > >>> the KVMi (the introspection subsystem from KVM) using a vsock socket,
> > >>> establishes a main communication channel, used for a few messages
> > >>> (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
> > >>> KVMI_GET_VERSION).
> > >>>
> > >>> Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
> > >>> establish a new connection, used to monitor and control that guest.
> > > 
> > >> What prevented building this on top of the already existing guest debug
> > >> interfaces of KVM, maybe extending it where needed? Could be win-win.
> > > 
> > > I might be mistaking, but this would require the application using the
> > > introspection capabilities to run on the host. If so, what we are
> > > trying to do is to isolate the application into its own VM. This is why
> > > we use vSock to communicate with the host.
> > 
> > The hypervisor process could terminate your link, providing that other
> > VM the introspection access. Or you even have a gdb-speaking process
> > running on the host, just reusing the existing gdbstub of QEMU. Just
> > wild ideas, I didn't look into details, and you may further elaborate on
> > your requirements.
> 
> QEMU userspace can provide the interface on an AF_VSOCK listen socket.
> Here is a rough idea:
> 
>   qemu --chardev socket,id=chardev0,type=vsock,port=1234,server,nowait \
>        --guest-introspection chardev=chardev0,allowed-cids=10
> 
> Since it uses a chardev this means the guest introspection API is also
> available via UNIX domain sockets to local applications, etc.

Exposing the introspections commands and events as IOCTLs can be done easily.

However, dropping the vsock access from the kernel, or having
the IOCTL <-> vsock adapter only in userspace will change more things.
With the proposed API, the guest introspection looks like:

 ------------------                  -----------------------------
|                  |<-- /dev/kvm -->| qemu        VM1             |
|                  |                |-------                      |
|                  |                | Linux |                     |
| KVM              |                 -----------------------------
|                  |<-- /dev/kvm -->| qemu        VM2             |
|                  |                |---------                    |
|                  |                | Windows |                   |
|                  |                 -----------------------------
|                  |<-- /dev/kvm -->| qemu        VM3             |
|    --------------|                |---------------------------  |
|   | kvmi <- vhost-vsock -----------> guest introspection tool | |
 ------------------                  -----------------------------

The introspection tool connects directly to the introspection subsystem (kvmi).

Moving the vsock to userland will change this:

                                     -----------------------------
                 /----- /dev/kvm -->| new_tool (guest on/off/list)|<-- vsock -->\
                 |                   -----------------------------              |
                 |                                                              |
 ----------------v-                  -----------------------------              |
|                  |<-- /dev/kvm -->| qemu        VM1             |<-- vsock -->|
|                  |                |-------                      |             |
|                  |                | Linux |                     |             |
| KVM              |                 -----------------------------              |
|                  |<-- /dev/kvm -->| qemu        VM2             |<-- vsock -->|
|                  |                |---------                    |             |
|                  |                | Windows |                   |             |
|                  |                 -----------------------------              |
|                  |<-- /dev/kvm -->| qemu        VM3      /----->|<-- vsock -->/
|           -------|                |---------------------v----   |
|          | kvmi  |                | guest introspection tool |  |
 ------------------                  -----------------------------

There will be a need for a new tool (and/or libvirt modified) to get
the guest events (on/off/list) and change the VM1, VM2 invocations (to
make them connect with the introspection tool). This might also be a
problem with products having the host locked down (eg. RHEV).

Moving more code to QEMU will make the introspection harder with other
hw emulators (from kvmtool to Google's implementation) because they will
need more than just:
   - migration notification: the introspection tool needs to unhook
     from the introspected guests [minor change]
   - KVM_SET_VM_UUID support [minor change]
   - vsock [medium change], but only if this particular emulator is used
     to start the guest introspection tool

Also, the path between kvmi and the introspection tool (when running
isolated in a guest) will be bigger, adding some overhead, which is
a big problem with live introspectors. EXTERIOR VMI paper shows[1]
between 5-20 times slower execution for small programs (ps, uptime).
So, a smaller path will help in keeping the overhead lower.

My colleague will follow up with some stats collected during an
introspection session. Hopefully they will shed more light on the
performance required from the tool-KVM communication channel.

[1]: https://www.utdallas.edu/~zhiqiang.lin/file/VEE13-Slides.pdf

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-20 14:58         ` alazar
@ 2017-06-20 15:03           ` Jan Kiszka
  2017-06-21 11:04           ` Stefan Hajnoczi
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Kiszka @ 2017-06-20 15:03 UTC (permalink / raw)
  To: alazar, Stefan Hajnoczi
  Cc: Mihai Dontu, Paolo Bonzini, Radim Krčmář, kvm

On 2017-06-20 16:58, alazar@bitdefender.com wrote:
> On Mon, 19 Jun 2017 10:39:28 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Fri, Jun 16, 2017 at 05:34:48PM +0200, Jan Kiszka wrote:
>>> On 2017-06-16 17:18, Mihai Donțu wrote:
>>>> On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
>>>>> On 2017-06-16 15:43, Adalbert Lazar wrote:
>>>>>> This patch series proposes an interface that will allow a guest
>>>>>> introspection tool to monitor and control other guests, in order to
>>>>>> protect them against different forms of exploits. This type of interface
>>>>>> is already present in the XEN hypervisor.
>>>>>>
>>>>>> With the current implementation, the introspection tool connects to
>>>>>> the KVMi (the introspection subsystem from KVM) using a vsock socket,
>>>>>> establishes a main communication channel, used for a few messages
>>>>>> (KVMI_EVENT_GUEST_ON, KVMI_EVENT_GUEST_OFF, KVMI_GET_GUESTS and
>>>>>> KVMI_GET_VERSION).
>>>>>>
>>>>>> Every KVMI_EVENT_GUEST_ON notification, makes the introspection tool
>>>>>> establish a new connection, used to monitor and control that guest.
>>>>
>>>>> What prevented building this on top of the already existing guest debug
>>>>> interfaces of KVM, maybe extending it where needed? Could be win-win.
>>>>
>>>> I might be mistaking, but this would require the application using the
>>>> introspection capabilities to run on the host. If so, what we are
>>>> trying to do is to isolate the application into its own VM. This is why
>>>> we use vSock to communicate with the host.
>>>
>>> The hypervisor process could terminate your link, providing that other
>>> VM the introspection access. Or you even have a gdb-speaking process
>>> running on the host, just reusing the existing gdbstub of QEMU. Just
>>> wild ideas, I didn't look into details, and you may further elaborate on
>>> your requirements.
>>
>> QEMU userspace can provide the interface on an AF_VSOCK listen socket.
>> Here is a rough idea:
>>
>>   qemu --chardev socket,id=chardev0,type=vsock,port=1234,server,nowait \
>>        --guest-introspection chardev=chardev0,allowed-cids=10
>>
>> Since it uses a chardev this means the guest introspection API is also
>> available via UNIX domain sockets to local applications, etc.
> 
> Exposing the introspections commands and events as IOCTLs can be done easily.
> 
> However, dropping the vsock access from the kernel, or having
> the IOCTL <-> vsock adapter only in userspace will change more things.
> With the proposed API, the guest introspection looks like:
> 
>  ------------------                  -----------------------------
> |                  |<-- /dev/kvm -->| qemu        VM1             |
> |                  |                |-------                      |
> |                  |                | Linux |                     |
> | KVM              |                 -----------------------------
> |                  |<-- /dev/kvm -->| qemu        VM2             |
> |                  |                |---------                    |
> |                  |                | Windows |                   |
> |                  |                 -----------------------------
> |                  |<-- /dev/kvm -->| qemu        VM3             |
> |    --------------|                |---------------------------  |
> |   | kvmi <- vhost-vsock -----------> guest introspection tool | |
>  ------------------                  -----------------------------
> 
> The introspection tool connects directly to the introspection subsystem (kvmi).
> 
> Moving the vsock to userland will change this:
> 
>                                      -----------------------------
>                  /----- /dev/kvm -->| new_tool (guest on/off/list)|<-- vsock -->\
>                  |                   -----------------------------              |
>                  |                                                              |
>  ----------------v-                  -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM1             |<-- vsock -->|
> |                  |                |-------                      |             |
> |                  |                | Linux |                     |             |
> | KVM              |                 -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM2             |<-- vsock -->|
> |                  |                |---------                    |             |
> |                  |                | Windows |                   |             |
> |                  |                 -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM3      /----->|<-- vsock -->/
> |           -------|                |---------------------v----   |
> |          | kvmi  |                | guest introspection tool |  |
>  ------------------                  -----------------------------
> 
> There will be a need for a new tool (and/or libvirt modified) to get
> the guest events (on/off/list) and change the VM1, VM2 invocations (to
> make them connect with the introspection tool). This might also be a
> problem with products having the host locked down (eg. RHEV).
> 
> Moving more code to QEMU will make the introspection harder with other
> hw emulators (from kvmtool to Google's implementation) because they will

I would bet that (not only) Google folks will look rather skeptical at
any proposal to widen the existing kvm user interface anyway, because of
security implications. Reducing that should be a primary design goal IMHO.

Jan

> need more than just:
>    - migration notification: the introspection tool needs to unhook
>      from the introspected guests [minor change]
>    - KVM_SET_VM_UUID support [minor change]
>    - vsock [medium change], but only if this particular emulator is used
>      to start the guest introspection tool
> 
> Also, the path between kvmi and the introspection tool (when running
> isolated in a guest) will be bigger, adding some overhead, which is
> a big problem with live introspectors. EXTERIOR VMI paper shows[1]
> between 5-20 times slower execution for small programs (ps, uptime).
> So, a smaller path will help in keeping the overhead lower.
> 
> My colleague will follow up with some stats collected during an
> introspection session. Hopefully they will shed more light on the
> performance required from the tool-KVM communication channel.
> 
> [1]: https://www.utdallas.edu/~zhiqiang.lin/file/VEE13-Slides.pdf
> 

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-20 14:58         ` alazar
  2017-06-20 15:03           ` Jan Kiszka
@ 2017-06-21 11:04           ` Stefan Hajnoczi
  2017-06-21 13:25             ` Paolo Bonzini
  1 sibling, 1 reply; 38+ messages in thread
From: Stefan Hajnoczi @ 2017-06-21 11:04 UTC (permalink / raw)
  To: alazar
  Cc: Jan Kiszka, Mihai Dontu, Paolo Bonzini, Radim Krčmář, kvm

[-- Attachment #1: Type: text/plain, Size: 2888 bytes --]

On Tue, Jun 20, 2017 at 05:58:41PM +0300, alazar@bitdefender.com wrote:
> On Mon, 19 Jun 2017 10:39:28 +0100, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Fri, Jun 16, 2017 at 05:34:48PM +0200, Jan Kiszka wrote:
> > > On 2017-06-16 17:18, Mihai Donțu wrote:
> > > > On Fri, 2017-06-16 at 16:45 +0200, Jan Kiszka wrote:
> > > >> On 2017-06-16 15:43, Adalbert Lazar wrote:
> Moving the vsock to userland will change this:
> 
>                                      -----------------------------
>                  /----- /dev/kvm -->| new_tool (guest on/off/list)|<-- vsock -->\
>                  |                   -----------------------------              |
>                  |                                                              |
>  ----------------v-                  -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM1             |<-- vsock -->|
> |                  |                |-------                      |             |
> |                  |                | Linux |                     |             |
> | KVM              |                 -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM2             |<-- vsock -->|
> |                  |                |---------                    |             |
> |                  |                | Windows |                   |             |
> |                  |                 -----------------------------              |
> |                  |<-- /dev/kvm -->| qemu        VM3      /----->|<-- vsock -->/
> |           -------|                |---------------------v----   |
> |          | kvmi  |                | guest introspection tool |  |
>  ------------------                  -----------------------------
> 
> There will be a need for a new tool (and/or libvirt modified) to get
> the guest events (on/off/list) and change the VM1, VM2 invocations (to
> make them connect with the introspection tool). This might also be a
> problem with products having the host locked down (eg. RHEV).

I think that is desirable in fact.  kvmi should be an explicit feature
that is controlled by the management tools.  This way the policy can be
decided by the administrator.  Libvirt changes will be necessary.

Some KVM users do not want kvmi.  Think of the new memory encryption
hardware support that is coming out - the point is to prevent the
hypervisor from looking inside the VMs!  What you are doing is the
opposite of that.

Also, anyone who doesn't actually use kvmi would be better of disabling
the feature to minimize the attack surface.

I'm not sure if kvmi should be inside the QEMU process though.  If a
guest is compromised and escapes into QEMU, then kvmi is defeated.  It
may be a better design for kvmi to be an isolated component.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event()
  2017-06-16 13:43 ` [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event() Adalbert Lazar
@ 2017-06-21 11:48   ` Paolo Bonzini
  2017-06-21 12:37     ` Mihai Donțu
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-21 11:48 UTC (permalink / raw)
  To: Adalbert Lazar, kvm; +Cc: Radim Krčmář, mdontu



On 16/06/2017 15:43, Adalbert Lazar wrote:
> +int kvm_breakpoint(struct kvm_vcpu *vcpu)
> +{
> +	gpa_t gpa;
> +	struct kvm_segment cs;
> +
> +	kvm_get_segment(vcpu, &cs, VCPU_SREG_CS);
> +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, cs.base + kvm_rip_read(vcpu), NULL);
> +	if (kvmi_breakpoint_event(vcpu, gpa))
> +		return 0;
> +	return 1;
> +}
> +EXPORT_SYMBOL_GPL(kvm_breakpoint);
> +

Please create a separate file with all these functions.
x86.c/vmx.c/svm.c are already too big, let's not make it worse.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 08/19] kvm: Add the introspection subsystem
  2017-06-16 13:43 ` [RFC PATCH 08/19] kvm: Add the introspection subsystem Adalbert Lazar
@ 2017-06-21 11:54   ` Paolo Bonzini
  2017-06-21 12:36     ` Mihai Donțu
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-21 11:54 UTC (permalink / raw)
  To: Adalbert Lazar, kvm; +Cc: Radim Krčmář, mdontu



On 16/06/2017 15:43, Adalbert Lazar wrote:
> +	while (!list_empty(&kvm->access_list)) {
> +		struct kvmi_mem_access *m =
> +		    list_first_entry(&kvm->access_list, struct kvmi_mem_access,
> +				     link);
> +
> +		list_del(&m->link);
> +		INIT_LIST_HEAD(&m->link);
> +
> +		kvmi_apply_mem_access(vcpu, m->gfn, m->access);
> +	}

How does this work when multiple VCPUs are running with different MMU
roles?  One VCPU is emptying the access_list for all, but
kvm_mmu_set_spte is using for_each_shadow_entry per-VCPU.

I'm really afraid of introducing subtle bugs, with possible security
effects.  I'm not really able to provide a suggestion yet, since I
haven't grasped the protocol entirely.

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 08/19] kvm: Add the introspection subsystem
  2017-06-21 11:54   ` Paolo Bonzini
@ 2017-06-21 12:36     ` Mihai Donțu
  2017-06-21 12:57       ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Mihai Donțu @ 2017-06-21 12:36 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Radim Krčmář, Adalbert Lazar, kvm

Hi Paolo,

On Wed, 2017-06-21 at 13:54 +0200, Paolo Bonzini wrote:
> On 16/06/2017 15:43, Adalbert Lazar wrote:
> > +	while (!list_empty(&kvm->access_list)) {
> > +		struct kvmi_mem_access *m =
> > +		    list_first_entry(&kvm->access_list, struct
> > kvmi_mem_access,
> > +				     link);
> > +
> > +		list_del(&m->link);
> > +		INIT_LIST_HEAD(&m->link);
> > +
> > +		kvmi_apply_mem_access(vcpu, m->gfn, m->access);
> > +	}

Thank you taking a look over this!

The entire mechanism for altering spte-s will need a separate
discussion, because right now it interferes with other mmu features
like dirty page logging and possibly other that I might not be aware
of.

The present code merely illustrates what we're really trying to
achieve: control the page permissions in the shadow page tables. The
mechanics are quite simple: while a VCPU is paused the introspection
tool creates a list of all spte's it wants to alter and then it
unpauses the VCPU which, on its wait back into the guest, walks the
list and applies the changes. This list-based approach is used because
sometimes accessing spte-s requires access to qemu's address space
which is tricky to do when the introspection tool runs as a separate
process. We tried using something on top of switch_mm() but it proved
unreliable. It also requires exporting some very deep x86-specific data
in order to build kvm-intel.ko.

> How does this work when multiple VCPUs are running with different MMU
> roles?  One VCPU is emptying the access_list for all, but
> kvm_mmu_set_spte is using for_each_shadow_entry per-VCPU.

Indeed, here we assume all VCPU-s' spt pointers are loaded with the
same address. A correct approach would iterate over all vCPU-s and I
suspect it would need to be a bit more complex with Intel's multiple
EPT views.

> I'm really afraid of introducing subtle bugs, with possible security
> effects.  I'm not really able to provide a suggestion yet, since I
> haven't grasped the protocol entirely.

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event()
  2017-06-21 11:48   ` Paolo Bonzini
@ 2017-06-21 12:37     ` Mihai Donțu
  0 siblings, 0 replies; 38+ messages in thread
From: Mihai Donțu @ 2017-06-21 12:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Radim Krčmář, Adalbert Lazar, kvm

On Wed, 2017-06-21 at 13:48 +0200, Paolo Bonzini wrote:
> On 16/06/2017 15:43, Adalbert Lazar wrote:
> > +int kvm_breakpoint(struct kvm_vcpu *vcpu)
> > +{
> > +	gpa_t gpa;
> > +	struct kvm_segment cs;
> > +
> > +	kvm_get_segment(vcpu, &cs, VCPU_SREG_CS);
> > +	gpa = kvm_mmu_gva_to_gpa_read(vcpu, cs.base +
> > kvm_rip_read(vcpu), NULL);
> > +	if (kvmi_breakpoint_event(vcpu, gpa))
> > +		return 0;
> > +	return 1;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_breakpoint);
> > +
> 
> Please create a separate file with all these functions.
> x86.c/vmx.c/svm.c are already too big, let's not make it worse.

Noted. Thank you!

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 08/19] kvm: Add the introspection subsystem
  2017-06-21 12:36     ` Mihai Donțu
@ 2017-06-21 12:57       ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-21 12:57 UTC (permalink / raw)
  To: Mihai Donțu; +Cc: Radim Krčmář, Adalbert Lazar, kvm



On 21/06/2017 14:36, Mihai Donțu wrote:
> The entire mechanism for altering spte-s will need a separate
> discussion, because right now it interferes with other mmu features
> like dirty page logging and possibly other that I might not be aware
> of.

Okay, this is the stuff that should be put in the cover letter.  The
current cover letter instead is an attempt at documentation that should,
of course, be in Documentation/ instead. :)

I think the right way to do it is to rebuild the MMU when the radix tree
is modified, with KVM_REQ_MMU_SYNC (or maybe even kvm_mmu_reset_context
is enough, I am not sure).  kvm_mmu_get_page can modify role.access
according to the result of the radix tree lookup.  When you make the
page read-only, the role (which is part of the hash key) changes and the
EPT tables will also be made read-only.

Paolo

> The present code merely illustrates what we're really trying to
> achieve: control the page permissions in the shadow page tables. The
> mechanics are quite simple: while a VCPU is paused the introspection
> tool creates a list of all spte's it wants to alter and then it
> unpauses the VCPU which, on its wait back into the guest, walks the
> list and applies the changes. This list-based approach is used because
> sometimes accessing spte-s requires access to qemu's address space
> which is tricky to do when the introspection tool runs as a separate
> process. We tried using something on top of switch_mm() but it proved
> unreliable. It also requires exporting some very deep x86-specific data
> in order to build kvm-intel.ko.
> 
>> How does this work when multiple VCPUs are running with different MMU
>> roles?  One VCPU is emptying the access_list for all, but
>> kvm_mmu_set_spte is using for_each_shadow_entry per-VCPU.
> Indeed, here we assume all VCPU-s' spt pointers are loaded with the
> same address. A correct approach would iterate over all vCPU-s and I
> suspect it would need to be a bit more complex with Intel's multiple
> EPT views.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-21 11:04           ` Stefan Hajnoczi
@ 2017-06-21 13:25             ` Paolo Bonzini
  2017-06-27 16:12               ` Mihai Donțu
  0 siblings, 1 reply; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-21 13:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: alazar, Jan Kiszka, Mihai Dontu, Radim Krčmář, kvm



On 21/06/2017 13:04, Stefan Hajnoczi wrote:> On Tue, Jun 20, 2017 at 05:58:41PM +0300, alazar@bitdefender.com wrote:
>> Moving the vsock to userland will change this:
>>
>>                                      -----------------------------
>>                  /----- /dev/kvm -->| new_tool (guest on/off/list)|<-- vsock -->\
>>                  |                   -----------------------------              |
>>                  |                                                              |
>>  ----------------v-                  -----------------------------              |
>> |                  |<-- /dev/kvm -->| qemu        VM1             |<-- vsock -->|
>> |                  |                |-------                      |             |
>> |                  |                | Linux |                     |             |
>> | KVM              |                 -----------------------------              |
>> |                  |<-- /dev/kvm -->| qemu        VM2             |<-- vsock -->|
>> |                  |                |---------                    |             |
>> |                  |                | Windows |                   |             |
>> |                  |                 -----------------------------              |
>> |                  |<-- /dev/kvm -->| qemu        VM3      /----->|<-- vsock -->/
>> |           -------|                |---------------------v----   |
>> |          | kvmi  |                | guest introspection tool |  |
>>  ------------------                  -----------------------------
>>
>> There will be a need for a new tool (and/or libvirt modified) to get
>> the guest events (on/off/list) and change the VM1, VM2 invocations (to
>> make them connect with the introspection tool

This kind of event should be provided directly by QEMU to the guest
introspection tool---see below.

>> This might also be a
>> problem with products having the host locked down (eg. RHEV).
> I think that is desirable in fact.  kvmi should be an explicit feature
> that is controlled by the management tools.  This way the policy can be
> decided by the administrator.  Libvirt changes will be necessary.
> 
> Some KVM users do not want kvmi.  Think of the new memory encryption
> hardware support that is coming out - the point is to prevent the
> hypervisor from looking inside the VMs!  What you are doing is the
> opposite of that.

I think Stefan has made quite a point here.  The policy manager for
kvmi should definitely be on the host, not on the introspector machine.
There can be multiple introspectors, some on the host and some on an
appliance, though I suppose a limit of one introspector per VM is
acceptable.

And this should be the starting point of the design.

Compared to Stefan's proposed command line:

  qemu --chardev socket,id=chardev0,type=vsock,port=1234,server,nowait \
       --guest-introspection chardev=chardev0,allowed-cids=10

I would do it in the opposite direction.  The introspector is the one that
presents a server socket; QEMU connects to the introspection VM, possibly
does some handshaking, and passes the file descriptor to KVM.  With another
small change, replacing --guest-introspection with the generic --object, that
gives the following:

  qemu --chardev socket,id=chardev0,type=vsock,cid=10,port=1234,nowait \
       --object introspection chardev=chardev0,allow=all,id=kvmi \
       --accel kvm,introspection=kvmi

The policy is specified via kvmi-{allow,deny} parameters and passed to KVM
via ioctls together with the socket file descriptor.

This lets you reuse common POSIX concepts and simplify the kernel code.
KVMI_EVENT_GUEST_ON is just POLLIN on the server socket (plus handshaking
on the client socket); KVMI_EVENT_GUEST_OFF is POLLHUP on the client socket.
There's no need for KVM to know a UUID, as the introspection application
can just have your usual poll() event loop or thread, and look up the VM
from the file descriptor.

QEMU supports socket reconnection, so you don't need KVMI_GET_GUESTS either.
If KVM cannot write to the socket, it should exit to userspace with a new
KVM_EXIT_KVMI vmexit (which can have multiple subcodes, one of them being
KVM_EXIT_KVMI_SOCKET_ERROR).

Of course the link need not even be VSOCK-based.  It can be a Unix socket
as Stefan has already mentioned, which is always nice when debugging or
writing unit tests.  I assume you'll want later some VMFUNC-based access
to the guest's memory; local introspection tools could use an alternative
way via file descriptor passing, similar to what is used already by vhost-user.
And dually, a hypothetical vhost-user server living in a VM could use VMFUNC
to access guest memory without being able to do all the kind of ugly traps
that your current usecase does.  This is another reason why policy has to
be in userspace.

Also, as a matter of fact: this series does not include either documentation
or unit tests.  That's seriously bad.

Patch 1 should explain the socket protocol in English and only affect
Documentation/ and possibly arch/x86/include/uapi.  There's no way that
I can review 2000 lines of code without even knowing what it is supposed
to be like.  In fact, for the next RFC, perhaps you should only submit
patch 1. :)

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-21 13:25             ` Paolo Bonzini
@ 2017-06-27 16:12               ` Mihai Donțu
  2017-06-27 16:23                 ` Paolo Bonzini
  0 siblings, 1 reply; 38+ messages in thread
From: Mihai Donțu @ 2017-06-27 16:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: alazar, Jan Kiszka, Radim Krčmář, kvm, Stefan Hajnoczi

On Wed, 2017-06-21 at 09:25 -0400, Paolo Bonzini wrote:
> On 21/06/2017 13:04, Stefan Hajnoczi wrote:
> > On Tue, Jun 20, 2017 at 05:58:41PM +0300, alazar@bitdefender.com wrote:
> > > Moving the vsock to userland will change this:
> > > 
> > >                                      -----------------------------
> > >                  /----- /dev/kvm -->| new_tool (guest on/off/list)|<-- vsock -->\
> > >                  |                   -----------------------------              |
> > >                  |                                                              |
> > >  ----------------v-                  -----------------------------              |
> > > >                  |<-- /dev/kvm -->| qemu        VM1             |<-- vsock -->|
> > > >                  |                |-------                      |             |
> > > >                  |                | Linux |                     |             |
> > > > KVM              |                 -----------------------------              |
> > > >                  |<-- /dev/kvm -->| qemu        VM2             |<-- vsock -->|
> > > >                  |                |---------                    |             |
> > > >                  |                | Windows |                   |             |
> > > >                  |                 -----------------------------              |
> > > >                  |<-- /dev/kvm -->| qemu        VM3      /----->|<-- vsock -->/
> > > >           -------|                |---------------------v----   |
> > > >          | kvmi  |                | guest introspection tool |  |
> > > 
> > >  ------------------                  -----------------------------
> > > 
> > > There will be a need for a new tool (and/or libvirt modified) to get
> > > the guest events (on/off/list) and change the VM1, VM2 invocations (to
> > > make them connect with the introspection tool
> 
> This kind of event should be provided directly by QEMU to the guest
> introspection tool---see below.
> 
> > > This might also be a
> > > problem with products having the host locked down (eg. RHEV).
> > 
> > I think that is desirable in fact.  kvmi should be an explicit feature
> > that is controlled by the management tools.  This way the policy can be
> > decided by the administrator.  Libvirt changes will be necessary.
> > 
> > Some KVM users do not want kvmi.  Think of the new memory encryption
> > hardware support that is coming out - the point is to prevent the
> > hypervisor from looking inside the VMs!  What you are doing is the
> > opposite of that.

Apologies for the late reply.

> I think Stefan has made quite a point here.  The policy manager for
> kvmi should definitely be on the host, not on the introspector machine.
> There can be multiple introspectors, some on the host and some on an
> appliance, though I suppose a limit of one introspector per VM is
> acceptable.

The host should, indeed, control whether the introspection feature
should be made available. I can see this being a checkbox in, say,
virt-manager.

Assuming the feature is enabled, the only policy we are interested in
is whether our application should indeed try and introspect a guest,
and this is connected to libvirt. For example, our management solution
will query libvirt about running VM-s and then, depending on
configuration made by an administrator, will tell our application which
VM-s to actually introspect. This is where the UUID comes into play:
the management solution refers to VM-s by their UUID and in turn the
application must be able to convert those to an actual handle (a file
descriptor or something else).

The flow you described below seems to make room for this: during the
initial handshake, qemu could tell our application the UUID of the
guest and we'd keep a map of sorts. No need to put that in kernel.

> And this should be the starting point of the design.
> 
> Compared to Stefan's proposed command line:
> 
>   qemu --chardev socket,id=chardev0,type=vsock,port=1234,server,nowait \
>        --guest-introspection chardev=chardev0,allowed-cids=10
> 
> I would do it in the opposite direction.  The introspector is the one that
> presents a server socket; QEMU connects to the introspection VM, possibly
> does some handshaking, and passes the file descriptor to KVM.  With another
> small change, replacing --guest-introspection with the generic --object, that
> gives the following:
> 
>   qemu --chardev socket,id=chardev0,type=vsock,cid=10,port=1234,nowait \
>        --object introspection chardev=chardev0,allow=all,id=kvmi \
>        --accel kvm,introspection=kvmi
> 
> The policy is specified via kvmi-{allow,deny} parameters and passed to KVM
> via ioctls together with the socket file descriptor.

I understand from this that the policy controls whether a certain VM
can be introspected. I'd imagine that it will be default "false" and
set to "true" respectively whenever an "introspection" object is
specified.

> This lets you reuse common POSIX concepts and simplify the kernel code.
> KVMI_EVENT_GUEST_ON is just POLLIN on the server socket (plus handshaking
> on the client socket); KVMI_EVENT_GUEST_OFF is POLLHUP on the client socket.
> There's no need for KVM to know a UUID, as the introspection application
> can just have your usual poll() event loop or thread, and look up the VM
> from the file descriptor.
> 
> QEMU supports socket reconnection, so you don't need KVMI_GET_GUESTS either.
> If KVM cannot write to the socket, it should exit to userspace with a new
> KVM_EXIT_KVMI vmexit (which can have multiple subcodes, one of them being
> KVM_EXIT_KVMI_SOCKET_ERROR).

If I understand all of the above correctly, qemu will initiate the
connection to the introspection tool and after a handshake pass the
file descritor to KVM thus making further communication take place only
between the tool and the host kernel (no need to pass through the host 
user space).

> Of course the link need not even be VSOCK-based.  It can be a Unix socket
> as Stefan has already mentioned, which is always nice when debugging or
> writing unit tests.  I assume you'll want later some VMFUNC-based access
> to the guest's memory; local introspection tools could use an alternative
> way via file descriptor passing, similar to what is used already by vhost-user.
> And dually, a hypothetical vhost-user server living in a VM could use VMFUNC
> to access guest memory without being able to do all the kind of ugly traps
> that your current usecase does.  This is another reason why policy has to
> be in userspace.
> 
> Also, as a matter of fact: this series does not include either documentation
> or unit tests.  That's seriously bad.
> 
> Patch 1 should explain the socket protocol in English and only affect
> Documentation/ and possibly arch/x86/include/uapi.  There's no way that
> I can review 2000 lines of code without even knowing what it is supposed
> to be like.  In fact, for the next RFC, perhaps you should only submit
> patch 1. :)

Noted! Thank you,

-- 
Mihai Donțu

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH 00/19] Guest introspection
  2017-06-27 16:12               ` Mihai Donțu
@ 2017-06-27 16:23                 ` Paolo Bonzini
  0 siblings, 0 replies; 38+ messages in thread
From: Paolo Bonzini @ 2017-06-27 16:23 UTC (permalink / raw)
  To: Mihai Donțu
  Cc: alazar, Jan Kiszka, Radim Krčmář, kvm, Stefan Hajnoczi



On 27/06/2017 18:12, Mihai Donțu wrote:
> The host should, indeed, control whether the introspection feature
> should be made available. I can see this being a checkbox in, say,
> virt-manager.
> 
> Assuming the feature is enabled, the only policy we are interested in
> is whether our application should indeed try and introspect a guest,
> and this is connected to libvirt. For example, our management solution
> will query libvirt about running VM-s and then, depending on
> configuration made by an administrator, will tell our application which
> VM-s to actually introspect. This is where the UUID comes into play:
> the management solution refers to VM-s by their UUID and in turn the
> application must be able to convert those to an actual handle (a file
> descriptor or something else).
> 
> The flow you described below seems to make room for this: during the
> initial handshake, qemu could tell our application the UUID of the
> guest and we'd keep a map of sorts. No need to put that in kernel.

Exactly.  QEMU already knows the UUID.

>>
>>   qemu --chardev socket,id=chardev0,type=vsock,cid=10,port=1234,nowait \
>>        --object introspection chardev=chardev0,allow=all,id=kvmi \
>>        --accel kvm,introspection=kvmi
>>
>> The policy is specified via kvmi-{allow,deny} parameters and passed to KVM
>> via ioctls together with the socket file descriptor.
> 
> I understand from this that the policy controls whether a certain VM
> can be introspected. I'd imagine that it will be default "false" and
> set to "true" respectively whenever an "introspection" object is
> specified.

It also controls what operations can be intercepted for introspection
purposes.  For example a VM may be okay with allowing memory access, but
would not be okay with allowing the introspector to trap instructions or
page faults.  (Or vice versa, since memory access to an untrusted
introspector effectively breaks security).

>> QEMU supports socket reconnection, so you don't need KVMI_GET_GUESTS either.
>> If KVM cannot write to the socket, it should exit to userspace with a new
>> KVM_EXIT_KVMI vmexit (which can have multiple subcodes, one of them being
>> KVM_EXIT_KVMI_SOCKET_ERROR).
> 
> If I understand all of the above correctly, qemu will initiate the
> connection to the introspection tool and after a handshake pass the
> file descritor to KVM thus making further communication take place only
> between the tool and the host kernel (no need to pass through the host 
> user space).

Right, though host user space is invoked at least for the error case, in
order to reconnect to the introspector.  This is useful in case the
introspector is restarted after an update or crash.

I'm not sure if other cases will need userspace cooperation.  We'll see!

Paolo

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-06-27 16:23 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-16 13:43 [RFC PATCH 00/19] Guest introspection Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 01/19] kvm: x86: mmu: Add kvm_mmu_get_spte() and kvm_mmu_set_spte() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 02/19] kvm: x86: Add kvm_arch_vcpu_set_regs() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 03/19] mm: Add vm_replace_page() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 04/19] kvm: Add kvm_enum() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 05/19] kvm: Add uuid member in struct kvm + support for KVM_CAP_VM_UUID Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 06/19] kvm: Add kvm_vm_shutdown() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 07/19] kvm: x86: Add kvm_arch_msr_intercept() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 08/19] kvm: Add the introspection subsystem Adalbert Lazar
2017-06-21 11:54   ` Paolo Bonzini
2017-06-21 12:36     ` Mihai Donțu
2017-06-21 12:57       ` Paolo Bonzini
2017-06-16 13:43 ` [RFC PATCH 09/19] kvm: Hook in kvmi on VM on/off events Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 10/19] kvm: vmx: Hook in kvmi_page_fault() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 11/19] kvm: x86: Hook in kvmi_breakpoint_event() Adalbert Lazar
2017-06-21 11:48   ` Paolo Bonzini
2017-06-21 12:37     ` Mihai Donțu
2017-06-16 13:43 ` [RFC PATCH 12/19] kvm: x86: Hook in kvmi_trap_event() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 13/19] kvm: x86: Hook in kvmi_cr_event() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 14/19] kvm: x86: Hook in kvmi_xsetbv_event() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 15/19] kvm: x86: Hook in kvmi_msr_event() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 16/19] kvm: x86: Change the emulation context Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 17/19] kvm: x86: Hook in kvmi_vmcall_event() Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 18/19] kvm: x86: Set the new spte flags before entering the guest Adalbert Lazar
2017-06-16 13:43 ` [RFC PATCH 19/19] kvm: x86: Handle KVM_REQ_INTROSPECTION Adalbert Lazar
2017-06-16 14:45 ` [RFC PATCH 00/19] Guest introspection Jan Kiszka
2017-06-16 15:18   ` Mihai Donțu
2017-06-16 15:34     ` Jan Kiszka
2017-06-16 15:59       ` Mihai Donțu
2017-06-19  9:39       ` Stefan Hajnoczi
2017-06-20 14:58         ` alazar
2017-06-20 15:03           ` Jan Kiszka
2017-06-21 11:04           ` Stefan Hajnoczi
2017-06-21 13:25             ` Paolo Bonzini
2017-06-27 16:12               ` Mihai Donțu
2017-06-27 16:23                 ` Paolo Bonzini
2017-06-16 17:05     ` Paolo Bonzini
2017-06-16 17:27       ` Jan Kiszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.